Skip to content

Conversation

omertuc
Copy link
Contributor

@omertuc omertuc commented Apr 11, 2022

While the networkRange function calculates lastIP, it uses
net.IPv4zero or net.IPv6zero as a starting value from which it
builds the final last IP. The issue is that these are global
variables defined in the standard library, and modifying them affects
all future users of those variables across the program.

This leads to a not-so-rare race condition when creating multiple
libvirt networks. These networks are being created concurrently and they
both race modifying that global variable. Their lastIP DHCP
configuration gets mixed up as result. The manner in which the
configuration gets mixed up is unpredictable, but it could lead to a
corrupt configuration that cannot be applied.

The solution is to copy the zero IP addresses to a new variable rather
than use them directly. Instead for simplicity I just instantiate an
empty slice with the same length as the net IP, which would be filled
with zeroes anyway.

This commit also solves another unrelated bug in getNetworkIPConfig
where the ^ operator was used as if it's a power operator, while in
reality it's a bitwise-xor that leads to slightly incorrect results.
This makes the validation only slightly wrong and is overly strict (e.g.
it will not allow you to create /28 IPv4 network even though such
network has far more available addresses than the condition blocks)

A similar race condition can be simply reproduced and visualized with
this short go code:

package main

import (
	"fmt"
	"net"
	"sync"
)

func getNetMaskWithMax16Bits(m net.IPMask) net.IPMask {
	ones, bits := m.Size()
	if bits-ones > 16 {
		if bits == 128 {
			return net.CIDRMask(128-16, 128)
		}
		return net.CIDRMask(32-16, 32)
	}
	return m
}

func networkRange(network *net.IPNet) (net.IP, net.IP) {
	netIP := network.IP.To4()
	lastIP := net.IPv4zero.To4()
	if netIP == nil {
		netIP = network.IP.To16()
		lastIP = net.IPv6zero.To16()
	}
	firstIP := netIP.Mask(network.Mask)
	intMask := getNetMaskWithMax16Bits(network.Mask)
	for i := 0; i < len(lastIP); i++ {
		lastIP[i] = netIP[i] | ^intMask[i]
	}
	return firstIP, lastIP
}

func update(wg *sync.WaitGroup, cidr string, id int) {
	address := cidr
	_, ipNet, _ := net.ParseCIDR(address)
	start, end := networkRange(ipNet)
	start[len(start)-1]++
	start[len(start)-1]++
	end[len(end)-1]--
	fmt.Printf("Start %d: %s\n", id, start.String())
	fmt.Printf("End %d: %s\n", id, end.String())
	wg.Done()
}

func main() {
	var wg sync.WaitGroup
	wg.Add(2)
	go update(&wg, "192.168.145.0/24", 0)
	go update(&wg, "192.168.127.0/24", 1)
	wg.Wait()
}

Then run:

watch -n0.1 -d go run main.go

To see it happen for short moments.

Alternatively, create this main.tf file:

terraform {
  required_providers {
    libvirt = {
      source = "dmacvicar/libvirt"
      version = "0.6.14"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
}

resource "libvirt_network" "net" {
  name = "omer-net"
  mode = "nat"
  addresses = ["192.168.127.0/24"]
  autostart = true
}

resource "libvirt_network" "secondary_net" {
  name = "omer-second-net"
  mode = "nat"
  addresses = ["192.168.145.0/24"]
  autostart = true
}

And run:

while true; do
    if ! (terraform apply -auto-approve && terraform destroy -auto-approve); then
        break
    fi
done

Until the bug reproduces with the following error:

│ Error: error defining libvirt network: internal error: range
192.168.127.2 - 192.168.145.254 is not entirely within network
    192.168.127.1/24 -   <network>
    │       <name>omer-net</name>
    │       <forward mode="nat"></forward>
    │       <bridge stp="on"></bridge>
    │       <dns enable="no"></dns>
    │       <ip address="192.168.127.1" family="ipv4" prefix="24">
    │           <dhcp>
    │               <range start="192.168.127.2"
    end="192.168.145.254"></range>
    │           </dhcp>
    │       </ip>
    │       <options
    xmlns="http://libvirt.org/schemas/network/dnsmasq/1.0"></options>
    │   </network>
    │
    │   with libvirt_network.net,
    │   on main.tf line 14, in resource "libvirt_network" "net":
    │   14: resource "libvirt_network" "net" {

Please make sure you read the contributor documentation before opening a Pull Request.

@omertuc omertuc force-pushed the race branch 2 times, most recently from cbf9d08 to 5143683 Compare April 12, 2022 08:25
While the `networkRange` function calculates `lastIP`, it uses
`net.IPv4zero` or `net.IPv6zero` as a starting value from which it
builds the final last IP. The issue is that these are global
variables defined in the standard library, and modifying them affects
all future users of those variables across the program.

This leads to a not-so-rare race condition when creating multiple
libvirt networks. These networks are being created concurrently and they
both race modifying that global variable. Their `lastIP` DHCP
configuration gets mixed up as result. The manner in which the
configuration gets mixed up is unpredictable, but it could lead to a
corrupt configuration that cannot be applied.

The solution is to copy the zero IP addresses to a new variable rather
than use them directly. Instead for simplicity I just instantiate an
empty slice with the same length as the net IP, which would be filled
with zeroes anyway.

This commit also solves another unrelated bug in `getNetworkIPConfig`
where the `^` operator was used as if it's a power operator, while in
reality it's a bitwise-xor that leads to slightly incorrect results.
This makes the validation only slightly wrong and is overly strict (e.g.
it will not allow you to create `/28` IPv4 network even though such
network has far more available addresses than the condition blocks)

A similar race condition can be simply reproduced and visualized with
this short go code:

```go
package main

import (
	"fmt"
	"net"
	"sync"
)

func getNetMaskWithMax16Bits(m net.IPMask) net.IPMask {
	ones, bits := m.Size()
	if bits-ones > 16 {
		if bits == 128 {
			return net.CIDRMask(128-16, 128)
		}
		return net.CIDRMask(32-16, 32)
	}
	return m
}

func networkRange(network *net.IPNet) (net.IP, net.IP) {
	netIP := network.IP.To4()
	lastIP := net.IPv4zero.To4()
	if netIP == nil {
		netIP = network.IP.To16()
		lastIP = net.IPv6zero.To16()
	}
	firstIP := netIP.Mask(network.Mask)
	intMask := getNetMaskWithMax16Bits(network.Mask)
	for i := 0; i < len(lastIP); i++ {
		lastIP[i] = netIP[i] | ^intMask[i]
	}
	return firstIP, lastIP
}

func update(wg *sync.WaitGroup, cidr string, id int) {
	address := cidr
	_, ipNet, _ := net.ParseCIDR(address)
	start, end := networkRange(ipNet)
	start[len(start)-1]++
	start[len(start)-1]++
	end[len(end)-1]--
	fmt.Printf("Start %d: %s\n", id, start.String())
	fmt.Printf("End %d: %s\n", id, end.String())
	wg.Done()
}

func main() {
	var wg sync.WaitGroup
	wg.Add(2)
	go update(&wg, "192.168.145.0/24", 0)
	go update(&wg, "192.168.127.0/24", 1)
	wg.Wait()
}
```

Then run:

```bash
watch -n0.1 -d go run main.go
```

To see it happen for short moments.

Alternatively, create this `main.tf` file:

```terraform
terraform {
  required_providers {
    libvirt = {
      source = "dmacvicar/libvirt"
      version = "0.6.14"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
}

resource "libvirt_network" "net" {
  name = "omer-net"
  mode = "nat"
  addresses = ["192.168.127.0/24"]
  autostart = true
}

resource "libvirt_network" "secondary_net" {
  name = "omer-second-net"
  mode = "nat"
  addresses = ["192.168.145.0/24"]
  autostart = true
}
```

And run:

```bash
while true; do
    if ! (terraform apply -auto-approve && terraform destroy -auto-approve); then
        break
    fi
done
```

Until the bug reproduces with the following error:

```
│ Error: error defining libvirt network: internal error: range
192.168.127.2 - 192.168.145.254 is not entirely within network
    192.168.127.1/24 -   <network>
    │       <name>omer-net</name>
    │       <forward mode="nat"></forward>
    │       <bridge stp="on"></bridge>
    │       <dns enable="no"></dns>
    │       <ip address="192.168.127.1" family="ipv4" prefix="24">
    │           <dhcp>
    │               <range start="192.168.127.2"
    end="192.168.145.254"></range>
    │           </dhcp>
    │       </ip>
    │       <options
    xmlns="http://libvirt.org/schemas/network/dnsmasq/1.0"></options>
    │   </network>
    │
    │   with libvirt_network.net,
    │   on main.tf line 14, in resource "libvirt_network" "net":
    │   14: resource "libvirt_network" "net" {
```
@ericroy
Copy link

ericroy commented Apr 27, 2022

Maintainers, can we merge this? This bug has been biting me semi-regularly for months. Thanks for tracking it down @omertuc !

@osherdp
Copy link

osherdp commented Jun 22, 2022

@dmacvicar

@dmacvicar dmacvicar merged commit e5bec5d into dmacvicar:main Jul 8, 2022
@dmacvicar
Copy link
Owner

dmacvicar commented Jul 8, 2022

Thanks for the patch. I will do a release once we get #950 in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants