Skip to content

Conversation

robmry
Copy link
Contributor

@robmry robmry commented Nov 27, 2023

- What I did

The original ticket reports that the daemon won't start if the fixed-cidr-v6 address is changed from fe80::/64 to, for example, fe80::/80 ... those are link local addresses, which turn out to be a special case. But, the same applies to any IPv6 subnet - if the daemon is started with IPv6 enabled on the default bridge, then the fixed-cidr-v6 is changed to an overlapping subnet, the daemon fails to start. (For example, 2001:db8::/64 to 2001:db8::/80.)

This change makes it possible to change the configured address for the default bridge to an overlapping subnet (without manually deleting the docker0 bridge).

Link-local addresses were a special case because they were not released when a container stopped. So, for the default bridge where IP addresses are based on MAC addresses - and MAC addresses were recycled - with fixed-cidr-v6=fe80::/64, once a container was stopped new containers could not be started. For non-default bridge networks with a link-local subnet, addresses leaked when containers stopped.

Linux always uses the fe80::/64 network, the 'veth' devices get addresses in that subnet. So, overlapping LL subnets are not allowed by this change, other LL subnets are ok.

- How I did it

The code was doing this:

setupBridgeIPv6():
   - programIPv6Address(<default link local address>)
   - programIPv6Address(<configured address>)
if !config.InhibitIPv4 then setupVerifyAndReconcile():
  - some IPv4 stuff
  - remove global-unicast IPv6 addresses, apart from the configured on

So, it was trying to add the new configured address before deleting the old one. Nowhere in the code has a complete picture of the IPv6 addresses the bridge needs, so it's not able to reconcile existing/required config properly.

This change calculates the IPv6 addresses needed on the bridge, then reconciles them with the addresses on an existing bridge by deleting then adding addresses as required - all in setupBridgeIPv6().

If the configured subnet is link-local, it replaces the default link local address/network.

Link local addresses are returned to their address pool when containers are stopped.

Also...

The only remaining calls of bridgeInterface.addresses() want either IPv4 or IPv6, but not both. So, made the address family a param.

Because the IPv6 addresses are now calculated and applied in one go, there's no need for setupVerifyAndReconcile() to check the set of IPv6 addresses on the bridge. And, that code was guarded by !config.InhibitIPv4, which can't have been right. So, removed its IPv6 parts, and added IPv4 to the function's name.

Fixes #46829

- How to verify it

New integration test, and updated unit tests.

Overlapping change to --fixed-cidr-v6:

# dockerd  --ipv6 --experimental --ip6tables --fixed-cidr-v6=fc00:1234::/64
...
# ip a
[...]
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:b8:f5:44:23 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fc00:1234::1/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::1/64 scope link
       valid_lft forever preferred_lft forever

# dockerd  --ipv6 --experimental --ip6tables --fixed-cidr-v6=fc00:1234::/80
...
# ip a
[...]
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:b8:f5:44:23 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fc00:1234::1/80 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::1/64 scope link
       valid_lft forever preferred_lft forever

Explicit errors if --fixed-cidr-v6 is link local and overlaps with (but isn't equal to) fe80::/64:

# dockerd --ipv6 --experimental --ip6tables --fixed-cidr-v6=fe80::/65 --validate
unable to configure the Docker daemon with file /etc/docker/daemon.json: merged configuration validation from file and command line flags failed: invalid fixed-cidr-v6: clash with the Link-Local prefix 'fe80::/64'

# docker network create --subnet=fe80::/65 testnet
Error response from daemon: clash with the Link-Local prefix 'fe80::/64'

- Description for the changelog

Allow overlapping change in bridge's IPv6 subnet, support link-local subnets.

@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch from c8fffaa to 1e837a9 Compare November 27, 2023 11:46
@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch 2 times, most recently from c8ac881 to e24b8d8 Compare November 30, 2023 15:32
@thaJeztah
Copy link
Member

Looks like some issues with Windows CI; this part looks like possible culprit: unix://C:\Users\RUNNER~1\AppData\Local\Temp\docker-integration\da6be70e5f239.sock

=== Failed
=== FAIL: github.com/docker/docker/integration/networking TestDefaultBridgeIPv6ULSubnetChange (0.01s)
    bridge_test.go:392: [da6be70e5f239] failed to start daemon with arguments [--data-root D:\a\moby\moby\go\src\github.com\docker\docker\bundles\tmp\TestDefaultBridgeIPv6ULSubnetChange\da6be70e5f239\root --exec-root C:\Users\RUNNER~1\AppData\Local\Temp\dxr\da6be70e5f239 --pidfile D:\a\moby\moby\go\src\github.com\docker\docker\bundles\tmp\TestDefaultBridgeIPv6ULSubnetChange\da6be70e5f239\docker.pid --userland-proxy=true --containerd-namespace da6be70e5f239 --containerd-plugins-namespace da6be70e5f239p --containerd /var/run/docker/containerd/containerd.sock --host unix://C:\Users\RUNNER~1\AppData\Local\Temp\docker-integration\da6be70e5f239.sock -D --experimental --ipv6 --ip6tables --fixed-cidr-v6=fc00:1234::/32] : protocol not available

=== FAIL: github.com/docker/docker/integration/networking TestDefaultBridgeIPv6LLSubnetChange (0.00s)
    bridge_test.go:407: [d698945a0cbce] failed to start daemon with arguments [--data-root D:\a\moby\moby\go\src\github.com\docker\docker\bundles\tmp\TestDefaultBridgeIPv6LLSubnetChange\d698945a0cbce\root --exec-root C:\Users\RUNNER~1\AppData\Local\Temp\dxr\d698945a0cbce --pidfile D:\a\moby\moby\go\src\github.com\docker\docker\bundles\tmp\TestDefaultBridgeIPv6LLSubnetChange\d698945a0cbce\docker.pid --userland-proxy=true --containerd-namespace d698945a0cbce --containerd-plugins-namespace d698945a0cbcep --containerd /var/run/docker/containerd/containerd.sock --host unix://C:\Users\RUNNER~1\AppData\Local\Temp\docker-integration\d698945a0cbce.sock -D --experimental --ipv6 --ip6tables --fixed-cidr-v6=fe80:1234::/32] : protocol not available

@thaJeztah
Copy link
Member

Perhaps @akerouanton can have a look at the change "in general" though 👍

@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch from e24b8d8 to 84b0a98 Compare November 30, 2023 16:36
@robmry
Copy link
Contributor Author

robmry commented Nov 30, 2023

Looks like some issues with Windows CI

Thank you - hopefully zapped now (second round of Windows-specific fixes for that push, oops).

@thaJeztah
Copy link
Member

🤞 Ignore the other failing test for now; that one is known to be flaky (I'll kick CI once it finishes);

=== FAIL: libnetwork/networkdb TestNetworkDBCRUDTableEntries (7.60s)
    networkdb_test.go:309: Entry existence verification test failed for node2(e0f6cdba41a7)

@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch from 84b0a98 to 219ca02 Compare December 1, 2023 09:40
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly "generic" comments; let me know if these are useful 😄

Comment on lines 179 to 181
if err := bridge.ValidateFixedCIDRV6(conf.FixedCIDRv6); err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is bridge.ValidateFixedCIDRV6 only applicable to this specific option, or can it be considered a more "generic" validation of IPv6 CIDRs?

If it can be considered "generic", we could consider decorating the error here, instead of in that function, i.e.;

Suggested change
if err := bridge.ValidateFixedCIDRV6(conf.FixedCIDRv6); err != nil {
return err
}
if err := bridge.ValidateFixedCIDRV6(conf.FixedCIDRv6); err != nil {
return errors.Wrap(err, "invalid fixed-cidr-v6")
}

I was also considering if we need to "type" the error with the correct errdefs type (to indicate it's an invalid parameter); we don't seem to be doing that for other ones though, so that may be something to look at in a wider scope;

moby/errdefs/defs.go

Lines 8 to 11 in f179243

// ErrInvalidParameter signals that the user input is invalid
type ErrInvalidParameter interface {
InvalidParameter()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is bridge.ValidateFixedCIDRV6 only applicable to this specific option, or can it be considered a more "generic" validation of IPv6 CIDRs?

It's not generic IPV6 CIDR validation, because it checks that the network doesn't overlap with the default link-local subnet fe80::/64.

But wrapping is clearly better than what I did with IPv6ReservedLinkLocalError to make it work for both fixed-cidr-v6 and the --subnet given to a non-default bridge network... I'll sort that out, thank you.

I was also considering if we need to "type" the error with the correct errdefs type

Yes - the IPv6ReservedLinkLocalError that's returned if the address is v6 and overlaps with fe80::/64 has the InvalidParameter. I'll make sure the other errors returned by bridge.ValidateFixedCIDRV6() do too.

if err != nil {
return fmt.Errorf("fixed-cidr-v6 is not a valid subnet '%s'", val)
}
if ipNet.IP.To4() != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To4() also looks to be supporting IPv4 in IPv6; https://pkg.go.dev/net#IP.To4
https://cs.opensource.google/go/go/+/refs/tags/go1.21.4:src/net/ip.go;l=212-223

Silly question; is that invalid? (this feels like a "if a tree falls in the forrest ..." kind of question 🙈 😂)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ipNet.IP is normalized (read: the host-part of the address is masked off) so this test would only evaluate to true if val is in the range ::ffff:0:0/96-::ffff:ffff:ffff/128 inclusive. In other words, it tests whether the configured prefix overlaps with the IPv4-mapped IPv6 address range. Addresses in that range represent the addresses of IPv4 nodes, which are special-cased in the kernel. Assigning such addresses as native IPv6 addresses is not standards-compliant and is liable to cause surprising problems. So yes, it should be invalid to set fixed-cidr-v6 to a prefix which overlaps the IPv4-mapped IPv6 address range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think To4() is the usual way to check whether a net.IP is IPv4 or IPv6.

The IsZeros() stuff in its implementation is to cope with an IPv4 address that's been To16()'d, turning it into a 128-bit representation that makes life easier for a dual-stack IP implementation (it's an IPv4-Mapped IPv6 Address).

So, someone could say --fixed-cidr-v4=::ffff:a01:203 but it's still an IPv4 address, 10.1.2.3.

Comment on lines 55 to 56
assert.Check(t, addrs != nil)
assert.Check(t, len(addrs) == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly curious here; in many situations, an empty slice is "roughly" identical to a nil slice; e.g., the following passes;

func foo() []string {
	return nil
}

func TestFoo(t *testing.T) {
	bla := foo()
	assert.Assert(t, is.Len(bla, 0))
}

For our use, is there an important distinction between the two? (empty slice vs nil)?

Copy link
Contributor

@corhere corhere Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When designing interfaces, avoid making a distinction between a nil slice and a non-nil, zero-length slice, as this can lead to subtle programming errors.

Suggested change
assert.Check(t, addrs != nil)
assert.Check(t, len(addrs) == 0)
assert.Check(t, is.Len(addrs, 0))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old version of bridgeInterface.addresses() fetched IPv4 and IPv6 addresses, returning two slices. If there were no IPv4 addresses it converted that empty slice to a nil. But, it didn't do that for IPv6 addresses ...

	if len(v4addr) == 0 {
		return nil, v6addr, nil
	}
	return v4addr, v6addr, nil

I couldn't see why, but it seemed weird so I got rid of it. I was just making that change explicit really - I'm happy to remove it though, it is pointless! (I guess I over-reacted by explicitly returning an empty slice in the no-interface case! I'll get rid of that.)

@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch from 219ca02 to 75f3784 Compare December 1, 2023 19:24
@corhere
Copy link
Contributor

corhere commented Dec 1, 2023

As discussed in this week's call, I don't think we should support resolving DNS names to LL addresses because it is brittle: attaching a container to a second network would break inter-container connectivity as zone identifiers cannot be included in AAAA records, so the LL addresses would become ambiguous. Furthermore, anything a user would want to do by assigning link-local container addresses could be done by assigning ULA container addresses instead, without any of the gotchas, brittleness or caveats. I think it's safe to say that configuring a bridge to allocate containers addresses which overlap the link-local range fe80::/64 has no legitimate use cases which outweigh the downsides, and so is always a misconfiguration that we should not accept as valid. (Explicitly assigning a secondary link-local address to a particular container endpoint could be useful in some advanced scenarios, however, as user-configured static addresses could be passed around out-of-band with the requisite zone identifier.)

The whole fe80::/10 prefix is reserved for link-local unicast addressing, but only the fe80::/64 prefix has been allocated for link-local unicast addresses. Addresses contained within the fe80::/10 prefix but not fe80::/64 are therefore not going to collide with auto-assigned LL addresses so it would not necessarily be a footgun if we allowed users to configure bridges to assign containers such prefixes. On the other hand, it might be to our benefit to reserve a link-local address range to use for internal implementation-detail purposes. Following the philosophy of "no is temporary but yes is forever," I'm thinking we should block users from using any prefix under fe80::/10 as an IPAM pool for now. And we should probably also disallow IPAM pools within the multicast prefix ff00::/8 to protect users from that particular footgun.

@robmry
Copy link
Contributor Author

robmry commented Dec 1, 2023

Thank you @corhere, my thoughts ...

For the default bridge, a link-local fixed-cidr-v6 doesn't work properly in released code - once a container is stopped, because it's MAC-based LL address isn't returned to the pool, it isn't possible to start any new containers without restarting the daemon. So, I don't think anyone is likely to be using that setup - it'd be best to disallow it.

For a non-default bridge, --subnet=fe80::/64 does work ... addresses leak, but the pool is massive so nobody is likely to notice/care.

Without connecting a second network, DNS will do what people expect. But, even with a second network - it's not possible to create two networks that have overlapping subnets - link-local or not, before or after these changes. So, apart from the special case of fe80::/64 only one interface per container can have a LL address in a given subnet. So, while it is a strange thing to want to do - I don't think lack of a zone index to disambiguate for DNS is part of the problem, the route to a subnet is always the interface on that subnet. (It's only fe80::/64 where all interfaces end up in the same subnet.)

To demonstrate - without the changes in this PR (using latest 'master', but equivalent to current releases) - I set up two networks with link local subnets fe80:1234::/64 and fe80:2345::/64, then attached two containers to both. In one of those containers ...

146: eth0@if147: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:16:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.22.0.3/16 brd 172.22.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe16:3/64 scope link
       valid_lft forever preferred_lft forever
    inet6 fe80:2345::5/64 scope link flags 02
       valid_lft forever preferred_lft forever
150: eth1@if151: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:15:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.21.0.3/16 brd 172.21.255.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe15:3/64 scope link
       valid_lft forever preferred_lft forever
    inet6 fe80:1234::5/64 scope link flags 02
       valid_lft forever preferred_lft forever

That container can ping the other using either of its pool-assigned link-local addresses, no zone id is needed ...

/ # ping fe80:2345::4
PING fe80:2345::4 (fe80:2345::4): 56 data bytes
64 bytes from fe80:2345::4: seq=0 ttl=64 time=0.469 ms
64 bytes from fe80:2345::4: seq=1 ttl=64 time=0.406 ms
^C
--- fe80:2345::4 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.406/0.437/0.469 ms
/ # ping fe80:1234::4
PING fe80:1234::4 (fe80:1234::4): 56 data bytes
64 bytes from fe80:1234::4: seq=0 ttl=64 time=0.489 ms
64 bytes from fe80:1234::4: seq=1 ttl=64 time=0.343 ms
^C

So, unfortunately, we've already said the forever-yes ...

Preventing any use of link-local subnets as address pools would take away some of the rope people can use to hang themselves. But, Hyram's law ... we see from tickets and community discussion that the broken cases have been picked up, I guess we wouldn't hear about the currently-working cases.

I think this proposed change is the option least likely to break current usage - but changing the rules may be for the best. Perhaps dealing with the fallout once, and getting people switched over to ULAs, would involve the least pain.

If there's consensus that a breaking change is ok - I can update the change to block all use of LL address pools, fe80::/64 on the default bridge, or fe80::/64 everywhere. Otherwise, if we want to be certain we're not breaking any current usage, I could update this change to remove the block on networks that overlap fe80::/64.

Or, something in-between - perhaps prohibit LL pools by-default, with the option to override if a footgun is what you want.

(In any case, I think we'll need to document our way out of it - either explaining the change in the rules, or the potential footgun.)

@corhere
Copy link
Contributor

corhere commented Dec 1, 2023

So, unfortunately, we've already said the forever-yes ...

Oh, right; it's only --ip6tables that's experimental, not IPv6 IPAM. And as both the bridge and overlay drivers provide a virtual Ethernet network segment to containers, inter-container communications traffic through those networks is inherently link-local so there is not really a meaningful difference between ULA addresses and fe80::/10 addresses in those contexts. And since I do not currently have any concrete use case for reserving addresses in the fe80::/10 range (aside from fe80::/64, naturally) I can't justify making that particular breaking change.

I just noticed that in the Engine API, there already is a distinction made between the IP addresses for an endpoint and the link-local IP addresses for an endpoint. Maybe that is sufficient to justify disallowing addresses within fe80::/64 as either IPAM-pool addresses or manually-configured endpoint IPv6 addresses, and only allowing them as kernel-assigned addresses or manually-assigned additional addresses in EndpointIPAMConfig.LinkLocalIPs.

Brainstorming: since the main problem with fe80::/64 IPAM addressing is that DNS resolution breaks when a container is attached to more than one network, what if we only blocked attachments when doing so would fire the footgun? That is, we could continue to allow custom bridges to be created with an fe80::/64 pool, but containers could only be attached to that network could not be concurrently attached to any other network. We could even return a warning in the network-create API response to inform users of that caveat.

@robmry
Copy link
Contributor Author

robmry commented Dec 4, 2023

I just noticed that in the Engine API, there already is a distinction made between the IP addresses for an endpoint and the link-local IP addresses for an endpoint. Maybe that is sufficient to justify disallowing addresses within fe80::/64 as either IPAM-pool addresses or manually-configured endpoint IPv6 addresses, and only allowing them as kernel-assigned addresses or manually-assigned additional addresses in EndpointIPAMConfig.LinkLocalIPs.

That would certainly make things simpler, but it would probably break things for a few people.

Brainstorming: since the main problem with fe80::/64 IPAM addressing is that DNS resolution breaks when a container is attached to more than one network, what if we only blocked attachments when doing so would fire the footgun? That is, we could continue to allow custom bridges to be created with an fe80::/64 pool, but containers could only be attached to that network could not be concurrently attached to any other network. We could even return a warning in the network-create API response to inform users of that caveat.

A warning on network-create sounds like a good idea ... I see the API's NetworkCreateResponse already has a Warning field, just no code to populate it at the moment (and it's not displayed by the Docker CLI).

Blocking further attachments when a container's already in a network with the fe80::/64 pool sounds good too - I'm just not sure whether there's a valid use-case for it? It'd break DNS, but it is allowed at the moment ... I'm not sure what the use-case might be, perhaps someone wanting to do something with a network where zone-ids are necessary? (The API currently returns an 200-OK with no response body on success, so there's no ready-made opportunity to return a warning instead of an error.)

I'm also dithering about whether I should have barred subnets within or including fe80::/64, given that they're allowed at the moment and won't lead to obvious problems in some cases... I don't yet have a feel for the consequences of a possibly-breaking change like that, even though it's unlikely to be something that's generally useful/sensible.

A summary might be useful ...

  • Before this change:

    • Changing fixed-cidr-v6 to any overlapping subnet prevented the daemon from starting.
    • Any link-local subnet is allowed, on any bridge network, but:
      • On the default bridge, any link-local subnet (fixed-cidr-v6) meant containers wouldn't start after one container stopped.
      • On other bridge networks, link-local addresses leaked when containers stopped.
  • With the current change:

    • Changes to fixed-cidr-v6 work as expected.
    • Subnet fe80::/64 is allowed on any bridge network.
    • Subnets that overlap fe80::/64 are not allowed on any bridge network.
    • Other link local subnets are allowed on any bridge network (and addresses no longer leak).
  • Further changes, TBD:

    • Return a warning when creating a network with subnet fe80::/64, to say that it'll break DNS if a container is attached to other networks as well.
      • fixed-cidr-v6 is a special case, I guess the warning would be in the daemon log rather than an API response.
      • (also, update the docker CLI to show the warning, figure out whether compose needs a change)
    • Disallow IPAM pools within the multicast prefix ff00::/8.
    • At-most one of:
      • Prevent a container attached to (or configured with) a network with subnet fe80::/64 from being added to any other network.
      • Disallow IPAM pools with subnet fe80:/64.
    • Remove the new restriction on subnets that overlap with fe80::/64. (Probably ok to leave it?)

Unless we decide to remove the new restriction on overlapping fe80::/64 - I think it'd be best to deal with the further changes separately, after merging this PR, rather than adding more here ... since they're changes/additions to pre-existing behaviour.

@robmry
Copy link
Contributor Author

robmry commented Dec 5, 2023

Discussed in the maintainers call ... the feeling was that we should go ahead and prevent the use of subnet fe80::/64, in addition to any pools that overlap it, even though that might be a breaking change in some cases. It simplifies the logic, and there's probably always a good alternative (like using unique-local addresses, or explicitly setting up link-local addresses via EndpointIPAMConfig.LinkLocalIPs). We'd provide documentation to help with migration of any existing configurations that would no longer work.

However - I'd confused myself ... before this change, fe80::/64 was allowed - but pools that overlap fe80::/64 weren't. So, this change only preserves the existing behaviour.

In the old code there was no explicit check for overlapping LL pools, but fe80::1/64 was always explicitly added to the bridge, before the pool assigned address was added. If the pool was fe80::/64, it worked because the old code spotted the duplicate fe80::1 and skipped the second add. But, adding fe80::1 with any other subnet missed the duplicate-address check, so adding the address again resulted in a file exists error.

The new code works out which LL address to add, rather than always adding fe80::1/64. So, the error is generated by an explicit check against the pool, instead of hitting the file exists failure. The end result is the same.

Because of that - I think it'd be best to merge this change as-is (or with additional reviews), because it fixes bugs but doesn't otherwise change existing behaviour ... then follow up with further changes to prevent the use of fe80::/64 as an IPAM pool, and the other "Further changes" above. That'll give us the chance to change our minds on those things more easily.

@thaJeztah
Copy link
Member

Thanks for summarizing! ❤️

Because of that - I think it'd be best to merge this change as-is (or with additional reviews), because it fixes bugs but doesn't otherwise change existing behaviour

Yes, definitely no objections against improving validation (and more actionable error messages)

@akerouanton would you be able to help review?

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://datatracker.ietf.org/doc/html/rfc5942 is a very interesting read. And from what I can tell reading some of the kernel sources, Linux is compliant: the prefix part of a netlink.Addr used when adding or listing IPv6 addresses on a link is purely informational. That explains why the kernel returns EEXIST when attempting to add fe80::1/80 to a link which already has fe80::1/64 added: they are the same address when you ignore the irrelevant part. But because findIPv6Address compared netlink.Addr values by .String(), which includes the prefix length and netlink label, it would spuriously return false on addresses which only differ in prefix length. If we're going with "fixing bugs without otherwise changing existing behaviour" I would prefer a two-line targeted bugfix over a 400-line change.

@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch from 75f3784 to 6d1d69f Compare December 6, 2023 15:03
@robmry
Copy link
Contributor Author

robmry commented Dec 6, 2023

If we're going with "fixing bugs without otherwise changing existing behaviour" I would prefer a two-line targeted bugfix over a 400-line change.

Excluding the additional tests, the change was about +113,-75 ... still more than two lines! But, refactoring to be able to delete unwanted addresses before trying to add new ones made that necessary I think.

Let me know if it needs to broken down into more git commits or separate PRs though?

Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch from 6d1d69f to ec0a1ab Compare December 6, 2023 17:14
Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, refactoring to be able to delete unwanted addresses before trying to add new ones made that necessary I think.

Which addresses are unwanted? The kernel-assigned LLAs coexist with the hardcoded fe80::1 added by the driver to the bridge's gateway interface, or IPAM-assigned LLAs on endpoints. Deleting the kernel-assigned LLAs is a change to the existing behaviour:

$ docker network create --ipv6 --subnet fe80::/64 llabridge
7d37cd1b7b7d3a6aa7caa4b180a5cf2620d499574d36db56c431146b87566b3f
$  docker run --rm -it --net llabridge alpine ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
22: eth0@if23: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 state UP
    inet6 fe80::42:acff:fe12:2/64 scope link tentative
       valid_lft forever preferred_lft forever
    inet6 fe80::2/64 scope link flags 02
       valid_lft forever preferred_lft forever
$ docker run -it --rm --privileged --net=host alpine ip -6 a
[...snip...]
17: br-7d37cd1b7b7d: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN
    inet6 fe80::42:5bff:fe6e:75a3/64 scope link
       valid_lft forever preferred_lft forever
    inet6 fe80::1/64 scope link
       valid_lft forever preferred_lft forever

Since the discussion to merge this PR roughly as-is is predicated on bugs being fixed without otherwise changing existing behaviour, we need to talk about this change to existing behaviour that is not fixing a bug. Removing kernel-assigned LLAs from the interfaces is presumably a side effect of the reconcile-and-remove code. Is that particular change strictly necessary to fix the bugs? Or could we get away with just banning overlay networks with pool configurations that overlap the Link-Local prefix using validateIPv6Subnet and fixing the IPAM address leak? I would be thrilled to have a targeted bugfix that could be backported. Changing the driver to reconcile IPv6 addresses is a good idea, and one I would be happy to see, but it is a lot to review and think about. How feasible would it be to split the reconciliation changes into a follow-up PR?


As I understand it, the root of the problem with dockerd --fixed-cidr-v6=fe80::/65 is that findIPv6Address() considered fe80::1 != fe80::1 because it incorrectly compared prefix lengths, resulting in an attempt to add fe80::1 to the interface twice. Because it is added with (*netlink.Handle).AddrAdd() the kernel is instructed to add the duplicate address with the NLM_F_EXCL flag, the kernel returns an error to signal that the address already exists on the interface. So if the only change made to master was to fix findIPv6Address() such that that fe80::1 == fe80::1 irrespective of prefix lengths, dockerd --fixed-cidr-v6=fe80::/65 would...allow IPAM pools which overlap the link-local prefix to be used without error. Oh. That's the opposite of the direction we decided to go with. Anyway, this hopefully at least demonstrates that deleting addresses before attempting to add new ones may not strictly be necessary to fix the bugs.

@robmry
Copy link
Contributor Author

robmry commented Dec 6, 2023

Which addresses are unwanted?

I'll read all that and respond properly in the morning (on my phone at the mo) ... but the problem isn't just with LL addresses - they're just the can of worms that got opened. The examples in the original ticket are LL, but they were a special case.

See the non-LL example in the first para of the description at the top of this PR.

Fixing the non-LL problem made it possible to start the daemon with a nonstandard LL prefix that overlapped the standard one - so I added a test to stop that from happening. But the leaking LL addresses were wrong too, and the fix there was the one- liner. (Then things really got out of hand!)

@corhere
Copy link
Contributor

corhere commented Dec 7, 2023

This diff is sufficient to get the daemon to not crash on startup using the GUA-prefix example at the top of the PR description.

diff --git a/libnetwork/drivers/bridge/setup_verify_linux.go b/libnetwork/drivers/bridge/setup_verify_linux.go
index c05fa7e242..7c7c5f2b0b 100644
--- a/libnetwork/drivers/bridge/setup_verify_linux.go
+++ b/libnetwork/drivers/bridge/setup_verify_linux.go
@@ -7,7 +7,6 @@ import (
 
 	"github.com/containerd/log"
 	"github.com/docker/docker/libnetwork/ns"
-	"github.com/docker/docker/libnetwork/types"
 	"github.com/vishvananda/netlink"
 )
 
@@ -41,7 +40,7 @@ func setupVerifyAndReconcile(config *networkConfiguration, i *bridgeInterface) e
 	// Release any residual IPv6 address that might be there because of older daemon instances
 	for _, addrv6 := range addrsv6 {
 		addrv6 := addrv6
-		if addrv6.IP.IsGlobalUnicast() && !types.CompareIPNet(addrv6.IPNet, i.bridgeIPv6) {
+		if addrv6.IP.IsGlobalUnicast() && !addrv6.IP.Equal(i.bridgeIPv6.IP) {
 			if err := i.nlh.AddrDel(i.Link, &addrv6); err != nil {
 				log.G(context.TODO()).Warnf("Failed to remove residual IPv6 address %s from bridge: %v", addrv6.IPNet, err)
 			}
@@ -53,7 +52,7 @@ func setupVerifyAndReconcile(config *networkConfiguration, i *bridgeInterface) e
 
 func findIPv6Address(addr netlink.Addr, addresses []netlink.Addr) bool {
 	for _, addrv6 := range addresses {
-		if addrv6.String() == addr.String() {
+		if addrv6.IP.Equal(addr.IP) {
 			return true
 		}
 	}

As for removing stale assigned LLAs without also removing kernel-assigned LLAs, the kernel provides the necessary information on v5.18. Unfortunately the netlink library we use has yet to be updated to expose that information. We perhaps could do some quick-and-dirty heuristic detection of LLAs that look like they were generated from EUI64, though that would be quite brittle. It's too bad we can't just nuke the bridge (or its gateway link) or flush all addresses on the gateway link on daemon startup as that would disrupt the network connectivity of live-restored containers. I'm coming around to being okay with deleting the kernel-assigned LLA on the bridge gateway link, in which case reconciliation would be easier to reason about than the existing janky code. Hold off on splitting the reconciliation code out.

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've convinced myself that you're right, reconciliation is the way to go to fix these bugs for good. I'm done with dragging on design reviews; I'm ready to approve this PR once we sort out the fiddly aspects of the reconciliation logic.

@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch 2 times, most recently from ce9a3d7 to 5b51848 Compare December 8, 2023 15:20
Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🎉

@shenanigans-77

This comment has been minimized.

@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch from 5b51848 to ee78f3b Compare December 11, 2023 11:37
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly some bike-shedding and suggestions that I apparently wrote, but didn't submit 😂 (happy to discuss those!).

Perhaps worth considering changing;

  • the naked returns (if possible); I really try to avoid those where possible
  • the use of the IPv6AddrAddError; it looks like those error-types no longer have a real purpose, and we should remove them (at least avoid using them in new code)
  • I'd suggest removing the TODO in this PR (to reduce code-churn) and open a separate PR already for those.

@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch 2 times, most recently from afced11 to de9d5ec Compare December 18, 2023 11:00
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (general code)

Left some non-blocking comments, most of which would be for follow-ups (some of them looks like they're opening some new cans of worms).

Would love @akerouanton to have a look for the actual networking changes.

Comment on lines +158 to +164
nc := &networkConfiguration{}
i := prepTestBridge(t, nc)

// The bridge has no addresses, ask for a regular IPv6 network and expect it to
// be added to the bridge, with the default link local address.
nc.AddressIPv6 = cidrToIPNet(t, "2000:3000::1/64")
err := i.programIPv6Addresses(nc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed prepTestBridge is always called with an empty struct, but in this test we are re-using that struct (pointer) for i.programIPv6Addresses(). This made me curious;

Do we need a pointer to the original nc created above? Or would this code be the exact equivalent of;

i := prepTestBridge(t, &networkConfiguration{})

// The bridge has no addresses, ask for a regular IPv6 network and expect it to
// be added to the bridge, with the default link local address.
err := i.programIPv6Addresses(&networkConfiguration{
	AddressIPv6: cidrToIPNet(t, "2000:3000::1/64"),
})

If it's equivalent, we could consider internalising the networkConfiguration argument in prepareTestBridge()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I did a quick check what networkConfiguration was used for here, and whether we need a pointer to it, and (LOL), looks like that's opening another can of potential worms; It looks like there's 2 fields used from that struct; BridgeName and DefaultBridge. Initially, I thought that only BridgeName was used (in which case we should consider just passing that as a string instead of the whole struct as these are non-exported (internal) functions).

HOWEVER, that's also where the fun starts; while newInterface only needs the BridgeName field, it also MUTATES the networkConfiguration struct that's passed, so newInterface has unexpected side-effects here;

func newInterface(nlh *netlink.Handle, config *networkConfiguration) (*bridgeInterface, error) {
var err error
i := &bridgeInterface{nlh: nlh}
// Initialize the bridge name to the default if unspecified.
if config.BridgeName == "" {
config.BridgeName = DefaultBridgeName
}

And there's more fun; setupDevice uses a different default; it uses a DOCKER_TEST_CREATE_DEFAULT_BRIDGE environment variable (used in tests?). But unlike newInterface above, it also takes the DefaultBridge boolean into account;

func setupDevice(config *networkConfiguration, i *bridgeInterface) error {
// We only attempt to create the bridge when the requested device name is
// the default one. The default bridge name can be overridden with the
// DOCKER_TEST_CREATE_DEFAULT_BRIDGE env var. It should be used only for
// test purpose.
var defaultBridgeName string
if defaultBridgeName = os.Getenv("DOCKER_TEST_CREATE_DEFAULT_BRIDGE"); defaultBridgeName == "" {
defaultBridgeName = DefaultBridgeName
}
if config.BridgeName != defaultBridgeName && config.DefaultBridge {
return NonDefaultBridgeExistError(config.BridgeName)
}

This means that;

  • newInterface has unexpected side-effects (should newInterface and setupDevice be combined so be an atomic operation? :thinking_fase:)
  • newInterface and setupDevice seem to be inconsistent at handling of the DefaultBridge field, and the default used for BridgeName (DefaultBridgeName or DOCKER_TEST_CREATE_DEFAULT_BRIDGE if set)

☝️ Let's have a more in-depth look at that, and look at that in a follow-up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this PR has already opened enough cans of worms!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this one "unresolved", but we could probably copy it into a ticket.

@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch from de9d5ec to dbaa895 Compare December 18, 2023 15:51
Calculate the IPv6 addreesses needed on a bridge, then reconcile them
with the addresses on an existing bridge by deleting then adding as
required.

(Previously, required addresses were added one-by-one, then unwanted
addresses were removed. This meant the daemon failed to start if, for
example, an existing bridge had address '2000:db8::/64' and the config
was changed to '2000:db8::/80'.)

IPv6 addresses are now calculated and applied in one go, so there's no
need for setupVerifyAndReconcile() to check the set of IPv6 addresses on
the bridge. And, it was guarded by !config.InhibitIPv4, which can't have
been right. So, removed its IPv6 parts, and added IPv4 to its name.

Link local addresses, the example given in the original ticket, are now
released when containers are stopped. Not releasing them meant that
when using an LL subnet on the default bridge, no container could be
started after a container was stopped (because the calculated address
could not be re-allocated). In non-default bridge networks using an
LL subnet, addresses leaked.

Linux always uses the standard 'fe80::/64' LL network. So, if a bridge
is configured with an LL subnet prefix that overlaps with it, a config
error is reported. Non-overlapping LL subnet prefixes are allowed.

Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry robmry force-pushed the 46829-allow_ipv6_subnet_change branch from dbaa895 to 27f3abd Compare December 18, 2023 16:10
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah thaJeztah added this to the 25.0.0 milestone Dec 19, 2023
@thaJeztah
Copy link
Member

This PR had a ton eyeing, so I'll bring it in, but if @akerouanton wants to have a look "post-merge" that would never hurt 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot change ipv6-cidr in daemon.json anymore - dockerd fails to start
4 participants