-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Description
Background
In 26.0, we removed use of the OCI prestart hook that was used by the daemon to get a callback from the runtime, during container task creation, to SetKey
.
With the hook in place, the initial set of network interfaces were created in the container's namespace before task creation. Once it was removed, interfaces were only moved into the network namespace after task creation (before the task was started).
The prestart hook runs before --sysctl
options have been applied. We reverted that change, because it meant --sysctl
for an interface-specific setting (like --sysctl net.ipv6.conf.eth0.accept_ra=2
), prevented the container from starting because the interface did not exist when the runtime tried to apply the sysctl setting.
The immediate driver for the change was to be able to inspect the container to determine whether it had IPv6 enabled, after sysctls had been applied. Then, use that to set up /etc/hosts
with or without IPv6 hosts - and, in follow-up PRs, to decide whether to allocate IPv6 addresses etc. But, eliminating the SetKey "reexec" was desirable in any case.
(Discussed in the networking maintainers sync call on 26th March - @corhere, @akerouanton, @cpuguy83.)
Objectives
- Provide a more reliable way to set interface-specific sysctls, without having to rely on interface naming:
eth0
is the name of the first interface that gets set up, but the order isn't predictable on restart for a container with multiple interfaces.
- (Make it possible to set a sysctl on an interface created after the container, by a
network connect
.) - Enable removal of the SetKey prestart hook.
Proposal
- Add a
sysctl
option to the "advanced syntax" of--network
. For example:--network name=mynet,sysctl=ipv6.conf.forwarding=1,sysctl=ipv6.conf.accept_ra=2
- In the CLI, check the option's value has the form "k=v", pass it to the engine API's create endpoint as a
[]string
inEndpointSettings
.- Existing
--sysctl
options are passed as a map, but ordering might be important and should be preserved.
- Existing
- Expand the setting's value, assuming
net.
and the interface's generated name...- For example:
ipv6.conf.forwarding
->net.ipv6.conf.eth0.forwarding
. - The idea is to:
- Avoid potential issues with arbitrary non-endpoint-specific sysctls being set per-endpoint (for example, enabling or disabling IPv6 for the whole container using
net.ipv6.conf.all.disable_ipv6
would be difficult to deal with). - Avoid the issue of interface naming. As a separate change we could offer a per-endpoint
ifname
option, but we'd also need to deal with conflicting names, and unspecified names.
- Avoid potential issues with arbitrary non-endpoint-specific sysctls being set per-endpoint (for example, enabling or disabling IPv6 for the whole container using
- But, it's not obvious and won't be anyone's first guess at the option's usage - if a full sysctl name is provided, we'll need the API to return an error message that clearly explains the problem... "you asked for 'net.ipv6.conf.eth0.forwarding', did you mean 'ipv6.conf.forwarding'?"
- Is this enough reason to use the full sysctl name? Could generate an error if it's not
net.something
, and if the interface name isall
ordefault
. But, the real interface name is more tricky - could either ignore the given name as a placeholder and use the generated name (but that might be surprising), or require the interface name to beIFNAME
or similar (but then, back to the problem that the option's usage isn't clear)?
- Is this enough reason to use the full sysctl name? Could generate an error if it's not
- For example:
- Apply these sysctls after moving the interface into the container's namespace, before bringing it up.
- Once we re-remove the SetKey prestart hook, try to avoid breaking existing commands...
- In the API handler, migrate
net.*.*.eth0
settings from--sysctl
to the new field inEndpointSettings
, for the first--network
. - Deprecate the use of per-endpoint settings in
--sysctl
, add a warning to the create response, disable the migration in some future API version.
- In the API handler, migrate
- Add a
--sysctl
option todocker network connect
.- Same shortened format as above,
ipv6.conf.accept_ra=2
.
- Same shortened format as above,
Matching changes will be needed in 'compose'. (I don't think 'build' has provision for sysctl settings.)
Other notes ...
- Interfaces are moved into the network namespace by libnetwork, not by its network drivers. We may want to give network drivers a veto on specific sysctl settings that would break their operation. But, there's no immediate requirement.
- Existing network drivers only create a single interface, if a future driver creates more than one we'll need to extend the syntax of the new
sysctl
options.