Skip to content

Per-interface sysctls #47639

@robmry

Description

@robmry

Description

Background

In 26.0, we removed use of the OCI prestart hook that was used by the daemon to get a callback from the runtime, during container task creation, to SetKey.

With the hook in place, the initial set of network interfaces were created in the container's namespace before task creation. Once it was removed, interfaces were only moved into the network namespace after task creation (before the task was started).

The prestart hook runs before --sysctl options have been applied. We reverted that change, because it meant --sysctl for an interface-specific setting (like --sysctl net.ipv6.conf.eth0.accept_ra=2), prevented the container from starting because the interface did not exist when the runtime tried to apply the sysctl setting.

The immediate driver for the change was to be able to inspect the container to determine whether it had IPv6 enabled, after sysctls had been applied. Then, use that to set up /etc/hosts with or without IPv6 hosts - and, in follow-up PRs, to decide whether to allocate IPv6 addresses etc. But, eliminating the SetKey "reexec" was desirable in any case.

(Discussed in the networking maintainers sync call on 26th March - @corhere, @akerouanton, @cpuguy83.)

Objectives

  • Provide a more reliable way to set interface-specific sysctls, without having to rely on interface naming:
    • eth0 is the name of the first interface that gets set up, but the order isn't predictable on restart for a container with multiple interfaces.
  • (Make it possible to set a sysctl on an interface created after the container, by a network connect.)
  • Enable removal of the SetKey prestart hook.

Proposal

  • Add a sysctl option to the "advanced syntax" of --network. For example:
    • --network name=mynet,sysctl=ipv6.conf.forwarding=1,sysctl=ipv6.conf.accept_ra=2
    • In the CLI, check the option's value has the form "k=v", pass it to the engine API's create endpoint as a []string in EndpointSettings.
      • Existing --sysctl options are passed as a map, but ordering might be important and should be preserved.
    • Expand the setting's value, assuming net. and the interface's generated name...
      • For example:ipv6.conf.forwarding -> net.ipv6.conf.eth0.forwarding.
      • The idea is to:
        • Avoid potential issues with arbitrary non-endpoint-specific sysctls being set per-endpoint (for example, enabling or disabling IPv6 for the whole container using net.ipv6.conf.all.disable_ipv6 would be difficult to deal with).
        • Avoid the issue of interface naming. As a separate change we could offer a per-endpoint ifname option, but we'd also need to deal with conflicting names, and unspecified names.
      • But, it's not obvious and won't be anyone's first guess at the option's usage - if a full sysctl name is provided, we'll need the API to return an error message that clearly explains the problem... "you asked for 'net.ipv6.conf.eth0.forwarding', did you mean 'ipv6.conf.forwarding'?"
        • Is this enough reason to use the full sysctl name? Could generate an error if it's not net.something, and if the interface name is all or default. But, the real interface name is more tricky - could either ignore the given name as a placeholder and use the generated name (but that might be surprising), or require the interface name to be IFNAME or similar (but then, back to the problem that the option's usage isn't clear)?
    • Apply these sysctls after moving the interface into the container's namespace, before bringing it up.
  • Once we re-remove the SetKey prestart hook, try to avoid breaking existing commands...
    • In the API handler, migrate net.*.*.eth0 settings from --sysctl to the new field in EndpointSettings, for the first --network.
    • Deprecate the use of per-endpoint settings in --sysctl, add a warning to the create response, disable the migration in some future API version.
  • Add a --sysctl option to docker network connect.
    • Same shortened format as above, ipv6.conf.accept_ra=2.

Matching changes will be needed in 'compose'. (I don't think 'build' has provision for sysctl settings.)

Other notes ...

  • Interfaces are moved into the network namespace by libnetwork, not by its network drivers. We may want to give network drivers a veto on specific sysctl settings that would break their operation. But, there's no immediate requirement.
  • Existing network drivers only create a single interface, if a future driver creates more than one we'll need to extend the syntax of the new sysctl options.

Metadata

Metadata

Assignees

Labels

area/networkingNetworkingkind/featureFunctionality or other elements that the project doesn't currently have. Features are new and shinystatus/0-triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions