Per-interface sysctls

### Description

**Background**

In 26.0, we [removed use of the OCI prestart hook](https://github.com/moby/moby/pull/47062/commits/0046b16d87105d334b72f5e98efd28bbd94d9659) that was used by the daemon to get a callback from the runtime, during container task creation, to `SetKey`.

With the hook in place, the initial set of network interfaces were created in the container's namespace before task creation. Once it was removed, interfaces were only moved into the network namespace after task creation (before the task was started).

The prestart hook runs before `--sysctl` options have been applied. We [reverted that change](https://github.com/moby/moby/pull/47621), because it meant `--sysctl` for an interface-specific setting (like `--sysctl net.ipv6.conf.eth0.accept_ra=2`), [prevented the container from starting](https://github.com/moby/moby/issues/47619) because the interface did not exist when the runtime tried to apply the sysctl setting.

The immediate driver for the change was to be able to inspect the container to determine whether it had IPv6 enabled, after sysctls had been applied. Then, use that to set up `/etc/hosts` with or without IPv6 hosts - and, in follow-up PRs, to decide whether to allocate IPv6 addresses etc. But, eliminating the SetKey "reexec" was desirable in any case.

(Discussed in the networking maintainers sync call on 26th March - @corhere, @akerouanton, @cpuguy83.)

**Objectives**

- Provide a more reliable way to set interface-specific sysctls, without having to rely on interface naming:
  - `eth0` is the name of the first interface that gets set up, but the order isn't predictable on restart for a container with multiple interfaces.
- (Make it possible to set a sysctl on an interface created after the container, by a `network connect`.)
- Enable removal of the SetKey prestart hook.

**Proposal**

- Add a `sysctl` option to the "advanced syntax" of `--network`. For example:
  -  `--network name=mynet,sysctl=ipv6.conf.forwarding=1,sysctl=ipv6.conf.accept_ra=2`
  - In the CLI, check the option's value has the form "k=v", pass it to the engine API's create endpoint as a `[]string` in `EndpointSettings`.
    - _Existing `--sysctl` options are passed as a map, but ordering might be important and should be preserved._
  - Expand the setting's value, assuming `net.` and the interface's generated name...
    - For example:`ipv6.conf.forwarding` -> `net.ipv6.conf.eth0.forwarding`.
    - The idea is to:
      - Avoid potential issues with arbitrary non-endpoint-specific sysctls being set per-endpoint (for example, enabling or disabling IPv6 for the whole container using `net.ipv6.conf.all.disable_ipv6` would be difficult to deal with).
      - Avoid the issue of interface naming. _As a separate change we could offer a per-endpoint `ifname` option, but we'd also need to deal with conflicting names, and unspecified names._
    - But, it's not obvious and won't be anyone's first guess at the option's usage - if a full sysctl name is provided, we'll need the API to return an error message that clearly explains the problem... "you asked for 'net.ipv6.conf.eth0.forwarding', did you mean 'ipv6.conf.forwarding'?"
      - _Is this enough reason to use the full sysctl name? Could generate an error if it's not `net.something`, and if the interface name is `all` or `default`. But, the real interface name is more tricky - could either ignore the given name as a placeholder and use the generated name (but that might be surprising), or require the interface name to be `IFNAME` or similar (but then, back to the problem that the option's usage isn't clear)?_
  - Apply these sysctls after moving the interface into the container's namespace, before bringing it up.
- Once we re-remove the SetKey prestart hook, try to avoid breaking existing commands...
  - In the API handler, migrate `net.*.*.eth0` settings from `--sysctl` to the new field in `EndpointSettings`, for the first `--network`.
  - Deprecate the use of per-endpoint settings in `--sysctl`, add a warning to the create response, disable the migration in some future API version.
- Add a `--sysctl` option to `docker network connect`.
  - Same shortened format as above, `ipv6.conf.accept_ra=2`.

Matching changes will be needed in 'compose'. (I don't think 'build' has provision for sysctl settings.)

Other notes ...
- Interfaces are moved into the network namespace by libnetwork, not by its network drivers. We may want to give network drivers a veto on specific sysctl settings that would break their operation. But, there's no immediate requirement.
- Existing network drivers only create a single interface, if a future driver creates more than one we'll need to extend the syntax of the new `sysctl` options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Per-interface sysctls #47639

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Per-interface sysctls #47639

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions