Skip to content

Conversation

polarathene
Copy link
Contributor

@polarathene polarathene commented Sep 15, 2023

1. Why is this pull request needed and what does it do?

TL;DR:

  • Enforcing the NET_BIND_SERVICE capability for any user to bind privileged ports isn't required with Docker unless running with --network host?
  • You could have a much simpler Dockerfile and allow those who enforce dropping NET_BIND_SERVICE to still run the coredns binary on unprivileged ports which is more secure of a choice to support. setcap applying the capability with effective + permitted set is enforcing a requirement for privileges that aren't always necessary. You can still run as non-root without this.

Introduced in March 2023 to support non-root feature. It seems largely unnecessary though?

#! /bin/bash

# Custom config:
cat > /tmp/Corefile <<EOF
.:5353 {
  whoami
}
EOF


# No problems binding with non-root to port >1024
# in container, and < 1024 on host:
docker run --rm \
  --sysctl 'net.ipv4.ip_unprivileged_port_start=1024' \
  --cap-drop NET_BIND_SERVICE \
  --user 1000:1000
  --workdir / --volume '/tmp/Corefile:/Corefile' \
  -p 53:5353/udp
  coredns/coredns:1.10.1

On a host network, or perhaps a different container run-time you may need to opt for the setcap approach? However due to the change in Docker 20.10.0 (Dec 2020), non-root users can already bind to ports below 1024:

# --cap-add NET_BIND_SERVICE is default capability,
# Dropping it has no impact due to Docker modifying sysctl:
$ docker run --rm --cap-drop NET_BIND_SERVICE coredns/coredns:1.10.1
$ docker run --rm --cap-drop NET_BIND_SERVICE --user 1000:1000 coredns/coredns:1.10.1

.:53
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3

Restore the sysctl restriction and the capability will have an effect when not applied:

# --cap-add NET_BIND_SERVICE is default capability, reset sysctl to 1024:
$ docker run --rm --sysctl 'net.ipv4.ip_unprivileged_port_start=1024'  coredns/coredns:1.10.1
.:53
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3


# Root (drop capability) and user (does not inherit?) - Both fail to bind now:
$ docker run --rm --sysctl 'net.ipv4.ip_unprivileged_port_start=1024' --cap-drop NET_BIND_SERVICE  coredns/coredns:1.10.1
$ docker run --rm --sysctl 'net.ipv4.ip_unprivileged_port_start=1024'  --user '1000:1000'  coredns/coredns:1.10.1
Listen: listen tcp :53: bind: permission denied

The only reason to use the setcap call in Dockerfile is to make the /coredns binary treated as a "capability-dumb binary":

# `setcap cap_net_bind_service=ep` via 1.11.1 tag,
# Capability requirement is enforced, sysctl has no relevance:
$ docker run --rm \
  --sysctl 'net.ipv4.ip_unprivileged_port_start=1024'  \
  --cap-add NET_BIND_SERVICE \
  coredns/coredns:1.11.1

.:53
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2

# Capability dropped, fails check and refuses to run binary:
$ docker run --rm \
  --sysctl 'net.ipv4.ip_unprivileged_port_start=1024' \
  --cap-drop NET_BIND_SERVICE \
  coredns/coredns:1.11.1

exec /coredns: operation not permitted


# Even root:
$ docker run --rm \
  --cap-drop NET_BIND_SERVICE \
  --user '0:0'
  coredns/coredns:1.11.1

exec /coredns: operation not permitted
  • Not a helpful error message to encounter.
  • Root user does not necessarily need to bind port lower than 1024 (and with Docker can regardless by default)
  • When the capability is dropped, not even the root user can bypass the restriction.
  • setcap is only in use to bypass the non-root restriction being respected (for any user in the container to leverage, with any port).

2. Which issues (if any) are related?

Last issue raised requesting non-root feature:

Reference of unexpected problems from users affected by the non-root approach (which also changed default WORKDIR):

Earlier attempts to apply the change:

NOTE: PR #4528 at least included a comment providing better context of the capability intent, but it's not been necessary since Docker 20.10.0 (Dec 2020) which for Docker managed networks drops the <=1024 privileged ports.

3. Which documentation changes (if any) need to be made?

N/A?

4. Does this introduce a backward incompatible change or deprecation?

  • Reverts back to 1.10.1 Docker image expectation without mandatory capability enforced. The entire build stage is not needed, it's only purpose was to apply setcap to a pre-built coredns binary.
  • Has not reverted the expectation of 1.10.1 working directory (I could include that, should just be WORKDIR /).
  • The USER nonroot:nonroot directive shouldn't be necessary as that's already what the image should run with, I left it alone for the visibility. EXPOSE likewise is not necessary and is effectively documentation.

Everything else was to support `setcap` which should not be required to support running non-root.
@chrisohaver
Copy link
Member

Please look into the failed ci/circleci kubernetes-test. On the surface, the change appears to have rendered coredns unresponsive to dns queries, but could of course be some flaw or inadequacy with the test.

@polarathene
Copy link
Contributor Author

Please look into the failed ci/circleci kubernetes-test.

I don't have experience with k8s.

srv_test.go:52: Could not load corefile: timeout waiting for coredns to be ready. coredns log: Listen: listen tcp :53: bind: permission denied

I am familiar with the error though. I would say it's due to the non-root user binding below port 1024. This wouldn't normally be allowed on the host with CoreDNS would it? The binary isn't built/packaged with the capability enforced?

It is fine for Docker due to their internal bridge networks relaxing the port restriction.


My concern is from their comment in k8s by @vinayakankugoyal , where perhaps they relied on the container to bind to :53 internally for convenience and their network/runtime does not relax the privileged ports (which is what the capability would bypass).

It would seem that you can only keep one group happy 😅

  • For DNS I can understand wanting to favor binding to :53 and needing that ability in other contexts beyond Dockers internal bridge networks.
  • For those that bind to a > 1024 port internally and then publish that via :53, it's also warranted to not require NET_BIND_SERVICE capability in the image, since non-root user still works fine for that.

So I guess it's your choice as maintainer which one to favour. I don't have much input for k8s as I don't know how flexible that is with DNS configuration. Docker doesn't seem to allow changing the DNS port for containers with --dns for example. but is otherwise functional (like dig with -p which can choose the custom port to connect to).

@SuperQ
Copy link
Collaborator

SuperQ commented Sep 15, 2023

This needs a DCO sign-off. You can use git commit -s --amend to add it.

@polarathene
Copy link
Contributor Author

This needs a DCO sign-off.

I'm aware, but in light of the k8s test failure I'll wait for maintainer decision on which route the Docker image should cater to.

If setcap is kept, I'll just raise a new PR (or adjust this one if preferred) to handle the WORKDIR and remove the ca-certificates.crt copy.

@chrisohaver
Copy link
Member

#5969 to support non-root feature. It seems largely unnecessary though?

Perhaps we should roll that back. Thoughts?

@marc1404
Copy link

@polarathene I'm sorry for 'reviving' this PR after almost a year. Are you still interested in maintaining it and eventually merging it? As you and others have pointed out and as it has been referenced multiple times it's a bit cumbersome having to add the NET_BIND_SERVICE capability when running CoreDNS on non-privileged ports.

I tried retracing the discussions in #6249, #5969 and here. As I understand it, this change would be accepted and rolling back #5969 could also be done.

I'd be happy to see your change merged :)

@polarathene
Copy link
Contributor Author

Are you still interested in maintaining it and eventually merging it?

There's not much to maintain, as you can see from the diff it simplified it to just a COPY without any messing around with enforcing capabilities.

I'm presently at full capacity elsewhere and likely will be for a while. If someone else wants to pursue the change I welcome that, as it's been effectively stale for over a year since I attempted to push this forward. I'll otherwise eventually get back to this some time in 2025.


As you and others have pointed out and as it has been referenced multiple times it's a bit cumbersome having to add the NET_BIND_SERVICE capability when running CoreDNS on non-privileged ports.

The capability is always required for the functionality. The setcap approach grants it to the binary for a non-root user to have provided it's still within the bounding set (dropping all caps for the container would prevent that). This method is just a lazy way to go about enabling that for non-root.

The correct approach is for the devs to implement raising the capability when it's actually needed, otherwise failing. It should be roughly a few lines to implement that, I know how to do so with Rust but I'm not a Go dev, shouldn't be hard though.

It would then work regardless of in a container environment or not, no setcap enforcement needed (you'd still do this for non-root to be permitted to request the capability, just not to enforce it to run the binary). So you may want to do that for the container still, or those who want rootless could run the container as rootless (with a root user internally mapped to a non-root on the host).

AFAIK there isn't much concern about running as root if you drop the capabilities accordingly, the attacker would need some exploit to escape the container and that does not always require a root user within the container to do so. The capability required here isn't one that is needed for Docker IIRC, and I believe k8s also is making it redundant by setting the associated sysctl to 0 by default.


I tried retracing the discussions in #6249, #5969 and here. As I understand it, this change would be accepted and rolling back #5969 could also be done.

AFAIK the main benefit of running as non-root is the user does not need to think about extra precautions like dropping capabilities you don't need. I don't stay on top of CVEs related to this but to my knowledge the defaults (at least for Docker) are fairly reasonable and compromising the container should be quite difficult unless the user does something unwise, meanwhile those that want to take the additional precautions are likely to do so already rather than trust various upstreams do it correctly.

My experience with rootless is minimal, I've done so recently with Podman, but that seems a far better way to approach it for those that are more security conscious.

@polarathene
Copy link
Contributor Author

polarathene commented Nov 27, 2024

I'd be happy to see your change merged :)

Just to clarify, you don't seem to be marked as involved with the CoreDNS project directly?

The project maintainers still seem to be interested in the non-root approach they have in place?:

@marc1404
Copy link

Just to clarify, you don't seem to be marked as involved with the CoreDNS project directly?

@polarathene Correct, I'm not directly involved with the CoreDNS project. Our project (gardener/gardener) consumes CoreDNS, and while not critically important, it'd be nice if we weren't forced to grant the NET_BIND_SERVICE capability when running CoreDNS on ports > 1024.

Thank you for your quick response and detailed explanation. I'll try contributing a change where the capability is raised dynamically when required 🤞

@polarathene
Copy link
Contributor Author

it'd be nice if we weren't forced to grant the NET_BIND_SERVICE capability when running CoreDNS on ports > 1024.

Oh ok. The only reason you wouldn't be forced is when the related systctl makes it irrelevant I think, otherwise you'd need to grant the capability (even for root if it were dropped IIRC).

If it helps, this is from an old comment of mine on the topic, demonstrating a non-root user with setcap only applying +p:

# Now with `+p`:
$ docker run --rm -it --user 1000 \
  --sysctl net.ipv4.ip_unprivileged_port_start=1024 \
  --cap-drop ALL --cap-add CAP_NET_BIND_SERVICE \
  cap-test-p inspect
Ambient: []
Inheritable: []
Effective: []
Permitted: ["CAP_NET_BIND_SERVICE"]
Bounding: ["CAP_NET_BIND_SERVICE"]

# Without the `--aware` flag this would fail
# since the cap would not be in the effective set:
$ docker run --rm -it --user 1000 \
  --sysctl net.ipv4.ip_unprivileged_port_start=1024 \
  --cap-drop ALL --cap-add CAP_NET_BIND_SERVICE \
  cap-test-p --aware

CAP_NET_BIND_SERVICE is required to bind a privileged port
CAP_NET_BIND_SERVICE is permitted but missing from the effective set
Successfully added CAP_NET_BIND_SERVICE into the Effective set
Successfully bound to: TcpListener { addr: 127.0.0.1:80, fd: 3 }

So if the desired port to bind is below net.ipv4.ip_unprivileged_port_start, you'll need to check that CAP_NET_BIND_SERVICE is in the permitted set (it should be by default for the root user, unless it was dropped from the bounding set). If it's permitted, then you can raise to effective set (allowing you to use that capability).

The above output is from a small rust program I wrote a while back to support justifying the change to not require it with CoreDNS, but I didn't get around to publishing it 😓

You can also test similar outside of Docker with systemd with ambient capabilities:

$ sudo systemd-run --system --uid 1000 \
  --unit cap-test-none \
  --collect --pty --quiet \
  --property AmbientCapabilities=CAP_NET_BIND_SERVICE \
  target/x86_64-unknown-linux-musl/release/capability-aware --aware

CAP_NET_BIND_SERVICE is required to bind a privileged port
The effective set already includes CAP_NET_BIND_SERVICE
Successfully bound to: TcpListener { addr: 127.0.0.1:80, fd: 3 }

# The `inspect` subcommand output (bounding set excluded):
Ambient: ["CAP_NET_BIND_SERVICE"]
Inheritable: ["CAP_NET_BIND_SERVICE"]
Permitted: ["CAP_NET_BIND_SERVICE"]
Effective: ["CAP_NET_BIND_SERVICE"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants