refactor: Simplify `Dockerfile` #6320

polarathene · 2023-09-15T09:59:24Z

1. Why is this pull request needed and what does it do?

TL;DR:

Enforcing the NET_BIND_SERVICE capability for any user to bind privileged ports isn't required with Docker unless running with --network host?
You could have a much simpler Dockerfile and allow those who enforce dropping NET_BIND_SERVICE to still run the coredns binary on unprivileged ports which is more secure of a choice to support. setcap applying the capability with effective + permitted set is enforcing a requirement for privileges that aren't always necessary. You can still run as non-root without this.

Introduced in March 2023 to support non-root feature. It seems largely unnecessary though?

Everything removed was to support setcap mandating NET_BIND_SERVICE on the /coredns binary.
There is no need to COPY for ca-certificates.crt, the image base static-debian11:nonroot already has that (unless the BASE reference ARG is changed). I assume that the base image won't be changing much due to the expectation of USER nonroot:nonroot to be successful, thus value of overwriting the contents seems questionable?
- EDIT: Originally this feature was added when the scratch image was used for the final stage, it was more compatible then, but was not merged until the switch to nonroot image which added the USER directive.
- EDIT: The ca-certificates.crt is also from earlier as an improvement for the final stage with scratch of the image build. Thus no longer important.

#! /bin/bash

# Custom config:
cat > /tmp/Corefile <<EOF
.:5353 {
  whoami
}
EOF


# No problems binding with non-root to port >1024
# in container, and < 1024 on host:
docker run --rm \
  --sysctl 'net.ipv4.ip_unprivileged_port_start=1024' \
  --cap-drop NET_BIND_SERVICE \
  --user 1000:1000
  --workdir / --volume '/tmp/Corefile:/Corefile' \
  -p 53:5353/udp
  coredns/coredns:1.10.1

On a host network, or perhaps a different container run-time you may need to opt for the setcap approach? However due to the change in Docker 20.10.0 (Dec 2020), non-root users can already bind to ports below 1024:

# --cap-add NET_BIND_SERVICE is default capability,
# Dropping it has no impact due to Docker modifying sysctl:
$ docker run --rm --cap-drop NET_BIND_SERVICE coredns/coredns:1.10.1
$ docker run --rm --cap-drop NET_BIND_SERVICE --user 1000:1000 coredns/coredns:1.10.1

.:53
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3

Restore the sysctl restriction and the capability will have an effect when not applied:

# --cap-add NET_BIND_SERVICE is default capability, reset sysctl to 1024:
$ docker run --rm --sysctl 'net.ipv4.ip_unprivileged_port_start=1024'  coredns/coredns:1.10.1
.:53
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3


# Root (drop capability) and user (does not inherit?) - Both fail to bind now:
$ docker run --rm --sysctl 'net.ipv4.ip_unprivileged_port_start=1024' --cap-drop NET_BIND_SERVICE  coredns/coredns:1.10.1
$ docker run --rm --sysctl 'net.ipv4.ip_unprivileged_port_start=1024'  --user '1000:1000'  coredns/coredns:1.10.1
Listen: listen tcp :53: bind: permission denied

The only reason to use the setcap call in Dockerfile is to make the /coredns binary treated as a "capability-dumb binary":

# `setcap cap_net_bind_service=ep` via 1.11.1 tag,
# Capability requirement is enforced, sysctl has no relevance:
$ docker run --rm \
  --sysctl 'net.ipv4.ip_unprivileged_port_start=1024'  \
  --cap-add NET_BIND_SERVICE \
  coredns/coredns:1.11.1

.:53
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2

# Capability dropped, fails check and refuses to run binary:
$ docker run --rm \
  --sysctl 'net.ipv4.ip_unprivileged_port_start=1024' \
  --cap-drop NET_BIND_SERVICE \
  coredns/coredns:1.11.1

exec /coredns: operation not permitted


# Even root:
$ docker run --rm \
  --cap-drop NET_BIND_SERVICE \
  --user '0:0'
  coredns/coredns:1.11.1

exec /coredns: operation not permitted

Not a helpful error message to encounter.
Root user does not necessarily need to bind port lower than 1024 (and with Docker can regardless by default)
When the capability is dropped, not even the root user can bypass the restriction.
setcap is only in use to bypass the non-root restriction being respected (for any user in the container to leverage, with any port).

2. Which issues (if any) are related?

Last issue raised requesting non-root feature:

coredns should run as a non-root user #5968

Reference of unexpected problems from users affected by the non-root approach (which also changed default WORKDIR):

Earlier attempts to apply the change:

NOTE: PR #4528 at least included a comment providing better context of the capability intent, but it's not been necessary since Docker 20.10.0 (Dec 2020) which for Docker managed networks drops the <=1024 privileged ports.

3. Which documentation changes (if any) need to be made?

N/A?

4. Does this introduce a backward incompatible change or deprecation?

Reverts back to 1.10.1 Docker image expectation without mandatory capability enforced. The entire build stage is not needed, it's only purpose was to apply setcap to a pre-built coredns binary.
Has not reverted the expectation of 1.10.1 working directory (I could include that, should just be WORKDIR /).
The USER nonroot:nonroot directive shouldn't be necessary as that's already what the image should run with, I left it alone for the visibility. EXPOSE likewise is not necessary and is effectively documentation.

Everything else was to support `setcap` which should not be required to support running non-root.

chrisohaver · 2023-09-15T12:08:48Z

Please look into the failed ci/circleci kubernetes-test. On the surface, the change appears to have rendered coredns unresponsive to dns queries, but could of course be some flaw or inadequacy with the test.

polarathene · 2023-09-15T12:37:35Z

Please look into the failed ci/circleci kubernetes-test.

I don't have experience with k8s.

srv_test.go:52: Could not load corefile: timeout waiting for coredns to be ready. coredns log: Listen: listen tcp :53: bind: permission denied

I am familiar with the error though. I would say it's due to the non-root user binding below port 1024. This wouldn't normally be allowed on the host with CoreDNS would it? The binary isn't built/packaged with the capability enforced?

It is fine for Docker due to their internal bridge networks relaxing the port restriction.

My concern is from their comment in k8s by @vinayakankugoyal , where perhaps they relied on the container to bind to :53 internally for convenience and their network/runtime does not relax the privileged ports (which is what the capability would bypass).

It would seem that you can only keep one group happy 😅

For DNS I can understand wanting to favor binding to :53 and needing that ability in other contexts beyond Dockers internal bridge networks.
For those that bind to a > 1024 port internally and then publish that via :53, it's also warranted to not require NET_BIND_SERVICE capability in the image, since non-root user still works fine for that.

So I guess it's your choice as maintainer which one to favour. I don't have much input for k8s as I don't know how flexible that is with DNS configuration. Docker doesn't seem to allow changing the DNS port for containers with --dns for example. but is otherwise functional (like dig with -p which can choose the custom port to connect to).

SuperQ · 2023-09-15T12:44:37Z

This needs a DCO sign-off. You can use git commit -s --amend to add it.

polarathene · 2023-09-15T22:37:52Z

This needs a DCO sign-off.

I'm aware, but in light of the k8s test failure I'll wait for maintainer decision on which route the Docker image should cater to.

If setcap is kept, I'll just raise a new PR (or adjust this one if preferred) to handle the WORKDIR and remove the ca-certificates.crt copy.

chrisohaver · 2023-12-14T15:19:04Z

#5969 to support non-root feature. It seems largely unnecessary though?

Perhaps we should roll that back. Thoughts?

marc1404 · 2024-11-27T10:02:41Z

@polarathene I'm sorry for 'reviving' this PR after almost a year. Are you still interested in maintaining it and eventually merging it? As you and others have pointed out and as it has been referenced multiple times it's a bit cumbersome having to add the NET_BIND_SERVICE capability when running CoreDNS on non-privileged ports.

I tried retracing the discussions in #6249, #5969 and here. As I understand it, this change would be accepted and rolling back #5969 could also be done.

I'd be happy to see your change merged :)

polarathene · 2024-11-27T23:41:29Z

Are you still interested in maintaining it and eventually merging it?

There's not much to maintain, as you can see from the diff it simplified it to just a COPY without any messing around with enforcing capabilities.

I'm presently at full capacity elsewhere and likely will be for a while. If someone else wants to pursue the change I welcome that, as it's been effectively stale for over a year since I attempted to push this forward. I'll otherwise eventually get back to this some time in 2025.

As you and others have pointed out and as it has been referenced multiple times it's a bit cumbersome having to add the NET_BIND_SERVICE capability when running CoreDNS on non-privileged ports.

The capability is always required for the functionality. The setcap approach grants it to the binary for a non-root user to have provided it's still within the bounding set (dropping all caps for the container would prevent that). This method is just a lazy way to go about enabling that for non-root.

The correct approach is for the devs to implement raising the capability when it's actually needed, otherwise failing. It should be roughly a few lines to implement that, I know how to do so with Rust but I'm not a Go dev, shouldn't be hard though.

It would then work regardless of in a container environment or not, no setcap enforcement needed (you'd still do this for non-root to be permitted to request the capability, just not to enforce it to run the binary). So you may want to do that for the container still, or those who want rootless could run the container as rootless (with a root user internally mapped to a non-root on the host).

AFAIK there isn't much concern about running as root if you drop the capabilities accordingly, the attacker would need some exploit to escape the container and that does not always require a root user within the container to do so. The capability required here isn't one that is needed for Docker IIRC, and I believe k8s also is making it redundant by setting the associated sysctl to 0 by default.

I tried retracing the discussions in #6249, #5969 and here. As I understand it, this change would be accepted and rolling back #5969 could also be done.

AFAIK the main benefit of running as non-root is the user does not need to think about extra precautions like dropping capabilities you don't need. I don't stay on top of CVEs related to this but to my knowledge the defaults (at least for Docker) are fairly reasonable and compromising the container should be quite difficult unless the user does something unwise, meanwhile those that want to take the additional precautions are likely to do so already rather than trust various upstreams do it correctly.

My experience with rootless is minimal, I've done so recently with Podman, but that seems a far better way to approach it for those that are more security conscious.

polarathene · 2024-11-27T23:43:33Z

I'd be happy to see your change merged :)

Just to clarify, you don't seem to be marked as involved with the CoreDNS project directly?

The project maintainers still seem to be interested in the non-root approach they have in place?:

marc1404 · 2024-11-28T08:12:19Z

Just to clarify, you don't seem to be marked as involved with the CoreDNS project directly?

@polarathene Correct, I'm not directly involved with the CoreDNS project. Our project (gardener/gardener) consumes CoreDNS, and while not critically important, it'd be nice if we weren't forced to grant the NET_BIND_SERVICE capability when running CoreDNS on ports > 1024.

Thank you for your quick response and detailed explanation. I'll try contributing a change where the capability is raised dynamically when required 🤞

polarathene · 2024-11-28T20:28:10Z

it'd be nice if we weren't forced to grant the NET_BIND_SERVICE capability when running CoreDNS on ports > 1024.

Oh ok. The only reason you wouldn't be forced is when the related systctl makes it irrelevant I think, otherwise you'd need to grant the capability (even for root if it were dropped IIRC).

If it helps, this is from an old comment of mine on the topic, demonstrating a non-root user with setcap only applying +p:

# Now with `+p`:
$ docker run --rm -it --user 1000 \
  --sysctl net.ipv4.ip_unprivileged_port_start=1024 \
  --cap-drop ALL --cap-add CAP_NET_BIND_SERVICE \
  cap-test-p inspect
Ambient: []
Inheritable: []
Effective: []
Permitted: ["CAP_NET_BIND_SERVICE"]
Bounding: ["CAP_NET_BIND_SERVICE"]

# Without the `--aware` flag this would fail
# since the cap would not be in the effective set:
$ docker run --rm -it --user 1000 \
  --sysctl net.ipv4.ip_unprivileged_port_start=1024 \
  --cap-drop ALL --cap-add CAP_NET_BIND_SERVICE \
  cap-test-p --aware

CAP_NET_BIND_SERVICE is required to bind a privileged port
CAP_NET_BIND_SERVICE is permitted but missing from the effective set
Successfully added CAP_NET_BIND_SERVICE into the Effective set
Successfully bound to: TcpListener { addr: 127.0.0.1:80, fd: 3 }

So if the desired port to bind is below net.ipv4.ip_unprivileged_port_start, you'll need to check that CAP_NET_BIND_SERVICE is in the permitted set (it should be by default for the root user, unless it was dropped from the bounding set). If it's permitted, then you can raise to effective set (allowing you to use that capability).

The above output is from a small rust program I wrote a while back to support justifying the change to not require it with CoreDNS, but I didn't get around to publishing it 😓

You can also test similar outside of Docker with systemd with ambient capabilities:

$ sudo systemd-run --system --uid 1000 \
  --unit cap-test-none \
  --collect --pty --quiet \
  --property AmbientCapabilities=CAP_NET_BIND_SERVICE \
  target/x86_64-unknown-linux-musl/release/capability-aware --aware

CAP_NET_BIND_SERVICE is required to bind a privileged port
The effective set already includes CAP_NET_BIND_SERVICE
Successfully bound to: TcpListener { addr: 127.0.0.1:80, fd: 3 }

# The `inspect` subcommand output (bounding set excluded):
Ambient: ["CAP_NET_BIND_SERVICE"]
Inheritable: ["CAP_NET_BIND_SERVICE"]
Permitted: ["CAP_NET_BIND_SERVICE"]
Effective: ["CAP_NET_BIND_SERVICE"]

Simplify Dockerfile

522ea68

Everything else was to support `setcap` which should not be required to support running non-root.

polarathene requested review from bradbeam, chrisohaver, dilyevsky, jameshartig, greenpau, Isolus, johnbelamaric, miekg, pmoroney, rajansandeep, stp-ip, SuperQ, yongtang and Tantalor93 as code owners September 15, 2023 09:59

polarathene mentioned this pull request Sep 16, 2023

Docker working directory changed #6249

Open

This was referenced Dec 15, 2023

Run coredns as non root. #5969

Merged

Use a UID rather than "nonroot" in the Dockerfile #6416

Open

prezha mentioned this pull request Mar 17, 2024

add default sysctls to allow privileged ports with no capabilities kubernetes/minikube#18421

Merged

marc1404 mentioned this pull request Nov 27, 2024

chore(deps): update registry.k8s.io/coredns/coredns docker tag to v1.12.0 gardener/gardener#10909

Merged

1 task

polarathene mentioned this pull request Aug 7, 2025

fix: Dockerfile - Remove setcap stage #7435

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: Simplify `Dockerfile` #6320

refactor: Simplify `Dockerfile` #6320

Uh oh!

polarathene commented Sep 15, 2023 •

edited

Loading

Uh oh!

chrisohaver commented Sep 15, 2023

Uh oh!

polarathene commented Sep 15, 2023

Uh oh!

SuperQ commented Sep 15, 2023

Uh oh!

polarathene commented Sep 15, 2023

Uh oh!

chrisohaver commented Dec 14, 2023

Uh oh!

marc1404 commented Nov 27, 2024

Uh oh!

polarathene commented Nov 27, 2024

Uh oh!

polarathene commented Nov 27, 2024 •

edited

Loading

Uh oh!

marc1404 commented Nov 28, 2024

Uh oh!

polarathene commented Nov 28, 2024

Uh oh!

Uh oh!

refactor: Simplify Dockerfile #6320

Are you sure you want to change the base?

refactor: Simplify Dockerfile #6320

Uh oh!

Conversation

polarathene commented Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Why is this pull request needed and what does it do?

2. Which issues (if any) are related?

3. Which documentation changes (if any) need to be made?

4. Does this introduce a backward incompatible change or deprecation?

Uh oh!

chrisohaver commented Sep 15, 2023

Uh oh!

polarathene commented Sep 15, 2023

Uh oh!

SuperQ commented Sep 15, 2023

Uh oh!

polarathene commented Sep 15, 2023

Uh oh!

chrisohaver commented Dec 14, 2023

Uh oh!

marc1404 commented Nov 27, 2024

Uh oh!

polarathene commented Nov 27, 2024

Uh oh!

polarathene commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marc1404 commented Nov 28, 2024

Uh oh!

polarathene commented Nov 28, 2024

Uh oh!

Uh oh!

refactor: Simplify `Dockerfile` #6320

refactor: Simplify `Dockerfile` #6320

polarathene commented Sep 15, 2023 •

edited

Loading

polarathene commented Nov 27, 2024 •

edited

Loading