internal/cri: simplify netns setup with pinned userns #10607

fuweid · 2024-08-17T14:14:59Z

Motivation:

For pod-level user namespaces, it's impossible to force the container runtime
to join an existing network namespace after creating a new user namespace.

According to the capabilities section in user_namespaces(7), a network
namespace created by containerd is owned by the root user namespace. When the
container runtime (like runc or crun) creates a new user namespace, it becomes
a child of the root user namespace. Processes within this child user namespace
are not permitted to access resources owned by the parent user namespace.

If the network namespace is not owned by the new user namespace, the container
runtime will fail to mount /sys due to the sysfs: Restrict mounting sysfs
patch.

Referencing the cap_capable function in Linux, a process can access a
resource if:

The resource is owned by the process's user namespace, and the process has
the required capability.
The resource is owned by a child of the process's user namespace, and the
owner's user namespace was created by the process's UID.

In the context of pod-level user namespaces, the CRI plugin delegates the
creation of the network namespace to the container runtime when running the
pause container. After the pause container is initialized, the CRI plugin pins
the pause container's network namespace into /run/netns and then executes
the CNI_ADD command over it.

However, if the pause container is terminated during the pinning process, the
CRI plugin might encounter a PID cycle, leading to the CNI_ADD command
operating on an incorrect network namespace.

Moreover, rolling back the RunPodSandbox API is complex due to the delegation
of network namespace creation. As highlighted in issue #10363, the CRI plugin
can lose IP information after a containerd restart, making it challenging to
maintain robustness in the RunPodSandbox API.

Solution:

Allow containerd to create a new user namespace and then create the network
namespace within that user namespace. This way, the CRI plugin can force the
container runtime to join both the user namespace and the network namespace.
Since the network namespace is owned by the newly created user namespace,
the container runtime will have the necessary permissions to mount /sys on
the container's root filesystem. As a result, delegation of network namespace
creation is no longer needed.

NOTE:

The CRI plugin does not need to pin the newly created user namespace as it
does with the network namespace, because the kernel allows retrieving a user
namespace reference via ioctl_ns(2). As a result, the podsandbox
implementation can obtain the user namespace using the netnsPath parameter.
The pkg/sys package continues to use go:linkname to handle fork operations
due to efficiency, despite being a notable member of hall of shame. If core/mount: use ptrace instead of go:linkname #10611 can work, I will switch it back.

Signed-off-by: Wei Fu fuweid89@gmail.com

k8s-ci-robot · 2024-08-17T14:15:01Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

fuweid · 2024-08-18T12:39:56Z

ping @dmcgowan @AkihiroSuda @mikebrow @rata @dcantah @MikeZappa87 PTAL when you have time. Thanks

fuweid · 2024-08-19T14:56:59Z

/retest

rata

@fuweid thanks a lot for tackling this! I really like the simplicity overall. I added some ideas to simplify it further, let me know what you think.

Just to understand, the issue you mention (#10363) has another PR to fix it too: #10400. Do we want both fixes or what is the plan?

It seems the PR is changing a lot just while I write this. Github doesn't still show those files as outdated, so I think the comments will still make sense.

Please let me know when this is more or less settled and I'll take another look

rata · 2024-08-19T15:34:48Z

internal/cri/server/sandbox_run_linux.go

+	if len(uidMaps) != 1 {
+		return nil, fmt.Errorf("required only one uid mapping, but got %d uid mapping(s)", len(uidMaps))
+	}
+	if uidMaps[0] == nil {


I think this is fine here for now. But I wanted to raise awareness that there are other PRs trying to support multiple mappings: #10307

rata · 2024-08-19T15:47:06Z

internal/cri/server/podsandbox/controller.go

@@ -79,13 +79,20 @@ func init() {
 				return nil, fmt.Errorf("unable to load CRI image service plugin dependency: %w", err)
 			}

+			usernsInode, err := getCurrentUserNamespaceInode()


I can't find where we use this. What am I missing?

Removed this since we don't it.

It seems you added it back by mistake?

updated. Thanks

rata · 2024-08-19T15:57:03Z

internal/cri/server/sandbox_run.go

+			sandbox.NetNS, err = netns.NewNetNS(netnsMountDir)
+		} else {
+			usernsOpts := config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetUsernsOptions()
+			sandbox.NetNS, err = c.setupNetnsWithinUserns(netnsMountDir, usernsOpts)


I guess this is run before: 3f6ab6e#diff-00338fb60fb364520225a9a56d3e7cfdadffdb695d98e0970f5df964556cbf46R114 ? Can you confirm? Sorry, I don't remember everything by heart. I'll try to check out the code locally tomorrow :)

Yes. The internal/cri/server package run before podsanbox pkg.

rata · 2024-08-19T15:58:21Z

internal/cri/server/sandbox_run.go

+			sandbox.NetNS, err = netns.NewNetNS(netnsMountDir)
+		} else {
+			usernsOpts := config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetUsernsOptions()
+			sandbox.NetNS, err = c.setupNetnsWithinUserns(netnsMountDir, usernsOpts)


Why not have this be: setupUsernsAndNetns() and just setup both namespaces? I mean, pin (mount) the userns too.

please checkout comment #10607 (comment)

rata · 2024-08-19T15:59:06Z

internal/cri/server/sandbox_run_linux.go

+			netNs, err = netns.NewNetNSFromPID(netnsMountDir, uint32(pid))
+			if err != nil {
+				return fmt.Errorf("failed to mount netns from pid %d: %w", pid, err)
+			}
+			return nil


I'd just add here the code to mount the userns and just simplify the rest (no need for the ioctl). Does it make sense?

Please checkout comment #10607 (comment)

rata · 2024-08-19T16:01:49Z

internal/cri/server/podsandbox/sandbox_run_linux.go

+
+			if err := c.pinUserNamespace(id, nsPath); err != nil {
+				return nil, fmt.Errorf("failed to pin user namespace: %w", err)
+			}
+			specOpts = append(specOpts, customopts.WithNamespacePath(runtimespec.UserNamespace, c.getSandboxPinnedUserNamespace(id)))


Why do we pin the userns here, instead of where we pin the netns? We really depend on the netns being created and populated, not just the path (as this function takes). This function seems more creating the config.json, it seems better to have only the path to the userns here, IMHO.

What do you think?

I was thinking that I should pin the userns in internal/cri/server package. However, the podsandbox can be remote plugin. If we handle userns in internal/cri/server package, we will have to extend the sandbox interface with new argument usernsPath. I'm not sure it's good to add such information in the api, because the sandbox implementation can retrieve user namespace from network namespace. So, I use ioctl to fetch the inode and mount it in podsandbox.

containerd/internal/cri/server/sandbox_service.go

Lines 51 to 57 in e8104a4

func (c *criSandboxService) CreateSandbox(ctx context.Context, info sandbox.Sandbox, opts ...sandbox.CreateOpt) error {

ctrl, err := c.SandboxController(info.Sandboxer)

if err != nil {

return err

}

return ctrl.Create(ctx, info, opts...)

}

containerd/core/sandbox/controller.go

Lines 65 to 70 in e8104a4

func WithNetNSPath(netNSPath string) CreateOpt {

return func(co *CreateOptions) error {

co.NetNSPath = netNSPath

return nil

}

}

cc @mikebrow @abel-von @mxpv

I don't really know what a remote sandbox is, so I'll risk making a question that doesn't make sense. Sorry in advance if that is the case :)

The userns is created when we create the netns now, that is not changing. Why can't we just persist it there too, when we mount the netns?

I mean, this way doesn't take a param to the userns path either, but it is just being created with the netns. What is the issue to persist it at that point in time too?

The userns is created when we create the netns now, that is not changing. Why can't we just persist it there too, when we mount the netns?

There are two different plugins in containerd.

io.containerd.grpc.v1.cri

io.containerd.sandbox.controller.v1

Currently, the io.containerd.sandbox.controller.v1 is implementation to setup sandbox environment. The
io.containerd.grpc.v1.cri manages the CNI configuration, which means that it should take responsibility of setting up networking plugin. Based on this, it looks like that io.containerd.grpc.v1.cri should take responsibility to pin the user namespace as well.

However, in my opinion, passing the user namespace path through the API is not ideal, similar to the networking namespace. The user namespace path seems redundant when compared to the networking namespace. Therefore, I believe the sandbox implementation should retrieve the user namespace from the networking namespace path.

containerd/api/services/sandbox/v1/sandbox.proto

Lines 101 to 105 in c8b095f

message ControllerCreateRequest {

string sandbox_id = 1;

repeated containerd.types.Mount rootfs = 2;

google.protobuf.Any options = 3;

string netns_path = 4;

ping @containerd/committers @containerd/reviewers for more input if it's good to add new param in proto.

Makes sense, I'm not familiar (yet) with the new plugins, sorry for the noise :)

The pinning in this function looks odd to me, as this should just create a runtime-spec config. But I'm fine with it, of course, you are the maintainer =)

rata · 2024-08-19T16:24:01Z

pkg/sys/unshare_unsafe_linux.go

+//go:norace
+//go:noinline
+//go:nosplit
+func unshareAfterEnterUserns(usernsFd uintptr, flags uintptr, pipeFd uintptr) (_pid uintptr, _pidfd uintptr, _ syscall.Errno) {


Why did you change it to this? The one creating a process seem very easy to reason about, no need to lock the OS in the runtime, etc.

Do you mean using os.StartProcess? Sorry I force-push to clean-up the history.

Yes. I have plan to use os.StartProcess(ptrace: true). But I run into two problems:

core/mount: use ptrace instead of go:linkname #10611 (comment)

2024-08-19T13:29:27.6655085Z pod_userns_linux_test.go:284: 2024-08-19T13:29:27.6656082Z Error Trace: /home/runner/work/containerd/containerd/integration/pod_userns_linux_test.go:284 2024-08-19T13:29:27.6656870Z Error: Received unexpected error: 2024-08-19T13:29:27.6662188Z rpc error: code = Unknown desc = failed to update container "16dd29b770c6a7127b84ebf1228617b417498522281b391ef5bd35d05ea6c85e" state: failed to checkpoint status to "/var/lib/containerd-test/io.containerd.grpc.v1.cri/containers/16dd29b770c6a7127b84ebf1228617b417498522281b391ef5bd35d05ea6c85e/status": close /var/lib/containerd-test/io.containerd.grpc.v1.cri/containers/16dd29b770c6a7127b84ebf1228617b417498522281b391ef5bd35d05ea6c85e/.tmp-status359157146: bad file descriptor 2024-08-19T13:29:27.6666867Z Test: TestPodUserNS/volumes_permissions

I can't reproduce it in my local. Maybe related to go1.23.0. It seems that the go runtime randomly closes fd.

Will switch to use os.StartProcess if there is any workrounds from golang/go#68984, like dup or reset finalizer.

Oh, great finding! Wouldn't just doing a dup here be a valid workaround using unix.Dup()?

Will update it after #10611 merged

fuweid · 2024-08-20T09:19:13Z

Just to understand, the issue you mention (#10363) has another PR to fix it too: #10400. Do we want both fixes or what is the plan?

@rata My plan is to simplify the logic about user namespace. There is a lot of changes since your first version. I revisited the code and found that the sandbox API design doesn't allow us to check pause container is alive when we try to pin netns by pid. It's easy to cause pid recycle issue. So pinned user namespace can help us avoid this case.

https://github.com/kinvolk/containerd/blob/ca69ae26567ca36f4a14d6896998d9130459ce4e/pkg/cri/server/sandbox_run.go#L408-L412

Yes. I think it can fix the issue #10363. It's kind of other option.

rata

LGTM. Thanks again for tackling this! :)

@fuweid makes sense, thanks! I think this PR is quite clean (userns creation in go can be tricky), it will be very clean after #10611.

I think it is also okay to merge this as it is and then fo a follow-up PR once #10611 is done (or just to the clean-up there), if that is convenient, as the cleanup is very isolated and small.

btw, I have also spin-up a cluster locally with this. But all I did was covered by tests already :-D

fuweid · 2024-09-06T09:58:28Z

ping @rata I updated the pull request. PTAL, Thanks!

rata

@fuweid It seems you added back the unused helper about the inode of the namespace. Other than that, it seems good :)

rata · 2024-09-09T14:47:15Z

internal/cri/server/podsandbox/controller.go

@@ -79,13 +79,20 @@ func init() {
 				return nil, fmt.Errorf("unable to load CRI image service plugin dependency: %w", err)
 			}

+			usernsInode, err := getCurrentUserNamespaceInode()


It seems you added it back by mistake?

rata

LGTM, thanks!

rata · 2024-09-09T16:40:19Z

internal/cri/server/podsandbox/sandbox_run_linux.go

+
+			if err := c.pinUserNamespace(id, nsPath); err != nil {
+				return nil, fmt.Errorf("failed to pin user namespace: %w", err)
+			}
+			specOpts = append(specOpts, customopts.WithNamespacePath(runtimespec.UserNamespace, c.getSandboxPinnedUserNamespace(id)))


Makes sense, I'm not familiar (yet) with the new plugins, sorry for the noise :)

The pinning in this function looks odd to me, as this should just create a runtime-spec config. But I'm fine with it, of course, you are the maintainer =)

pkg/sys/unshare_linux.go

It allows to disassociate parts of its execution context within a user namespace. Signed-off-by: Wei Fu <fuweid89@gmail.com>

Signed-off-by: Wei Fu <fuweid89@gmail.com>

Motivation: For pod-level user namespaces, it's impossible to force the container runtime to join an existing network namespace after creating a new user namespace. According to the capabilities section in [user_namespaces(7)][1], a network namespace created by containerd is owned by the root user namespace. When the container runtime (like runc or crun) creates a new user namespace, it becomes a child of the root user namespace. Processes within this child user namespace are not permitted to access resources owned by the parent user namespace. If the network namespace is not owned by the new user namespace, the container runtime will fail to mount /sys due to the [sysfs: Restrict mounting sysfs][2] patch. Referencing the [cap_capable][3] function in Linux, a process can access a resource if: * The resource is owned by the process's user namespace, and the process has the required capability. * The resource is owned by a child of the process's user namespace, and the owner's user namespace was created by the process's UID. In the context of pod-level user namespaces, the CRI plugin delegates the creation of the network namespace to the container runtime when running the pause container. After the pause container is initialized, the CRI plugin pins the pause container's network namespace into `/run/netns` and then executes the `CNI_ADD` command over it. However, if the pause container is terminated during the pinning process, the CRI plugin might encounter a PID cycle, leading to the `CNI_ADD` command operating on an incorrect network namespace. Moreover, rolling back the `RunPodSandbox` API is complex due to the delegation of network namespace creation. As highlighted in issue containerd#10363, the CRI plugin can lose IP information after a containerd restart, making it challenging to maintain robustness in the RunPodSandbox API. Solution: Allow containerd to create a new user namespace and then create the network namespace within that user namespace. This way, the CRI plugin can force the container runtime to join both the user namespace and the network namespace. Since the network namespace is owned by the newly created user namespace, the container runtime will have the necessary permissions to mount `/sys` on the container's root filesystem. As a result, delegation of network namespace creation is no longer needed. NOTE: * The CRI plugin does not need to pin the newly created user namespace as it does with the network namespace, because the kernel allows retrieving a user namespace reference via [ioctl_ns(2)][4]. As a result, the podsandbox implementation can obtain the user namespace using the `netnsPath` parameter. [1]: <https://man7.org/linux/man-pages/man7/user_namespaces.7.html> [2]: <torvalds/linux@7dc5dbc> [3]: <https://github.com/torvalds/linux/blob/2c85ebc57b3e1817b6ce1a6b703928e113a90442/security/commoncap.c#L65> [4]: <https://man7.org/linux/man-pages/man2/ioctl_ns.2.html> Signed-off-by: Wei Fu <fuweid89@gmail.com>

fuweid · 2024-09-14T00:12:51Z

ping @AkihiroSuda @dmcgowan may I have a approval on this? Thanks

rata · 2024-09-20T08:44:36Z

@fuweid does this fix #10363 too? Or do we still need a fix for that?

k8s-ci-robot added do-not-merge/work-in-progress size/XL labels Aug 17, 2024

fuweid force-pushed the pin-userns branch 8 times, most recently from 63004fc to 5e57b9f Compare August 18, 2024 10:06

fuweid marked this pull request as ready for review August 18, 2024 10:06

dosubot bot added the area/cri Container Runtime Interface (CRI) label Aug 18, 2024

fuweid changed the title ~~[WIP] internal/cri: Simplify network namespace setup when user namespaces are enabled~~ internal/cri: simplify netns setup with pinned userns Aug 18, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress label Aug 18, 2024

fuweid added the ok-to-test label Aug 18, 2024

fuweid force-pushed the pin-userns branch 4 times, most recently from d8b4935 to 3f6ab6e Compare August 19, 2024 13:56

fuweid force-pushed the pin-userns branch from 3f6ab6e to 5e57b9f Compare August 19, 2024 16:04

MikeZappa87 self-assigned this Aug 19, 2024

rata reviewed Aug 19, 2024

View reviewed changes

fuweid force-pushed the pin-userns branch from 5e57b9f to 65a1ad9 Compare August 20, 2024 09:05

rata approved these changes Aug 23, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase label Aug 30, 2024

fuweid force-pushed the pin-userns branch from 65a1ad9 to cd64666 Compare September 6, 2024 09:41

k8s-ci-robot removed the needs-rebase label Sep 6, 2024

rata reviewed Sep 9, 2024

View reviewed changes

fuweid force-pushed the pin-userns branch from cd64666 to d25de18 Compare September 9, 2024 15:17

rata approved these changes Sep 9, 2024

View reviewed changes

rata mentioned this pull request Sep 10, 2024

'runc exec' errors with 'failed to setns into net namespace: Operation not permitted' opencontainers/runc#4390

Closed

cpuguy83 approved these changes Sep 10, 2024

View reviewed changes

pkg/sys/unshare_linux.go Outdated Show resolved Hide resolved

fuweid added 3 commits September 11, 2024 07:21

pkg/sys: Add UnshareAfterEnterUserns function

490e45a

It allows to disassociate parts of its execution context within a user namespace. Signed-off-by: Wei Fu <fuweid89@gmail.com>

pkg/sys: add GetUsernsForNamespace interface

fd3f3d5

Signed-off-by: Wei Fu <fuweid89@gmail.com>

fuweid force-pushed the pin-userns branch from d25de18 to ee0ed75 Compare September 10, 2024 23:23

AkihiroSuda approved these changes Sep 18, 2024

View reviewed changes

AkihiroSuda added this pull request to the merge queue Sep 18, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 19, 2024

AkihiroSuda added this pull request to the merge queue Sep 19, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 19, 2024

AkihiroSuda added this pull request to the merge queue Sep 19, 2024

Merged via the queue into containerd:main with commit 8c64a2f Sep 19, 2024
53 checks passed

mbaynton mentioned this pull request Sep 27, 2024

setgroups denied in user namespaces #10742

Closed

AkihiroSuda mentioned this pull request Oct 17, 2024

[v2.0.0] No CNI info for pod sandbox after containerd restart when using user namespaces #10363

Closed

This was referenced Oct 17, 2024

Allow setgroups in user namespaces #10741

Merged

fsGroup are not applied, when user-namespaces are enabled #10847

Closed

update runc binary to 1.2.1 #10877

Merged

AkihiroSuda mentioned this pull request Oct 23, 2024

containerd's TestPodUserNS fails with runc v1.2 (succeeds with crun) on SELinux distro: setxattr /[...]/dev/mqueue: operation not permitted opencontainers/runc#4466

Closed

fuweid mentioned this pull request Aug 7, 2025

sys: fix pidfd leak in UnshareAfterEnterUserns #12167

Merged

	func (c *criSandboxService) CreateSandbox(ctx context.Context, info sandbox.Sandbox, opts ...sandbox.CreateOpt) error {
	ctrl, err := c.SandboxController(info.Sandboxer)
	if err != nil {
	return err
	}
	return ctrl.Create(ctx, info, opts...)
	}

	func WithNetNSPath(netNSPath string) CreateOpt {
	return func(co *CreateOptions) error {
	co.NetNSPath = netNSPath
	return nil
	}
	}

	message ControllerCreateRequest {
	string sandbox_id = 1;
	repeated containerd.types.Mount rootfs = 2;
	google.protobuf.Any options = 3;
	string netns_path = 4;

internal/cri: simplify netns setup with pinned userns #10607

internal/cri: simplify netns setup with pinned userns #10607

Uh oh!

Conversation

fuweid commented Aug 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation:

Solution:

NOTE:

Uh oh!

k8s-ci-robot commented Aug 17, 2024

Uh oh!

fuweid commented Aug 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented Aug 19, 2024

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rata Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuweid commented Aug 20, 2024

Uh oh!

rata left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuweid commented Sep 6, 2024

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rata left a comment

fuweid commented Aug 17, 2024 •

edited

Loading

fuweid commented Aug 18, 2024 •

edited

Loading

rata Aug 19, 2024 •

edited

Loading

rata left a comment •

edited

Loading