Skip to content

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Aug 17, 2024

Motivation:

For pod-level user namespaces, it's impossible to force the container runtime
to join an existing network namespace after creating a new user namespace.

According to the capabilities section in user_namespaces(7), a network
namespace created by containerd is owned by the root user namespace. When the
container runtime (like runc or crun) creates a new user namespace, it becomes
a child of the root user namespace. Processes within this child user namespace
are not permitted to access resources owned by the parent user namespace.

If the network namespace is not owned by the new user namespace, the container
runtime will fail to mount /sys due to the sysfs: Restrict mounting sysfs
patch.

Referencing the cap_capable function in Linux, a process can access a
resource if:

  • The resource is owned by the process's user namespace, and the process has
    the required capability.

  • The resource is owned by a child of the process's user namespace, and the
    owner's user namespace was created by the process's UID.

In the context of pod-level user namespaces, the CRI plugin delegates the
creation of the network namespace to the container runtime when running the
pause container. After the pause container is initialized, the CRI plugin pins
the pause container's network namespace into /run/netns and then executes
the CNI_ADD command over it.

However, if the pause container is terminated during the pinning process, the
CRI plugin might encounter a PID cycle, leading to the CNI_ADD command
operating on an incorrect network namespace.

Moreover, rolling back the RunPodSandbox API is complex due to the delegation
of network namespace creation. As highlighted in issue #10363, the CRI plugin
can lose IP information after a containerd restart, making it challenging to
maintain robustness in the RunPodSandbox API.

Solution:

Allow containerd to create a new user namespace and then create the network
namespace within that user namespace. This way, the CRI plugin can force the
container runtime to join both the user namespace and the network namespace.
Since the network namespace is owned by the newly created user namespace,
the container runtime will have the necessary permissions to mount /sys on
the container's root filesystem. As a result, delegation of network namespace
creation is no longer needed.

NOTE:

  • The CRI plugin does not need to pin the newly created user namespace as it
    does with the network namespace, because the kernel allows retrieving a user
    namespace reference via ioctl_ns(2). As a result, the podsandbox
    implementation can obtain the user namespace using the netnsPath parameter.

  • The pkg/sys package continues to use go:linkname to handle fork operations
    due to efficiency, despite being a notable member of hall of shame. If core/mount: use ptrace instead of go:linkname #10611 can work, I will switch it back.

Signed-off-by: Wei Fu fuweid89@gmail.com

@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@fuweid fuweid force-pushed the pin-userns branch 8 times, most recently from 63004fc to 5e57b9f Compare August 18, 2024 10:06
@fuweid fuweid marked this pull request as ready for review August 18, 2024 10:06
@dosubot dosubot bot added the area/cri Container Runtime Interface (CRI) label Aug 18, 2024
@fuweid fuweid changed the title [WIP] internal/cri: Simplify network namespace setup when user namespaces are enabled internal/cri: simplify netns setup with pinned userns Aug 18, 2024
@fuweid
Copy link
Member Author

fuweid commented Aug 18, 2024

ping @dmcgowan @AkihiroSuda @mikebrow @rata @dcantah @MikeZappa87 PTAL when you have time. Thanks

@fuweid fuweid force-pushed the pin-userns branch 4 times, most recently from d8b4935 to 3f6ab6e Compare August 19, 2024 13:56
@fuweid
Copy link
Member Author

fuweid commented Aug 19, 2024

/retest

Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid thanks a lot for tackling this! I really like the simplicity overall. I added some ideas to simplify it further, let me know what you think.

Just to understand, the issue you mention (#10363) has another PR to fix it too: #10400. Do we want both fixes or what is the plan?

It seems the PR is changing a lot just while I write this. Github doesn't still show those files as outdated, so I think the comments will still make sense.

Please let me know when this is more or less settled and I'll take another look

Comment on lines +50 to +53
if len(uidMaps) != 1 {
return nil, fmt.Errorf("required only one uid mapping, but got %d uid mapping(s)", len(uidMaps))
}
if uidMaps[0] == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine here for now. But I wanted to raise awareness that there are other PRs trying to support multiple mappings: #10307

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -79,13 +79,20 @@ func init() {
return nil, fmt.Errorf("unable to load CRI image service plugin dependency: %w", err)
}

usernsInode, err := getCurrentUserNamespaceInode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find where we use this. What am I missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this since we don't it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you added it back by mistake?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. Thanks

sandbox.NetNS, err = netns.NewNetNS(netnsMountDir)
} else {
usernsOpts := config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetUsernsOptions()
sandbox.NetNS, err = c.setupNetnsWithinUserns(netnsMountDir, usernsOpts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is run before: 3f6ab6e#diff-00338fb60fb364520225a9a56d3e7cfdadffdb695d98e0970f5df964556cbf46R114 ? Can you confirm? Sorry, I don't remember everything by heart. I'll try to check out the code locally tomorrow :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The internal/cri/server package run before podsanbox pkg.

sandbox.NetNS, err = netns.NewNetNS(netnsMountDir)
} else {
usernsOpts := config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetUsernsOptions()
sandbox.NetNS, err = c.setupNetnsWithinUserns(netnsMountDir, usernsOpts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have this be: setupUsernsAndNetns() and just setup both namespaces? I mean, pin (mount) the userns too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please checkout comment #10607 (comment)

Comment on lines +72 to +76
netNs, err = netns.NewNetNSFromPID(netnsMountDir, uint32(pid))
if err != nil {
return fmt.Errorf("failed to mount netns from pid %d: %w", pid, err)
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just add here the code to mount the userns and just simplify the rest (no need for the ioctl). Does it make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please checkout comment #10607 (comment)


if err := c.pinUserNamespace(id, nsPath); err != nil {
return nil, fmt.Errorf("failed to pin user namespace: %w", err)
}
specOpts = append(specOpts, customopts.WithNamespacePath(runtimespec.UserNamespace, c.getSandboxPinnedUserNamespace(id)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we pin the userns here, instead of where we pin the netns? We really depend on the netns being created and populated, not just the path (as this function takes). This function seems more creating the config.json, it seems better to have only the path to the userns here, IMHO.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that I should pin the userns in internal/cri/server package. However, the podsandbox can be remote plugin. If we handle userns in internal/cri/server package, we will have to extend the sandbox interface with new argument usernsPath. I'm not sure it's good to add such information in the api, because the sandbox implementation can retrieve user namespace from network namespace. So, I use ioctl to fetch the inode and mount it in podsandbox.

func (c *criSandboxService) CreateSandbox(ctx context.Context, info sandbox.Sandbox, opts ...sandbox.CreateOpt) error {
ctrl, err := c.SandboxController(info.Sandboxer)
if err != nil {
return err
}
return ctrl.Create(ctx, info, opts...)
}

func WithNetNSPath(netNSPath string) CreateOpt {
return func(co *CreateOptions) error {
co.NetNSPath = netNSPath
return nil
}
}

cc @mikebrow @abel-von @mxpv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what a remote sandbox is, so I'll risk making a question that doesn't make sense. Sorry in advance if that is the case :)

The userns is created when we create the netns now, that is not changing. Why can't we just persist it there too, when we mount the netns?

I mean, this way doesn't take a param to the userns path either, but it is just being created with the netns. What is the issue to persist it at that point in time too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The userns is created when we create the netns now, that is not changing. Why can't we just persist it there too, when we mount the netns?

There are two different plugins in containerd.

  • io.containerd.grpc.v1.cri
  • io.containerd.sandbox.controller.v1

Currently, the io.containerd.sandbox.controller.v1 is implementation to setup sandbox environment. The
io.containerd.grpc.v1.cri manages the CNI configuration, which means that it should take responsibility of setting up networking plugin. Based on this, it looks like that io.containerd.grpc.v1.cri should take responsibility to pin the user namespace as well.

However, in my opinion, passing the user namespace path through the API is not ideal, similar to the networking namespace. The user namespace path seems redundant when compared to the networking namespace. Therefore, I believe the sandbox implementation should retrieve the user namespace from the networking namespace path.

message ControllerCreateRequest {
string sandbox_id = 1;
repeated containerd.types.Mount rootfs = 2;
google.protobuf.Any options = 3;
string netns_path = 4;

ping @containerd/committers @containerd/reviewers for more input if it's good to add new param in proto.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm not familiar (yet) with the new plugins, sorry for the noise :)

The pinning in this function looks odd to me, as this should just create a runtime-spec config. But I'm fine with it, of course, you are the maintainer =)

//go:norace
//go:noinline
//go:nosplit
func unshareAfterEnterUserns(usernsFd uintptr, flags uintptr, pipeFd uintptr) (_pid uintptr, _pidfd uintptr, _ syscall.Errno) {
Copy link
Contributor

@rata rata Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change it to this? The one creating a process seem very easy to reason about, no need to lock the OS in the runtime, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean using os.StartProcess? Sorry I force-push to clean-up the history.

Yes. I have plan to use os.StartProcess(ptrace: true). But I run into two problems:

  1. core/mount: use ptrace instead of go:linkname #10611 (comment)
2024-08-19T13:29:27.6655085Z     pod_userns_linux_test.go:284: 
2024-08-19T13:29:27.6656082Z         	Error Trace:	/home/runner/work/containerd/containerd/integration/pod_userns_linux_test.go:284
2024-08-19T13:29:27.6656870Z         	Error:      	Received unexpected error:
2024-08-19T13:29:27.6662188Z         	            	rpc error: code = Unknown desc = failed to update container "16dd29b770c6a7127b84ebf1228617b417498522281b391ef5bd35d05ea6c85e" state: failed to checkpoint status to "/var/lib/containerd-test/io.containerd.grpc.v1.cri/containers/16dd29b770c6a7127b84ebf1228617b417498522281b391ef5bd35d05ea6c85e/status": close /var/lib/containerd-test/io.containerd.grpc.v1.cri/containers/16dd29b770c6a7127b84ebf1228617b417498522281b391ef5bd35d05ea6c85e/.tmp-status359157146: bad file descriptor
2024-08-19T13:29:27.6666867Z         	Test:       	TestPodUserNS/volumes_permissions

I can't reproduce it in my local. Maybe related to go1.23.0. It seems that the go runtime randomly closes fd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will switch to use os.StartProcess if there is any workrounds from golang/go#68984, like dup or reset finalizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, great finding! Wouldn't just doing a dup here be a valid workaround using unix.Dup()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update it after #10611 merged

@fuweid
Copy link
Member Author

fuweid commented Aug 20, 2024

Just to understand, the issue you mention (#10363) has another PR to fix it too: #10400. Do we want both fixes or what is the plan?

@rata My plan is to simplify the logic about user namespace. There is a lot of changes since your first version. I revisited the code and found that the sandbox API design doesn't allow us to check pause container is alive when we try to pin netns by pid. It's easy to cause pid recycle issue. So pinned user namespace can help us avoid this case.

https://github.com/kinvolk/containerd/blob/ca69ae26567ca36f4a14d6896998d9130459ce4e/pkg/cri/server/sandbox_run.go#L408-L412

Yes. I think it can fix the issue #10363. It's kind of other option.

Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks again for tackling this! :)

@fuweid makes sense, thanks! I think this PR is quite clean (userns creation in go can be tricky), it will be very clean after #10611.

I think it is also okay to merge this as it is and then fo a follow-up PR once #10611 is done (or just to the clean-up there), if that is convenient, as the cleanup is very isolated and small.

btw, I have also spin-up a cluster locally with this. But all I did was covered by tests already :-D

@fuweid
Copy link
Member Author

fuweid commented Sep 6, 2024

ping @rata I updated the pull request. PTAL, Thanks!

Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid It seems you added back the unused helper about the inode of the namespace. Other than that, it seems good :)

@@ -79,13 +79,20 @@ func init() {
return nil, fmt.Errorf("unable to load CRI image service plugin dependency: %w", err)
}

usernsInode, err := getCurrentUserNamespaceInode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you added it back by mistake?

Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!


if err := c.pinUserNamespace(id, nsPath); err != nil {
return nil, fmt.Errorf("failed to pin user namespace: %w", err)
}
specOpts = append(specOpts, customopts.WithNamespacePath(runtimespec.UserNamespace, c.getSandboxPinnedUserNamespace(id)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm not familiar (yet) with the new plugins, sorry for the noise :)

The pinning in this function looks odd to me, as this should just create a runtime-spec config. But I'm fine with it, of course, you are the maintainer =)

It allows to disassociate parts of its execution context within a user
namespace.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Motivation:

For pod-level user namespaces, it's impossible to force the container runtime
to join an existing network namespace after creating a new user namespace.

According to the capabilities section in [user_namespaces(7)][1], a network
namespace created by containerd is owned by the root user namespace. When the
container runtime (like runc or crun) creates a new user namespace, it becomes
a child of the root user namespace. Processes within this child user namespace
are not permitted to access resources owned by the parent user namespace.

If the network namespace is not owned by the new user namespace, the container
runtime will fail to mount /sys due to the [sysfs: Restrict mounting sysfs][2]
patch.

Referencing the [cap_capable][3] function in Linux, a process can access a
resource if:

* The resource is owned by the process's user namespace, and the process has
the required capability.

* The resource is owned by a child of the process's user namespace, and the
owner's user namespace was created by the process's UID.

In the context of pod-level user namespaces, the CRI plugin delegates the
creation of the network namespace to the container runtime when running the
pause container. After the pause container is initialized, the CRI plugin pins
the pause container's network namespace into `/run/netns` and then executes
the `CNI_ADD` command over it.

However, if the pause container is terminated during the pinning process, the
CRI plugin might encounter a PID cycle, leading to the `CNI_ADD` command
operating on an incorrect network namespace.

Moreover, rolling back the `RunPodSandbox` API is complex due to the delegation
of network namespace creation. As highlighted in issue containerd#10363, the CRI plugin
can lose IP information after a containerd restart, making it challenging to
maintain robustness in the RunPodSandbox API.

Solution:

Allow containerd to create a new user namespace and then create the network
namespace within that user namespace. This way, the CRI plugin can force the
container runtime to join both the user namespace and the network namespace.
Since the network namespace is owned by the newly created user namespace,
the container runtime will have the necessary permissions to mount `/sys` on
the container's root filesystem. As a result, delegation of network namespace
creation is no longer needed.

NOTE:

* The CRI plugin does not need to pin the newly created user namespace as it
does with the network namespace, because the kernel allows retrieving a user
namespace reference via [ioctl_ns(2)][4]. As a result, the podsandbox
implementation can obtain the user namespace using the `netnsPath` parameter.

[1]: <https://man7.org/linux/man-pages/man7/user_namespaces.7.html>
[2]: <torvalds/linux@7dc5dbc>
[3]: <https://github.com/torvalds/linux/blob/2c85ebc57b3e1817b6ce1a6b703928e113a90442/security/commoncap.c#L65>
[4]: <https://man7.org/linux/man-pages/man2/ioctl_ns.2.html>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid
Copy link
Member Author

fuweid commented Sep 14, 2024

ping @AkihiroSuda @dmcgowan may I have a approval on this? Thanks

@AkihiroSuda AkihiroSuda added this pull request to the merge queue Sep 18, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 19, 2024
@AkihiroSuda AkihiroSuda added this pull request to the merge queue Sep 19, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 19, 2024
@AkihiroSuda AkihiroSuda added this pull request to the merge queue Sep 19, 2024
Merged via the queue into containerd:main with commit 8c64a2f Sep 19, 2024
53 checks passed
@rata
Copy link
Contributor

rata commented Sep 20, 2024

@fuweid does this fix #10363 too? Or do we still need a fix for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) ok-to-test size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants