-
Notifications
You must be signed in to change notification settings - Fork 3.6k
internal/cri: simplify netns setup with pinned userns #10607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Skipping CI for Draft Pull Request. |
63004fc
to
5e57b9f
Compare
ping @dmcgowan @AkihiroSuda @mikebrow @rata @dcantah @MikeZappa87 PTAL when you have time. Thanks |
d8b4935
to
3f6ab6e
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fuweid thanks a lot for tackling this! I really like the simplicity overall. I added some ideas to simplify it further, let me know what you think.
Just to understand, the issue you mention (#10363) has another PR to fix it too: #10400. Do we want both fixes or what is the plan?
It seems the PR is changing a lot just while I write this. Github doesn't still show those files as outdated, so I think the comments will still make sense.
Please let me know when this is more or less settled and I'll take another look
if len(uidMaps) != 1 { | ||
return nil, fmt.Errorf("required only one uid mapping, but got %d uid mapping(s)", len(uidMaps)) | ||
} | ||
if uidMaps[0] == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine here for now. But I wanted to raise awareness that there are other PRs trying to support multiple mappings: #10307
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -79,13 +79,20 @@ func init() { | |||
return nil, fmt.Errorf("unable to load CRI image service plugin dependency: %w", err) | |||
} | |||
|
|||
usernsInode, err := getCurrentUserNamespaceInode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find where we use this. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this since we don't it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you added it back by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated. Thanks
sandbox.NetNS, err = netns.NewNetNS(netnsMountDir) | ||
} else { | ||
usernsOpts := config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetUsernsOptions() | ||
sandbox.NetNS, err = c.setupNetnsWithinUserns(netnsMountDir, usernsOpts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is run before: 3f6ab6e#diff-00338fb60fb364520225a9a56d3e7cfdadffdb695d98e0970f5df964556cbf46R114 ? Can you confirm? Sorry, I don't remember everything by heart. I'll try to check out the code locally tomorrow :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The internal/cri/server
package run before podsanbox
pkg.
sandbox.NetNS, err = netns.NewNetNS(netnsMountDir) | ||
} else { | ||
usernsOpts := config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetUsernsOptions() | ||
sandbox.NetNS, err = c.setupNetnsWithinUserns(netnsMountDir, usernsOpts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have this be: setupUsernsAndNetns()
and just setup both namespaces? I mean, pin (mount) the userns too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please checkout comment #10607 (comment)
netNs, err = netns.NewNetNSFromPID(netnsMountDir, uint32(pid)) | ||
if err != nil { | ||
return fmt.Errorf("failed to mount netns from pid %d: %w", pid, err) | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just add here the code to mount the userns and just simplify the rest (no need for the ioctl). Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please checkout comment #10607 (comment)
|
||
if err := c.pinUserNamespace(id, nsPath); err != nil { | ||
return nil, fmt.Errorf("failed to pin user namespace: %w", err) | ||
} | ||
specOpts = append(specOpts, customopts.WithNamespacePath(runtimespec.UserNamespace, c.getSandboxPinnedUserNamespace(id))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we pin the userns here, instead of where we pin the netns? We really depend on the netns being created and populated, not just the path (as this function takes). This function seems more creating the config.json, it seems better to have only the path to the userns here, IMHO.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that I should pin the userns in internal/cri/server package. However, the podsandbox can be remote plugin. If we handle userns in internal/cri/server package, we will have to extend the sandbox interface with new argument usernsPath
. I'm not sure it's good to add such information in the api, because the sandbox implementation can retrieve user namespace from network namespace. So, I use ioctl
to fetch the inode and mount it in podsandbox.
containerd/internal/cri/server/sandbox_service.go
Lines 51 to 57 in e8104a4
func (c *criSandboxService) CreateSandbox(ctx context.Context, info sandbox.Sandbox, opts ...sandbox.CreateOpt) error { | |
ctrl, err := c.SandboxController(info.Sandboxer) | |
if err != nil { | |
return err | |
} | |
return ctrl.Create(ctx, info, opts...) | |
} |
containerd/core/sandbox/controller.go
Lines 65 to 70 in e8104a4
func WithNetNSPath(netNSPath string) CreateOpt { | |
return func(co *CreateOptions) error { | |
co.NetNSPath = netNSPath | |
return nil | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know what a remote sandbox is, so I'll risk making a question that doesn't make sense. Sorry in advance if that is the case :)
The userns is created when we create the netns now, that is not changing. Why can't we just persist it there too, when we mount the netns?
I mean, this way doesn't take a param to the userns path either, but it is just being created with the netns. What is the issue to persist it at that point in time too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The userns is created when we create the netns now, that is not changing. Why can't we just persist it there too, when we mount the netns?
There are two different plugins in containerd.
io.containerd.grpc.v1.cri
io.containerd.sandbox.controller.v1
Currently, the io.containerd.sandbox.controller.v1
is implementation to setup sandbox environment. The
io.containerd.grpc.v1.cri
manages the CNI configuration, which means that it should take responsibility of setting up networking plugin. Based on this, it looks like that io.containerd.grpc.v1.cri
should take responsibility to pin the user namespace as well.
However, in my opinion, passing the user namespace path through the API is not ideal, similar to the networking namespace. The user namespace path seems redundant when compared to the networking namespace. Therefore, I believe the sandbox implementation should retrieve the user namespace from the networking namespace path.
containerd/api/services/sandbox/v1/sandbox.proto
Lines 101 to 105 in c8b095f
message ControllerCreateRequest { | |
string sandbox_id = 1; | |
repeated containerd.types.Mount rootfs = 2; | |
google.protobuf.Any options = 3; | |
string netns_path = 4; |
ping @containerd/committers @containerd/reviewers for more input if it's good to add new param in proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'm not familiar (yet) with the new plugins, sorry for the noise :)
The pinning in this function looks odd to me, as this should just create a runtime-spec config. But I'm fine with it, of course, you are the maintainer =)
pkg/sys/unshare_unsafe_linux.go
Outdated
//go:norace | ||
//go:noinline | ||
//go:nosplit | ||
func unshareAfterEnterUserns(usernsFd uintptr, flags uintptr, pipeFd uintptr) (_pid uintptr, _pidfd uintptr, _ syscall.Errno) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change it to this? The one creating a process seem very easy to reason about, no need to lock the OS in the runtime, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean using os.StartProcess
? Sorry I force-push to clean-up the history.
Yes. I have plan to use os.StartProcess(ptrace: true)
. But I run into two problems:
2024-08-19T13:29:27.6655085Z pod_userns_linux_test.go:284:
2024-08-19T13:29:27.6656082Z Error Trace: /home/runner/work/containerd/containerd/integration/pod_userns_linux_test.go:284
2024-08-19T13:29:27.6656870Z Error: Received unexpected error:
2024-08-19T13:29:27.6662188Z rpc error: code = Unknown desc = failed to update container "16dd29b770c6a7127b84ebf1228617b417498522281b391ef5bd35d05ea6c85e" state: failed to checkpoint status to "/var/lib/containerd-test/io.containerd.grpc.v1.cri/containers/16dd29b770c6a7127b84ebf1228617b417498522281b391ef5bd35d05ea6c85e/status": close /var/lib/containerd-test/io.containerd.grpc.v1.cri/containers/16dd29b770c6a7127b84ebf1228617b417498522281b391ef5bd35d05ea6c85e/.tmp-status359157146: bad file descriptor
2024-08-19T13:29:27.6666867Z Test: TestPodUserNS/volumes_permissions
I can't reproduce it in my local. Maybe related to go1.23.0. It seems that the go runtime randomly closes fd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will switch to use os.StartProcess
if there is any workrounds from golang/go#68984, like dup or reset finalizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, great finding! Wouldn't just doing a dup here be a valid workaround using unix.Dup()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update it after #10611 merged
@rata My plan is to simplify the logic about user namespace. There is a lot of changes since your first version. I revisited the code and found that the sandbox API design doesn't allow us to check pause container is alive when we try to pin netns by pid. It's easy to cause pid recycle issue. So pinned user namespace can help us avoid this case. Yes. I think it can fix the issue #10363. It's kind of other option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks again for tackling this! :)
@fuweid makes sense, thanks! I think this PR is quite clean (userns creation in go can be tricky), it will be very clean after #10611.
I think it is also okay to merge this as it is and then fo a follow-up PR once #10611 is done (or just to the clean-up there), if that is convenient, as the cleanup is very isolated and small.
btw, I have also spin-up a cluster locally with this. But all I did was covered by tests already :-D
ping @rata I updated the pull request. PTAL, Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fuweid It seems you added back the unused helper about the inode of the namespace. Other than that, it seems good :)
@@ -79,13 +79,20 @@ func init() { | |||
return nil, fmt.Errorf("unable to load CRI image service plugin dependency: %w", err) | |||
} | |||
|
|||
usernsInode, err := getCurrentUserNamespaceInode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you added it back by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
|
||
if err := c.pinUserNamespace(id, nsPath); err != nil { | ||
return nil, fmt.Errorf("failed to pin user namespace: %w", err) | ||
} | ||
specOpts = append(specOpts, customopts.WithNamespacePath(runtimespec.UserNamespace, c.getSandboxPinnedUserNamespace(id))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'm not familiar (yet) with the new plugins, sorry for the noise :)
The pinning in this function looks odd to me, as this should just create a runtime-spec config. But I'm fine with it, of course, you are the maintainer =)
It allows to disassociate parts of its execution context within a user namespace. Signed-off-by: Wei Fu <fuweid89@gmail.com>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Motivation: For pod-level user namespaces, it's impossible to force the container runtime to join an existing network namespace after creating a new user namespace. According to the capabilities section in [user_namespaces(7)][1], a network namespace created by containerd is owned by the root user namespace. When the container runtime (like runc or crun) creates a new user namespace, it becomes a child of the root user namespace. Processes within this child user namespace are not permitted to access resources owned by the parent user namespace. If the network namespace is not owned by the new user namespace, the container runtime will fail to mount /sys due to the [sysfs: Restrict mounting sysfs][2] patch. Referencing the [cap_capable][3] function in Linux, a process can access a resource if: * The resource is owned by the process's user namespace, and the process has the required capability. * The resource is owned by a child of the process's user namespace, and the owner's user namespace was created by the process's UID. In the context of pod-level user namespaces, the CRI plugin delegates the creation of the network namespace to the container runtime when running the pause container. After the pause container is initialized, the CRI plugin pins the pause container's network namespace into `/run/netns` and then executes the `CNI_ADD` command over it. However, if the pause container is terminated during the pinning process, the CRI plugin might encounter a PID cycle, leading to the `CNI_ADD` command operating on an incorrect network namespace. Moreover, rolling back the `RunPodSandbox` API is complex due to the delegation of network namespace creation. As highlighted in issue containerd#10363, the CRI plugin can lose IP information after a containerd restart, making it challenging to maintain robustness in the RunPodSandbox API. Solution: Allow containerd to create a new user namespace and then create the network namespace within that user namespace. This way, the CRI plugin can force the container runtime to join both the user namespace and the network namespace. Since the network namespace is owned by the newly created user namespace, the container runtime will have the necessary permissions to mount `/sys` on the container's root filesystem. As a result, delegation of network namespace creation is no longer needed. NOTE: * The CRI plugin does not need to pin the newly created user namespace as it does with the network namespace, because the kernel allows retrieving a user namespace reference via [ioctl_ns(2)][4]. As a result, the podsandbox implementation can obtain the user namespace using the `netnsPath` parameter. [1]: <https://man7.org/linux/man-pages/man7/user_namespaces.7.html> [2]: <torvalds/linux@7dc5dbc> [3]: <https://github.com/torvalds/linux/blob/2c85ebc57b3e1817b6ce1a6b703928e113a90442/security/commoncap.c#L65> [4]: <https://man7.org/linux/man-pages/man2/ioctl_ns.2.html> Signed-off-by: Wei Fu <fuweid89@gmail.com>
ping @AkihiroSuda @dmcgowan may I have a approval on this? Thanks |
Motivation:
For pod-level user namespaces, it's impossible to force the container runtime
to join an existing network namespace after creating a new user namespace.
According to the capabilities section in user_namespaces(7), a network
namespace created by containerd is owned by the root user namespace. When the
container runtime (like runc or crun) creates a new user namespace, it becomes
a child of the root user namespace. Processes within this child user namespace
are not permitted to access resources owned by the parent user namespace.
If the network namespace is not owned by the new user namespace, the container
runtime will fail to mount /sys due to the sysfs: Restrict mounting sysfs
patch.
Referencing the cap_capable function in Linux, a process can access a
resource if:
The resource is owned by the process's user namespace, and the process has
the required capability.
The resource is owned by a child of the process's user namespace, and the
owner's user namespace was created by the process's UID.
In the context of pod-level user namespaces, the CRI plugin delegates the
creation of the network namespace to the container runtime when running the
pause container. After the pause container is initialized, the CRI plugin pins
the pause container's network namespace into
/run/netns
and then executesthe
CNI_ADD
command over it.However, if the pause container is terminated during the pinning process, the
CRI plugin might encounter a PID cycle, leading to the
CNI_ADD
commandoperating on an incorrect network namespace.
Moreover, rolling back the
RunPodSandbox
API is complex due to the delegationof network namespace creation. As highlighted in issue #10363, the CRI plugin
can lose IP information after a containerd restart, making it challenging to
maintain robustness in the RunPodSandbox API.
Solution:
Allow containerd to create a new user namespace and then create the network
namespace within that user namespace. This way, the CRI plugin can force the
container runtime to join both the user namespace and the network namespace.
Since the network namespace is owned by the newly created user namespace,
the container runtime will have the necessary permissions to mount
/sys
onthe container's root filesystem. As a result, delegation of network namespace
creation is no longer needed.
NOTE:
The CRI plugin does not need to pin the newly created user namespace as it
does with the network namespace, because the kernel allows retrieving a user
namespace reference via ioctl_ns(2). As a result, the podsandbox
implementation can obtain the user namespace using the
netnsPath
parameter.The
pkg/sys
package continues to use go:linkname to handle fork operationsdue to efficiency, despite being a notable member of hall of shame. If core/mount: use ptrace instead of go:linkname #10611 can work, I will switch it back.
Signed-off-by: Wei Fu fuweid89@gmail.com