Skip to content

Conversation

justxuewei
Copy link
Member

@justxuewei justxuewei commented Jul 20, 2025

The first commit is to implement get_thread_ids() for QEMU to return the
real vCPU thread ids. It is a required feature, since our tests are using
QEMU.

The second commit is to ignore SIGTERM signal. When enabling systemd cgroup
driver and sandbox cgroup only, the shim is under a systemd unit. When the
unit is stopping, systemd sends SIGTERM to the shim. The shim can't exit
immediately, as there are some cleanups to do. Therefore, ignoring SIGTERM
is required here. The shim should complete the work within a period (Kata
sets it to 300s by default). Once a timeout occurs, systemd will send
SIGKILL.

The third one is to add full cgroups support on host.

Cgroups are managed by FsManager and SystemdManager. As the names
impies, the FsManager manages cgroups through cgroupfs, while the
SystemdManager manages cgroups through systemd. The two manages support
cgroup v1 and cgroup v2.

Two types of cgroups path are supported:

  1. For colon paths, for example "foo.slice:bar:baz", the runtime manages
    cgroups by SystemdManager;
  2. For relative/absolute paths, the runtime manages cgroups by
    FsManager.

vCPU threads are added into the sandbox cgroups in cgroup v1 + cgroupfs,
others, cgroup v1 + systemd, cgroup v2 + cgroupfs, cgroup v2 + systemd, VMM
process is added into the cgroups.

The systemd doesn't provide a way to add thread to a unit. add_thread()
in SystemdManager is equivalent to add_process().

Cgroup v2 supports threaded mode. However, we should enable threaded mode
from leaf node to the root node (/) iteratively [1]. This means the
runtime needs to modify the cgroups created by container runtime (e.g.
containerd). Considering cgroupfs + cgroup v2 is not a common combination,
its behavior is aligned with systemd + cgroup v2, which is not allowed to
manage process at the thread level.

1: https://www.kernel.org/doc/html/v4.18/admin-guide/cgroup-v2.html#threads

Fixes: #11356

Signed-off-by: Xuewei Niu niuxuewei.nxw@antgroup.com

@justxuewei
Copy link
Member Author

justxuewei commented Jul 20, 2025

Testing Results

Starting a pod with 2 CPUs and 4 GiB of memory across all tests.

Systemd + cgroup v2 + overhead cgroup

Pod and cgroups information

  • cgroups path: kubepods-poddc430cd3_34b7_4ea6_be26_ddfd88d7ddff.slice:cri-containerd:8564224395a4fbf7f2792baa22ee16dfc8f8e862e20342d3b2bcc211a3939af6
  • pod id: 8564224395a4fbf7f2792baa22ee16dfc8f8e862e20342d3b2bcc211a3939af6

Show the systemd unit (part of the output)

$ UNIT="cri-containerd-8564224395a4fbf7f2792baa22ee16dfc8f8e862e20342d3b2bcc211a3939af6.scope"
$ systemctl show $UNIT
TimeoutStopUSec=5min
Slice=kubepods-poddc430cd3_34b7_4ea6_be26_ddfd88d7ddff.slice
ControlGroup=/kubepods.slice/kubepods-poddc430cd3_34b7_4ea6_be26_ddfd88d7ddff.slice/cri-containerd-8564224395a4fbf7f2792baa22ee16dfc8f8e862e20342d3b2bcc211a3939af6.scope
Delegate=yes
CPUAccounting=yes
IOAccounting=yes
MemoryAccounting=yes
TasksAccounting=yes
Requires=kubepods-poddc430cd3_34b7_4ea6_be26_ddfd88d7ddff.slice
ActiveState=active

Show the cgroup contents (QEMU process is under the sandbox cgroup)

$ systemd-cgls name
Control group /:
├─kata_overhead
│ └─8564224395a4fbf7f2792baa22ee16dfc8f8e862e20342d3b2bcc211a3939af6
│   ├─29743 /home/vagrant/kata-containers/src/runtime-rs/target/x86_64-unknown-linux-musl/debug/containerd-shim-kata-v2 -id 8564224395a4fbf7f2792ba>
│   └─29780 /home/vagrant/kata-static/kata/libexec/virtiofsd --socket-path virtiofsd.sock --shared-dir /run/kata-containers/shared/sandboxes/856422>
└─kubepods.slice
  ├─kubepods-poddc430cd3_34b7_4ea6_be26_ddfd88d7ddff.slice
  │ └─cri-containerd-8564224395a4fbf7f2792baa22ee16dfc8f8e862e20342d3b2bcc211a3939af6.scope
  │   └─29807 /usr/local/bin/qemu-system-x86_64 -name sandbox-8564224395a4fbf7f2792baa22ee16dfc8f8e862e20342d3b2bcc211a3939af6 -kernel /home/vagran>

Show the memory limit of the parent (memory limit = 4Gi)

$ cat /sys/fs/cgroup/kubepods.slice/kubepods-poddc430cd3_34b7_4ea6_be26_ddfd88d7ddff.slice
4294967296

Systemd + cgroup v2 + sandbox cgroup only

Pod and cgroups information

  • cgroups path: kubepods-pod3c311657_556e_48de_92ac_998d761f36b0.slice:cri-containerd:431eaa4675c09cb0fc49746f6d6157e80c5418311bc4bf6ab75af7064a05dabb
  • pod id: 431eaa4675c09cb0fc49746f6d6157e80c5418311bc4bf6ab75af7064a05dabb

The overhead cgroup (kata_overhead/${podid}) not exists

$ systemd-cgls name | grep kata_overhead | grep -v grep

The sandbox cgroup exists

$ systemd-cgls name
└─kubepods.slice
  ├─kubepods-pod3c311657_556e_48de_92ac_998d761f36b0.slice
  │ └─cri-containerd-431eaa4675c09cb0fc49746f6d6157e80c5418311bc4bf6ab75af7064a05dabb.scope
  │   ├─83236 /home/vagrant/kata-containers/src/runtime-rs/target/x86_64-unknown-linux-musl/debug/containerd-shim-kata-v2 -id 431eaa4675c09cb0fc497>
  │   ├─83257 /home/vagrant/kata-static/kata/libexec/virtiofsd --socket-path virtiofsd.sock --shared-dir /run/kata-containers/shared/sandboxes/431e>
  │   └─83260 /usr/local/bin/qemu-system-x86_64 -name sandbox-431eaa4675c09cb0fc49746f6d6157e80c5418311bc4bf6ab75af7064a05dabb -kernel /home/vagran>

Cgroupfs + cgroup v1 + overhead cgroup

Pod and cgroups information

  • cgroups path: kubepods/pode5352f9d-d513-4f73-80cf-03c11b34b870/bcac3c59c3ec4bdf3fb679be3f139e2f5e6ce94472239e2988db2bb0b5655035
  • pod id: bcac3c59c3ec4bdf3fb679be3f139e2f5e6ce94472239e2988db2bb0b5655035

The overhead cgroup (kata_overhead/${podid}) exists

$ pod=bcac3c59c3ec4bdf3fb679be3f139e2f5e6ce94472239e2988db2bb0b5655035
$ cat /sys/fs/cgroup/memory/kata_overhead/$pod/cgroup.procs
147038
147053
147091
147094
$ sudo ps aux | grep 147038
root      147038 11.0  2.5 706164 422888 ?       Sl   02:29   0:24 /home/vagrant/kata-containers/src/runtime-rs/target/x86_64-unknown-linux-musl/debug/containerd-shim-kata-v2 -id bcac3c59c3ec4bdf3fb679be3f139e2f5e6ce94472239e2988db2bb0b5655035 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /usr/local/bin/containerd -debug
$ sudo ps aux | grep 147053
root      147053  0.0  0.0 6300744 6240 ?        Sl   02:29   0:00 /home/vagrant/kata-static/kata/libexec/virtiofsd --socket-path virtiofsd.sock --shared-dir /run/kata-containers/shared/sandboxes/bcac3c59c3ec4bdf3fb679be3f139e2f5e6ce94472239e2988db2bb0b5655035/ro --cache auto --sandbox none --seccomp none --thread-pool-size=1 -o announce_submounts
$ sudo ps aux | grep 147091
root      147091 36.3  2.2 6931864 364580 ?      Sl   02:29   5:15 /usr/local/bin/qemu-system-x86_64 -name sandbox-bcac3c59c3ec4bdf3fb679be3f139e2f5e6t
$ sudo ps aux | grep 147094
root      147094  0.0  0.0      0     0 ?        S    02:29   0:00 [kvm-nx-lpage-recovery-147091]

The sandbox cgroup exists, and only vCPUs (2 threads) are added into that cgroup.

$ CGROUPS_PATH="kubepods/pode5352f9d-d513-4f73-80cf-03c11b34b870/bcac3c59c3ec4bdf3fb679be3f139e2f5e6ce94472239e2988db2bb0b5655035"
$ cat /sys/fs/cgroup/memory/$CGROUPS_PATH/cgroup.procs
147091
$ cat /sys/fs/cgroup/memory/$CGROUPS_PATH/tasks
147097
147098

Cgroupfs + cgroup v1 + sandbox cgroup only

Pod and cgroups information

  • cgroups path: kubepods/poda729a9aa-5c8e-483a-8986-e67176150bf7/94e311d6639f1346368216035e5c13a8d4a12fa5671ed9f400c147cc28668340
  • pod id: 94e311d6639f1346368216035e5c13a8d4a12fa5671ed9f400c147cc28668340

The overhead cgroup not exists

$ pod=94e311d6639f1346368216035e5c13a8d4a12fa5671ed9f400c147cc28668340
$ ls /sys/fs/cgroup/memory/kata_overhead/$pod
ls: cannot access '/sys/fs/cgroup/memory/kata_overhead/94e311d6639f1346368216035e5c13a8d4a12fa5671ed9f400c147cc28668340': No such file or directory

The sandbox cgroup exists

$ CGROUPS_PATH="kubepods/poda729a9aa-5c8e-483a-8986-e67176150bf7/94e311d6639f1346368216035e5c13a8d4a12fa5671ed9f400c147cc28668340"
$ cat /sys/fs/cgroup/memory/$CGROUPS_PATH/cgroup.procs
171635
171659
171666
171669
$ cat /sys/fs/cgroup/memory/$CGROUPS_PATH/../memory.limit_in_bytes
4294967296

@justxuewei
Copy link
Member Author

Hey guys @jepio @fidencio @Champ-Goblem , needs some input here.

@justxuewei justxuewei merged commit 6f6d646 into kata-containers:main Jul 25, 2025
465 of 478 checks passed
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 4, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 4, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 5, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 5, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 5, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 5, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 5, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 5, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 5, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 5, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 5, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 11, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 11, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
justxuewei added a commit to justxuewei/kata-containers that referenced this pull request Aug 11, 2025
This is a follow-up patch to
kata-containers#11598, aimed at
bump cgroups-rs to 0.4.1, so that the two Rust components share the same
codebase to manage cgroups.

Introduce two new types, `SandboxCgroupManager` and
`ContainerCgroupManager`. The `SandboxCgroupManager` is used to manage
sandbox resources. Device cgroups have been supported so far. The
`ContainerCgroupManager` is used to manage container resources. It has a
copy of the sandbox cgroup manager, so it can update sandbox if needed.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runtime-rs: Support cgroup v2 on host
7 participants