Skip to content

Conversation

katexochen
Copy link
Contributor

This enables guest pull via config, without the need of any external snapshotter. When the config enables runtim.force_guest_pull, instead of relying on annotations to select the way to share the root FS, we always use guest pull.

@katacontainersbot katacontainersbot added the size/large Task of significant size label May 8, 2025
@fidencio
Copy link
Member

fidencio commented May 8, 2025

@fidencio
Copy link
Member

fidencio commented May 8, 2025

https://github.com/kata-containers/kata-containers/actions/runs/14906205657/job/41871669346

Let me be more verbose here. I've changed the coco-non-tee test to run with force_guest_pull and tried a run from another branch, and it passed.

Copy link
Member

@jepio jepio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm but I'm not up-to-speed on guest pull mechanics.

@Apokleos
Copy link
Contributor

Apokleos commented May 9, 2025

Thx @katexochen @burgerdev for this new feature.
One key question I'd like make it clear that, does the container image be pulled on the host if it works well ? Or shall we have to modify related container image mgmt logic of containerd ?

One more question, given it works well, shall we need the CSI of Ephemeral storage to serve as external storage for CoCo to store the image pulled inside guest ?

@katexochen katexochen force-pushed the p/guest-pull-config branch from 993c5be to 3f0207f Compare May 9, 2025 05:32
@katexochen
Copy link
Contributor Author

does the container image be pulled on the host if it works well ?

Yes, the container image is still be pulled on the host, as it was with nydus. See the discussion in #11041.

Or shall we have to modify related container image mgmt logic of containerd ?

I don't understand this question.

shall we need the CSI of Ephemeral storage to serve as external storage for CoCo to store the image pulled inside guest ?

This is an orthogonal problem. On the guest side, this PR does exactly the same as guest pull with nydus-snapshotter.

@mythi
Copy link
Contributor

mythi commented May 9, 2025

Or shall we have to modify related container image mgmt logic of containerd ?

I don't understand this question.

If this is the question what I had in mind, it's about what snapshotter configuration to use in containerd after this change.

@katexochen
Copy link
Contributor Author

Or shall we have to modify related container image mgmt logic of containerd ?

I don't understand this question.

If this is the question what I had in mind, it's about what snapshotter configuration to use in containerd after this change.

No snapshotter-specific configuration is needed in the containerd config anymore.

@Apokleos
Copy link
Contributor

Apokleos commented May 9, 2025

does the container image be pulled on the host if it works well ?

Yes, the container image is still be pulled on the host, as it was with nydus. See the discussion in #11041.

Ok, currently nydus-snapshotter used in CoCo will not pull the image layers onto the host, just manifest which is metadata will be pulled and the real layers of this container image will be pulled inside the guest.
Please let me try to understand it, the force_guest_pull still need cooperate with nydus-snapshotter or not ? sorry, I am still not clear about whether it needs nydus-snapshotter or not.

Or shall we have to modify related container image mgmt logic of containerd ?

I don't understand this question.

Sorry, let me make it clear, I just want to make it clear that how to prevent pulling layers onto host without remote snapshotter.

shall we need the CSI of Ephemeral storage to serve as external storage for CoCo to store the image pulled inside guest ?

This is an orthogonal problem. On the guest side, this PR does exactly the same as guest pull with nydus-snapshotter.

Ok, it should be another question about if it's time to drop CSI of Ephemeral storage to serve guest-pull case. If it's a good time and place, I will talk about it.

@mythi
Copy link
Contributor

mythi commented May 9, 2025

Ok, it should be another question about if it's time to drop CSI of Ephemeral storage to serve guest-pull case. If it's a good time and place, I will talk about it.

You'd want to keep it as long as you do guest pull to store the image layers to some ephemeral protected disk storage so that your CVM RAM (tmpfs) is not used for it.

@Apokleos
Copy link
Contributor

Apokleos commented May 9, 2025

Ok, it should be another question about if it's time to drop CSI of Ephemeral storage to serve guest-pull case. If it's a good time and place, I will talk about it.

You'd want to keep it as long as you do guest pull to store the image layers to some ephemeral protected disk storage so that your CVM RAM (tmpfs) is not used for it.

Yes, it act as the ephemeral protected disk storage, But one point, I tend to say that it have the same lifetime with kata/CoCo Pod.

@burgerdev
Copy link
Contributor

Ok, currently nydus-snapshotter used in CoCo will not pull the image layers onto the host, just manifest which is metadata will be pulled and the real layers of this container image will be pulled inside the guest.

This is a conjecture for which I'd like to see evidence.

The hypothesis underlying this PR is that containerd does pull layers on the host, and there is evidence for that. Refer to the reproducer in #11162 (comment). The assumption that containerd does not pull layers on the host does not seem to hold because

  1. How does containerd resolve the named user from the image metadata to the UID sent to the Kata shim?
  2. Why is there layer content of the guest-pulled image in /var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots?

Please let me try to understand it, the force_guest_pull still need cooperate with nydus-snapshotter or not ? sorry, I am still not clear about whether it needs nydus-snapshotter or not.

No - the force-guest-pull setup does not need anything related to Nydus. The host uses the default snapshotter, the guest is using image-rs without modification (which never interacted with Nydus, afaict).

@katexochen katexochen force-pushed the p/guest-pull-config branch from 3f0207f to afc1e50 Compare May 9, 2025 13:35
Copy link
Member

@c3d c3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, I would like the annotation to be documented.

@katexochen katexochen force-pushed the p/guest-pull-config branch from afc1e50 to d1963a7 Compare May 9, 2025 14:23
@katexochen katexochen requested a review from a team as a code owner May 9, 2025 14:23
@Apokleos
Copy link
Contributor

Apokleos commented May 9, 2025

Ok, currently nydus-snapshotter used in CoCo will not pull the image layers onto the host, just manifest which is metadata will be pulled and the real layers of this container image will be pulled inside the guest.

This is a conjecture for which I'd like to see evidence.

The hypothesis underlying this PR is that containerd does pull layers on the host, and there is evidence for that. Refer to the reproducer in #11162 (comment). The assumption that containerd does not pull layers on the host does not seem to hold because

  1. How does containerd resolve the named user from the image metadata to the UID sent to the Kata shim?

Yeah, In my mind, it seems a hard problem for me, I have no idea how to address it.

  1. Why is there layer content of the guest-pulled image in /var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots?

Could you please confirm if the configuration is indeed correct?
AFAIK, with nydus snapshotter in proxy mode, it's expected that no layers are pulled on the host if the configuration is correct.
And I have tested on my ENV, I don't find any content of layers in the snapshotter paths, result as below:

root@pk001:/home/pk001/katadev # tree -L 6 /var/lib/containerd/io.containerd.snapshotter.v1.nydus/
/var/lib/containerd/io.containerd.snapshotter.v1.nydus/
├── cache
├── metadata.db
├── nydus.db
└── snapshots
    ├── 274
    │   ├── fs
    │   └── work
    ├── 315
    │   ├── fs
    │   └── work
    ├── 316
    │   ├── fs
    │   └── work
    │       └── work
    └── 47
        ├── fs
        └── work

16 directories, 2 files
root@pk001:/home/pk001/katadev/NYDUS# 

and with its related nydus snapshotter logs

May 09 22:05:55 tnt001 containerd-nydus-grpc[716237]: time="2025-05-09T22:05:55.310401793+08:00" level=info msg="encode kata volume {\"volume_type\":\"image_guest_pull\",\"source\":\"dummy-image-reference\",\"options\":[\"containerd.io/snapshot/cri.layer-digest=sha256:7d66b83ec869a899bc8364af9c9eb0f1a5ba6907f699ef52f3182e19e2598924\",\"containerd.io/snapshot/nydus-proxy-mode=true\"],\"image_pull\":{\"metadata\":{\"containerd.io/snapshot/cri.layer-digest\":\"sha256:7d66b83ec869a899bc8364af9c9eb0f1a5ba6907f699ef52f3182e19e2598924\",\"containerd.io/snapshot/nydus-proxy-mode\":\"true\"}}}"
May 09 22:05:55 tnt001 containerd-nydus-grpc[716237]: time="2025-05-09T22:05:55.310580156+08:00" level=debug msg="fuse.nydus-overlayfs mount options [workdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/316/work upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/316/fs lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/274/fs io.katacontainers.volume=eyJ2b2x1bWVfdHlwZSI6ImltYWdlX2d1ZXN0X3B1bGwiLCJzb3VyY2UiOiJkdW1teS1pbWFnZS1yZWZlcmVuY2UiLCJvcHRpb25zIjpbImNvbnRhaW5lcmQuaW8vc25hcHNob3QvY3JpLmxheWVyLWRpZ2VzdD1zaGEyNTY6N2Q2NmI4M2VjODY5YTg5OWJjODM2NGFmOWM5ZWIwZjFhNWJhNjkwN2Y2OTllZjUyZjMxODJlMTllMjU5ODkyNCIsImNvbnRhaW5lcmQuaW8vc25hcHNob3QvbnlkdXMtcHJveHktbW9kZT10cnVlIl0sImltYWdlX3B1bGwiOnsibWV0YWRhdGEiOnsiY29udGFpbmVyZC5pby9zbmFwc2hvdC9jcmkubGF5ZXItZGlnZXN0Ijoic2hhMjU2OjdkNjZiODNlYzg2OWE4OTliYzgzNjRhZjljOWViMGYxYTViYTY5MDdmNjk5ZWY1MmYzMTgyZTE5ZTI1OTg5MjQiLCJjb250YWluZXJkLmlvL3NuYXBzaG90L255ZHVzLXByb3h5LW1vZGUiOiJ0cnVlIn19fQ==]"

Please let me try to understand it, the force_guest_pull still need cooperate with nydus-snapshotter or not ? sorry, I am still not clear about whether it needs nydus-snapshotter or not.

No - the force-guest-pull setup does not need anything related to Nydus. The host uses the default snapshotter, the guest is using image-rs without modification (which never interacted with Nydus, afaict).

guest-pulling mode are container images pulled inside guest, not pulled on host, without remote snapshotter which help prevent images pulling on host, how it will work I'm still confused. Is there anything I miss ?

@fitzthum
Copy link
Contributor

fitzthum commented May 9, 2025

One thing I have heard is that the nydus approach will pull the entire image on the host unless the annotation io.containerd.cri.runtime-handler is set. If so, it will only pull the manifest.

@burgerdev
Copy link
Contributor

Thanks for the feedback @Apokleos, I think we're starting to get to the root of the issue here. I just ran the deployment script again (linked above), and I can reproduce my observations.

Layer content
root@aks-nodepool1-15520615-vmss000000:/var/lib/containerd/io.containerd.snapshotter.v1.nydus# tree -d
.
|-- cache
|-- logs
`-- snapshots
    |-- 1
    |   |-- fs
    |   `-- work
    |       `-- work
    |-- 10
    |   |-- fs
    |   |   `-- usr
    |   |       `-- bin
    |   `-- work
    |       `-- work
    |-- 11
    |   |-- fs
    |   |   `-- usr
    |   |       `-- bin
    |   `-- work
    |       `-- work
    |-- 12
    |   |-- fs
    |   |   |-- data
    |   |   |-- etc
    |   |   |   `-- redis
    |   |   |-- node-conf
    |   |   `-- run
    |   `-- work
    |       `-- work
    |-- 13
    |   |-- fs
    |   `-- work
    |       `-- work
    |-- 14
    |   |-- fs
    |   `-- work
    |       `-- work
    |-- 2
    |   |-- fs
    |   `-- work
    |       `-- work
    |-- 3
    |   |-- fs
    |   |   |-- bin
    |   |   |-- dev
    |   |   |-- etc
    |   |   |   |-- apk
    |   |   |   |   |-- keys
    |   |   |   |   `-- protected_paths.d
    |   |   |   |-- busybox-paths.d
    |   |   |   |-- conf.d
    |   |   |   |-- crontabs
    |   |   |   |-- init.d
    |   |   |   |-- logrotate.d
    |   |   |   |-- modprobe.d
    |   |   |   |-- modules-load.d
    |   |   |   |-- network
    |   |   |   |   |-- if-down.d
    |   |   |   |   |-- if-post-down.d
    |   |   |   |   |-- if-post-up.d
    |   |   |   |   |-- if-pre-down.d
    |   |   |   |   |-- if-pre-up.d
    |   |   |   |   `-- if-up.d
    |   |   |   |-- opt
    |   |   |   |-- periodic
    |   |   |   |   |-- 15min
    |   |   |   |   |-- daily
    |   |   |   |   |-- hourly
    |   |   |   |   |-- monthly
    |   |   |   |   `-- weekly
    |   |   |   |-- profile.d
    |   |   |   |-- secfixes.d
    |   |   |   |-- ssl
    |   |   |   |   |-- certs
    |   |   |   |   |-- misc
    |   |   |   |   `-- private
    |   |   |   |-- ssl1.1
    |   |   |   |   `-- certs -> /etc/ssl/certs
[...]
215 directories

Could you please confirm if the configuration is indeed correct?

To be honest, I don't know - this is what the gha-run.sh deploys on AKS. How did you set up your environment? Did you run a container before listing the snapshot dir? It might help to compare configuration.

`ctr version`
Client:
  Version:  1.7.27-1
  Revision: 05044ec0a9a75232cad458027ca83437aae3f4da
  Go version: go1.22.11

Server:
  Version:  1.7.27-1
  Revision: 05044ec0a9a75232cad458027ca83437aae3f4da
  UUID: 7e0ad939-ed03-41e1-b76a-52aeba3f3c7d
`/etc/containerd/config.toml`
version = 2
oom_score = -999

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "mcr.microsoft.com/oss/kubernetes/pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
discard_unpacked_layers = false
disable_snapshot_annotations = false
default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/bin/runc"
SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.untrusted]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.untrusted.options]
BinaryName = "/usr/bin/runc"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata-qemu-coco-dev]
runtime_type = "io.containerd.kata-qemu-coco-dev.v2"
runtime_path = "/opt/kata/bin/containerd-shim-kata-v2"
privileged_without_host_devices = true
pod_annotations = ["io.katacontainers.*"]
snapshotter = "nydus"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata-qemu-coco-dev.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration-qemu-coco-dev.toml"

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"

[plugins."io.containerd.grpc.v1.cri".registry.headers]
X-Meta-Source-Client = ["azure/aks"]

[metrics]
address = "0.0.0.0:10257"

[proxy_plugins.nydus]
type = "snapshot"
address = "/run/containerd-nydus/containerd-nydus-grpc.sock"

[debug]
level = "debug"
`/etc/nydus/config.toml`
version = 1
# Snapshotter's own home directory where it stores and creates necessary resources
root = "/var/lib/containerd/io.containerd.snapshotter.v1.nydus"
# The snapshotter's GRPC server socket, containerd will connect to plugin on this socket
address = "/run/containerd-nydus/containerd-nydus-grpc.sock"
# The nydus daemon mode can be one of the following options: multiple, dedicated, shared, or none. 
# If `daemon_mode` option is not specified, the default value is multiple.
daemon_mode = "none"
# Whether snapshotter should try to clean up resources when it is closed
cleanup_on_close = false

[system]
# Snapshotter's debug and trace HTTP server interface
enable = true
# Unix domain socket path where system controller is listening on
address = "/run/containerd-nydus/system.sock"

[system.debug]
# Snapshotter can profile the CPU utilization of each nydusd daemon when it is being started.
# This option specifies the profile duration when nydusd is downloading and uncomproessing data.
daemon_cpu_profile_duration_secs = 5
# Enable by assigning an address, empty indicates pprof server is disabled
pprof_address = ""

[daemon]
# Specify a configuration file for nydusd
nydusd_config = "/etc/nydus/nydusd-config.fusedev.json"
nydusd_path = "/usr/local/bin/nydusd"
nydusimage_path = "/usr/local/bin/nydus-image"
# The fs driver can be one of the following options: fusedev, fscache, blockdev, proxy, or nodev. 
# If `fs_driver` option is not specified, the default value is fusedev.
fs_driver = "proxy"
# How to process when daemon dies: "none", "restart" or "failover"
recover_policy = "restart"
# Nydusd worker thread number to handle FUSE or fscache requests, [0-1024].
# Setting to 0 will use the default configuration of nydusd.
threads_number = 4
# Log rotation size for nydusd, in unit MB(megabytes). (default 100MB)
log_rotation_size = 100

[cgroup]
# Whether to use separate cgroup for nydusd.
enable = true
# The memory limit for nydusd cgroup, which contains all nydusd processes.
# Percentage is supported as well, please ensure it is end with "%".
# The default unit is bytes. Acceptable values include "209715200", "200MiB", "200Mi" and "10%".
memory_limit = ""

[log]
# Print logs to stdout rather than logging files
log_to_stdout = false
# Snapshotter's log level
level = "info"
log_rotation_compress = true
log_rotation_local_time = true
# Max number of days to retain logs
log_rotation_max_age = 7
log_rotation_max_backups = 5
# In unit MB(megabytes)
log_rotation_max_size = 100

[metrics]
# Enable by assigning an address, empty indicates metrics server is disabled
address = ":9110"

[remote]
convert_vpc_registry = false

[remote.mirrors_config]
# Snapshotter will overwrite daemon's mirrors configuration
# if the values loaded from this driectory are not null before starting a daemon.
# Set to "" or an empty directory to disable it.
#dir = "/etc/nydus/certs.d"

[remote.auth]
# Fetch the private registry auth by listening to K8s API server
enable_kubeconfig_keychain = false
# synchronize `kubernetes.io/dockerconfigjson` secret from kubernetes API server with specified kubeconfig (default `$KUBECONFIG` or `~/.kube/config`)
kubeconfig_path = ""
# Fetch the private registry auth as CRI image service proxy
enable_cri_keychain = false
# the target image service when using image proxy
#image_service_address = "/run/containerd/containerd.sock"

[snapshot]
# Let containerd use nydus-overlayfs mount helper
enable_nydus_overlayfs = false
# Insert Kata Virtual Volume option to `Mount.Options`
enable_kata_volume = true
# Whether to remove resources when a snapshot is removed
sync_remove = false

[cache_manager]
# Disable or enable recyclebin
disable = false
# How long to keep deleted files in recyclebin
gc_period = "24h"
# Directory to host cached files
cache_dir = ""

[image]
public_key_file = ""
validate_signature = false

# The configuraions for features that are not production ready
[experimental]
# Whether to enable stargz support
enable_stargz = false
# Whether to enable referrers support
# The option enables trying to fetch the Nydus image associated with the OCI image and run it.
# Also see https://github.com/opencontainers/distribution-spec/blob/main/spec.md#listing-referrers
enable_referrer_detect = false
# Whether to enable authentication support
# The option enables nydus snapshot to provide backend information to nydusd.
enable_backend_source = false
[experimental.tarfs]
# Whether to enable nydus tarfs mode. Tarfs is supported by:
# - The EROFS filesystem driver since Linux 6.4
# - Nydus Image Service release v2.3
enable_tarfs = false
# Mount rafs on host by loopdev and EROFS
mount_tarfs_on_host = false
# Only enable nydus tarfs mode for images with `tarfs hint` label when true
tarfs_hint = false
# Maximum of concurrence to converting OCIv1 images to tarfs, 0 means default
max_concurrent_proc = 0
# Mode to export tarfs images:
# - "none" or "": do not export tarfs
# - "layer_verity_only": only generate disk verity information for a layer blob
# - "image_verity_only": only generate disk verity information for all blobs of an image
# - "layer_block": generate a raw block disk image with tarfs for a layer
# - "image_block": generate a raw block disk image with tarfs for an image
# - "layer_block_with_verity": generate a raw block disk image with tarfs for a layer with dm-verity info
# - "image_block_with_verity": generate a raw block disk image with tarfs for an image with dm-verity info
export_mode = ""

guest-pulling mode are container images pulled inside guest, not pulled on host, without remote snapshotter which help prevent images pulling on host, how it will work I'm still confused. Is there anything I miss ?

Sorry, the reasoning could have been more clear. Let me try to explain:

  1. We observed that Nydus does pull layers on the host. In the configuration outlined above, as well as in our downstream product.
  2. If Nydus is pulling on the host anyway, we are not making matters worse by pulling on the host with the default snapshotter. In answer to your question, this PR does not try to prevent pulling on the host.
  3. Since the rest of the snapshotter is only static config, we might as well add it based on a Kata runtime flag instead of a magic mount option from the snapshotter.

This argument hinges on (2): if there is a configuration where Nydus does not pull layer content on the host, we do indeed have a trade-off. I, personally, am ok with pulling on the host and in the guest, because (a) host pull only happens once per node and is cached afterwards and (b) I prefer wasting network bandwidth to dealing with the stability issues caused by Nydus. Others' views may differ.


@fitzthum:

One thing I have heard is that the nydus approach will pull the entire image on the host unless the annotation io.containerd.cri.runtime-handler is set. If so, it will only pull the manifest.

That is interesting, I wonder why. At least it does not seem to be a direct effect, because then I would have expected a reference to that annotation in the Nydus snapshotter repo, which I did not find.

@fitzthum
Copy link
Contributor

fitzthum commented May 9, 2025

Personally, I think it is reasonable to support something like this in the short term given all the issues that actual users have with Nydus. Hopefully we can also really understand the optimal solution as well.

@csegarragonz might remember something more about the behavior of the io.containerd.cri.runtime-handler with the snapshotter.

@Apokleos
Copy link
Contributor

Thanks for the feedback @Apokleos, I think we're starting to get to the root of the issue here. I just ran the deployment script again (linked above), and I can reproduce my observations.

Layer content

Could you please confirm if the configuration is indeed correct?

To be honest, I don't know - this is what the gha-run.sh deploys on AKS. How did you set up your environment? Did you run a container before listing the snapshot dir? It might help to compare configuration.

ctr version
/etc/containerd/config.toml
/etc/nydus/config.toml

guest-pulling mode are container images pulled inside guest, not pulled on host, without remote snapshotter which help prevent images pulling on host, how it will work I'm still confused. Is there anything I miss ?

Sorry, the reasoning could have been more clear. Let me try to explain:

  1. We observed that Nydus does pull layers on the host. In the configuration outlined above, as well as in our downstream product.
  2. If Nydus is pulling on the host anyway, we are not making matters worse by pulling on the host with the default snapshotter. In answer to your question, this PR does not try to prevent pulling on the host.
  3. Since the rest of the snapshotter is only static config, we might as well add it based on a Kata runtime flag instead of a magic mount option from the snapshotter.

This argument hinges on (2): if there is a configuration where Nydus does not pull layer content on the host, we do indeed have a trade-off. I, personally, am ok with pulling on the host and in the guest, because (a) host pull only happens once per node and is cached afterwards and (b) I prefer wasting network bandwidth to dealing with the stability issues caused by Nydus. Others' views may differ.

@fitzthum:

One thing I have heard is that the nydus approach will pull the entire image on the host unless the annotation io.containerd.cri.runtime-handler is set. If so, it will only pull the manifest.

That is interesting, I wonder why. At least it does not seem to be a direct effect, because then I would have expected a reference to that annotation in the Nydus snapshotter repo, which I did not find.

Hi @burgerdev I have reproduced the issue you have with yours or mine configuration for nydus snapshotter.

  • If I apply a pod without io.containerd.cri.runtime-handler the container image content will be pulled on host, as you see that, nydus snapshott/{fs, work} will have contents inside them.
root@pk001:/home/pk001/katadev/NYDUS# cat nydus-test.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: snap-03
  #annotations:
  #  "io.containerd.cri.runtime-handler": "kata"
spec:
  runtimeClassName: kata
  containers:
    - name: snapshotter-co3
      image: debian:10.11
      imagePullPolicy: Always
      volumeMounts:
      command: [ "sleep", "1000000" ]
        #resources:
        #  limits:
        #    cpu: "4"
        #    memory: 1Gi
root@pk001:/home/pk001/katadev/NYDUS#  kubectl apply -f nydus-test.yaml 
pod/snap-03 created
root@pk001:/home/pk001/katadev/NYDUS#  watch kubectl get po 
root@pk001:/home/pk001/katadev/NYDUS#  tree -L 3 /var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots
/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots
├── 11
│   ├── fs
│   │   └── pause
│   └── work
│       └── work
├── 33
│   ├── fs
│   └── work
├── 34
│   ├── fs
│   │   ├── bin
│   │   ├── boot
│   │   ├── dev
│   │   ├── etc
│   │   ├── home
│   │   ├── lib
│   │   ├── lib64
│   │   ├── media
│   │   ├── mnt
│   │   ├── opt
│   │   ├── proc
│   │   ├── root
│   │   ├── run
│   │   ├── sbin
│   │   ├── srv
│   │   ├── sys
│   │   ├── tmp
│   │   ├── usr
│   │   └── var
│   └── work
│       └── work
└── 35
    ├── fs
    └── work
        └── work

35 directories, 1 file
root@pk001:/home/pk001/katadev/NYDUS#  kubectl delete -f nydus-test.yaml 
root@pk001:/home/pk001/katadev/NYDUS# crictl rmi debian:10.11
Deleted: xxx/debian:10.11
root@pk001:/home/pk001/katadev/NYDUS# tree -L 3 /var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots
/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots
├── 11
│   ├── fs
│   │   └── pause
│   └── work
│       └── work
└── 33
    ├── fs
    └── work

May 11 23:07:21 tnt001 containerd-nydus-grpc[1714739]: time="2025-05-11T23:07:21.372597077+08:00" level=info msg="proxy mount options [workdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/29/work upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/29/fs lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/11/fs]"
May 11 23:07:21 tnt001 containerd-nydus-grpc[1714739]: time="2025-05-11T23:07:21.372800511+08:00" level=info msg="encode kata volume {\"volume_type\":\"image_guest_pull\",\"source\":\"dummy-image-reference\",\"image_pull\":{\"metadata\":{}}}"
May 11 23:07:21 tnt001 containerd-nydus-grpc[1714739]: time="2025-05-11T23:07:21.372839581+08:00" level=debug msg="fuse.nydus-overlayfs mount options [workdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/29/work upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/29/fs lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/11/fs io.katacontainers.volume=eyJ2b2x1bWVfdHlwZSI6ImltYWdlX2d1ZXN0X3B1bGwiLCJzb3VyY2UiOiJkdW1teS1pbWFnZS1yZWZlcmVuY2UiLCJpbWFnZV9wdWxsIjp7Im1ldGFkYXRhIjp7fX19]"
  • If I apply a pod with annotation: io.containerd.cri.runtime-handler: kata added, the container image will not be pulled on the host.
root@pk001:/home/pk001/katadev/NYDUS# kubectl apply -f nydus-test.yaml 
pod/snap-03 created
root@pk001:/home/pk001/katadev/NYDUS# watch kubectl get po 
root@pk001:/home/pk001/katadev/NYDUS# tree -L 3 /var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots
/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots
├── 11
│   ├── fs
│   │   └── pause
│   └── work
│       └── work
├── 36
│   ├── fs
│   └── work
├── 37
│   ├── fs
│   └── work
└── 38
    ├── fs
    └── work
        └── work

15 directories, 1 file
root@pk001:/home/pk001/katadev/NYDUS# cat nydus-test.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: snap-03
  annotations:
    "io.containerd.cri.runtime-handler": "kata"
spec:
  runtimeClassName: kata
  containers:
    - name: snapshotter-co3
      image: debian:10.11
      imagePullPolicy: Always
      volumeMounts:
      command: [ "sleep", "1000000" ]
        #resources:
        #  limits:
        #    cpu: "4"
        #    memory: 1Gi

May 11 23:11:30 tnt001 containerd-nydus-grpc[1714739]: time="2025-05-11T23:11:30.259569640+08:00" level=debug msg="Prepare remote snapshot 37" key=k8s.io/45/66b079d3bb3c48b7831b20ac083bb12553c6c857f5525a6fef8663c4238fe1c5 parent="sha256:3f65e6373268ca82c99e8f5b5b8c1da353293c31f986c06e3344c76bf00eb24a"
May 11 23:11:30 tnt001 containerd-nydus-grpc[1714739]: time="2025-05-11T23:11:30.270581924+08:00" level=info msg="Nydus remote snapshot 37 is ready" key=k8s.io/45/66b079d3bb3c48b7831b20ac083bb12553c6c857f5525a6fef8663c4238fe1c5 parent="sha256:3f65e6373268ca82c99e8f5b5b8c1da353293c31f986c06e3344c76bf00eb24a"
May 11 23:11:30 tnt001 containerd-nydus-grpc[1714739]: time="2025-05-11T23:11:30.270646992+08:00" level=info msg="remote mount options [workdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/38/work upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/38/fs lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/37/fs]"
May 11 23:11:30 tnt001 containerd-nydus-grpc[1714739]: time="2025-05-11T23:11:30.270699841+08:00" level=info msg="encode kata volume {\"volume_type\":\"image_guest_pull\",\"source\":\"dummy-image-reference\",\"options\":[\"containerd.io/snapshot/cri.layer-digest=sha256:7d66b83ec869a899bc8364af9c9eb0f1a5ba6907f699ef52f3182e19e2598924\",\"containerd.io/snapshot/nydus-proxy-mode=true\"],\"image_pull\":{\"metadata\":{\"containerd.io/snapshot/cri.layer-digest\":\"sha256:7d66b83ec869a899bc8364af9c9eb0f1a5ba6907f699ef52f3182e19e2598924\",\"containerd.io/snapshot/nydus-proxy-mode\":\"true\"}}}"
May 11 23:11:30 tnt001 containerd-nydus-grpc[1714739]: time="2025-05-11T23:11:30.270749458+08:00" level=debug msg="fuse.nydus-overlayfs mount options [workdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/38/work upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/38/fs lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/37/fs io.katacontainers.volume=eyJ2b2x1bWVfdHlwZSI6ImltYWdlX2d1ZXN0X3B1bGwiLCJzb3VyY2UiOiJkdW1teS1pbWFnZS1yZWZlcmVuY2UiLCJvcHRpb25zIjpbImNvbnRhaW5lcmQuaW8vc25hcHNob3QvY3JpLmxheWVyLWRpZ2VzdD1zaGEyNTY6N2Q2NmI4M2VjODY5YTg5OWJjODM2NGFmOWM5ZWIwZjFhNWJhNjkwN2Y2OTllZjUyZjMxODJlMTllMjU5ODkyNCIsImNvbnRhaW5lcmQuaW8vc25hcHNob3QvbnlkdXMtcHJveHktbW9kZT10cnVlIl0sImltYWdlX3B1bGwiOnsibWV0YWRhdGEiOnsiY29udGFpbmVyZC5pby9zbmFwc2hvdC9jcmkubGF5ZXItZGlnZXN0Ijoic2hhMjU2OjdkNjZiODNlYzg2OWE4OTliYzgzNjRhZjljOWViMGYxYTViYTY5MDdmNjk5ZWY1MmYzMTgyZTE5ZTI1OTg5MjQiLCJjb250YWluZXJkLmlvL3NuYXBzaG90L255ZHVzLXByb3h5LW1vZGUiOiJ0cnVlIn19fQ==]"

And it's also as what Tobin @fitzthum said.

@csegarragonz
Copy link
Contributor

csegarragonz commented May 11, 2025

Yes, what @fitzthum and @Apokleos mention confirm what I observed.

In summary, even if nydus is set as a snapshotter for a given runtime class, containerd may first PullImage with the overlayfs snapshotter unless the experimental annotation is set. This results in the image being pulled in the host and then in the guest.

If you enable the experimental annotation, containerd will honour it here.

As a rule of thumb, you should always see experimental: PullImage in your containerd logs when using the nydus-snapshotter.

@Apokleos
Copy link
Contributor

Ok, currently nydus-snapshotter used in CoCo will not pull the image layers onto the host, just manifest which is metadata will be pulled and the real layers of this container image will be pulled inside the guest.

This is a conjecture for which I'd like to see evidence.
The hypothesis underlying this PR is that containerd does pull layers on the host, and there is evidence for that. Refer to the reproducer in #11162 (comment). The assumption that containerd does not pull layers on the host does not seem to hold because

  1. How does containerd resolve the named user from the image metadata to the UID sent to the Kata shim?

Yeah, In my mind, it seems a hard problem for me, I have no idea how to address it.

The reason I commented that "resolving the named user..." is hard because the relevant processing flow is deeply intertwined with containerd. This is why I previously speculated about the potential need to modify containerd's logic.

In the previous issue #11162, the discussed issue by @Camelron @imeoer and @burgerdev abouttempmounts which involves a flow: WithUser -> mount.WithReadonlyTempMount -> WithTempMount -> UserFromPath(xxxx) for handling /etc/passwd. However, this flow relies on the prerequisite that the image has been pulled on the host and can be unpacked into the snapshotter's paths, such as {fs, work}. Subsequently, these unpacked contents are temporarily mounted on tempmount, and finally, /etc/passwd is read from there.

However, with nydus snapshotter for guest pull, we explicitly disallow pulling image layers on the host. This implies that the image cannot be unpacked into the fs and work paths, nor can it be temporarily mounted on tempmount, consequently making it impossible to read /etc/passwd.

Maybe, some methods needed to help work around for this issue based on its function, for example, explicitly specify the username and UID etc. in Dockerfile, but I am not sure if there's any side effects.

  1. Why is there layer content of the guest-pulled image in /var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots?

Could you please confirm if the configuration is indeed correct? AFAIK, with nydus snapshotter in proxy mode, it's expected that no layers are pulled on the host if the configuration is correct. And I have tested on my ENV, I don't find any content of layers in the snapshotter paths, result as below:

root@pk001:/home/pk001/katadev # tree -L 6 /var/lib/containerd/io.containerd.snapshotter.v1.nydus/
/var/lib/containerd/io.containerd.snapshotter.v1.nydus/
├── cache
├── metadata.db
├── nydus.db
└── snapshots
    ├── 274
    │   ├── fs
    │   └── work
    ├── 315
    │   ├── fs
    │   └── work
    ├── 316
    │   ├── fs
    │   └── work
    │       └── work
    └── 47
        ├── fs
        └── work

16 directories, 2 files
root@pk001:/home/pk001/katadev/NYDUS# 

and with its related nydus snapshotter logs

May 09 22:05:55 tnt001 containerd-nydus-grpc[716237]: time="2025-05-09T22:05:55.310401793+08:00" level=info msg="encode kata volume {\"volume_type\":\"image_guest_pull\",\"source\":\"dummy-image-reference\",\"options\":[\"containerd.io/snapshot/cri.layer-digest=sha256:7d66b83ec869a899bc8364af9c9eb0f1a5ba6907f699ef52f3182e19e2598924\",\"containerd.io/snapshot/nydus-proxy-mode=true\"],\"image_pull\":{\"metadata\":{\"containerd.io/snapshot/cri.layer-digest\":\"sha256:7d66b83ec869a899bc8364af9c9eb0f1a5ba6907f699ef52f3182e19e2598924\",\"containerd.io/snapshot/nydus-proxy-mode\":\"true\"}}}"
May 09 22:05:55 tnt001 containerd-nydus-grpc[716237]: time="2025-05-09T22:05:55.310580156+08:00" level=debug msg="fuse.nydus-overlayfs mount options [workdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/316/work upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/316/fs lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.nydus/snapshots/274/fs io.katacontainers.volume=eyJ2b2x1bWVfdHlwZSI6ImltYWdlX2d1ZXN0X3B1bGwiLCJzb3VyY2UiOiJkdW1teS1pbWFnZS1yZWZlcmVuY2UiLCJvcHRpb25zIjpbImNvbnRhaW5lcmQuaW8vc25hcHNob3QvY3JpLmxheWVyLWRpZ2VzdD1zaGEyNTY6N2Q2NmI4M2VjODY5YTg5OWJjODM2NGFmOWM5ZWIwZjFhNWJhNjkwN2Y2OTllZjUyZjMxODJlMTllMjU5ODkyNCIsImNvbnRhaW5lcmQuaW8vc25hcHNob3QvbnlkdXMtcHJveHktbW9kZT10cnVlIl0sImltYWdlX3B1bGwiOnsibWV0YWRhdGEiOnsiY29udGFpbmVyZC5pby9zbmFwc2hvdC9jcmkubGF5ZXItZGlnZXN0Ijoic2hhMjU2OjdkNjZiODNlYzg2OWE4OTliYzgzNjRhZjljOWViMGYxYTViYTY5MDdmNjk5ZWY1MmYzMTgyZTE5ZTI1OTg5MjQiLCJjb250YWluZXJkLmlvL3NuYXBzaG90L255ZHVzLXByb3h5LW1vZGUiOiJ0cnVlIn19fQ==]"

Please let me try to understand it, the force_guest_pull still need cooperate with nydus-snapshotter or not ? sorry, I am still not clear about whether it needs nydus-snapshotter or not.

No - the force-guest-pull setup does not need anything related to Nydus. The host uses the default snapshotter, the guest is using image-rs without modification (which never interacted with Nydus, afaict).

guest-pulling mode are container images pulled inside guest, not pulled on host, without remote snapshotter which help prevent images pulling on host, how it will work I'm still confused. Is there anything I miss ?

@katexochen
Copy link
Contributor Author

Okay, so to summarize the discussion:

guest-pull with nydus-snapshotter:

  • Doesn't pull the image on the host if correctly configured
  • Has issues with stability
  • Has issues with correctness (/etc/passwd not present)

force_guest_pull:

  • Doesn't require nydus-snapshotter
  • Image is pulled on the host, by containerd
  • Requires additional bandwidth/storage on the host
  • Better stability as nydus isn't used
  • Better correctness, as containerd can use pulled image to derive users etc.
  • Is fully optional

As I understand, there isn't really any argument against adding force_guest_pull as option a user can enable. Switching guest-pull by default to this mechanism is another discussion that doesn't need to happen on this PR.

@Apokleos
Copy link
Contributor

Apokleos commented May 14, 2025

Okay, so to summarize the discussion:

...

  • Has issues with correctness (/etc/passwd not present)

nice summary of the discussion!
I am now addressing this issue, and I create a PR which implement a in-guest process user information correction for guest pull PTAL. and looking forward to your feedback.
Thx a lot. @katexochen @burgerdev

force_guest_pull:

  • Doesn't require nydus-snapshotter
  • Image is pulled on the host, by containerd
  • Requires additional bandwidth/storage on the host
  • Better stability as nydus isn't used
  • Better correctness, as containerd can use pulled image to derive users etc.
  • Is fully optional

As I understand, there isn't really any argument against adding force_guest_pull as option a user can enable. Switching guest-pull by default to this mechanism is another discussion that doesn't need to happen on this PR.

@katexochen katexochen force-pushed the p/guest-pull-config branch 2 times, most recently from 359f121 to 059e889 Compare May 15, 2025 08:44
Copy link
Member

@fidencio fidencio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but I'd wait to push this after the release is done.
Thanks @katexochen!

Copy link
Contributor

@burgerdev burgerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@fitzthum
Copy link
Contributor

After our discussion in the CoCo meeting, I think it makes sense to support this at least until containerd support multiple snapshotters (and we default to containerd 2.0). Until that point, it seems like whatever snapshotter we use will potentially run into synchronization issues.

@burgerdev
Copy link
Contributor

I think we established that the force-guest-pull approach can be useful for users struggling with snapshotters, and it's an experimental API that we can remove after we found the right™ way to do guest pulling. Now that 3.17 is released, are there any objections to merging this?

Copy link
Contributor

@Apokleos Apokleos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
There isn't a perfect existing solution, but I favor trying different approaches while simultaneously exploring new ones. Let's move it forward.

@fidencio
Copy link
Member

Now that 3.17 is released, are there any objections to merging this?

Please, do!
Some checks were not passing, which may require a rebase though.

Once it's green, merge it in.

@katexochen katexochen force-pushed the p/guest-pull-config branch from 059e889 to be0c593 Compare May 27, 2025 07:29
@fidencio
Copy link
Member

@katexochen, from the logs I can see:

#   Warning  FailedCreatePodSandBox  10s (x7 over 90s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: toml: line 698 (last key "runtime.experimental_force_guest_pull"): expected value but found '@' instead: unknown

This enables guest pull via config, without the need of any external
snapshotter. When the config enables runtime.experimental_force_guest_pull, instead of
relying on annotations to select the way to share the root FS, we always
use guest pull.

Co-authored-by: Markus Rudy <mr@edgeless.systems>
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
@katexochen katexochen force-pushed the p/guest-pull-config branch from be0c593 to c4815eb Compare May 27, 2025 10:42
@katexochen
Copy link
Contributor Author

katexochen commented May 27, 2025

katexochen, from the logs I can see:

#   Warning  FailedCreatePodSandBox  10s (x7 over 90s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: toml: line 698 (last key "runtime.experimental_force_guest_pull"): expected value but found '@' instead: unknown

Was missing USER_VARS += DEFFORCEGUESTPULL in the Makefile, as USER_VARS is used for the variable substitution, should be fixed now @fidencio .

@fidencio fidencio merged commit ac934e0 into kata-containers:main May 27, 2025
494 of 532 checks passed
@katexochen katexochen deleted the p/guest-pull-config branch May 27, 2025 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/large Task of significant size
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants