-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Description
I'm not sure whether this is considered a bug or working intended, but thought it might still be worth reporting.
When containerd (client) pulls an image, it tries to pull without authentication first. If the request fails with a 401 error, containerd retries the request with authentication information. This works fine in normal cases, but I can fabricate a case that would fail consistently. For example, if I delete an image layer from the content store manually, then try to repull the same image from GCR, the request will fail with a 403 error. The reason is that containerd skips fetching the image manifest (which is already present on the node), where it usually relies on to authenticate.
- containerd resolves the image.
- containerd tries to fetch the image manifest, but found it in the content store. (no real fetching, no authentication)
- containerd attempts to fetch the children digests from the manifest without authentication
- GCR redirects (302) the anonymous request to the GCS bucket
- GCS rejects the anonymous request and returns 403.
- containerd aborts the pull operation after seeing 403.
Steps to reproduce the issue:
- Use
crictl
to pull an image from a private GCR repository - Delete a (non-image-manifest) layer/blob of the pulled image from the content store
- Try repulling the same image.
Describe the results you received:
pulling image failed: rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/k8s-authenticated-test/serve-hostname-amd64:1.0": httpReaderSeeker: failed open: unexpected status code https://gcr.io/v2/k8s-authenticated-test/serve-hostname-amd64/blobs/sha256:080afe3806dc31a89e0cb073f487f203c093bcec0229f0c61342891d20f3b5c5: 403 Forbidden
Describe the results you expected:
containerd re-pull the image successfully.
Output of containerd --version
:
containerd github.com/containerd/containerd v1.2.0-beta.2 ce243288e27971e324363de8f322d221635a852
Deleting an image layer from the content store without deleting the manifest should not happen with normal workflows, so maybe it's okay to ignore the edge case. On the other hand, if we want to tackle that, here are some random ideas: 1) try authenticating after receiving 403, or 2) re-use the authentication information from the resolver, 3) ask GCS to return 401 instead, and 4) always re-pull the manifest(?).
This doesn't happen with images using Docker schema 1 manifest. Containerd will always try pulling down the schema 1 manifest for reasons unknown to me.
/cc @Random-Liu