Edge cases where containerd would fail to re-pull an image from a private GCR repository

**Description**

I'm not sure whether this is considered a bug or working intended, but thought it might still be worth reporting.

When containerd (client) pulls an image, it tries to pull without authentication first. If the request fails with a 401 error, containerd retries the request with authentication information. This works fine in normal cases, but I can fabricate a case that would fail consistently. For example, if I delete an image layer from the content store manually, then try to repull the same image from GCR, the request will fail with a 403 error. The reason is that containerd skips fetching the image manifest (which is already present on the node), where it usually relies on to authenticate. 

1. containerd resolves the image.
2. containerd tries to fetch the image manifest, but found it in the content store. (no real fetching, no authentication)
3. containerd attempts to fetch the children digests from the manifest without authentication
    - GCR redirects (302) the anonymous request to the GCS bucket
    - GCS rejects the anonymous request and returns 403.
4. containerd aborts the pull operation after seeing 403.


**Steps to reproduce the issue:**
1. Use `crictl` to pull an image from a private GCR repository
2. Delete a (non-image-manifest) layer/blob of the pulled image from the content store
3. Try repulling the same image.

**Describe the results you received:**
```
pulling image failed: rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/k8s-authenticated-test/serve-hostname-amd64:1.0": httpReaderSeeker: failed open: unexpected status code https://gcr.io/v2/k8s-authenticated-test/serve-hostname-amd64/blobs/sha256:080afe3806dc31a89e0cb073f487f203c093bcec0229f0c61342891d20f3b5c5: 403 Forbidden 
```
**Describe the results you expected:**

containerd re-pull the image successfully.

**Output of `containerd --version`:**

```
containerd github.com/containerd/containerd v1.2.0-beta.2 ce243288e27971e324363de8f322d221635a852
```

Deleting an image layer from the content store without deleting the manifest should not happen with normal workflows, so maybe it's okay to ignore the edge case. On the other hand, if we want to tackle that, here are some random ideas: 1) try authenticating after receiving 403, or 2) re-use the authentication information from the resolver, 3) ask GCS to return 401 instead, and 4) always re-pull the manifest(?).

This doesn't happen with images using Docker schema 1 manifest. Containerd will always try pulling down the schema 1 manifest for reasons unknown to me.

/cc @Random-Liu 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Edge cases where containerd would fail to re-pull an image from a private GCR repository #2683

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Edge cases where containerd would fail to re-pull an image from a private GCR repository #2683

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions