cri:fix containerd panic when can't find sandbox extension #11576

ningmingxiao · 2025-03-20T01:45:37Z

fix #10848
I add some sleep at


		if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {
			return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err)
		}
		// Save sandbox metadata to store
                time.Sleep(time.Second*60)
		if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil {
			return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err)
		}

create a pod when containerd is running at sleep 60 restart containerd, it will happen.

k8s-ci-robot · 2025-03-20T01:45:47Z

Hi @ningmingxiao. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mxpv · 2025-03-20T17:13:01Z

internal/cri/server/restart.go

@@ -106,6 +107,14 @@ func (c *criService) recover(ctx context.Context) error {
 		metadata := sandboxstore.Metadata{}
 		err := sbx.GetExtension(podsandbox.MetadataKey, &metadata)
 		if err != nil {
+			if errors.Is(err, errdefs.ErrNotFound) {
+				err = c.client.SandboxStore().Delete(ctx, sbx.ID)


Can we add error log here? Something like "could not recover sandbox X: missing metadata key"

nod, need to log the error message that was being returned below "failed to get metadata for stored sandbox %q: %w", sbx.ID, err to this err not found case so we always get that as output when we run into it..

mikebrow

see comments

mikebrow · 2025-03-20T18:09:54Z

internal/cri/server/restart.go

@@ -106,6 +107,14 @@ func (c *criService) recover(ctx context.Context) error {
 		metadata := sandboxstore.Metadata{}
 		err := sbx.GetExtension(podsandbox.MetadataKey, &metadata)
 		if err != nil {
+			if errors.Is(err, errdefs.ErrNotFound) {
+				err = c.client.SandboxStore().Delete(ctx, sbx.ID)


nod, need to log the error message that was being returned below "failed to get metadata for stored sandbox %q: %w", sbx.ID, err to this err not found case so we always get that as output when we run into it..

internal/cri/server/restart.go

brandond · 2025-03-20T19:30:52Z

Is there some operation that needs to be done in a transaction or something? Ideally it wouldn't be possible to introduce an inconsistency between stores that later needs to be cleaned up, just by stopping containerd at the wrong time.

Feels like all these error handling PRs are just to clean up after a problem that shouldn't be occurring in the first place?

mikebrow · 2025-03-20T21:28:37Z

Is there some operation that needs to be done in a transaction or something? Ideally it wouldn't be possible to introduce an inconsistency between stores that later needs to be cleaned up, just by stopping containerd at the wrong time.

Feels like all these error handling PRs are just to clean up after a problem that shouldn't be occurring in the first place?

we don't have transaction support across two stores.. wip to ensure we are only modifying one store for both the object and object's meta ... if memory serves there was a split to support new sandbox types..

additionally there was a bug in network CNI tear down of a pod that was causing failure to remove the partially created pod in defer.. was also broken in stop pod when the network was never successfully setup in the first place .. the issue was no cni at all or just an error in running the cni setup/teardown would cause that problem.. this before we had the optional support for doing the minimum of loopback internally....

a number of PRs are all trying to fix the same problem

internal/cri/server/restart.go

brandond · 2025-03-20T21:48:05Z

Unrelated, but I feel like the errors in this section are switched:

containerd/internal/cri/server/sandbox_run.go

Lines 221 to 227 in d20482f

    
           if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil { 
        
           	return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err) 
        
           } 
        
           // Save sandbox metadata to store 
        
           if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil { 
        
           	return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err) 
        
           }

Unless I'm misunderstanding, the first operation should log an unable to update extensions error, and the second should log unable to save sandbox %q to store. Add the extensions does not save the store, that's what the second function is doing - as indicated by the comment on the line above it.

AddExtension is just a helper that sets sandboxInfo.Extensions[podsandbox.MetadataKey] = typeurl.MarshalAny(&sandbox.Metadata), it doesn't actually save anything as far as I can tell?

brandond · 2025-03-20T22:04:18Z

@mikebrow
additionally there was a bug in network CNI tear down of a pod that was causing failure to remove the partially created pod in defer.. was also broken in stop pod when the network was never successfully setup in the first place .. the issue was no cni at all or just an error in running the cni setup/teardown would cause that problem..

Do you have a link to the issue or PR for this one? Is it fixed in containerd 2.0.4?

mikebrow · 2025-03-20T22:39:32Z

#10744 yes fixed in all releases see linked PRs..

mikebrow · 2025-03-20T22:51:07Z

Unrelated, but I feel like the errors in this section are switched:

containerd/internal/cri/server/sandbox_run.go

Lines 221 to 227 in d20482f

if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {

return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err)

}

// Save sandbox metadata to store

if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil {

return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err)

}

Unless I'm misunderstanding, the first operation should log an unable to update extensions error, and the second should log unable to save sandbox %q to store. Add the extensions does not save the store, that's what the second function is doing - as indicated by the comment on the line above it.

AddExtension is just a helper that sets sandboxInfo.Extensions[podsandbox.MetadataKey] = typeurl.MarshalAny(&sandbox.Metadata), it doesn't actually save anything as far as I can tell?

original.. https://github.com/containerd/containerd/blob/release/1.5/pkg/cri/server/sandbox_run.go#L372-L388

brandond · 2025-07-01T17:26:40Z

for real what is going on here. This needs to be merged and backported to 2.0 asap. We have been shipping b9ab7a3 with k3s and rke2 for several months.

mikebrow · 2025-07-08T19:06:38Z

/ok-to-test

mikebrow · 2025-07-08T19:08:24Z

closing and reopening to attempt to reset the test results that have a bad link / can't be restarted...

mikebrow

LGTM on green
cc @fuweid

estesp · 2025-07-09T16:30:03Z

/cherry-pick release/2.1

estesp · 2025-07-09T16:30:10Z

/cherry-pick release/2.0

k8s-infra-cherrypick-robot · 2025-07-09T16:30:41Z

@estesp: new pull request created: #12076

In response to this:

/cherry-pick release/2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2025-07-09T16:30:47Z

@estesp: new pull request created: #12077

In response to this:

/cherry-pick release/2.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ningmingxiao · 2025-07-09T17:02:36Z

can you review my pr 11967? @mikebrow

github-project-automation bot added this to Pull Request Review Mar 20, 2025

github-project-automation bot moved this to Needs Triage in Pull Request Review Mar 20, 2025

k8s-ci-robot added size/XS needs-ok-to-test labels Mar 20, 2025

dosubot bot added area/cri Container Runtime Interface (CRI) kind/bug labels Mar 20, 2025

ningmingxiao mentioned this pull request Mar 20, 2025

containerd fails on start: "failed to get metadata for stored sandbox" #10848

Closed

ningmingxiao force-pushed the fix_panic_2 branch 2 times, most recently from 20abca1 to ad9d389 Compare March 20, 2025 03:35

ningmingxiao mentioned this pull request Mar 20, 2025

K3s exits due to containerd exiting with failed to recover state: failed to get metadata for stored sandbox error k3s-io/k3s#11973

Closed

ningmingxiao force-pushed the fix_panic_2 branch from ad9d389 to 90c144b Compare March 20, 2025 05:08

k8s-ci-robot added size/S and removed size/XS labels Mar 20, 2025

ningmingxiao force-pushed the fix_panic_2 branch 2 times, most recently from 7d11626 to abe6dfe Compare March 20, 2025 07:24

k8s-ci-robot added size/XS and removed size/S labels Mar 20, 2025

mxpv reviewed Mar 20, 2025

View reviewed changes

mikebrow reviewed Mar 20, 2025

View reviewed changes

brandond reviewed Mar 20, 2025

View reviewed changes

internal/cri/server/restart.go Outdated Show resolved Hide resolved

brandond mentioned this pull request Mar 20, 2025

Bump containerd to v2.0.4 k3s-io/k3s#11982

Merged

This was referenced Mar 20, 2025

CRI: Fix error messages when updating extensions and writing to SandboxStore #11577

Open

[release-1.32] Bump to containerd v2.0.4 k3s-io/k3s#12003

Merged

brandond mentioned this pull request Jul 1, 2025

Fix the panic caused by the failure of RunPodSandbox #11588

Merged

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jul 8, 2025

mikebrow closed this Jul 8, 2025

github-project-automation bot moved this from Needs Triage to Done in Pull Request Review Jul 8, 2025

mikebrow reopened this Jul 8, 2025

github-project-automation bot moved this from Done to Needs Triage in Pull Request Review Jul 8, 2025

mikebrow approved these changes Jul 8, 2025

View reviewed changes

mikebrow added the cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch label Jul 8, 2025

estesp approved these changes Jul 9, 2025

View reviewed changes

github-project-automation bot moved this from Needs Triage to Review In Progress in Pull Request Review Jul 9, 2025

estesp added the cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch label Jul 9, 2025

estesp added this pull request to the merge queue Jul 9, 2025

Merged via the queue into containerd:main with commit b15270e Jul 9, 2025
142 of 147 checks passed

github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Jul 9, 2025

k8s-infra-cherrypick-robot mentioned this pull request Jul 9, 2025

[release/2.1] Fix containerd panic when sandbox extension is missing #12076

Merged

k8s-infra-cherrypick-robot mentioned this pull request Jul 9, 2025

[release/2.0] Fix containerd panic when sandbox extension is missing #12077

Merged

cri:fix containerd panic when can't find sandbox extension #11576

cri:fix containerd panic when can't find sandbox extension #11576

Uh oh!

Conversation

ningmingxiao commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 20, 2025

Uh oh!

mxpv Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

mikebrow Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

mikebrow left a comment

Choose a reason for hiding this comment

Uh oh!

mikebrow Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brandond commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikebrow commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

brandond commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brandond commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikebrow commented Mar 20, 2025

Uh oh!

mikebrow commented Mar 20, 2025

Uh oh!

brandond commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikebrow commented Jul 8, 2025

Uh oh!

mikebrow commented Jul 8, 2025

Uh oh!

mikebrow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

estesp commented Jul 9, 2025

Uh oh!

estesp commented Jul 9, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Jul 9, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Jul 9, 2025

Uh oh!

ningmingxiao commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ningmingxiao commented Mar 20, 2025 •

edited

Loading

brandond commented Mar 20, 2025 •

edited

Loading

mikebrow commented Mar 20, 2025 •

edited

Loading

brandond commented Mar 20, 2025 •

edited

Loading

brandond commented Mar 20, 2025 •

edited

Loading

brandond commented Jul 1, 2025 •

edited

Loading

ningmingxiao commented Jul 9, 2025 •

edited

Loading