Skip to content

Conversation

ningmingxiao
Copy link
Contributor

@ningmingxiao ningmingxiao commented Mar 20, 2025

fix #10848
I add some sleep at


		if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {
			return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err)
		}
		// Save sandbox metadata to store
                time.Sleep(time.Second*60)
		if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil {
			return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err)
		}
 

create a pod when containerd is running at sleep 60 restart containerd, it will happen.

@k8s-ci-robot
Copy link

Hi @ningmingxiao. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@@ -106,6 +107,14 @@ func (c *criService) recover(ctx context.Context) error {
metadata := sandboxstore.Metadata{}
err := sbx.GetExtension(podsandbox.MetadataKey, &metadata)
if err != nil {
if errors.Is(err, errdefs.ErrNotFound) {
err = c.client.SandboxStore().Delete(ctx, sbx.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add error log here? Something like "could not recover sandbox X: missing metadata key"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod, need to log the error message that was being returned below "failed to get metadata for stored sandbox %q: %w", sbx.ID, err to this err not found case so we always get that as output when we run into it..

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@@ -106,6 +107,14 @@ func (c *criService) recover(ctx context.Context) error {
metadata := sandboxstore.Metadata{}
err := sbx.GetExtension(podsandbox.MetadataKey, &metadata)
if err != nil {
if errors.Is(err, errdefs.ErrNotFound) {
err = c.client.SandboxStore().Delete(ctx, sbx.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod, need to log the error message that was being returned below "failed to get metadata for stored sandbox %q: %w", sbx.ID, err to this err not found case so we always get that as output when we run into it..

@brandond
Copy link
Contributor

brandond commented Mar 20, 2025

Is there some operation that needs to be done in a transaction or something? Ideally it wouldn't be possible to introduce an inconsistency between stores that later needs to be cleaned up, just by stopping containerd at the wrong time.

Feels like all these error handling PRs are just to clean up after a problem that shouldn't be occurring in the first place?

@mikebrow
Copy link
Member

mikebrow commented Mar 20, 2025

Is there some operation that needs to be done in a transaction or something? Ideally it wouldn't be possible to introduce an inconsistency between stores that later needs to be cleaned up, just by stopping containerd at the wrong time.

Feels like all these error handling PRs are just to clean up after a problem that shouldn't be occurring in the first place?

we don't have transaction support across two stores.. wip to ensure we are only modifying one store for both the object and object's meta ... if memory serves there was a split to support new sandbox types..

additionally there was a bug in network CNI tear down of a pod that was causing failure to remove the partially created pod in defer.. was also broken in stop pod when the network was never successfully setup in the first place .. the issue was no cni at all or just an error in running the cni setup/teardown would cause that problem.. this before we had the optional support for doing the minimum of loopback internally....

a number of PRs are all trying to fix the same problem

@brandond
Copy link
Contributor

brandond commented Mar 20, 2025

Unrelated, but I feel like the errors in this section are switched:

if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {
return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err)
}
// Save sandbox metadata to store
if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil {
return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err)
}

Unless I'm misunderstanding, the first operation should log an unable to update extensions error, and the second should log unable to save sandbox %q to store. Add the extensions does not save the store, that's what the second function is doing - as indicated by the comment on the line above it.

AddExtension is just a helper that sets sandboxInfo.Extensions[podsandbox.MetadataKey] = typeurl.MarshalAny(&sandbox.Metadata), it doesn't actually save anything as far as I can tell?

@brandond
Copy link
Contributor

brandond commented Mar 20, 2025

@mikebrow
additionally there was a bug in network CNI tear down of a pod that was causing failure to remove the partially created pod in defer.. was also broken in stop pod when the network was never successfully setup in the first place .. the issue was no cni at all or just an error in running the cni setup/teardown would cause that problem..

Do you have a link to the issue or PR for this one? Is it fixed in containerd 2.0.4?

@mikebrow
Copy link
Member

#10744 yes fixed in all releases see linked PRs..

@mikebrow
Copy link
Member

Unrelated, but I feel like the errors in this section are switched:

if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {
return nil, fmt.Errorf("unable to save sandbox %q to store: %w", id, err)
}
// Save sandbox metadata to store
if sandboxInfo, err = c.client.SandboxStore().Update(ctx, sandboxInfo, "extensions"); err != nil {
return nil, fmt.Errorf("unable to update extensions for sandbox %q: %w", id, err)
}

Unless I'm misunderstanding, the first operation should log an unable to update extensions error, and the second should log unable to save sandbox %q to store. Add the extensions does not save the store, that's what the second function is doing - as indicated by the comment on the line above it.

AddExtension is just a helper that sets sandboxInfo.Extensions[podsandbox.MetadataKey] = typeurl.MarshalAny(&sandbox.Metadata), it doesn't actually save anything as far as I can tell?

original.. https://github.com/containerd/containerd/blob/release/1.5/pkg/cri/server/sandbox_run.go#L372-L388

@brandond
Copy link
Contributor

brandond commented Jul 1, 2025

for real what is going on here. This needs to be merged and backported to 2.0 asap. We have been shipping b9ab7a3 with k3s and rke2 for several months.

@mikebrow
Copy link
Member

mikebrow commented Jul 8, 2025

/ok-to-test

@mikebrow
Copy link
Member

mikebrow commented Jul 8, 2025

closing and reopening to attempt to reset the test results that have a bad link / can't be restarted...

@mikebrow mikebrow closed this Jul 8, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to Done in Pull Request Review Jul 8, 2025
@mikebrow mikebrow reopened this Jul 8, 2025
@github-project-automation github-project-automation bot moved this from Done to Needs Triage in Pull Request Review Jul 8, 2025
Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on green
cc @fuweid

@mikebrow mikebrow added the cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch label Jul 8, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to Review In Progress in Pull Request Review Jul 9, 2025
@estesp estesp added the cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch label Jul 9, 2025
@estesp estesp added this pull request to the merge queue Jul 9, 2025
Merged via the queue into containerd:main with commit b15270e Jul 9, 2025
142 of 147 checks passed
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Jul 9, 2025
@estesp
Copy link
Member

estesp commented Jul 9, 2025

/cherry-pick release/2.1

@estesp
Copy link
Member

estesp commented Jul 9, 2025

/cherry-pick release/2.0

@k8s-infra-cherrypick-robot

@estesp: new pull request created: #12076

In response to this:

/cherry-pick release/2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot

@estesp: new pull request created: #12077

In response to this:

/cherry-pick release/2.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ningmingxiao
Copy link
Contributor Author

ningmingxiao commented Jul 9, 2025

can you review my pr 11967? @mikebrow

@austinvazquez austinvazquez added cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch and removed cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch labels Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch kind/bug ok-to-test size/S
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

containerd fails on start: "failed to get metadata for stored sandbox"
8 participants