-
Notifications
You must be signed in to change notification settings - Fork 3.7k
cri:fix containerd panic when can't find sandbox extension #11576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @ningmingxiao. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
20abca1
to
ad9d389
Compare
ad9d389
to
90c144b
Compare
7d11626
to
abe6dfe
Compare
@@ -106,6 +107,14 @@ func (c *criService) recover(ctx context.Context) error { | |||
metadata := sandboxstore.Metadata{} | |||
err := sbx.GetExtension(podsandbox.MetadataKey, &metadata) | |||
if err != nil { | |||
if errors.Is(err, errdefs.ErrNotFound) { | |||
err = c.client.SandboxStore().Delete(ctx, sbx.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add error log here? Something like "could not recover sandbox X: missing metadata key"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod, need to log the error message that was being returned below "failed to get metadata for stored sandbox %q: %w", sbx.ID, err to this err not found case so we always get that as output when we run into it..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
@@ -106,6 +107,14 @@ func (c *criService) recover(ctx context.Context) error { | |||
metadata := sandboxstore.Metadata{} | |||
err := sbx.GetExtension(podsandbox.MetadataKey, &metadata) | |||
if err != nil { | |||
if errors.Is(err, errdefs.ErrNotFound) { | |||
err = c.client.SandboxStore().Delete(ctx, sbx.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod, need to log the error message that was being returned below "failed to get metadata for stored sandbox %q: %w", sbx.ID, err to this err not found case so we always get that as output when we run into it..
Is there some operation that needs to be done in a transaction or something? Ideally it wouldn't be possible to introduce an inconsistency between stores that later needs to be cleaned up, just by stopping containerd at the wrong time. Feels like all these error handling PRs are just to clean up after a problem that shouldn't be occurring in the first place? |
we don't have transaction support across two stores.. wip to ensure we are only modifying one store for both the object and object's meta ... if memory serves there was a split to support new sandbox types.. additionally there was a bug in network CNI tear down of a pod that was causing failure to remove the partially created pod in defer.. was also broken in stop pod when the network was never successfully setup in the first place .. the issue was no cni at all or just an error in running the cni setup/teardown would cause that problem.. this before we had the optional support for doing the minimum of loopback internally.... a number of PRs are all trying to fix the same problem |
Unrelated, but I feel like the errors in this section are switched: containerd/internal/cri/server/sandbox_run.go Lines 221 to 227 in d20482f
Unless I'm misunderstanding, the first operation should log an
|
Do you have a link to the issue or PR for this one? Is it fixed in containerd 2.0.4? |
#10744 yes fixed in all releases see linked PRs.. |
original.. https://github.com/containerd/containerd/blob/release/1.5/pkg/cri/server/sandbox_run.go#L372-L388 |
for real what is going on here. This needs to be merged and backported to 2.0 asap. We have been shipping b9ab7a3 with k3s and rke2 for several months. |
/ok-to-test |
closing and reopening to attempt to reset the test results that have a bad link / can't be restarted... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on green
cc @fuweid
/cherry-pick release/2.1 |
/cherry-pick release/2.0 |
@estesp: new pull request created: #12076 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@estesp: new pull request created: #12077 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
can you review my pr 11967? @mikebrow |
fix #10848
I add some sleep at
create a pod when containerd is running at sleep 60 restart containerd, it will happen.