Skip to content

Conversation

dcantah
Copy link
Member

@dcantah dcantah commented Jun 28, 2024

Fixes: #10363

Due to when we setup networking for userns containers (AFTER the OCI runtime has ran), this was causing the underlying object for the sandbox to not have the IP information saved in the metadata extension. Because we're trying to move away from modifying the underlying container object for sandboxes (or even just keeping around a reference to it), I don't want to update the underlying container afterwards to include this. So, to remedy this we can update the sandboxes metadata after network setup runs for userns containers, and then our recovery logic just needs to be altered a bit.

Our recovery logic already checks containers with the sandbox label first, and then checks our shiny new sandbox store. Our IP information is stored in the sandbox store objects, so all we need to do is check if it exists, and if so and we already found a container for it (although without the IP information) then just update the sandbox in our in-memory store.

@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@dcantah
Copy link
Member Author

dcantah commented Jul 2, 2024

/retest

@dcantah dcantah force-pushed the cri/userns-persist-ips branch from 51dbc98 to c148cc0 Compare July 9, 2024 06:41
@k8s-ci-robot k8s-ci-robot added size/L and removed size/M labels Jul 9, 2024
@dcantah dcantah force-pushed the cri/userns-persist-ips branch 3 times, most recently from c56d3d0 to 7bf5940 Compare July 9, 2024 09:21
@dcantah dcantah marked this pull request as draft July 9, 2024 21:18
@dcantah dcantah force-pushed the cri/userns-persist-ips branch 6 times, most recently from 12fe520 to b5a0bad Compare July 15, 2024 07:29
@dcantah dcantah force-pushed the cri/userns-persist-ips branch from b5a0bad to 82fdfad Compare July 15, 2024 08:15
@dcantah dcantah marked this pull request as ready for review July 15, 2024 19:15
@dosubot dosubot bot added the kind/bug label Jul 15, 2024
@dcantah
Copy link
Member Author

dcantah commented Jul 15, 2024

@fuweid Sorry for delay, un-drafted this and added an integration test. Only some of the CI machines support userns in full so I added a log so we can grep for if the testcase is actually being hit. "restart_linux_test.go:36: adding userns pods for containerd restart test"

@dcantah dcantah force-pushed the cri/userns-persist-ips branch from 82fdfad to 0feb455 Compare July 15, 2024 19:20
@dcantah
Copy link
Member Author

dcantah commented Jul 15, 2024

/retest

dcantah added 2 commits July 15, 2024 18:34
Fixes: containerd#10363

Due to when we setup networking for userns containers (AFTER the OCI runtime has ran),
this was causing the underlying object for the sandbox to not have the IP information
saved in the metadata extension. Because we're trying to move away from modifying the
underlying container object for sandboxes (or even just keeping around a reference to it),
I don't want to update the underlying container afterwards to include this. So, to remedy
this we can update the sandboxes metadata after network setup runs for userns containers,
and then our recovery logic just needs to be altered a bit.

Our recovery logic already checks containers with the sandbox label first, and then checks
our shiny new sandbox store. Our IP information is stored in the sandbox store objects, so
all we need to do is check if it exists, and if so and we already found a container for it
(although without the IP information) then just update the sandbox in our in-memory store.

Signed-off-by: Danny Canter <danny@dcantah.dev>
Add a new case to TestContainerdRestartSandboxRecover that will launch
a userns pod if the host and runtime supports it. The new case will just
test that we have networking info in the pod metadata on restart.

Signed-off-by: Danny Canter <danny@dcantah.dev>
@dcantah dcantah force-pushed the cri/userns-persist-ips branch from 0feb455 to 3f9c70c Compare July 16, 2024 01:34
@k8s-ci-robot
Copy link

@dcantah: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-containerd-k8s-e2e-ec2 3f9c70c link false /test pull-containerd-k8s-e2e-ec2

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -358,6 +350,15 @@ func (c *criService) RunPodSandbox(ctx context.Context, r *runtime.RunPodSandbox
sandboxCreateNetworkTimer.UpdateSince(netStart)
}

if err := sandboxInfo.AddExtension(podsandbox.MetadataKey, &sandbox.Metadata); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should move these two updates AddExtension and SandboxStore.Update into the end of if !hostNetwork(config) && userNsEnabled because non-user-ns case doesn't need to update it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. When I'm by my computer I'll see if there's any gotchas

@@ -81,7 +82,7 @@ func (c *criService) recover(ctx context.Context) error {
return nil
}
log.G(ctx2).Debugf("Loaded sandbox %+v", sb)
if err := c.sandboxStore.Add(sb); err != nil {
if err := c.sandboxStore.Add(sb); err != nil && !errors.Is(err, errdefs.ErrAlreadyExists) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it related to this fix?

@@ -100,12 +101,18 @@ func (c *criService) recover(ctx context.Context) error {
return fmt.Errorf("failed to list sandboxes from API: %w", err)
}
for _, sbx := range storedSandboxes {
if _, err := c.sandboxStore.Get(sbx.ID); err == nil {
sb, err := c.sandboxStore.Get(sbx.ID)
Copy link
Member

@fuweid fuweid Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: we should merge metadata-db.sandbox-store.metadata with metadata-db.container.metadata in function podSandboxLoader.RecoverContainer. This loop is designed for the sandboxes which are managed by remote sandbox plugin. I think we should not involve two sources in one loop.

In the RecoverContainer function, we should prefer to use metadata from metadata-db.sandbox. What do you think?


@dmcgowan @mxpv

I found that the sandboxes bucket wasn't in v1. Should we add the version before namespace for sandbox?
#10467

func getSandboxBucket(tx *bolt.Tx, namespace string) *bolt.Bucket {
return getBucket(
tx,
[]byte(namespace),
bucketKeyObjectSandboxes,
)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but that seemed a big shift for this PR. This "use the underlying container for the given sandbox" vs "use the sandbox api as the source of truth to grab the sandboxes" is very confusing atm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the underlying container for the given sandbox

It's to recover sandbox created by previous release actually.

but that seemed a big shift for this PR

OK. What about adding a condition that if IP is empty, we should use metadata from sandbox bucket 😂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give that a try!

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AkihiroSuda
Copy link
Member

Removing from v2.0 milestone as staled

@AkihiroSuda AkihiroSuda modified the milestones: 2.0, 2.1 Oct 17, 2024
Copy link

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 7 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the Stale label Jan 16, 2025
Copy link

This PR was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this Jan 24, 2025
@dmcgowan dmcgowan removed this from the 2.1 milestone Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[v2.0.0] No CNI info for pod sandbox after containerd restart when using user namespaces
5 participants