ipam: Fix overlapping/duplicate PodCIDR allocation when nodes are added while operator is down #21526

dylandreimerink · 2022-09-30T13:56:14Z

This PR fixes an edge case in the NodesPodCIDRManager.
If there were any nodes on operator startup which have no PodCIDRs, the
operator would sometimes assign PodCIDRs to these nodes which have
already been allocated to other nodes.

The operator assumed that when k8sCiliumNodesCacheSynced closes, all
node events have been processed. And it proceeds to call Resync on
the nodeManager.

The NodesPodCIDRManager will queue any nodes
without PodCIDRs to be allocated once the canAllocatePodCIDRs variable
is set. This variable is set by the Resync.

So, the assumption/expected behavior is that the
NodesPodCIDRManager.Update function has been called for all nodes in
the cache before Resync is called. However, this wasn't the case.
The startSynchronizingCiliumNodes function starts the informer
and connects the nodeManager to it. But instead of handling the events
at once, the callbacks enqueue the events, to be handled by a separate
go routine. This means that k8sCiliumNodesCacheSynced is closed once
all of the node events are enqueued, not when they have been processed
by the nodeManager.

This commit fixes this behavior by processing all events at once
in the informer callbacks until the full sync is complete, at which
point we will switch over to using the workqueue.

Fixes: #21482

Fix overlapping/duplicate PodCIDR allocation when nodes are added while operator is down

dylandreimerink · 2022-10-01T11:57:41Z

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sAgentPolicyTest External services To Services first endpoint creation match service by labels

Failure Output

FAIL: Error creating resource /home/jenkins/workspace/Cilium-PR-K8s-1.16-kernel-4.9/src/github.com/cilium/cilium/test/k8s/manifests/external-policy-labeled.yaml: Timed out while waiting for policies to be enforced: 4m0s timeout expired

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.25-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Check BPF masquerading with ip-masq-agent DirectRouting

Failure Output

FAIL: Failed to add ip route

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-net-next so I can create one.

gandro

Really nice find, thanks for fixing this!

pchaigno

Thanks for the fix & nice explanation!

The explanation makes sense to me and this is covered by a test, so I think we're good to go, but I'll note I'm not very familiar with this logic and don't understand all consequences this change can have.
Looking at the commit history, @aanm may be a good candidate to review this if nobody else feels confident enough.

operator/cmd/allocator_test.go

operator/cmd/cilium_node.go

dylandreimerink · 2022-10-04T13:47:02Z

note I'm not very familiar with this logic and don't understand all consequences this change can have

Me neither. This change bypasses a queuing / rate limiting mechanism on startup. So it might cause more load on the API server on startup. Digging into the commit history, that doesn't seem to be the reason for introducing the queue.

This PR introduces the queue, the commit says "this commit introduces the right
way of handling Kubernetes events, with a workqueue" but I am not sure why this is the right way. @aanm could you clarify?

gandro · 2022-10-04T13:56:37Z

@aanm Is unavailable till next week.

An alternative solution which would preserve rate limiting is to enqueue the "k8s cached synced" event as well. Since the queue already allows any type to be enqueued, we could enqueue a special "resync" key and invoke nodeManager.Resync when we dequeue the key.

We could implement this via a third "resyncHandler" func in syncHandlerConstructor that is invoked when the special resync key is hit.

dylandreimerink · 2022-10-04T14:22:20Z

An alternative solution which would preserve rate limiting is to enqueue the "k8s cached synced" event as well.

Yes, that is also something that crossed my mind. I initially went for this approach because it is the minimal change, and reverts to the behavior of <=v1.10. I don't see the added benefit of the worker queue personally since events are already queued internally in the informer.

I would personally like to refactor this code at some point to use the new Resources. The event handling of the NodeManagers is currently split across packages and files and relies on these global variables to synchronize across at least 3 goroutines. Having each node manager be its own Cell with a dependency on Resource[CiliumNode] and Invoking one of them depending on config. Then making the KVStore synchronization its own Cell would remove a lot of complexity.

dylandreimerink · 2022-10-05T08:26:14Z

/test

dylandreimerink · 2022-10-05T13:04:13Z

All tests green, except for the connectivity test, which has the following error:

curl: (22) The requested URL returned error: 503

So I guess cloudflare has a bad day, seem unrelated to these changes.

We have decided to merge the fix into master now without waiting for André's feedback. Removing the backport labels for now in case there turns out to be an issue with this. Will re-add the labels as soon as we get the all clear next week.

pchaigno · 2022-10-05T13:22:34Z

curl: (22) The requested URL returned error: 503

In this case it's okay, but there's one case where we can cause a 503, that's when L7 policies are enforced.

christarazi · 2022-10-05T19:30:05Z

@dylandreimerink

This #17873 introduces the queue, the commit says "this commit introduces the right
way of handling Kubernetes events, with a workqueue" but I am not sure why this is the right way.
...
I don't see the added benefit of the worker queue personally since events are already queued internally in the informer.

The reason why there's a workqueue instead of processing the event directly from within the informer callbacks is because that processing will actually delay the informer itself. With the workqueue, you receive events and push them into a queue for the work to be done later, without blocking the informer itself.

The workqueue is different than the internal queue of the informer.

dylandreimerink · 2022-10-20T09:03:39Z

I updated the test, to run up to 5 times now which seems to make this test reliable enough to remove the flake warnings. In my testing it typically triggers the bug now in run 1 or 2. This still leaks goroutines, but it doesn't impact this test case. I am writing an issue for the refactoring now

aanm

@dylandreimerink I don't think my previous review was fully addressed 🤔

There exists an edge case in the operators pod CIDR allocation code which causes the assignment of overlapping PodCIDRs. This commit adds a test which currently fails, proving that the bug exists. The test, tests that when the operator starts up, all assigned PodCIDR are marked as allocated so they will never be given out to nodes seeking a new PodCIDR. The next commit in this patch set will fix the bug, using this test to validate correct behavior. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>

dylandreimerink · 2022-10-24T10:17:42Z

@aanm I added my detailed comment on the wrong commit, changed that now, does this address your previous review?

pchaigno · 2022-10-24T10:23:31Z

@dylandreimerink IMO the best way to make sure you didn't miss anything is to mark resolved everything that you think you addressed. If something isn't marked resolved then maybe it was missed?

This commit fixes an edge case in the `NodesPodCIDRManager`. If there were any nodes on operator startup which have no PodCIDRs, the operator would sometimes assign PodCIDRs to these nodes which have already been allocated to other nodes. The operator assumed that when `k8sCiliumNodesCacheSynced` closes, all node events have been processed. And it proceeds to call `Resync` on the `nodeManager`. The `NodesPodCIDRManager` will queue any nodes without PodCIDRs to be allocated once the `canAllocatePodCIDRs` variable is set. This variable is set by the `Resync`. So, the assumption/expected behavior is that the `NodesPodCIDRManager.Update` function has been called for all nodes in the cache before `Resync` is called. However, this wasn't the case. The `startSynchronizingCiliumNodes` function starts the informer and connects the nodeManager to it. But instead of handling the events at once, the callbacks enqueue the events, to be handled by a separate go routine. This means that `k8sCiliumNodesCacheSynced` is closed once all of the node events are enqueued, not when they have been processed by the `nodeManager`. This commit fixes this behavior by processing all events at once in the informer callbacks until the full sync is complete, at which point we will switch over to using the workqueue. Fixes: cilium#21482 Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>

dylandreimerink · 2022-10-24T14:25:24Z

/test

dylandreimerink added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/operator Impacts the cilium-operator component sig/ipam labels Sep 30, 2022

dylandreimerink requested review from pchaigno, a team and gandro and removed request for a team September 30, 2022 13:56

dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from 7b2b22a to 12a94e9 Compare September 30, 2022 14:33

dylandreimerink marked this pull request as ready for review September 30, 2022 14:33

dylandreimerink requested a review from a team as a code owner September 30, 2022 14:33

dylandreimerink requested a review from nebril September 30, 2022 14:33

dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from 12a94e9 to fa32fd5 Compare September 30, 2022 14:44

gandro approved these changes Oct 3, 2022

View reviewed changes

pchaigno approved these changes Oct 3, 2022

View reviewed changes

operator/cmd/allocator_test.go Show resolved Hide resolved

operator/cmd/cilium_node.go Outdated Show resolved Hide resolved

dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from fa32fd5 to fda6b93 Compare October 4, 2022 13:54

dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from fda6b93 to 1022a66 Compare October 4, 2022 14:38

dylandreimerink added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed needs-backport/1.11 labels Oct 5, 2022

dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from 35f63ff to 53f644c Compare October 20, 2022 08:58

dylandreimerink mentioned this pull request Oct 20, 2022

Improve operator IPAM life-cycle management and test-ability #21818

Open

aanm requested changes Oct 21, 2022

View reviewed changes

dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from 53f644c to a350d22 Compare October 24, 2022 10:16

dylandreimerink requested a review from aanm October 24, 2022 10:17

dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from a350d22 to 172a3ff Compare October 24, 2022 11:08

aanm approved these changes Oct 24, 2022

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 24, 2022

aanm merged commit 4c9c1d3 into cilium:master Oct 25, 2022

pchaigno mentioned this pull request Nov 7, 2022

v1.12 backports 2022-11-07 #22028

Merged

pchaigno added backport-pending/1.12 and removed needs-backport/1.12 labels Nov 7, 2022

pchaigno mentioned this pull request Nov 9, 2022

v1.11 backports 2022-11-09 #22073

Merged

pchaigno added backport-pending/1.11 and removed needs-backport/1.11 labels Nov 9, 2022

This was referenced Nov 15, 2022

Prepare for release v1.11.11 #22192

Closed

Prepare for release v1.12.4 #22201

Merged

Prepare for release v1.11.11 #22202

Merged

joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.11 labels Nov 18, 2022

dylandreimerink mentioned this pull request Dec 1, 2022

Improve alerting for overlapping PodCIDRs #22472

Closed

gandro mentioned this pull request Feb 27, 2023

cilium endpoint cidr conflict #22770

Closed

2 tasks

ipam: Fix overlapping/duplicate PodCIDR allocation when nodes are added while operator is down #21526

ipam: Fix overlapping/duplicate PodCIDR allocation when nodes are added while operator is down #21526

Uh oh!

Conversation

dylandreimerink commented Sep 30, 2022

Uh oh!

dylandreimerink commented Oct 1, 2022 • edited by maintainer-s-little-helper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Name

Failure Output

Test Name

Failure Output

Uh oh!

gandro left a comment

Choose a reason for hiding this comment

Uh oh!

pchaigno left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dylandreimerink commented Oct 4, 2022

Uh oh!

gandro commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dylandreimerink commented Oct 4, 2022

Uh oh!

dylandreimerink commented Oct 5, 2022

Uh oh!

dylandreimerink commented Oct 5, 2022

Uh oh!

pchaigno commented Oct 5, 2022

Uh oh!

christarazi commented Oct 5, 2022

Uh oh!

dylandreimerink commented Oct 20, 2022

Uh oh!

aanm left a comment

Choose a reason for hiding this comment

Uh oh!

dylandreimerink commented Oct 24, 2022

Uh oh!

pchaigno commented Oct 24, 2022

Uh oh!

dylandreimerink commented Oct 24, 2022

Uh oh!

Uh oh!

dylandreimerink commented Oct 1, 2022 •

edited by maintainer-s-little-helper bot

Loading

gandro commented Oct 4, 2022 •

edited

Loading