Skip to content

Conversation

dylandreimerink
Copy link
Member

This PR fixes an edge case in the NodesPodCIDRManager.
If there were any nodes on operator startup which have no PodCIDRs, the
operator would sometimes assign PodCIDRs to these nodes which have
already been allocated to other nodes.

The operator assumed that when k8sCiliumNodesCacheSynced closes, all
node events have been processed. And it proceeds to call Resync on
the nodeManager.

The NodesPodCIDRManager will queue any nodes
without PodCIDRs to be allocated once the canAllocatePodCIDRs variable
is set. This variable is set by the Resync.

So, the assumption/expected behavior is that the
NodesPodCIDRManager.Update function has been called for all nodes in
the cache before Resync is called. However, this wasn't the case.
The startSynchronizingCiliumNodes function starts the informer
and connects the nodeManager to it. But instead of handling the events
at once, the callbacks enqueue the events, to be handled by a separate
go routine. This means that k8sCiliumNodesCacheSynced is closed once
all of the node events are enqueued, not when they have been processed
by the nodeManager.

This commit fixes this behavior by processing all events at once
in the informer callbacks until the full sync is complete, at which
point we will switch over to using the workqueue.

Fixes: #21482

Fix overlapping/duplicate PodCIDR allocation when nodes are added while operator is down

@dylandreimerink dylandreimerink added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/operator Impacts the cilium-operator component sig/ipam labels Sep 30, 2022
@dylandreimerink dylandreimerink requested review from pchaigno, a team and gandro and removed request for a team September 30, 2022 13:56
@dylandreimerink dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from 7b2b22a to 12a94e9 Compare September 30, 2022 14:33
@dylandreimerink dylandreimerink marked this pull request as ready for review September 30, 2022 14:33
@dylandreimerink dylandreimerink requested a review from a team as a code owner September 30, 2022 14:33
@dylandreimerink dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from 12a94e9 to fa32fd5 Compare September 30, 2022 14:44
@dylandreimerink
Copy link
Member Author

dylandreimerink commented Oct 1, 2022

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sAgentPolicyTest External services To Services first endpoint creation match service by labels

Failure Output

FAIL: Error creating resource /home/jenkins/workspace/Cilium-PR-K8s-1.16-kernel-4.9/src/github.com/cilium/cilium/test/k8s/manifests/external-policy-labeled.yaml: Timed out while waiting for policies to be enforced: 4m0s timeout expired

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

Job 'Cilium-PR-K8s-1.25-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Check BPF masquerading with ip-masq-agent DirectRouting

Failure Output

FAIL: Failed to add ip route

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-net-next so I can create one.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice find, thanks for fixing this!

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix & nice explanation!

The explanation makes sense to me and this is covered by a test, so I think we're good to go, but I'll note I'm not very familiar with this logic and don't understand all consequences this change can have.
Looking at the commit history, @aanm may be a good candidate to review this if nobody else feels confident enough.

@dylandreimerink
Copy link
Member Author

note I'm not very familiar with this logic and don't understand all consequences this change can have

Me neither. This change bypasses a queuing / rate limiting mechanism on startup. So it might cause more load on the API server on startup. Digging into the commit history, that doesn't seem to be the reason for introducing the queue.

This PR introduces the queue, the commit says "this commit introduces the right
way of handling Kubernetes events, with a workqueue" but I am not sure why this is the right way. @aanm could you clarify?

@dylandreimerink dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from fa32fd5 to fda6b93 Compare October 4, 2022 13:54
@gandro
Copy link
Member

gandro commented Oct 4, 2022

@aanm Is unavailable till next week.

An alternative solution which would preserve rate limiting is to enqueue the "k8s cached synced" event as well. Since the queue already allows any type to be enqueued, we could enqueue a special "resync" key and invoke nodeManager.Resync when we dequeue the key.

We could implement this via a third "resyncHandler" func in syncHandlerConstructor that is invoked when the special resync key is hit.

@dylandreimerink
Copy link
Member Author

An alternative solution which would preserve rate limiting is to enqueue the "k8s cached synced" event as well.

Yes, that is also something that crossed my mind. I initially went for this approach because it is the minimal change, and reverts to the behavior of <=v1.10. I don't see the added benefit of the worker queue personally since events are already queued internally in the informer.

I would personally like to refactor this code at some point to use the new Resources. The event handling of the NodeManagers is currently split across packages and files and relies on these global variables to synchronize across at least 3 goroutines. Having each node manager be its own Cell with a dependency on Resource[CiliumNode] and Invoking one of them depending on config. Then making the KVStore synchronization its own Cell would remove a lot of complexity.

@dylandreimerink dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from fda6b93 to 1022a66 Compare October 4, 2022 14:38
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink
Copy link
Member Author

All tests green, except for the connectivity test, which has the following error:

curl: (22) The requested URL returned error: 503

So I guess cloudflare has a bad day, seem unrelated to these changes.

We have decided to merge the fix into master now without waiting for André's feedback. Removing the backport labels for now in case there turns out to be an issue with this. Will re-add the labels as soon as we get the all clear next week.

@dylandreimerink dylandreimerink added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed needs-backport/1.11 labels Oct 5, 2022
@pchaigno
Copy link
Member

pchaigno commented Oct 5, 2022

curl: (22) The requested URL returned error: 503

In this case it's okay, but there's one case where we can cause a 503, that's when L7 policies are enforced.

@christarazi
Copy link
Member

@dylandreimerink

This #17873 introduces the queue, the commit says "this commit introduces the right
way of handling Kubernetes events, with a workqueue" but I am not sure why this is the right way.
...
I don't see the added benefit of the worker queue personally since events are already queued internally in the informer.

The reason why there's a workqueue instead of processing the event directly from within the informer callbacks is because that processing will actually delay the informer itself. With the workqueue, you receive events and push them into a queue for the work to be done later, without blocking the informer itself.

The workqueue is different than the internal queue of the informer.

@dylandreimerink dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from 35f63ff to 53f644c Compare October 20, 2022 08:58
@dylandreimerink
Copy link
Member Author

I updated the test, to run up to 5 times now which seems to make this test reliable enough to remove the flake warnings. In my testing it typically triggers the bug now in run 1 or 2. This still leaks goroutines, but it doesn't impact this test case. I am writing an issue for the refactoring now

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylandreimerink I don't think my previous review was fully addressed 🤔

There exists an edge case in the operators pod CIDR allocation code
which causes the assignment of overlapping PodCIDRs. This commit adds
a test which currently fails, proving that the bug exists.

The test, tests that when the operator starts up, all assigned PodCIDR
are marked as allocated so they will never be given out to nodes seeking
a new PodCIDR.

The next commit in this patch set will fix the bug, using this test to
validate correct behavior.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
@dylandreimerink dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from 53f644c to a350d22 Compare October 24, 2022 10:16
@dylandreimerink
Copy link
Member Author

@aanm I added my detailed comment on the wrong commit, changed that now, does this address your previous review?

@dylandreimerink dylandreimerink requested a review from aanm October 24, 2022 10:17
@pchaigno
Copy link
Member

@dylandreimerink IMO the best way to make sure you didn't miss anything is to mark resolved everything that you think you addressed. If something isn't marked resolved then maybe it was missed?

This commit fixes an edge case in the `NodesPodCIDRManager`.
If there were any nodes on operator startup which have no PodCIDRs, the
operator would sometimes assign PodCIDRs to these nodes which have
already been allocated to other nodes.

The operator assumed that when `k8sCiliumNodesCacheSynced` closes, all
node events have been processed. And it proceeds to call `Resync` on
the `nodeManager`.

The `NodesPodCIDRManager` will queue any nodes
without PodCIDRs to be allocated once the `canAllocatePodCIDRs` variable
is set. This variable is set by the `Resync`.

So, the assumption/expected behavior is that the
`NodesPodCIDRManager.Update` function has been called for all nodes in
the cache before `Resync` is called. However, this wasn't the case.
The `startSynchronizingCiliumNodes` function starts the informer
and connects the nodeManager to it. But instead of handling the events
at once, the callbacks enqueue the events, to be handled by a separate
go routine. This means that `k8sCiliumNodesCacheSynced` is closed once
all of the node events are enqueued, not when they have been processed
by the `nodeManager`.

This commit fixes this behavior by processing all events at once
in the informer callbacks until the full sync is complete, at which
point we will switch over to using the workqueue.

Fixes: cilium#21482

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
@dylandreimerink dylandreimerink force-pushed the feature/podcidr-overlap-fix branch from a350d22 to 172a3ff Compare October 24, 2022 11:08
@dylandreimerink
Copy link
Member Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 24, 2022
@aanm aanm merged commit 4c9c1d3 into cilium:master Oct 25, 2022
@joestringer joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.11 labels Nov 18, 2022
@gandro gandro mentioned this pull request Feb 27, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cilium-operator with IPAM cluster-pool handing out duplicate CIDRs
8 participants