-
Notifications
You must be signed in to change notification settings - Fork 3.4k
ipam: Fix overlapping/duplicate PodCIDR allocation when nodes are added while operator is down #21526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipam: Fix overlapping/duplicate PodCIDR allocation when nodes are added while operator is down #21526
Conversation
7b2b22a
to
12a94e9
Compare
12a94e9
to
fa32fd5
Compare
/test Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.25-kernel-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice find, thanks for fixing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix & nice explanation!
The explanation makes sense to me and this is covered by a test, so I think we're good to go, but I'll note I'm not very familiar with this logic and don't understand all consequences this change can have.
Looking at the commit history, @aanm may be a good candidate to review this if nobody else feels confident enough.
Me neither. This change bypasses a queuing / rate limiting mechanism on startup. So it might cause more load on the API server on startup. Digging into the commit history, that doesn't seem to be the reason for introducing the queue. This PR introduces the queue, the commit says "this commit introduces the right |
fa32fd5
to
fda6b93
Compare
@aanm Is unavailable till next week. An alternative solution which would preserve rate limiting is to enqueue the "k8s cached synced" event as well. Since the queue already allows We could implement this via a third "resyncHandler" func in |
Yes, that is also something that crossed my mind. I initially went for this approach because it is the minimal change, and reverts to the behavior of <=v1.10. I don't see the added benefit of the worker queue personally since events are already queued internally in the informer. I would personally like to refactor this code at some point to use the new Resources. The event handling of the NodeManagers is currently split across packages and files and relies on these global variables to synchronize across at least 3 goroutines. Having each node manager be its own Cell with a dependency on Resource[CiliumNode] and Invoking one of them depending on config. Then making the KVStore synchronization its own Cell would remove a lot of complexity. |
fda6b93
to
1022a66
Compare
/test |
All tests green, except for the connectivity test, which has the following error:
So I guess cloudflare has a bad day, seem unrelated to these changes. We have decided to merge the fix into master now without waiting for André's feedback. Removing the backport labels for now in case there turns out to be an issue with this. Will re-add the labels as soon as we get the all clear next week. |
In this case it's okay, but there's one case where we can cause a 503, that's when L7 policies are enforced. |
The reason why there's a workqueue instead of processing the event directly from within the informer callbacks is because that processing will actually delay the informer itself. With the workqueue, you receive events and push them into a queue for the work to be done later, without blocking the informer itself. The workqueue is different than the internal queue of the informer. |
35f63ff
to
53f644c
Compare
I updated the test, to run up to 5 times now which seems to make this test reliable enough to remove the flake warnings. In my testing it typically triggers the bug now in run 1 or 2. This still leaks goroutines, but it doesn't impact this test case. I am writing an issue for the refactoring now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dylandreimerink I don't think my previous review was fully addressed 🤔
There exists an edge case in the operators pod CIDR allocation code which causes the assignment of overlapping PodCIDRs. This commit adds a test which currently fails, proving that the bug exists. The test, tests that when the operator starts up, all assigned PodCIDR are marked as allocated so they will never be given out to nodes seeking a new PodCIDR. The next commit in this patch set will fix the bug, using this test to validate correct behavior. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
53f644c
to
a350d22
Compare
@aanm I added my detailed comment on the wrong commit, changed that now, does this address your previous review? |
@dylandreimerink IMO the best way to make sure you didn't miss anything is to mark resolved everything that you think you addressed. If something isn't marked resolved then maybe it was missed? |
This commit fixes an edge case in the `NodesPodCIDRManager`. If there were any nodes on operator startup which have no PodCIDRs, the operator would sometimes assign PodCIDRs to these nodes which have already been allocated to other nodes. The operator assumed that when `k8sCiliumNodesCacheSynced` closes, all node events have been processed. And it proceeds to call `Resync` on the `nodeManager`. The `NodesPodCIDRManager` will queue any nodes without PodCIDRs to be allocated once the `canAllocatePodCIDRs` variable is set. This variable is set by the `Resync`. So, the assumption/expected behavior is that the `NodesPodCIDRManager.Update` function has been called for all nodes in the cache before `Resync` is called. However, this wasn't the case. The `startSynchronizingCiliumNodes` function starts the informer and connects the nodeManager to it. But instead of handling the events at once, the callbacks enqueue the events, to be handled by a separate go routine. This means that `k8sCiliumNodesCacheSynced` is closed once all of the node events are enqueued, not when they have been processed by the `nodeManager`. This commit fixes this behavior by processing all events at once in the informer callbacks until the full sync is complete, at which point we will switch over to using the workqueue. Fixes: cilium#21482 Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
a350d22
to
172a3ff
Compare
/test |
This PR fixes an edge case in the
NodesPodCIDRManager
.If there were any nodes on operator startup which have no PodCIDRs, the
operator would sometimes assign PodCIDRs to these nodes which have
already been allocated to other nodes.
The operator assumed that when
k8sCiliumNodesCacheSynced
closes, allnode events have been processed. And it proceeds to call
Resync
onthe
nodeManager
.The
NodesPodCIDRManager
will queue any nodeswithout PodCIDRs to be allocated once the
canAllocatePodCIDRs
variableis set. This variable is set by the
Resync
.So, the assumption/expected behavior is that the
NodesPodCIDRManager.Update
function has been called for all nodes inthe cache before
Resync
is called. However, this wasn't the case.The
startSynchronizingCiliumNodes
function starts the informerand connects the nodeManager to it. But instead of handling the events
at once, the callbacks enqueue the events, to be handled by a separate
go routine. This means that
k8sCiliumNodesCacheSynced
is closed onceall of the node events are enqueued, not when they have been processed
by the
nodeManager
.This commit fixes this behavior by processing all events at once
in the informer callbacks until the full sync is complete, at which
point we will switch over to using the workqueue.
Fixes: #21482