-
Notifications
You must be signed in to change notification settings - Fork 3.4k
ipam: Fix bug where IP lease did not expire #29443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/test |
aa68241
to
87eac13
Compare
/ci-ginkgo |
Somehow, CI ginkgo is now failing suspiciously often with Edit: It seems this bugfix uncovered another problem in CI - due to #27253 we're now expiring IP leases after 5 seconds instead of 10 minutes in CI. That is probably too of a period between CNI ADD and the first regeneration to succeed.
|
This commit addresses two problems with the IPAM expiration timer: 1. Before this commit, each timer consisted of a Go routine calling `time.Sleep` to wait for expiration to occur. The default expiration timeout is 10 minutes. This meant, that for every IP allocated via CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes, only to (in most cases) wake up and learn that the expiration timer was stopped. This commit improves that situation by having the expiration Go routine wake up and exit early if it was stopped (either via IP Release or `StopExpirationTimer`). 2. In CI, we set the hidden `max-internal-timer-delay` option to 5 seconds (see cilium#27253). This meant that the `time.Sleep` expiration timer would effectively be 5 seconds instead of 10 minutes. 5 seconds however is not enough for an endpoint to be created via CNI ADD and complete its first endpoint regeneration. This therefore led to endpoint IPs being released while the endpoint was still being created. Due to another bug (fixed in the next commit) the expiration timer failed to actually release the IP, which is why this bug was not discovered earlier when we introduced the 5 second limit. This commit addresses this issue by adding an escape hatch to `pkg/time`, allowing the creation of a timer which is not subject to the `max-internal-timer-delay`. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
The `AllocateNextWithExpiration`[1] function is used to allocate an IP via API from the CNI plugin. Such IPs are always allocated with an expiration timer, which means that if the CNI ADD fails later on and is never retried, the IP is automatically released. Only once an endpoint is created, we then stop the expiration timer during the endpoint creation request [2], making the allocation of the IP permanent until it is explicitly freed. The current expiration implementation however has a bug: Instead of releasing the IP back into the IPAM pool from where the IP was actually allocated from, we forwarded the desired pool, which can be empty and is later overwritten with the actual pool. Because we passed in an empty pool into `StartExpirationTimer`, this led to IP expiration being broken in almost all cases: ``` 2023-11-24T06:24:37.089657953Z level=warning msg="Unable to release IP after expiration" error="no IPAM pool provided for IP release of 10.0.1.41" ip=10.0.1.41 subsys=ipam uuid=2320c5c1-b4c0-4a2e-8f3d-2b906330ab55 ``` This commit fixes that by using the realized pool (from the result) instead of the desired pool from the request. In addition, the unit tests are also adapted to cover this case to ensure we don't regress. [1] https://github.com/cilium/cilium/blob/0fcd1c86e347b2701880c9034e7ea3a74cd6b13e/daemon/cmd/ipam.go#L46 [2] https://github.com/cilium/cilium/blob/95a7d1288d5a13a5a216dcdb09383f1f483e5ac1/daemon/cmd/endpoint.go#L536 Reported-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
87eac13
to
d40a3af
Compare
Pushed another commit which hopefully should fix the premature IP release in Ginkgo CI:
|
/test
|
Mirroring some thoughts from the community meeting regarding the newly introduced
|
As I understand it the max delay is currently global, so I assume it's not possible to specify separate timeouts for different operations? Might be worth to introduce a second option for longer-running operations (such as endpoint regeneration). But then again I think this would complicate things in
The advantage of the current change is that it is fairly minimal and (mostly) contained to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find(s)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it the max delay is currently global, so I assume it's not possible to specify separate timeouts for different operations? Might be worth to introduce a second option for longer-running operations (such as endpoint regeneration).
My main concern here is that the whole point of the pkg/time
changes is to ensure that all timers are triggered and the corresponding code is executed within regular CI test times. If we introduce a longer period and that period does not execute within a single agent runtime, we may just be excluding this timer (and related cleanup code) from the testing. That then increases the likelihood of timer-related bugs relating to this feature.
One of my goals with pkg/time
is also to trigger these discussions so that we can primarily rely on non-timer mechanisms for signalling with the hope that this increases reliability of both prod and test code. I've opened a thread to explore these possibilities below.
In the mean time, as an escape hatch, I don't necessarily mind this, but I feel like the existence of the NewTimerWithoutMaxDelay
suggests that there's unresolved tech debt. For a bugfix we'd want a targeted fix and not necessarily do more invasive refactoring to address this, but for main it'd be nice to explore possibilities to improve the code. I wouldn't want ongoing contributions to the tree to increasingly use this new function, as this will slowly undermine the testing benefits of pkg/time
.
Something to note is that the |
(no need to block merging on my discussion above, it's more forward-looking. We can move it to a dedicated issue if that's easier) |
[Warning, a bit philosophical response] I'm not sure I agree. Timeouts to determine that the communication peer are dead (which I would argue is how the timeout is used here) is rather fundamental building block in distributed systems. Now, sometimes there are better options, e.g. if both peers are running on the same machine, we can rely on the operating system to tell as that the peer process has died, which could be an option here (see last part of #29443 (comment)). But in other cases, the peers a cilium-agent process talks to are going to be remote and then I think there is no alternative to timeouts. Even consensus protocols (at least to my knowledge) fundamentally rely on that. Going back to So instead of |
👍 I appreciate your argument, and I agree with you that there are good reasons for timeouts. My statement about tech debt was ignoring an important set of use cases. Given the architectural background to the design behind this particular timer, I think it does make sense to retain unless we make significant further changes as you describe.
This is an interesting question and lends itself to further philosophy ;) I agree that 'one size fits all' won't fit for all use cases. However I think it's worth evaluating every operation that we think takes more than 5s and reason about why that is. That time scale is an incredibly long time on a modern CPU, perhaps trillions of instructions can be executed. Obviously if we're waiting on network operations and external systems that's one thing, but if there's horribly inefficient code inside Cilium that we're waiting on, then that is something we should catch and avoid. Though probably the more common cases are "we expect something to happen within a few seconds or so, so let's just wait 10m and then clean up". If we take this specific example into consideration, for arguments' sake if we consider that an endpoint creation should complete successfully within 1 second.. When should we consider the endpoint creation attempt unsuccessful and release the IP? after 5x the ETA? 10x the ETA? 100x the ETA? 600x the ETA? (current default: 10m; even if we think average endpoint creation takes 6 seconds of Cilium's time, we're giving it enough time to handle 100 creations by the time we release the IP). I think there's probably a statistics answer out there involving probability of completion over time with an asymptote approaching forever which could give us a sense for the confidence we have that the operation failed. But let's say in production we stick with ~100x by default, then in testing maybe we lower that to ~10x? Maybe if we're basing these timer calculations based on explicit "expected timeframes" then we can also pick up on problems where the actual time things take doesn't match the programmer's expectations.
Good point, I added this as an item to #28844 . This would make for an interesting further investigation in time. 🕵️ As a seperate thought here, perhaps instead of using timers, all cases where the timeout is relying on an external system should just use
The short/medium/long idea occurred to me as well. There's a few things I don't particularly like about it:
Arguably improvements to (2) may be to perhaps (a) Implement more system testing ( unit < system < e2e ) to validate the logic beyond the timer, and/or (b) Introduce dedicated soak testing to run a bunch of logic and ensure the agent is running for a longer time and will trigger all of the timers. However, the
I'd be more open to scaling the time values, but point (2) above still plays a significant role. If I think about the background issue that inspired the The other thought I had was what if instead we have something like |
Thanks for the response Joe! I think your point about use-case driven timer classes is an interesting one. I particularly like it because it forces the caller to think about what kind of operation they are waiting for (which hopefully also nudges them into thinking about potential alternative notification mechanisms) |
No, we can just use a regular timer. See #29443 (comment) I'll do the backport! |
The
AllocateNextWithExpiration
function is used to allocate an IP via API from the CNI plugin. Such IPs are always allocated with an expiration timer, which means that if the CNI ADD fails later on and is never retried, the IP is automatically released. Only once an endpoint is created, do we stop the expiration timer during the endpoint creation request, making the allocation of the IP permanent until it is explicitly freed.The current expiration implementation however has a bug: Instead of releasing the IP back into the IPAM pool from where the IP was actually allocated from, we forwarded the desired pool, which can be empty and is later overwritten with the actual pool.
Because we passed in an empty pool into
StartExpirationTimer
, this led to IP expiration being broken in almost all cases:This commit fixes that by using the realized pool (from the result) instead of the desired pool from the request. In addition, the unit tests are also adapted to cover this case to ensure we don't regress.
Reported-by: Yutaro Hayakawa yutaro.hayakawa@isovalent.com