Skip to content

Conversation

gandro
Copy link
Member

@gandro gandro commented Nov 28, 2023

The AllocateNextWithExpiration function is used to allocate an IP via API from the CNI plugin. Such IPs are always allocated with an expiration timer, which means that if the CNI ADD fails later on and is never retried, the IP is automatically released. Only once an endpoint is created, do we stop the expiration timer during the endpoint creation request, making the allocation of the IP permanent until it is explicitly freed.

The current expiration implementation however has a bug: Instead of releasing the IP back into the IPAM pool from where the IP was actually allocated from, we forwarded the desired pool, which can be empty and is later overwritten with the actual pool.

Because we passed in an empty pool into StartExpirationTimer, this led to IP expiration being broken in almost all cases:

2023-11-24T06:24:37.089657953Z level=warning msg="Unable to release IP after expiration" error="no IPAM pool provided for IP release of 10.0.1.41" ip=10.0.1.41 subsys=ipam uuid=2320c5c1-b4c0-4a2e-8f3d-2b906330ab55

This commit fixes that by using the realized pool (from the result) instead of the desired pool from the request. In addition, the unit tests are also adapted to cover this case to ensure we don't regress.

Reported-by: Yutaro Hayakawa yutaro.hayakawa@isovalent.com

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/ipam area/ipam IP address management, including cloud IPAM labels Nov 28, 2023
@gandro gandro requested a review from a team as a code owner November 28, 2023 12:07
@gandro gandro requested a review from tommyp1ckles November 28, 2023 12:07
@gandro
Copy link
Member Author

gandro commented Nov 28, 2023

/test

@gandro gandro force-pushed the pr/gandro/ipam-fix-expiration branch from aa68241 to 87eac13 Compare November 28, 2023 16:08
@gandro
Copy link
Member Author

gandro commented Nov 28, 2023

@gandro
Copy link
Member Author

gandro commented Nov 29, 2023

/ci-ginkgo

@gandro
Copy link
Member Author

gandro commented Nov 29, 2023

Somehow, CI ginkgo is now failing suspiciously often with 2023-11-29T10:03:56.967849326Z level=error msg="endpoint regeneration failed" ciliumEndpointName=kube-system/coredns-5d78c9869d-ct7ss containerID=2e1ea172f1 containerInterface= datapathPolicyRevision=1 desiredPolicyRevision=1 endpointID=2 error="Error while configuring proxy redirects: context cancelled before waiting for proxy updates: context canceled" identity=49024 ipv4=10.0.1.206 ipv6="fd02::1ae" k8sPodName=kube-system/coredns-5d78c9869d-ct7ss subsys=endpoint

Edit:

It seems this bugfix uncovered another problem in CI - due to #27253 we're now expiring IP leases after 5 seconds instead of 10 minutes in CI. That is probably too of a period between CNI ADD and the first regeneration to succeed.

2023-11-29T10:03:49.395858492Z level=debug msg="Allocated random IP" ip="fd02::1ae" owner=kube-system/coredns-5d78c9869d-ct7ss pool=default subsys=ipam
2023-11-29T10:03:49.429032753Z level=info msg="New endpoint" ciliumEndpointName=kube-system/coredns-5d78c9869d-ct7ss containerID=2e1ea172f1 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2 ipv4=10.0.1.206 ipv6="fd02::1ae" k8sPodName=kube-system/coredns-5d78c9869d-ct7ss subsys=endpoint
2023-11-29T10:03:49.429076123Z level=info msg="Waiting for endpoint to be generated" ciliumEndpointName=kube-system/coredns-5d78c9869d-ct7ss containerID=2e1ea172f1 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=2 ipv4=10.0.1.206 ipv6="fd02::1ae" k8sPodName=kube-system/coredns-5d78c9869d-ct7ss subsys=endpoint
2023-11-29T10:03:54.397425361Z level=debug msg="Released IP" ip="fd02::1ae" owner=kube-system/coredns-5d78c9869d-ct7ss subsys=ipam
2023-11-29T10:03:54.397601610Z level=warning msg="Released IP after expiration" ip="fd02::1ae" subsys=ipam uuid=41b5583d-b813-4ebd-bb31-1fe3f9013b9d
2023-11-29T10:03:56.967788703Z level=info msg="Rewrote endpoint BPF program" ciliumEndpointName=kube-system/coredns-5d78c9869d-ct7ss containerID=2e1ea172f1 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2 identity=49024 ipv4=10.0.1.206 ipv6="fd02::1ae" k8sPodName=kube-system/coredns-5d78c9869d-ct7ss subsys=endpoint

This commit addresses two problems with the IPAM expiration timer:

1. Before this commit, each timer consisted of a Go routine calling
   `time.Sleep` to wait for expiration to occur. The default expiration
   timeout is 10 minutes. This meant, that for every IP allocated via
   CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes,
   only to (in most cases) wake up and learn that the expiration timer was stopped.
   This commit improves that situation by having the expiration Go
   routine wake up and exit early if it was stopped (either via IP Release
   or `StopExpirationTimer`).

2. In CI, we set the hidden `max-internal-timer-delay` option to 5
   seconds (see cilium#27253). This meant that the `time.Sleep`
   expiration timer would effectively be 5 seconds instead of 10 minutes. 5
   seconds however is not enough for an endpoint to be created via CNI ADD
   and complete its first endpoint regeneration. This therefore led to
   endpoint IPs being released while the endpoint was still being created.
   Due to another bug (fixed in the next commit) the expiration timer
   failed to actually  release the IP, which is why this bug was not
   discovered earlier when we introduced the 5 second limit.  This commit
   addresses this issue by adding an escape hatch to `pkg/time`, allowing
   the creation of a timer which is not subject to the
  `max-internal-timer-delay`.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
The `AllocateNextWithExpiration`[1] function is used to allocate an IP
via API from the CNI plugin. Such IPs are always allocated with an
expiration timer, which means that if the CNI ADD fails later on and is
never retried, the IP is automatically released. Only once an endpoint
is created, we then stop the expiration timer during the endpoint
creation request [2], making the allocation of the IP permanent until it
is explicitly freed.

The current expiration implementation however has a bug: Instead of
releasing the IP back into the IPAM pool from where the IP was actually
allocated from, we forwarded the desired pool, which can be empty and is
later overwritten with the actual pool.

Because we passed in an empty pool into `StartExpirationTimer`, this led
to IP expiration being broken in almost all cases:

```
2023-11-24T06:24:37.089657953Z level=warning msg="Unable to release IP after expiration" error="no IPAM pool provided for IP release of 10.0.1.41" ip=10.0.1.41 subsys=ipam uuid=2320c5c1-b4c0-4a2e-8f3d-2b906330ab55
```

This commit fixes that by using the realized pool (from the result)
instead of the desired pool from the request. In addition, the unit
tests are also adapted to cover this case to ensure we don't regress.

[1] https://github.com/cilium/cilium/blob/0fcd1c86e347b2701880c9034e7ea3a74cd6b13e/daemon/cmd/ipam.go#L46
[2] https://github.com/cilium/cilium/blob/95a7d1288d5a13a5a216dcdb09383f1f483e5ac1/daemon/cmd/endpoint.go#L536

Reported-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/ipam-fix-expiration branch from 87eac13 to d40a3af Compare November 29, 2023 13:56
@gandro gandro requested a review from a team as a code owner November 29, 2023 13:56
@gandro
Copy link
Member Author

gandro commented Nov 29, 2023

Pushed another commit which hopefully should fix the premature IP release in Ginkgo CI:

1. Before this commit, each timer consisted of a Go routine calling
   `time.Sleep` to wait for expiration to occur. The default expiration
   timeout is 10 minutes. This meant, that for every IP allocated via
   CNI ADD, we had a Go routine unconditionally sleeping for 10 minutes,
   only to (in most cases) wake up and learn that the expiration timer was stopped.
   This commit improves that situation by having the expiration Go
   routine wake up and exit early if it was stopped (either via IP Release
   or `StopExpirationTimer`).

2. In CI, we set the hidden `max-internal-timer-delay` option to 5
   seconds (see cilium/cilium#27253). This meant that the `time.Sleep`
   expiration timer would effectively be 5 seconds instead of 10 minutes. 5
   seconds however is not enough for an endpoint to be created via CNI ADD
   and complete its first endpoint regeneration. This therefore led to
   endpoint IPs being released while the endpoint was still being created.
   Due to another bug (fixed in the next commit) the expiration timer
   failed to actually  release the IP, which is why this bug was not
   discovered earlier when we introduced the 5 second limit.  This commit
   addresses this issue by adding an escape hatch to `pkg/time`, allowing
   the creation of a timer which is not subject to the
  `max-internal-timer-delay`.

@gandro
Copy link
Member Author

gandro commented Nov 29, 2023

/test

@gandro
Copy link
Member Author

gandro commented Nov 29, 2023

Mirroring some thoughts from the community meeting regarding the newly introduced pkg/time.NewTimerWithoutMaxDelay:

  • This now simply ignores the CI max delay. While it's clear that the current 5 seconds are clearly too short, a discussion point is if it makes sense to introduce some longer alternative delay for known long-running operations. For example, 1 minute for endpoint regeneration might be a reasonable CI value here
  • Does the current lease-based mechanism even make sense? An alternative option would be for the IPAM subsystem to periodically run garbage collection on IPs that are not backed by any endpoint and reclaim "unused" IPs that way. But that would require some more fundamental restructuring, which is not something we'd want to backport.

@tklauser tklauser self-requested a review November 29, 2023 16:38
@tklauser
Copy link
Member

Mirroring some thoughts from the community meeting regarding the newly introduced pkg/time.NewTimerWithoutMaxDelay:

  • This now simply ignores the CI max delay. While it's clear that the current 5 seconds are clearly too short, a discussion point is if it makes sense to introduce some longer alternative delay for known long-running operations. For example, 1 minute for endpoint regeneration might be a reasonable CI value here

As I understand it the max delay is currently global, so I assume it's not possible to specify separate timeouts for different operations? Might be worth to introduce a second option for longer-running operations (such as endpoint regeneration). But then again I think this would complicate things in pkg/time, i.e. how does it decide which max delay to apply without changing the API surface too much? IMO the current solution is fine for this PR and we can defer potential improvements to the pkg/time package to a future PR.

  • Does the current lease-based mechanism even make sense? An alternative option would be for the IPAM subsystem to periodically run garbage collection on IPs that are not backed by any endpoint and reclaim "unused" IPs that way. But that would require some more fundamental restructuring, which is not something we'd want to backport.

The advantage of the current change is that it is fairly minimal and (mostly) contained to (*IPAM).StopExpirationTimer and (*IPAM).releaseIPLocked. I think in terms of being able to backport the change this is fine and like with the previous point we can defer thinking about a potentially better mechanism as a follow-up.

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find(s)!

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it the max delay is currently global, so I assume it's not possible to specify separate timeouts for different operations? Might be worth to introduce a second option for longer-running operations (such as endpoint regeneration).

My main concern here is that the whole point of the pkg/time changes is to ensure that all timers are triggered and the corresponding code is executed within regular CI test times. If we introduce a longer period and that period does not execute within a single agent runtime, we may just be excluding this timer (and related cleanup code) from the testing. That then increases the likelihood of timer-related bugs relating to this feature.

One of my goals with pkg/time is also to trigger these discussions so that we can primarily rely on non-timer mechanisms for signalling with the hope that this increases reliability of both prod and test code. I've opened a thread to explore these possibilities below.

In the mean time, as an escape hatch, I don't necessarily mind this, but I feel like the existence of the NewTimerWithoutMaxDelay suggests that there's unresolved tech debt. For a bugfix we'd want a targeted fix and not necessarily do more invasive refactoring to address this, but for main it'd be nice to explore possibilities to improve the code. I wouldn't want ongoing contributions to the tree to increasingly use this new function, as this will slowly undermine the testing benefits of pkg/time.

@joestringer
Copy link
Member

Something to note is that the pkg/time changes are specifically trying to ensure that deferred code logic is run during CI runtimes. By moving the timer back out, we are likely reducing coverage over the affected functions. Maybe there's an additional work item there to improve the testing of this IPAM GC code.

@joestringer
Copy link
Member

(no need to block merging on my discussion above, it's more forward-looking. We can move it to a dedicated issue if that's easier)

@gandro
Copy link
Member Author

gandro commented Nov 30, 2023

In the mean time, as an escape hatch, I don't necessarily mind this, but I feel like the existence of the NewTimerWithoutMaxDelay suggests that there's unresolved tech debt

[Warning, a bit philosophical response]

I'm not sure I agree. Timeouts to determine that the communication peer are dead (which I would argue is how the timeout is used here) is rather fundamental building block in distributed systems.

Now, sometimes there are better options, e.g. if both peers are running on the same machine, we can rely on the operating system to tell as that the peer process has died, which could be an option here (see last part of #29443 (comment)).

But in other cases, the peers a cilium-agent process talks to are going to be remote and then I think there is no alternative to timeouts. Even consensus protocols (at least to my knowledge) fundamentally rely on that.

Going back to pkg/time: Assuming we'll have to keep using timeouts to declare remote peers (or resource owners) as dead, I think the discussion then becomes: Is a one-size-fits-all upper bound of 5 seconds really enough to cover all cases? Some resource owners can provide a sign of life every 5 seconds, but I think others won't. In particular, i'd be interested to see what context.WithTimeout values we're using in Cilium. This is often used in Golang to declare peers as dead and something currently not covered by pkg/time. That could then inform what "max timeout" we want to accept in CI. My suspicion is that it will vary between use-cases and that there is no single one-size-fits-all value.

So instead of NewTimerWithoutMaxDelay I'd advocate for something like NewTimerWith{Short,Medium,Long}MaxDelay. Or have the max delay scaled based on the input timeout, e.g. a timeout of 30 seconds becomes 10 seconds in CI, a timeout of 10 minutes becomes 1 minute in CI etc.

@gandro gandro removed the request for review from tommyp1ckles November 30, 2023 10:56
@joestringer
Copy link
Member

joestringer commented Dec 1, 2023

👍 I appreciate your argument, and I agree with you that there are good reasons for timeouts. My statement about tech debt was ignoring an important set of use cases. Given the architectural background to the design behind this particular timer, I think it does make sense to retain unless we make significant further changes as you describe.

Is a one-size-fits-all upper bound of 5 seconds really enough to cover all cases?

This is an interesting question and lends itself to further philosophy ;) I agree that 'one size fits all' won't fit for all use cases. However I think it's worth evaluating every operation that we think takes more than 5s and reason about why that is. That time scale is an incredibly long time on a modern CPU, perhaps trillions of instructions can be executed. Obviously if we're waiting on network operations and external systems that's one thing, but if there's horribly inefficient code inside Cilium that we're waiting on, then that is something we should catch and avoid. Though probably the more common cases are "we expect something to happen within a few seconds or so, so let's just wait 10m and then clean up".

If we take this specific example into consideration, for arguments' sake if we consider that an endpoint creation should complete successfully within 1 second.. When should we consider the endpoint creation attempt unsuccessful and release the IP? after 5x the ETA? 10x the ETA? 100x the ETA? 600x the ETA? (current default: 10m; even if we think average endpoint creation takes 6 seconds of Cilium's time, we're giving it enough time to handle 100 creations by the time we release the IP). I think there's probably a statistics answer out there involving probability of completion over time with an asymptote approaching forever which could give us a sense for the confidence we have that the operation failed. But let's say in production we stick with ~100x by default, then in testing maybe we lower that to ~10x? Maybe if we're basing these timer calculations based on explicit "expected timeframes" then we can also pick up on problems where the actual time things take doesn't match the programmer's expectations.

[context.WithTimeout()] is often used in Golang to declare peers as dead and something currently not covered by pkg/time.

Good point, I added this as an item to #28844 . This would make for an interesting further investigation in time. 🕵️

As a seperate thought here, perhaps instead of using timers, all cases where the timeout is relying on an external system should just use context.WithTimeout()? We could then avoid having any special cases or semantics for NewTimer. Then if we decide that context.WithTimeout should also get a similar timer override, we can evaluate the timeouts with more sensitivity while implementing that wrapper.

So instead of NewTimerWithoutMaxDelay I'd advocate for something like NewTimerWith{Short,Medium,Long}MaxDelay.

The short/medium/long idea occurred to me as well. There's a few things I don't particularly like about it:

  1. Short/Medium/Long aren't strongly tied to the underlying mechanisms that motivate the timer. The time is usually linked somehow to an expectation of another event occurring first, so if possible we should just wait on the signal that the other thing completes; if that's not possible, we should consider the expected timeframe for completion and pick a time based on that. If we then encourage using short/medium/long, we have to do another logical jump to decide which category it fits into.
  2. There's no guarantee that any of those timers are bounded within the run time of CI. If timers are related to endpoints, well then maybe we'll keep the Pods around for a few minutes during smoke tests and the timers are OK. If they're related to policies, then maybe we'll create+delete the policies within a few seconds during the individual test, and never trigger the 'eventual' code.
  3. This diverges from standard library APIs, which imposes additional burden on every developer interacting with timer logic in the agent.

Arguably improvements to (2) may be to perhaps (a) Implement more system testing ( unit < system < e2e ) to validate the logic beyond the timer, and/or (b) Introduce dedicated soak testing to run a bunch of logic and ensure the agent is running for a longer time and will trigger all of the timers. However, the pkg/time based testing is not intended as the primary testing device for this sort of logic; it is intended as a fallback mechanism to ensure that there is some sort of systematic testing, even if we fail to properly implement tests for deferred logic like this.

Or have the max delay scaled based on the input timeout, e.g. a timeout of 30 seconds becomes 10 seconds in CI, a timeout of 10 minutes becomes 1 minute in CI etc.

I'd be more open to scaling the time values, but point (2) above still plays a significant role. If I think about the background issue that inspired the pkg/time override, it was a 10 minute timer on startup to defer some cleanup until some other operation completed. As it turned out, the other operation completed fine (admittedly without signalling when it was done, which is arguably one of the problems), but then the logic that ran after 10 minutes was buggy. I think that case is similar to this case. I wonder also about whether there might be cases where the value is relatively small but it can't scale down. Maybe we could put lower and upper bounds and then scale the timers within that range. For example, times can be < 1 second but if the initial time is > 1s then pkg/time will only scale it down to 1s. Similarly, the initial time could be 1 hour for garbage collection but pkg/time will scale it down to a maximum of 1 minute.

The other thought I had was what if instead we have something like NewTimerWithNetworkDependency, where we document that the point behind this particular API is to specifically handle the cases where timers are relying on an external component and that we can't scale/limit those timers as aggressively as all the other timers in the agent. This would discourage its use unless the particular user really needs to make use of this property, and we could document the sorts of scenarios where developers may choose to use it.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 1, 2023
@aanm aanm enabled auto-merge December 1, 2023 09:56
@aanm aanm added this pull request to the merge queue Dec 4, 2023
Merged via the queue into cilium:main with commit cf76f89 Dec 4, 2023
@gandro
Copy link
Member Author

gandro commented Dec 4, 2023

Thanks for the response Joe! I think your point about use-case driven timer classes is an interesting one. I particularly like it because it forces the caller to think about what kind of operation they are waiting for (which hopefully also nudges them into thinking about potential alternative notification mechanisms)

@nbusseneau nbusseneau mentioned this pull request Dec 5, 2023
10 tasks
@nbusseneau nbusseneau added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 labels Dec 5, 2023
@nbusseneau
Copy link
Member

@gandro This PR depends on the time wrapper introduced in #27253, which was not backported to 1.14, adding the backport/author label. Should we backport the time wrapper to 1.14?

@nbusseneau nbusseneau added needs-backport/1.14 backport/author The backport will be carried out by the author of the PR. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Dec 5, 2023
@gandro
Copy link
Member Author

gandro commented Dec 6, 2023

@gandro This PR depends on the time wrapper introduced in #27253, which was not backported to 1.14, adding the backport/author label. Should we backport the time wrapper to 1.14?

No, we can just use a regular timer. See #29443 (comment)

I'll do the backport!

@gandro gandro added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 labels Dec 6, 2023
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipam IP address management, including cloud IPAM backport/author The backport will be carried out by the author of the PR. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

5 participants