Skip to content

Fix regression to avoid freeing alive IPs #10207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 18, 2020
Merged

Conversation

tgraf
Copy link
Member

@tgraf tgraf commented Feb 17, 2020

This is a replacement fix for ab61853 which turned out to be racy, leading
to alive PodIPs to be freed. Instead of attempting to reconstruct the context
for potential release in CNI DEL for any failure scenarios, introduce an
expiration timer that is enabled when any IP is allocated via the CNI plugin.
The expiration timer must be explicitly stopped by the endpoint creation
following the allocation to avoid releasing the IP again after a timeout. This
ensures that IPs are always either used or released without requiring any
cleanup from the CNI plugin itself.

In the event that an IP is released due to expiration but then the endpoint
still succeeds to be created, the endpoint creation will fail, triggering a
re-creation of the endpoint.

Fixes: #10065


This change is Reviewable

@tgraf tgraf added kind/bug This is a bug in the Cilium logic. wip release-note/bug This PR fixes an issue in a previous release of Cilium. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. labels Feb 17, 2020
@tgraf tgraf requested a review from a team February 17, 2020 12:36
@tgraf tgraf requested review from a team as code owners February 17, 2020 12:36
@tgraf tgraf requested a review from a team February 17, 2020 12:36
@tgraf
Copy link
Member Author

tgraf commented Feb 17, 2020

test-me-please

EDIT: All relevant tests passed

@coveralls
Copy link

coveralls commented Feb 17, 2020

Coverage Status

Coverage increased (+0.04%) to 44.494% when pulling 442323c on pr/tgraf/fix-ip-release into 9f492a1 on master.

@aanm aanm changed the title Fix regression to avoid freeing alive Ps Fix regression to avoid freeing alive IPs Feb 17, 2020
Introduce the ability to mark an IP to be expired after a certain time unless
usage is confirmed in a later step. Protect IP reuse with active expiration
timers with the help of an UUID.

Updates: #10065

Signed-off-by: Thomas Graf <thomas@cilium.io>
Adds a helper function to allocate required addresses and start the expiration
timer for all allocated IPs.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
This is a replacement fix for ab61853 which turned out to be racy, leading
to alive PodIPs to be freed. Instead of attempting to reconstruct the context
for potential release in CNI DEL for any failure scenarios, introduce an
expiration timer that is enabled when any IP is allocated via the CNI plugin.
The expiration timer must be explicitly stopped by the endpoint creation
following the allocation to avoid releasing the IP again after a timeout. This
ensures that IPs are always either used or released without requiring any
cleanup from the CNI plugin itself.

In the event that an IP is released due to expiration but then the endpoint
still succeeds to be created, the endpoint creation will fail, triggering a
re-creation of the endpoint.

Fixes: #10065
Fixes: ab61853 ("cni: Release IP even when endpoint deletion fails")
Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf force-pushed the pr/tgraf/fix-ip-release branch from bb8a074 to 442323c Compare February 17, 2020 16:16
@tgraf
Copy link
Member Author

tgraf commented Feb 17, 2020

test-me-please

@tgraf tgraf added pending-review and removed wip labels Feb 17, 2020
@tgraf tgraf merged commit 82a8c71 into master Feb 18, 2020
@tgraf tgraf deleted the pr/tgraf/fix-ip-release branch February 18, 2020 08:17
@vadorovsky
Copy link
Member

I'm trying to backport this one, but I think we need also #8016, since it depends on the Allocator interface.

@vadorovsky
Copy link
Member

Probably also needs:

#8146
#9102

@joestringer
Copy link
Member

Due to the complexity of backporting, I nominate dropping the 1.5 backport for this PR.

Users that are affected by this issue can update to a recent v1.6.x or v1.7.x release to address this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alive IP can be released if CNI DELETE is called with stale result
7 participants