-
Notifications
You must be signed in to change notification settings - Fork 3.4k
lbipam: Fix deletion of multi-block IP pool #40013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
dylandreimerink
merged 2 commits into
cilium:main
from
pippolo84:pr/pippolo84/fix-lbipam-pool-multi-cidr-del
Jun 13, 2025
Merged
lbipam: Fix deletion of multi-block IP pool #40013
dylandreimerink
merged 2 commits into
cilium:main
from
pippolo84:pr/pippolo84/fix-lbipam-pool-multi-cidr-del
Jun 13, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This comment was marked as outdated.
This comment was marked as outdated.
1273a30
to
8d4743c
Compare
/test |
Whenever a pool is deleted, the operator deletes all LB ranges of that pool and then strip off each address allocation related to the removed ranges. While doing so, it attempts to immediately reconcile each service that lost an address, trying to allocating another address from an alive range. The reason to do so instead of running a full reconciliation is to improve efficiency: since each k8s object describing a service that lost an allocation should be patched, a later change to reallocate another address would lead to an additional update immediately after. However, trying to immediately reconcile a service while deleting the ranges might lead to issues in the allocation. In the case of a multi-range IP pool, if we try to reconcile a service while deleting the first range, we might end up allocating an IP from a range that belongs to the same pool, leading to a further removal and reconciliation in the next iteration. Moreover, the internal LBIPAM manager status is not consistent during the loop: while the pool has already been deleted, the ranges from that pool are being deleting one at a time. Therefore, a reconciliation attempt in the middle of the ranges deletion, might allocate an IP from a range with a non-existent parent pool, leading to spuriuos warnings like the following: `Bad state detected, store contains lbRange for pool 'pool-a' but missing the pool` To solve this the commit breaks the current deleteRangeAllocations in two parts. In the first one, we loop over the ranges to remove each of them alongside any allocation for an address contained in the range. Each modified service is just collected and returned. Then, after deleting the removed pool from the LBIPAM internal status, we finally reconcile each modified service, trying to reallocate an address immediately to avoid the double update. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
When handling a pool deletion event, the LBIPAM manager loops over the removed pool ranges to strip off each address allocation with an IP contained in one of the ranges. While doing that, it also updates the poolToRanges internal map removing the range it is iterating on from the slice keyed by the deleted pool name. Removing items from a slice while iterating on them with a for range loop is incorrect, since the length of the slice is evaluated just once at the loop start. In case of a pool with multiple ranges, this may lead to the following panic: ``` $ go test -trimpath -v ./operator/pkg/lbipam/... -run="TestPoolDelete" === RUN TestPoolDelete --- FAIL: TestPoolDelete (0.00s) panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x2db94b8] goroutine 98 [running]: testing.tRunner.func1.2({0x3203820, 0x61279f0}) testing/testing.go:1734 +0x21c testing.tRunner.func1() testing/testing.go:1737 +0x35e panic({0x3203820?, 0x61279f0?}) runtime/panic.go:792 +0x132 github.com/cilium/cilium/operator/pkg/lbipam.(*rangesStore).Delete(0xc0003b0bf8, 0x0) github.com/cilium/cilium/operator/pkg/lbipam/range_store.go:34 +0x1d8 github.com/cilium/cilium/operator/pkg/lbipam.(*LBIPAM).handlePoolDeleted(0xc0003b0b60, {0x3d66d08, 0xc000653540}, 0xc0006e6b00) github.com/cilium/cilium/operator/pkg/lbipam/lbipam.go:1763 +0x1df github.com/cilium/cilium/operator/pkg/lbipam.(*LBIPAM).poolOnDelete(0xc0003b0b60, {0x3d66d08, 0xc000653540}, 0xc0006e6b00) github.com/cilium/cilium/operator/pkg/lbipam/lbipam.go:345 +0xe7 github.com/cilium/cilium/operator/pkg/lbipam.(*LBIPAM).handlePoolEvent(0xc0003b0b60, {0x3d66d08?, 0xc000653540?}, {{0x38bfe05, 0x6}, {{0xc00051e6d8, 0x6}, {0x0, 0x0}}, 0xc0006e6b00, ...}) github.com/cilium/cilium/operator/pkg/lbipam/lbipam.go:262 +0x87 github.com/cilium/cilium/operator/pkg/lbipam.(*newFixture).DeletePool(0xc00008bef0, 0xc0006836c0, 0xc0006e6b00) github.com/cilium/cilium/operator/pkg/lbipam/lbipam_fixture_test.go:192 +0x1a5 github.com/cilium/cilium/operator/pkg/lbipam.TestPoolDelete(0xc0006836c0) github.com/cilium/cilium/operator/pkg/lbipam/lbipam_test.go:1814 +0x385 testing.tRunner(0xc0006836c0, 0x3a35298) testing/testing.go:1792 +0xf4 created by testing.(*T).Run in goroutine 1 testing/testing.go:1851 +0x413 FAIL github.com/cilium/cilium/operator/pkg/lbipam 0.031s FAIL ``` The commit extends the TestDisablePool unit test to have multiple ranges in the fixture pools and then fixes the issue shallow copying the ranges slice before starting the iteration. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
8d4743c
to
a1d6cc0
Compare
dylandreimerink
approved these changes
Jun 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thank you for the detailed explanation of the issue and its fix
/test |
ci-clustermesh hit #39855, rerunning |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/operator
Impacts the cilium-operator component
backport-done/1.15
The backport for Cilium 1.15.x for this PR is done.
backport-done/1.16
The backport for Cilium 1.16.x for this PR is done.
backport-done/1.17
The backport for Cilium 1.17.x for this PR is done.
feature/lb-ipam
kind/bug
This is a bug in the Cilium logic.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, the operator panics when a multi blocks CiliumLoadBalancerIPPool is deleted. To reproduce it, just run Cilium in a local kind cluster and apply and delete the following pool:
The PR fixes the LBIPAM reconciliation following a deletion of a CiliumLoadBalancerIPPool with multiple IP blocks. To do so, rewrite the reconciliation to:
Please review commit by commit