-
Notifications
You must be signed in to change notification settings - Fork 3.4k
ipcache: Fix metadata access from CIDR allocation #21565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipcache: Fix metadata access from CIDR allocation #21565
Conversation
The locking order for metadata -> IPCache should be first grabbing the metadata lock then the IPCache lock, according to the documentation in the metadata structure. Correct this lock ordering to conform to the documented pattern. This sort of improper lock ordering could theoretically cause a deadlock if for instance the label injector ran at the same time as this function is called. Found by inspection. Fixes: 40e13ea ("ipcache: Fix race in identity/ipcache release") Signed-off-by: Joe Stringer <joe@cilium.io>
/test |
Marking for v1.11 backport so that we can include it in #21564. |
/test Job 'Cilium-PR-K8s-1.25-kernel-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the commit description, I think the changes seem straight forward. Any chance a unit test can be added/extended in order to prevent such regressions in future? I'm looking at ipcache/metadata_test.go: TestRemoveLabelsFromIPs
.
👍 reasonable ask in general, I'll have to think about it a bit more. The trick is that we're looking for a lock race condition, so depending on when the goroutines are scheduled it may not trigger. This sort of test is maybe more appropriate for a stress test that runs a bunch of logic constantly for some period. AFAIK we don't really have test infrastructure for that sort of issue currently. For what it's worth, I did try a relatively naive test like this, but even with
|
Hit #21120, otherwise CI is passing. Reviews are in, merging. |
Is there any scenario where |
@gandro you're right, it's not safe. |
@joestringer shouldn't the same fix be applied to |
As [previously discussed][1], the `ipc.metadata.get` method is not safe, as it returns a reference to a map which could be mutated, without holding on to the lock protecting that map. This commit therefore removes that method and ensures that the caller holds on to the lock while the data is being read. The `ToLabels` method creates a copy of the data, so it is safe to be read after the metadata lock has been released. [1]: cilium#21565 (comment) Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@codablock are you looking for #20721 or something more than that? The change here relates to the conflict between the ipcache overall lock and the ipcache metadata lock. The ipcache metadata lock is not used during release. EDIT: Ah I see more detailed analysis in #21596, we can follow up there. |
@joestringer What I wrote was just a wild guess before I did the detailed analysis of the stack traces, so you can ignore my comment in this PR. |
As [previously discussed][1], the `ipc.metadata.get` method is not safe, as it returns a reference to a map which could be mutated, without holding on to the lock protecting that map. This commit therefore removes that method and ensures that the caller holds on to the lock while the data is being read. The `ToLabels` method creates a copy of the data, so it is safe to be read after the metadata lock has been released. [1]: #21565 (comment) Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Given this doesn't backport cleanly, I've removed the labels and assigned myself. When the other subsequent fixes are ready I can prepare the backports (or collaborate with Chris on the 1.11 variation) |
[ upstream commit 820e348 ] As [previously discussed][1], the `ipc.metadata.get` method is not safe, as it returns a reference to a map which could be mutated, without holding on to the lock protecting that map. This commit therefore removes that method and ensures that the caller holds on to the lock while the data is being read. The `ToLabels` method creates a copy of the data, so it is safe to be read after the metadata lock has been released. [1]: cilium#21565 (comment) Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 820e348 ] As [previously discussed][1], the `ipc.metadata.get` method is not safe, as it returns a reference to a map which could be mutated, without holding on to the lock protecting that map. This commit therefore removes that method and ensures that the caller holds on to the lock while the data is being read. The `ToLabels` method creates a copy of the data, so it is safe to be read after the metadata lock has been released. [1]: #21565 (comment) Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
The locking order for metadata -> IPCache should be first grabbing the
metadata lock then the IPCache lock, according to the documentation in
the metadata structure. Correct this lock ordering to conform to the
documented pattern.
This sort of improper lock ordering could theoretically cause a deadlock
if for instance the label injector ran at the same time as this function
is called.
Found by inspection.
Fixes: 40e13ea ("ipcache: Fix race in identity/ipcache release")
Fixes: #20721