-
Notifications
You must be signed in to change notification settings - Fork 3.4k
pkg/bpf: Synchronize access to map in DumpReliablyWithCallback #38590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/test |
There's a couple other places we access that directly, such as in the batch iterator: IterateAll function. |
e39de9c
to
113a14e
Compare
/test |
113a14e
to
eb0173e
Compare
/ci-runtime |
eb0173e
to
3636997
Compare
/test |
The test failures are due to a deadlock where map entries are being deleted while iterating over the map. The tricky thing is that the cilium agent has a common pattern where map entries are deleted while the map is being iterated. The logic is written with the assumption that
|
3636997
to
d0f0563
Compare
/test |
d0f0563
to
c2795bd
Compare
/test |
c2795bd
to
9fccc7b
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see other functions that call m.Delete
, e.g. doFlush4/6
, which are passed as a callback to DumpReliablyWithCallback
. Shouldn't those call m.DeleteLocked
instead?
As a nit, there is a typo in the second commit message, referring to nat as "nap".
0373c4d
to
7f198b3
Compare
/test |
Thanks, fixed. I also added a unit test for the |
/test |
ci-e2e-upgrade shows a failure/flake when deleting CT entries on the first run. Is that an existing flake or are we potentially creating a new one here? |
It's similar to this one - #38498. @tommyp1ckles We could have similar error reports during ct flush operation as LRU might've kicked some keys. |
@aditighag @rgo3 The ctmap flush command uses the same underlying code (i.e. doGCForFamily) as the general GC. So that should be captured by the same recent fix: https://github.com/cilium/cilium/blob/main/pkg/maps/ctmap/ctmap.go#L462 |
Add a Delete method that can be called with the map lock acquired. This allows synchronizing access to the map attributes in DumpReliablyWithCallback where the callback invokes map key delete operation. Signed-off-by: Aditi Ghag <aditi@cilium.io>
PurgeOrphanNATEntries previously iterated the nat map via DumpReliablyWithCallback, and provided a callback function that also deleted entries in the nat map. The previous commit synchronized access to the Map construct by acquiring the Map lock, and would lead to a deadlock in PurgeOrphanNATEntries. Hence, move the orphan entries deletion logic outside the nat map iteration. Signed-off-by: Aditi Ghag <aditi@cilium.io>
The method previously accessed the map construct directly, which could result in race conditions with other operations on the map, such as in the `Close` method. One such data race was reported in the ctmap, where the map was closed by a periodically running controller while it was simultaneously accessed in `DumpReliablyWithCallback` - Currently, callers can invoke map operations in the callback that acquire the map (write) lock. To allow such callbacks without causing a deadlock, acquire the (write) lock instead of RLock. (snippet) ``` 2025-01-30T21:27:22.005325276Z Read at 0x00c0099535a8 by goroutine 6096710: 2025-01-30T21:27:22.005373476Z github.com/cilium/ebpf/internal/sys.(*FD).Uint() 2025-01-30T21:27:22.005379677Z /go/src/github.com/cilium/cilium/vendor/github.com/cilium/ebpf/internal/sys/fd.go:71 +0x144 2025-01-30T21:27:22.005446912Z github.com/cilium/ebpf.(*Map).nextKey() 2025-01-30T21:27:22.005454006Z /go/src/github.com/cilium/cilium/vendor/github.com/cilium/ebpf/map.go:993 +0x112 2025-01-30T21:27:22.005553101Z github.com/cilium/ebpf.(*Map).NextKey() 2025-01-30T21:27:22.005560745Z /go/src/github.com/cilium/cilium/vendor/github.com/cilium/ebpf/map.go:950 +0xae 2025-01-30T21:27:22.005640643Z github.com/cilium/cilium/pkg/bpf.(*Map).NextKey() 2025-01-30T21:27:22.005647146Z /go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:606 +0x11e 2025-01-30T21:27:22.005712027Z github.com/cilium/cilium/pkg/bpf.(*Map).DumpReliablyWithCallback() 2025-01-30T21:27:22.005718629Z /go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:768 +0x71b 2025-01-30T21:27:22.005815760Z github.com/cilium/cilium/pkg/bpf.(*Map).GetModel() 2025-01-30T21:27:22.005822322Z /go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:1413 +0x511 2025-01-30T21:27:22.005880481Z github.com/cilium/cilium/pkg/bpf.GetOpenMaps() 2025-01-30T21:27:22.005886152Z /go/src/github.com/cilium/cilium/pkg/bpf/map_register_linux.go:64 +0x2ad 2025-01-30T21:27:22.006024930Z github.com/cilium/cilium/pkg/maps.(*getMapHandler).Handle() 2025-01-30T21:27:22.006034067Z /go/src/github.com/cilium/cilium/pkg/maps/api.go:146 +0x24 ``` ``` 2025-01-30T21:27:22.006410985Z Previous write at 0x00c0099535a8 by goroutine 4182: 2025-01-30T21:27:22.006418068Z github.com/cilium/ebpf/internal/sys.(*FD).Disown() 2025-01-30T21:27:22.006423969Z /go/src/github.com/cilium/cilium/vendor/github.com/cilium/ebpf/internal/sys/fd.go:93 +0x8e 2025-01-30T21:27:22.006429549Z github.com/cilium/ebpf/internal/sys.(*FD).Close() 2025-01-30T21:27:22.006435180Z /go/src/github.com/cilium/cilium/vendor/github.com/cilium/ebpf/internal/sys/fd.go:84 +0x47 2025-01-30T21:27:22.006870411Z github.com/cilium/ebpf.(*Map).Close() 2025-01-30T21:27:22.006879989Z /go/src/github.com/cilium/cilium/vendor/github.com/cilium/ebpf/map.go:1371 +0x1e6 2025-01-30T21:27:22.006884858Z github.com/cilium/cilium/pkg/bpf.(*Map).Close() 2025-01-30T21:27:22.006890338Z /go/src/github.com/cilium/cilium/pkg/bpf/map_linux.go:591 +0x1ac 2025-01-30T21:27:22.006895778Z github.com/cilium/cilium/pkg/maps/ctmap.CalculateCTMapPressure.func1.deferwrap1() 2025-01-30T21:27:22.006901959Z /go/src/github.com/cilium/cilium/pkg/maps/ctmap/ctmap.go:858 +0x33 2025-01-30T21:27:22.006907119Z runtime.deferreturn() 2025-01-30T21:27:22.006957073Z /usr/local/go/src/runtime/panic.go:605 +0x5d 2025-01-30T21:27:22.006962683Z github.com/cilium/cilium/pkg/controller.(*controller).runController() 2025-01-30T21:27:22.007064754Z /go/src/github.com/cilium/cilium/pkg/controller/controller.go:251 +0xa2 2025-01-30T21:27:22.007072437Z github.com/cilium/cilium/pkg/controller.(*Manager).createControllerLocked.gowrap1() 2025-01-30T21:27:22.007077677Z /go/src/github.com/cilium/cilium/pkg/controller/manager.go:111 +0xbc ``` Consequently, call DeleteLocked in callbacks passed to DumpReliablyWithCallback. Signed-off-by: Aditi Ghag <aditi@cilium.io>
1b0d99f
to
26b4ba1
Compare
/test |
@aditighag given this is |
The method previously accessed the map construct directly, which could result in race conditions with other operations on the map, such as in the
Close
method.One such data race was reported in the ctmap, where the map was closed by a periodically running controller while it was simultaneously accessed in
DumpReliablyWithCallback
. Currently, callers can invoke map operations in the callbackthat acquire the map (write) lock. To allow such callbacks without causing a deadlock, acquire the (write) lock instead
of RLock.
(snippet)
Fixes: #37383