-
Notifications
You must be signed in to change notification settings - Fork 3.4k
v1.6 backports 2020-03-10 #10532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.6 backports 2020-03-10 #10532
Conversation
[ upstream commit 5bf90eb ] UpdateGenerateDNS updated the cache and generate selector updates as two separate lock-holding sections. When updates happened concurrently, via DNS lookups, DNS GC, or endpoint restores, the snapshot of data that eventually updated the selectors could be out-of-order. This created an A-B-A race for selectors that needed an update and those that needed to be cleared. These races have been observed on agent starts with pending pods but may manifest between multiple pods making DNS updates or the DNS garbage collector controller and any pod. The code now holds the NameManager lock for the entirety of UpdateGenerateDNS and ForceRegenerateDNS. This forces the update to complete in its entirety before allowing another to modify the FQDN cache or registered selectors. The locking order between the NameManager lock and the SelectorCache lock has been accidental so far. When removing a selector user from the cache, the SelectorCache lock is taken, and if an FQDN selector has no more users, the NameManager is notified of this, which internally takes the NameManager lock. Reverse this by explicitly locking NamaManager for the duration of selector cache operations. From now on, if the two locks are to be held at the same time, the NameManager mutex MUST be taken first, the SelectorCache mutex second, as the name manager is a higher level construct than the selector cache. Exposing 'Lock()' and 'Unlock()' operations in an interface is not ideal, but it helps make the lockinng order explicit, so it also serves as documentation on the required locking order. NameManager now return IDs also if the selector has already been registered. This should help not get into a state where FQDNs are not passed to the selector cache in case of a race condition. Signed-off-by: Ray Bejjani <ray@isovalent.com> Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
delete(d.selectors, selector) | ||
} | ||
|
||
func (d *DummyIdentityNotifier) InjectIdentitiesForSelector(fqdnSel api.FQDNSelector, ids []identity.NumericIdentity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported method DummyIdentityNotifier.InjectIdentitiesForSelector should have comment or be unexported
selectors map[api.FQDNSelector][]identity.NumericIdentity | ||
} | ||
|
||
func NewDummyIdentityNotifier() *DummyIdentityNotifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported function NewDummyIdentityNotifier should have comment or be unexported
"github.com/cilium/cilium/pkg/policy/api" | ||
) | ||
|
||
type DummyIdentityNotifier struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported type DummyIdentityNotifier should have comment or be unexported
test-me-please |
Provisioning failure: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/17835/ |
test-me-please |
@jrajahalme I noticed that this hunk was dropped from the backport, was this intentional?
|
Hit known flake #10278 : |
test-me-please EDIT: Provisioning failure: |
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awaiting confirmation whether the omitted hunk is deliberate or not:
@joestringer v1.6 |
As the previous commit mentions, this does not occur on main, v1.16, and v1.15 due to cilium#29036. However, in v1.14 and v1.13, we need to take special care because the NameManager and SelectorCache lock can be taken while the Endpoint lock is held during Endpoint deletion. Here are the relevant stacktraces regarding the deadlock: ``` 1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*uint32(0x5), false, 43690) sync mutex.go:171 (*Mutex).lockSlow(*Mutex(cilium#1733)) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*RWMutex(0xb0)) fqdn name_manager.go:70 (*NameManager).Lock(0xffffffffffffffff) policy selectorcache.go:964 (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582}) policy l4.go:810 (*L4Filter).removeSelectors(cilium#32582, cilium#29992) policy l4.go:817 (*L4Filter).detach(cilium#719, cilium#29993) policy l4.go:988 L4PolicyMap.Detach(...) policy l4.go:1179 (*L4Policy).Detach(cilium#20318, cilium#1383) policy resolve.go:103 (*selectorPolicy).Detach(...) policy distillery.go:81 (*PolicyCache).delete(cilium#1354, cilium#19354) policy distillery.go:138 (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163) identitymanager manager.go:167 (*IdentityManager).remove(cilium#706, cilium#19354) identitymanager manager.go:147 (*IdentityManager).Remove(cilium#706, cilium#19354) identitymanager manager.go:52 Remove(...) endpoint endpoint.go:1146 (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0}) endpoint endpoint.go:2192 (*Endpoint).Delete(cilium#1883, {0x80, 0xaa}) endpointmanager manager.go:380 (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff}) endpointmanager manager.go:394 (*EndpointManager).RemoveEndpoint(...) cmd endpoint.go:684 (*Daemon).deleteEndpointQuiet(...) cmd endpoint.go:666 (*Daemon).deleteEndpoint(cilium#1155, cilium#1883) cmd endpoint.go:713 (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108}) cmd endpoint.go:770 (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}}) endpoint delete_endpoint_id.go:66 (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3}) middleware router.go:78 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59}) middleware redoc.go:72 Redoc.func1({cilium#666, cilium#2242}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45}) middleware spec.go:46 Spec.func1({cilium#666, cilium#2242}, cilium#4921) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015}) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922) http server.go:2938 serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6) http server.go:2009 (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554}) 8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*, 0x47, cilium#1154) sync mutex.go:171 (*Mutex).lockSlow(cilium#706) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*) identitymanager manager.go:99 (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154) identitymanager manager.go:123 RemoveOldAddNew(...) endpoint policy.go:852 (*Endpoint).SetIdentity(*, cilium#1154, 0) endpoint endpoint.go:1932 (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1) endpoint endpoint.go:1780 (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1) endpoint endpoint.go:1720 (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8) cmd endpoint.go:477 (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *) cmd endpoint.go:542 (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}}) endpoint put_endpoint_id.go:58 (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware router.go:78 NewRouter.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware redoc.go:72 Redoc.func1({cilium#666, *}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45) middleware spec.go:46 Spec.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *) http server.go:2938 serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6) http server.go:2009 (*conn).serve(*, {cilium#673, cilium#1554}) 5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] sync sema.go:77 runtime_SemacquireMutex(cilium#142, 0xe8, *) sync mutex.go:171 (*Mutex).lockSlow(cilium#1733) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(0x68) fqdn name_manager.go:70 (*NameManager).Lock(*) policy selectorcache.go:798 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, *}) api l4.go:284 PortRules.Iterate({*, 1, cilium#546}, *) policy rule.go:624 mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, *) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, *) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(*) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(*, *, *) endpoint bpf.go:589 (*Endpoint).regenerateBPF(*, *) endpoint policy.go:457 (*Endpoint).regenerate(*, *) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(*, *) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*, *) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(*) 1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] semaphore semaphore.go:60 (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766) lock semaphored_mutex.go:30 (*SemaphoredMutex).Lock(...) ipcache ipcache.go:140 (*IPCache).Lock(...) ipcache cidr.go:56 (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0) ipcache cidr.go:103 (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff) cmd identity.go:114 cachingIdentityAllocator.AllocateCIDRsForIPs(...) policy selectorcache.go:509 (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45) policy selectorcache.go:843 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, cilium#24510}) api l4.go:284 PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741) policy rule.go:624 mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(cilium#1748) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781) endpoint bpf.go:589 (*Endpoint).regenerateBPF(cilium#1748, cilium#27542) endpoint policy.go:457 (*Endpoint).regenerate(cilium#1748, cilium#27542) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*Once(#45414), func(cilium#69)) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(cilium#6023) ``` Generated from pp tool: https://github.com/maruel/panicparse Signed-off-by: Chris Tarazi <chris@isovalent.com>
As the previous commit mentions, this does not occur on main, v1.16, and v1.15 due to cilium#29036. However, in v1.14 and v1.13, we need to take special care because the NameManager and SelectorCache lock can be taken while the Endpoint lock is held during Endpoint deletion. Here are the relevant stacktraces regarding the deadlock: ``` 1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*uint32(0x5), false, 43690) sync mutex.go:171 (*Mutex).lockSlow(*Mutex(cilium#1733)) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*RWMutex(0xb0)) fqdn name_manager.go:70 (*NameManager).Lock(0xffffffffffffffff) policy selectorcache.go:964 (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582}) policy l4.go:810 (*L4Filter).removeSelectors(cilium#32582, cilium#29992) policy l4.go:817 (*L4Filter).detach(cilium#719, cilium#29993) policy l4.go:988 L4PolicyMap.Detach(...) policy l4.go:1179 (*L4Policy).Detach(cilium#20318, cilium#1383) policy resolve.go:103 (*selectorPolicy).Detach(...) policy distillery.go:81 (*PolicyCache).delete(cilium#1354, cilium#19354) policy distillery.go:138 (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163) identitymanager manager.go:167 (*IdentityManager).remove(cilium#706, cilium#19354) identitymanager manager.go:147 (*IdentityManager).Remove(cilium#706, cilium#19354) identitymanager manager.go:52 Remove(...) endpoint endpoint.go:1146 (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0}) endpoint endpoint.go:2192 (*Endpoint).Delete(cilium#1883, {0x80, 0xaa}) endpointmanager manager.go:380 (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff}) endpointmanager manager.go:394 (*EndpointManager).RemoveEndpoint(...) cmd endpoint.go:684 (*Daemon).deleteEndpointQuiet(...) cmd endpoint.go:666 (*Daemon).deleteEndpoint(cilium#1155, cilium#1883) cmd endpoint.go:713 (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108}) cmd endpoint.go:770 (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}}) endpoint delete_endpoint_id.go:66 (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3}) middleware router.go:78 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59}) middleware redoc.go:72 Redoc.func1({cilium#666, cilium#2242}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45}) middleware spec.go:46 Spec.func1({cilium#666, cilium#2242}, cilium#4921) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015}) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922) http server.go:2938 serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6) http server.go:2009 (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554}) 8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*, 0x47, cilium#1154) sync mutex.go:171 (*Mutex).lockSlow(cilium#706) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*) identitymanager manager.go:99 (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154) identitymanager manager.go:123 RemoveOldAddNew(...) endpoint policy.go:852 (*Endpoint).SetIdentity(*, cilium#1154, 0) endpoint endpoint.go:1932 (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1) endpoint endpoint.go:1780 (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1) endpoint endpoint.go:1720 (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8) cmd endpoint.go:477 (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *) cmd endpoint.go:542 (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}}) endpoint put_endpoint_id.go:58 (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware router.go:78 NewRouter.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware redoc.go:72 Redoc.func1({cilium#666, *}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45) middleware spec.go:46 Spec.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *) http server.go:2938 serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6) http server.go:2009 (*conn).serve(*, {cilium#673, cilium#1554}) 5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] sync sema.go:77 runtime_SemacquireMutex(cilium#142, 0xe8, *) sync mutex.go:171 (*Mutex).lockSlow(cilium#1733) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(0x68) fqdn name_manager.go:70 (*NameManager).Lock(*) policy selectorcache.go:798 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, *}) api l4.go:284 PortRules.Iterate({*, 1, cilium#546}, *) policy rule.go:624 mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, *) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, *) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(*) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(*, *, *) endpoint bpf.go:589 (*Endpoint).regenerateBPF(*, *) endpoint policy.go:457 (*Endpoint).regenerate(*, *) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(*, *) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*, *) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(*) 1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] semaphore semaphore.go:60 (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766) lock semaphored_mutex.go:30 (*SemaphoredMutex).Lock(...) ipcache ipcache.go:140 (*IPCache).Lock(...) ipcache cidr.go:56 (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0) ipcache cidr.go:103 (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff) cmd identity.go:114 cachingIdentityAllocator.AllocateCIDRsForIPs(...) policy selectorcache.go:509 (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45) policy selectorcache.go:843 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, cilium#24510}) api l4.go:284 PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741) policy rule.go:624 mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(cilium#1748) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781) endpoint bpf.go:589 (*Endpoint).regenerateBPF(cilium#1748, cilium#27542) endpoint policy.go:457 (*Endpoint).regenerate(cilium#1748, cilium#27542) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*Once(#45414), func(cilium#69)) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(cilium#6023) ``` Generated from pp tool: https://github.com/maruel/panicparse Signed-off-by: Chris Tarazi <chris@isovalent.com>
As the previous commit mentions, this does not occur on main, v1.16, and v1.15 due to cilium#29036. However, in v1.14 and v1.13, we need to take special care because the NameManager and SelectorCache lock can be taken while the Endpoint lock is held during Endpoint deletion. Here are the relevant stacktraces regarding the deadlock: ``` 1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*uint32(0x5), false, 43690) sync mutex.go:171 (*Mutex).lockSlow(*Mutex(cilium#1733)) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*RWMutex(0xb0)) fqdn name_manager.go:70 (*NameManager).Lock(0xffffffffffffffff) policy selectorcache.go:964 (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582}) policy l4.go:810 (*L4Filter).removeSelectors(cilium#32582, cilium#29992) policy l4.go:817 (*L4Filter).detach(cilium#719, cilium#29993) policy l4.go:988 L4PolicyMap.Detach(...) policy l4.go:1179 (*L4Policy).Detach(cilium#20318, cilium#1383) policy resolve.go:103 (*selectorPolicy).Detach(...) policy distillery.go:81 (*PolicyCache).delete(cilium#1354, cilium#19354) policy distillery.go:138 (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163) identitymanager manager.go:167 (*IdentityManager).remove(cilium#706, cilium#19354) identitymanager manager.go:147 (*IdentityManager).Remove(cilium#706, cilium#19354) identitymanager manager.go:52 Remove(...) endpoint endpoint.go:1146 (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0}) endpoint endpoint.go:2192 (*Endpoint).Delete(cilium#1883, {0x80, 0xaa}) endpointmanager manager.go:380 (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff}) endpointmanager manager.go:394 (*EndpointManager).RemoveEndpoint(...) cmd endpoint.go:684 (*Daemon).deleteEndpointQuiet(...) cmd endpoint.go:666 (*Daemon).deleteEndpoint(cilium#1155, cilium#1883) cmd endpoint.go:713 (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108}) cmd endpoint.go:770 (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}}) endpoint delete_endpoint_id.go:66 (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3}) middleware router.go:78 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59}) middleware redoc.go:72 Redoc.func1({cilium#666, cilium#2242}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45}) middleware spec.go:46 Spec.func1({cilium#666, cilium#2242}, cilium#4921) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015}) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922) http server.go:2938 serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6) http server.go:2009 (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554}) 8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*, 0x47, cilium#1154) sync mutex.go:171 (*Mutex).lockSlow(cilium#706) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*) identitymanager manager.go:99 (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154) identitymanager manager.go:123 RemoveOldAddNew(...) endpoint policy.go:852 (*Endpoint).SetIdentity(*, cilium#1154, 0) endpoint endpoint.go:1932 (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1) endpoint endpoint.go:1780 (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1) endpoint endpoint.go:1720 (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8) cmd endpoint.go:477 (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *) cmd endpoint.go:542 (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}}) endpoint put_endpoint_id.go:58 (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware router.go:78 NewRouter.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware redoc.go:72 Redoc.func1({cilium#666, *}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45) middleware spec.go:46 Spec.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *) http server.go:2938 serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6) http server.go:2009 (*conn).serve(*, {cilium#673, cilium#1554}) 5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] sync sema.go:77 runtime_SemacquireMutex(cilium#142, 0xe8, *) sync mutex.go:171 (*Mutex).lockSlow(cilium#1733) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(0x68) fqdn name_manager.go:70 (*NameManager).Lock(*) policy selectorcache.go:798 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, *}) api l4.go:284 PortRules.Iterate({*, 1, cilium#546}, *) policy rule.go:624 mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, *) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, *) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(*) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(*, *, *) endpoint bpf.go:589 (*Endpoint).regenerateBPF(*, *) endpoint policy.go:457 (*Endpoint).regenerate(*, *) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(*, *) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*, *) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(*) 1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] semaphore semaphore.go:60 (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766) lock semaphored_mutex.go:30 (*SemaphoredMutex).Lock(...) ipcache ipcache.go:140 (*IPCache).Lock(...) ipcache cidr.go:56 (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0) ipcache cidr.go:103 (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff) cmd identity.go:114 cachingIdentityAllocator.AllocateCIDRsForIPs(...) policy selectorcache.go:509 (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45) policy selectorcache.go:843 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, cilium#24510}) api l4.go:284 PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741) policy rule.go:624 mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(cilium#1748) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781) endpoint bpf.go:589 (*Endpoint).regenerateBPF(cilium#1748, cilium#27542) endpoint policy.go:457 (*Endpoint).regenerate(cilium#1748, cilium#27542) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*Once(#45414), func(cilium#69)) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(cilium#6023) ``` Generated from pp tool: https://github.com/maruel/panicparse Signed-off-by: Chris Tarazi <chris@isovalent.com>
Once this PR is merged, you can update the PR labels via:
This change is