Skip to content

Conversation

aanm
Copy link
Member

@aanm aanm commented Aug 16, 2023

To prevent breaking the main branch after merging two PRs that passed CI but conflicted which each other, we should enable merge queue which should prevent this from ever happening.

More info: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue

To prevent breaking the main branch after merging two PRs that passed CI
but conflicted which each other, we should enable merge queue which
should prevent this from ever happening.

More info: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue

Signed-off-by: André Martins <andre@cilium.io>
@aanm aanm requested review from a team as code owners August 16, 2023 15:09
@aanm aanm requested a review from nbusseneau August 16, 2023 15:09
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 16, 2023
@aanm aanm added release-note/misc This PR makes changes that have no direct user impact. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 16, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 16, 2023
Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooooh since we can enable it per workflow, then let's try it!

@aanm aanm merged commit 78a5f17 into cilium:main Aug 16, 2023
@aanm aanm deleted the pr/fix-add-merge-queue-setting branch August 16, 2023 19:00
christarazi added a commit to christarazi/cilium that referenced this pull request Jul 17, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to christarazi/cilium that referenced this pull request Aug 7, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to christarazi/cilium that referenced this pull request Aug 13, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants