Skip to content

Conversation

tklauser
Copy link
Member

@tklauser tklauser commented Jun 27, 2022

While updating eksctl to v0.101.0 in master/v1.11 as part of commit
396eb1b ("ci: update eksctl to v0.101.0 and AWS VPC CNI plugin to
v1.11"), the AWS VPC CNI was updated to 1.11 on master, but not on the
v1.11 branch. Do that now to fix the following error on that workflow:

error: unable to recognize "https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.7.10/config/v1.7/aws-k8s-cni.yaml": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"

Since v1.12 was branched of in the meantime, also using the outdated AWS
VPC CNI version bump it in the corresponding workflow as well.

Fixes: 396eb1b ("ci: update eksctl to v0.101.0 and AWS VPC CNI plugin to v1.11")

Reported-by: Martynas Pumputis m@lambda.lt

@tklauser tklauser added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Jun 27, 2022
@tklauser tklauser requested a review from brb June 27, 2022 15:07
@tklauser tklauser requested review from a team as code owners June 27, 2022 15:07
@tklauser tklauser requested a review from pchaigno June 27, 2022 15:07
…nches

While updating eksctl to v0.101.0 in master/v1.11 as part of commit
396eb1b ("ci: update eksctl to v0.101.0 and AWS VPC CNI plugin to
v1.11"), the AWS VPC CNI was updated to 1.11 on master, but not on the
v1.11 branch. Do that now to fix the following error on that workflow:

    error: unable to recognize "https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.7.10/config/v1.7/aws-k8s-cni.yaml": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"

Since v1.12 was branched of in the meantime, also using the outdated AWS
VPC CNI version bump it in the corresponding workflow as well.

Fixes: 396eb1b ("ci: update eksctl to v0.101.0 and AWS VPC CNI plugin to v1.11")

Reported-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser force-pushed the pr/tklauser/aws-cni-1.11-vpc-cni-bump branch from 72d9fc2 to 9399326 Compare June 27, 2022 15:16
@tklauser tklauser changed the title ci: update AWS VPC CNI plugin for AWS-CNI workflow in v1.11 branch ci: update AWS VPC CNI plugin for AWS-CNI workflow in v1.11/v1.12 branches Jun 27, 2022
Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why I thought that was on purpose 🤦

@tklauser tklauser added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 28, 2022
@kkourt kkourt merged commit cc0739c into master Jun 28, 2022
@kkourt kkourt deleted the pr/tklauser/aws-cni-1.11-vpc-cni-bump branch June 28, 2022 11:40
@brb
Copy link
Member

brb commented Jun 28, 2022

@tklauser I think we need to update v1.10 too: https://github.com/cilium/cilium/runs/7089474018?check_suite_focus=true#step:10:24

@tklauser
Copy link
Member Author

tklauser commented Jun 28, 2022

@tklauser I think we need to update v1.10 too: https://github.com/cilium/cilium/runs/7089474018?check_suite_focus=true#step:10:24

not sure how I missed that - again 🤦 will send a follow-up PR.

EDIT: addressed in #20329

christarazi added a commit to christarazi/cilium that referenced this pull request Jul 17, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to christarazi/cilium that referenced this pull request Aug 7, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to christarazi/cilium that referenced this pull request Aug 13, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants