Skip to content

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Mar 10, 2020

Once this PR is merged, you can update the PR labels via:

$ for pr in 10501; do contrib/backporting/set-labels.py $pr done 1.6; done

This change is Reviewable

[ upstream commit 5bf90eb ]

UpdateGenerateDNS updated the cache and generate selector updates as two
separate lock-holding sections. When updates happened concurrently, via
DNS lookups, DNS GC, or endpoint restores, the snapshot of data that
eventually updated the selectors could be out-of-order. This created an
A-B-A race for selectors that needed an update and those that needed to
be cleared.

These races have been observed on agent starts with pending pods but may
manifest between multiple pods making DNS updates or the DNS garbage
collector controller and any pod.

The code now holds the NameManager lock for the entirety of
UpdateGenerateDNS and ForceRegenerateDNS. This forces the update to
complete in its entirety before allowing another to modify the FQDN
cache or registered selectors.

The locking order between the NameManager lock and the SelectorCache
lock has been accidental so far. When removing a selector user from
the cache, the SelectorCache lock is taken, and if an FQDN selector
has no more users, the NameManager is notified of this, which
internally takes the NameManager lock. Reverse this by explicitly
locking NamaManager for the duration of selector cache operations.

From now on, if the two locks are to be held at the same time, the
NameManager mutex MUST be taken first, the SelectorCache mutex second,
as the name manager is a higher level construct than the selector
cache.

Exposing 'Lock()' and 'Unlock()' operations in an interface is not
ideal, but it helps make the lockinng order explicit, so it also
serves as documentation on the required locking order.

NameManager now return IDs also if the selector has already been
registered. This should help not get into a state where FQDNs are not
passed to the selector cache in case of a race condition.

Signed-off-by: Ray Bejjani <ray@isovalent.com>
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
@jrajahalme jrajahalme added kind/backports This PR provides functionality previously merged into master. backport/1.6 release-note/misc This PR makes changes that have no direct user impact. labels Mar 10, 2020
@jrajahalme jrajahalme requested a review from a team as a code owner March 10, 2020 15:17
delete(d.selectors, selector)
}

func (d *DummyIdentityNotifier) InjectIdentitiesForSelector(fqdnSel api.FQDNSelector, ids []identity.NumericIdentity) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported method DummyIdentityNotifier.InjectIdentitiesForSelector should have comment or be unexported

selectors map[api.FQDNSelector][]identity.NumericIdentity
}

func NewDummyIdentityNotifier() *DummyIdentityNotifier {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function NewDummyIdentityNotifier should have comment or be unexported

"github.com/cilium/cilium/pkg/policy/api"
)

type DummyIdentityNotifier struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type DummyIdentityNotifier should have comment or be unexported

@jrajahalme
Copy link
Member Author

test-me-please

@joestringer
Copy link
Member

@joestringer
Copy link
Member

test-me-please

@joestringer
Copy link
Member

@jrajahalme I noticed that this hunk was dropped from the backport, was this intentional?

diff --git a/pkg/ipcache/cidr.go b/pkg/ipcache/cidr.go
index 995b2497c19c..feea372140e7 100644
--- a/pkg/ipcache/cidr.go
+++ b/pkg/ipcache/cidr.go
@@ -71,6 +71,9 @@ func allocateCIDRs(prefixes []*net.IPNet) ([]*identity.Identity, error) {
                allocateCtx, cancel := context.WithTimeout(context.Background(), option.Config.IPAllocationTimeout)
                defer cancel()

+               if IdentityAllocator == nil {
+                       return nil, fmt.Errorf("IdentityAllocator not initialized!")
+               }
                id, isNew, err := IdentityAllocator.AllocateIdentity(allocateCtx, cidr.GetCIDRLabels(prefix), false)
                if err != nil {
                        IdentityAllocator.ReleaseSlice(context.Background(), nil, usedIdentities)

@joestringer
Copy link
Member

@joestringer
Copy link
Member

joestringer commented Mar 10, 2020

test-me-please

EDIT: Provisioning failure:
https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/17846/

@joestringer
Copy link
Member

test-me-please

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awaiting confirmation whether the omitted hunk is deliberate or not:

#10532 (comment)

@jrajahalme
Copy link
Member Author

@joestringer v1.6 pkg/ipcache calls directly to the cache package, so there is no chance of nil pointer dereference through (the nonexistent) ipcache.IdentityAllocator. AllocateIdentity() in pkg/identity/cache already has the nil test for it's internal IdentityAllocator.

@joestringer joestringer merged commit c7771d3 into v1.6 Mar 13, 2020
@joestringer joestringer deleted the pr/v1.6-backport-2020-03-10 branch March 13, 2020 21:38
christarazi added a commit to christarazi/cilium that referenced this pull request Jul 17, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to christarazi/cilium that referenced this pull request Aug 7, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to christarazi/cilium that referenced this pull request Aug 13, 2024
As the previous commit mentions, this does not occur on main, v1.16, and
v1.15 due to cilium#29036.

However, in v1.14 and v1.13, we need to take special care because the
NameManager and SelectorCache lock can be taken while the Endpoint lock
is held during Endpoint deletion.

Here are the relevant stacktraces regarding the deadlock:

```
1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*uint32(0x5), false, 43690)
    sync            mutex.go:171                 (*Mutex).lockSlow(*Mutex(cilium#1733))
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*RWMutex(0xb0))
    fqdn            name_manager.go:70           (*NameManager).Lock(0xffffffffffffffff)
    policy          selectorcache.go:964         (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582})
    policy          l4.go:810                    (*L4Filter).removeSelectors(cilium#32582, cilium#29992)
    policy          l4.go:817                    (*L4Filter).detach(cilium#719, cilium#29993)
    policy          l4.go:988                    L4PolicyMap.Detach(...)
    policy          l4.go:1179                   (*L4Policy).Detach(cilium#20318, cilium#1383)
    policy          resolve.go:103               (*selectorPolicy).Detach(...)
    policy          distillery.go:81             (*PolicyCache).delete(cilium#1354, cilium#19354)
    policy          distillery.go:138            (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163)
    identitymanager manager.go:167               (*IdentityManager).remove(cilium#706, cilium#19354)
    identitymanager manager.go:147               (*IdentityManager).Remove(cilium#706, cilium#19354)
    identitymanager manager.go:52                Remove(...)
    endpoint        endpoint.go:1146             (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0})
    endpoint        endpoint.go:2192             (*Endpoint).Delete(cilium#1883, {0x80, 0xaa})
    endpointmanager manager.go:380               (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff})
    endpointmanager manager.go:394               (*EndpointManager).RemoveEndpoint(...)
    cmd             endpoint.go:684              (*Daemon).deleteEndpointQuiet(...)
    cmd             endpoint.go:666              (*Daemon).deleteEndpoint(cilium#1155, cilium#1883)
    cmd             endpoint.go:713              (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108})
    cmd             endpoint.go:770              (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}})
    endpoint        delete_endpoint_id.go:66     (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3})
    middleware      router.go:78                 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59})
    middleware      redoc.go:72                  Redoc.func1({cilium#666, cilium#2242}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45})
    middleware      spec.go:46                   Spec.func1({cilium#666, cilium#2242}, cilium#4921)
    http            server.go:2136               HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015})
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922)
    http            server.go:2938               serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6)
    http            server.go:2009               (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554})
8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086]
    sync            sema.go:77                   runtime_SemacquireMutex(*, 0x47, cilium#1154)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#706)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(*)
    identitymanager manager.go:99                (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154)
    identitymanager manager.go:123               RemoveOldAddNew(...)
    endpoint        policy.go:852                (*Endpoint).SetIdentity(*, cilium#1154, 0)
    endpoint        endpoint.go:1932             (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1)
    endpoint        endpoint.go:1780             (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1)
    endpoint        endpoint.go:1720             (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8)
    cmd             endpoint.go:477              (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *)
    cmd             endpoint.go:542              (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}})
    endpoint        put_endpoint_id.go:58        (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *)
    middleware      operation.go:28              (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      router.go:78                 NewRouter.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    middleware      redoc.go:72                  Redoc.func1({cilium#666, *}, cilium#1251)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45)
    middleware      spec.go:46                   Spec.func1({cilium#666, *}, *)
    http            server.go:2136               HandlerFunc.ServeHTTP(*, {cilium#666, *}, *)
    metrics         middleware.go:64             (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *)
    api             apipanic.go:42               (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *)
    http            server.go:2938               serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6)
    http            server.go:2009               (*conn).serve(*, {cilium#673, cilium#1554})
5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    sync            sema.go:77                   runtime_SemacquireMutex(cilium#142, 0xe8, *)
    sync            mutex.go:171                 (*Mutex).lockSlow(cilium#1733)
    sync            mutex.go:90                  (*Mutex).Lock(...)
    sync            rwmutex.go:147               (*RWMutex).Lock(0x68)
    fqdn            name_manager.go:70           (*NameManager).Lock(*)
    policy          selectorcache.go:798         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, *})
    api             l4.go:284                    PortRules.Iterate({*, 1, cilium#546}, *)
    policy          rule.go:624                  mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, *)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, *)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(*)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(*, *, *)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(*, *)
    endpoint        policy.go:457                (*Endpoint).regenerate(*, *)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(*, *)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*, *)
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(*)
1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229]
    semaphore       semaphore.go:60              (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766)
    lock            semaphored_mutex.go:30       (*SemaphoredMutex).Lock(...)
    ipcache         ipcache.go:140               (*IPCache).Lock(...)
    ipcache         cidr.go:56                   (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0)
    ipcache         cidr.go:103                  (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff)
    cmd             identity.go:114              cachingIdentityAllocator.AllocateCIDRsForIPs(...)
    policy          selectorcache.go:509         (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45)
    policy          selectorcache.go:843         (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}})
    policy          l4.go:628                    (*L4Filter).cacheFQDNSelector(...)
    policy          l4.go:623                    (*L4Filter).cacheFQDNSelectors(cilium#27811, {#42936, 0x287, cilium#193}, cilium#536)
    policy          l4.go:725                    createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...)
    policy          l4.go:879                    createL4EgressFilter(...)
    policy          rule.go:717                  mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...)
    policy          rule.go:672                  mergeEgress.func1({cilium#660, cilium#24510})
    api             l4.go:284                    PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741)
    policy          rule.go:624                  mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...)
    policy          rule.go:753                  (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...})
    policy          rules.go:103                 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687)
    policy          repository.go:718            (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461)
    policy          distillery.go:119            (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461)
    policy          distillery.go:153            (*PolicyCache).UpdatePolicy(...)
    endpoint        policy.go:262                (*Endpoint).regeneratePolicy(cilium#1748)
    endpoint        bpf.go:744                   (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781)
    endpoint        bpf.go:589                   (*Endpoint).regenerateBPF(cilium#1748, cilium#27542)
    endpoint        policy.go:457                (*Endpoint).regenerate(cilium#1748, cilium#27542)
    endpoint        events.go:53                 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703)
    eventqueue      eventqueue.go:245            (*EventQueue).run.func1()
    sync            once.go:74                   (*Once).doSlow(*Once(#45414), func(cilium#69))
    sync            once.go:65                   (*Once).Do(...)
    eventqueue      eventqueue.go:233            (*EventQueue).run(cilium#6023)
```

Generated from pp tool: https://github.com/maruel/panicparse

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants