Skip to content

Conversation

fristonio
Copy link
Member

See commit message for more details.

Relevant crash stack trace.

2025-08-04T00:40:52.104526177Z panic: runtime error: invalid memory address or nil pointer dereference
2025-08-04T00:40:52.104914523Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x28fea25]
2025-08-04T00:40:52.104925173Z 
2025-08-04T00:40:52.105147488Z goroutine 378 [running]:
2025-08-04T00:40:52.105285226Z github.com/cilium/cilium/pkg/policy.(*selectorPolicy).detach(...)
2025-08-04T00:40:52.105461936Z 	/go/src/github.com/cilium/cilium/pkg/policy/resolve.go:290
2025-08-04T00:40:52.105590045Z github.com/cilium/cilium/pkg/policy.(*policyCache).delete(0xc0015f2840, 0xc00a9fc4c0)
2025-08-04T00:40:52.105937734Z 	/go/src/github.com/cilium/cilium/pkg/policy/distillery.go:73 +0xc5
2025-08-04T00:40:52.106142488Z github.com/cilium/cilium/pkg/policy.(*policyCache).LocalEndpointIdentityRemoved(0xc00a9fc4c0?, 0x4e5f7d0?)
2025-08-04T00:40:52.106267340Z 	/go/src/github.com/cilium/cilium/pkg/policy/distillery.go:128 +0x13
2025-08-04T00:40:52.106565598Z github.com/cilium/cilium/pkg/identity/identitymanager.(*IdentityManager).remove(0xc000fd65a0, 0xc00a9fc4c0)
2025-08-04T00:40:52.106652430Z 	/go/src/github.com/cilium/cilium/pkg/identity/identitymanager/manager.go:155 +0x16a
2025-08-04T00:40:52.107387694Z github.com/cilium/cilium/pkg/identity/identitymanager.(*IdentityManager).RemoveOldAddNew(0xc000fd65a0, 0xc00a9fc4c0, 0xc00a72cac0)
2025-08-04T00:40:52.107506195Z 	/go/src/github.com/cilium/cilium/pkg/identity/identitymanager/manager.go:109 +0x185
2025-08-04T00:40:52.107905230Z github.com/cilium/cilium/pkg/endpoint.(*Endpoint).SetIdentity(0xc001948f08, 0xc00a72cac0, 0x0?)
2025-08-04T00:40:52.107919347Z 	/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:1080 +0xb5
2025-08-04T00:40:52.108031977Z github.com/cilium/cilium/pkg/endpoint.(*Endpoint).identityLabelsChanged(0xc001948f08, {0x55fcab0, 0x8c5d640})
2025-08-04T00:40:52.108347346Z 	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:2305 +0xaa5
2025-08-04T00:40:52.109019978Z github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runIdentityResolver(0xc001948f08, {0x55fcab0, 0x8c5d640}, 0x0, 0x6fc23ac00)
2025-08-04T00:40:52.109152971Z 	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:2151 +0x47b
2025-08-04T00:40:52.109494860Z github.com/cilium/cilium/pkg/endpoint.(*Endpoint).ModifyIdentityLabels(0xc001948f08, {0x4e5f6e3, 0x3}, 0xc0085e9830, 0xc0085e98f0, 0x6fc23ac00)
2025-08-04T00:40:52.109637957Z 	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:1973 +0x493
2025-08-04T00:40:52.110190490Z github.com/cilium/cilium/pkg/endpoint.(*Endpoint).UpdateLabelsFrom(0xc001948f08, 0xc0090ca120, 0xba7e0c6bd8ea023e?, {0x4e5f6e3, 0x3})
2025-08-04T00:40:52.110411623Z 	/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:2103 +0xc5
2025-08-04T00:40:52.110517080Z github.com/cilium/cilium/pkg/endpointmanager.(*endpointManager).updateHostEndpointLabels(0xc000f96380, 0xc0090ca120, 0xc0085e9500)
2025-08-04T00:40:52.110668543Z 	/go/src/github.com/cilium/cilium/pkg/endpointmanager/host.go:62 +0x56
2025-08-04T00:40:52.111712839Z github.com/cilium/cilium/pkg/endpointmanager.(*endpointManager).startNodeLabelsObserver.func1({0xc00171d3e0, {{0xc002ca6420, 0xc}, {0xc001c96c50, 0x7}, {0xc0034123c0, 0x4, 0x4}, 0xc003892ae8, {0x0, ...}, ...}, ...})
2025-08-04T00:40:52.111993688Z 	/go/src/github.com/cilium/cilium/pkg/endpointmanager/host.go:43 +0x632
2025-08-04T00:40:52.112400258Z github.com/cilium/stream.Multicast[...].func1()
2025-08-04T00:40:52.112413462Z 	/go/src/github.com/cilium/cilium/vendor/github.com/cilium/stream/sources.go:205 +0x1fd
2025-08-04T00:40:52.112881452Z github.com/cilium/cilium/pkg/node.(*LocalNodeStore).Update(0xc000f26780, 0xc003655f80)
2025-08-04T00:40:52.113486146Z 	/go/src/github.com/cilium/cilium/pkg/node/local_node_store.go:221 +0x13d
2025-08-04T00:40:52.113732046Z github.com/cilium/cilium/daemon/cmd.(*localNodeSynchronizer).SyncLocalNode(0xc000f263c0, {0x55fd3c8?, 0xc000dc3950?}, 0xc000f26780)
2025-08-04T00:40:52.113922582Z 	/go/src/github.com/cilium/cilium/daemon/cmd/local_node_sync.go:91 +0x289
2025-08-04T00:40:52.114098431Z github.com/cilium/cilium/pkg/node.NewLocalNodeStore.func1.1()
2025-08-04T00:40:52.114193037Z 	/go/src/github.com/cilium/cilium/pkg/node/local_node_store.go:150 +0x31
2025-08-04T00:40:52.114366832Z created by github.com/cilium/cilium/pkg/node.NewLocalNodeStore.func1 in goroutine 1
2025-08-04T00:40:52.114454466Z 	/go/src/github.com/cilium/cilium/pkg/node/local_node_store.go:149 +0x37b

This crash is related to a race condition between endpoint policy computation and node identity update(due to labels change). Currently when host endpoint identity labels are updated we generate a new cached selector policy. The old cached selector policy is removed from the policy cache and a new selector policy is expected to be created when endpoint regeneration happens. In this case the policy cache and corresponding cached selector policy is updated in two steps:

  1. lookupOrCreate new cached selector policy in policy cache
  2. Resolve the selector policy and set it in cached selector policy
func (cache *policyCache) updateSelectorPolicy(identity *identityPkg.Identity, endpointID uint64) (*selectorPolicy, bool, error) {
	cip := cache.lookupOrCreate(identity) // 1. Create new cached selector policy in policy cache

	cip.Lock()
	defer cip.Unlock()

	if selPolicy := cip.getPolicy(); selPolicy != nil && selPolicy.Revision >= cache.repo.GetRevision() {
		return selPolicy, false, nil
	}

	selPolicy, err := cache.repo.resolvePolicyLocked(identity)
	if err != nil {
		return nil, false, err
	}

	cip.setPolicy(selPolicy, endpointID) // 2. Set resolved selector policy in cached selector policy instance from policy cache. 

	return selPolicy, true, nil
}

This selector policy resolution for the endpoint is done as part of endpoint regeneration, which is asynchronous to node identity update handler.
If between step 1 and 2 a new host endpoint labels change is processed, we will attempt to first remove the old cached selector policy from the policy cache. This will cause a nil pointer dereference since the underlying selector policy is nil.

func (cache *policyCache) delete(identity *identityPkg.Identity) bool {
	cache.Lock()
	defer cache.Unlock()
	cip, ok := cache.policies[identity.ID]
	if ok {
		delete(cache.policies, identity.ID)
		cip.getPolicy().detach(true, 0) // getPolicy() returns nil.
	}
	return ok
}

To reproduce the issue we can add an artificial delay and trigger two host endpoint labels updates.

diff --git a/pkg/policy/distillery.go b/pkg/policy/distillery.go
index ea4ff9dec9..f8bd110dc3 100644
--- a/pkg/policy/distillery.go
+++ b/pkg/policy/distillery.go
@@ -5,6 +5,7 @@ package policy

 import (
        "sync/atomic"
+       "time"

        "github.com/cilium/cilium/pkg/container/versioned"
        identityPkg "github.com/cilium/cilium/pkg/identity"
@@ -89,6 +90,8 @@ func (cache *policyCache) delete(identity *identityPkg.Identity) bool {
 func (cache *policyCache) updateSelectorPolicy(identity *identityPkg.Identity, endpointID uint64) (*selectorPolicy, bool, error) {
        cip := cache.lookupOrCreate(identity)

+       time.Sleep(5 * time.Second)
+
        // As long as UpdatePolicy() is triggered from endpoint
        // regeneration, it's possible for two endpoints with the
        // *same* identity to race to update the policy here. Such
kubectl label nodes kind-worker test-label=value
sleep 1 // To make sure host endpoint regeneration is triggered and we reach our sleep.
kubectl label nodes kind-worker test-label-

Since policy cache operation to create new cached selector policy doesn't assume underlying selector policy is set, we shouldn't make this assumption in delete operation as well.
Adding a simple nil pointer check should fix the issue. I will create a PR.

@fristonio fristonio added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 11, 2025
@fristonio fristonio marked this pull request as ready for review August 11, 2025 22:13
@fristonio fristonio requested a review from a team as a code owner August 11, 2025 22:13
@fristonio fristonio requested a review from bimmlerd August 11, 2025 22:13
@fristonio
Copy link
Member Author

/test

Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix is fine for the specific crash but I'm concerned about other callers also crashing.

This commit fixes a nil pointer dereference in cilium-agent. The
segfault results due to a race condition during host endpoint identity
labels update processing.

When host endpoint identity labels are updated we first delete the
existing entry from policy cache and then trigger async endpoint
regeneration to populate the cache again with resolved selector policy from
repository.
During endpoint policy regeneration the policy resolution happens in two
steps:
1. Lookup or create the cached selector policy entry in policy cache.
2. Resolve the selector policy for endpoint and set it in the corresponding
   cached selector policy.

Currently when cachedSelectorPolicy entry is deleted from the
policycache we assume that the underlying selector policy is set.
If two host endpoint identity labels updates are close together then the
policy cache delete operation might happen between the above 2 steps
leading to a nil pointer deref for underlying selector policy.

This commit makes sure that the underlying selector policy is not nil
before attempting to detach. This behavior is now consistent with how
`policyCache.lookupOrCreate` creates the entry in cache where the
underlying selector policy is nil unless set explicitly.

Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
@fristonio fristonio force-pushed the pr/fristonio/fix/host-ep-id-update-segfault branch from 2a2a4fc to dcb6ac5 Compare August 12, 2025 16:26
@fristonio
Copy link
Member Author

fristonio commented Aug 12, 2025

/test
ci-clustermesh - #39370
K8s E2E tests - #39968

@fristonio fristonio enabled auto-merge August 13, 2025 16:05
@fristonio fristonio disabled auto-merge August 13, 2025 18:48
@fristonio fristonio enabled auto-merge August 13, 2025 18:48
@fristonio fristonio added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 13, 2025
@fristonio fristonio added this pull request to the merge queue Aug 13, 2025
Merged via the queue into cilium:main with commit a32a6c8 Aug 13, 2025
68 of 69 checks passed
@fristonio fristonio deleted the pr/fristonio/fix/host-ep-id-update-segfault branch August 13, 2025 19:00
@joamaki joamaki mentioned this pull request Aug 19, 2025
19 tasks
@joamaki joamaki added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 19, 2025
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants