-
Notifications
You must be signed in to change notification settings - Fork 3.4k
policy: fix agent crash due to policy cache update-delete race #41079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
fristonio
merged 1 commit into
cilium:main
from
fristonio:pr/fristonio/fix/host-ep-id-update-segfault
Aug 13, 2025
Merged
policy: fix agent crash due to policy cache update-delete race #41079
fristonio
merged 1 commit into
cilium:main
from
fristonio:pr/fristonio/fix/host-ep-id-update-segfault
Aug 13, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/test |
bimmlerd
requested changes
Aug 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fix is fine for the specific crash but I'm concerned about other callers also crashing.
This commit fixes a nil pointer dereference in cilium-agent. The segfault results due to a race condition during host endpoint identity labels update processing. When host endpoint identity labels are updated we first delete the existing entry from policy cache and then trigger async endpoint regeneration to populate the cache again with resolved selector policy from repository. During endpoint policy regeneration the policy resolution happens in two steps: 1. Lookup or create the cached selector policy entry in policy cache. 2. Resolve the selector policy for endpoint and set it in the corresponding cached selector policy. Currently when cachedSelectorPolicy entry is deleted from the policycache we assume that the underlying selector policy is set. If two host endpoint identity labels updates are close together then the policy cache delete operation might happen between the above 2 steps leading to a nil pointer deref for underlying selector policy. This commit makes sure that the underlying selector policy is not nil before attempting to detach. This behavior is now consistent with how `policyCache.lookupOrCreate` creates the entry in cache where the underlying selector policy is nil unless set explicitly. Signed-off-by: Deepesh Pathak <deepesh.pathak@isovalent.com>
2a2a4fc
to
dcb6ac5
Compare
bimmlerd
approved these changes
Aug 13, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
backport-done/1.18
The backport for Cilium 1.18.x for this PR is done.
kind/bug
This is a bug in the Cilium logic.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
sig/policy
Impacts whether traffic is allowed or denied based on user-defined policies.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See commit message for more details.
Relevant crash stack trace.
This crash is related to a race condition between endpoint policy computation and node identity update(due to labels change). Currently when host endpoint identity labels are updated we generate a new cached selector policy. The old cached selector policy is removed from the policy cache and a new selector policy is expected to be created when endpoint regeneration happens. In this case the policy cache and corresponding cached selector policy is updated in two steps:
This selector policy resolution for the endpoint is done as part of endpoint regeneration, which is asynchronous to node identity update handler.
If between step 1 and 2 a new host endpoint labels change is processed, we will attempt to first remove the old cached selector policy from the policy cache. This will cause a nil pointer dereference since the underlying selector policy is nil.
To reproduce the issue we can add an artificial delay and trigger two host endpoint labels updates.
Since policy cache operation to create new cached selector policy doesn't assume underlying selector policy is set, we shouldn't make this assumption in
delete
operation as well.Adding a simple nil pointer check should fix the issue. I will create a PR.