-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
I have CililiumClusterwideNetworkPolicy
deployed and Host Firewall enabled. Among other CCNPs that allow Cilium to function normally, I have CCNP that allows access from trusted subnets (private network, home VPN, etc...) and it looks like this:
apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
metadata:
name: "whitelisted-cidrs"
spec:
description: "Allow access from trusted CIDRs"
nodeSelector:
matchLabels: {}
ingress:
- fromCIDR:
- "127.0.0.0/8"
- "10.0.0.0/8"
- "172.16.0.0/12"
- "192.168.0.0/16"
- "<trusted public ip>:32"
When this policy is applied for the first time everything works as expected. Problem occur 10 minutes after Cilium agent is restarted and is manifested by loosing connectivity to cluster from addresses defined in whitelisted-cidrs
CCNP.
Since this is huge problem for me because I'm heavily relaying on Cilium Host Firewall feature, I did some investigation and these are my findings:
- Identities associated with CIDRs in
fromCIDR
list are cleared 10 minutes after Cilium agent starts - This part of code is responsible for 10 minutes delay
- What actually releases Identity is localIdentityCache release function
- It releases Identity if refcount for provided Identity is equal to 1
- In Cilium versions prior to v1.14.0 this function is called only once, and it's called from that delayed goroutine from second bullet point, here
- This call does not cause Identity to be released because at that point, refcount for given Identity is 2, as it is increased here beforehand
- Difference between v1.14.0 and v.1.13.4 is that in v.1.14 release function is called once more when refcount is 1 which causes given Identity to be released which makes cluster unavailable
- Release function is called second time from ipcache.InjectLabels, here
- Reason for this is changed logic inside
releaseIdentity
block, here - This change is introduced in this commit
After I figured out what is happening, I reverted logic in releaseIdentity
to look like this:
releaseIdentity:
if entryExists {
// ...
if _, ok := idsToAdd[oldID.ID]; !
previouslyAllocatedIdentities[prefix] = oldID
}
// ...
if prefixInfo == nil && oldID.createdFromMetadata {
entriesToDelete[prefix] = oldID
}
}
This change solved the issue, but since I'm new to Cilium code base, I'm not sure will this have some unwanted side effects. If this change looks good, I can submit PR.
How to reproduce in local Kind cluster
- Checkout Cilium v1.14.0 tag
- Add following snippet to kind-values.yaml:
extraArgs:
- '--identity-restore-grace-period'
- '2m'
hostFirewall:
enabled: "true"
- Run
make kind && make kind-image && make kind-install-cilium
- Apply CCNP with
fromCIDR
rule (like one from above) - Restart Cilium Agent
- Wait for 2 minutes
- Cluster should become unavailable since fromCIDR identities are released
Cilium Version
v1.14.0
Kernel Version
.
Kubernetes Version
v1.27.3
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct