Skip to content

Valid CIDR Identities are getting released upon Cilium Agent restart #27210

@carnerito

Description

@carnerito

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I have CililiumClusterwideNetworkPolicy deployed and Host Firewall enabled. Among other CCNPs that allow Cilium to function normally, I have CCNP that allows access from trusted subnets (private network, home VPN, etc...) and it looks like this:

apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: "whitelisted-cidrs"
spec:
  description: "Allow access from trusted CIDRs"
  nodeSelector:
    matchLabels: {}
  ingress:
  - fromCIDR:
    - "127.0.0.0/8"
    - "10.0.0.0/8"
    - "172.16.0.0/12"
    - "192.168.0.0/16"
    - "<trusted public ip>:32"

When this policy is applied for the first time everything works as expected. Problem occur 10 minutes after Cilium agent is restarted and is manifested by loosing connectivity to cluster from addresses defined in whitelisted-cidrs CCNP.

Since this is huge problem for me because I'm heavily relaying on Cilium Host Firewall feature, I did some investigation and these are my findings:

  • Identities associated with CIDRs in fromCIDR list are cleared 10 minutes after Cilium agent starts
  • This part of code is responsible for 10 minutes delay
  • What actually releases Identity is localIdentityCache release function
  • It releases Identity if refcount for provided Identity is equal to 1
  • In Cilium versions prior to v1.14.0 this function is called only once, and it's called from that delayed goroutine from second bullet point, here
  • This call does not cause Identity to be released because at that point, refcount for given Identity is 2, as it is increased here beforehand
  • Difference between v1.14.0 and v.1.13.4 is that in v.1.14 release function is called once more when refcount is 1 which causes given Identity to be released which makes cluster unavailable
  • Release function is called second time from ipcache.InjectLabels, here
  • Reason for this is changed logic inside releaseIdentity block, here
  • This change is introduced in this commit

After I figured out what is happening, I reverted logic in releaseIdentity to look like this:

	releaseIdentity:
		if entryExists {
			// ...
			if _, ok := idsToAdd[oldID.ID]; !
				previouslyAllocatedIdentities[prefix] = oldID
			}

			// ...
			if prefixInfo == nil && oldID.createdFromMetadata {
				entriesToDelete[prefix] = oldID
			}
		}

This change solved the issue, but since I'm new to Cilium code base, I'm not sure will this have some unwanted side effects. If this change looks good, I can submit PR.

How to reproduce in local Kind cluster

  • Checkout Cilium v1.14.0 tag
  • Add following snippet to kind-values.yaml:
extraArgs:
  - '--identity-restore-grace-period'
  - '2m'
hostFirewall:
  enabled: "true"
  • Run make kind && make kind-image && make kind-install-cilium
  • Apply CCNP with fromCIDR rule (like one from above)
  • Restart Cilium Agent
  • Wait for 2 minutes
  • Cluster should become unavailable since fromCIDR identities are released

Cilium Version

v1.14.0

Kernel Version

.

Kubernetes Version

v1.27.3

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.sig/policyImpacts whether traffic is allowed or denied based on user-defined policies.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions