Skip to content

Cilium cannot recover from a failed VXLAN interface device #38581

@rapour

Description

@rapour

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.17.2 and lower than v1.18.0

What happened?

If there is already another VXLAN device with the same destination port when you install Cilium (e.g. when you have fan-networking enabled on your host), the cilium-agent fails to setup a VXLAN interface since ensureDevice function returns with

failed to setup vxlan tunnel device: setting up vxlan device: creating vxlan device cilium_vxlan: setting up device cilium_vxlan: address already in use

That is expected. However, if you try to change the tunnel-port for VXLAN and restart the Cilium, it still fails. That's because setupVxlanDevice first tries to set up the previously failed cilium_vxlan through ensureDevice before checking for a destination port change and recreation of the interface. As a result, the deployment gets stuck in a deadlock until the cilium_vxlan deleted manually.

How can we reproduce the issue?

  1. Install kind
  2. Setup a vxlan device inside the kind's control plane node
docker exec -it kind-control-plane sh
ip link add vxlan0 type vxlan id 100 dev eth0 dstport 8472
ip link set vxlan0 up
  1. Install Cilium on kind using Helm (Cilium fails to start which is expected)
  2. Edit the cilium-config and add tunnel-port:"8473"
kubectl edit cm/cilium-config -n kube-system
  1. Restart the agent's DaemonSet
kubectl rollout restart ds/cilium -n kube-system

Cilium agent still failing and the destination port change on cilium_vxlan is not applied.

Cilium Version

cilium-cli: v0.18.2 compiled with go1.24.0 on linux/amd64
cilium image (default): v1.17.0
cilium image (stable): v1.17.2
cilium image (running): 1.17.2

Kernel Version

inux localhost 6.8.0-53-generic #55-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 17 15:37:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.32.2

Regression

No response

Sysdump

No response

Relevant log output

Anything else?

I have added a DRAFT PR for a possible fix to better demonstrate the issue. If the issue and the general direction to fix it get accepted, I'll be more than happy to implement a complete fix.

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

area/agentCilium agent related.area/loaderImpacts the loading of BPF programs into the kernel.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions