-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Expected Behavior
Pod-pod and pod-service communication across nodes should work.
Current Behavior
All traffic between pods across nodes is dropped (with the exception of ICMP).
Possible Solution
VMware recommends to either:
- Change the VXLAN port to 8472 (when NSX is not used) or 4789 (when NSX is used)
- Disable the VXLAN hardware offload feature on the VMXNET3 NIC (which recent Linux driver version enable by default)
Since a port change is not feasible for Calico Windows (which requires 4789) disabling the hardware offload feature is the only feasible solution. Since this feature was not even supported by earlier Linux versions for that particular NIC device there is no performance impact of disabling it.
Given that the NIC firmware configuration is not something most users are used to manage i suggest to implement a transparent solution in Calico that disables the offload feature when Calico configures VXLAN on host interfaces backed by a VMXNET3 device.
To that effect: It looks like Calico already configures NIC driver settings: https://github.com/projectcalico/felix/blob/master/ethtool/ethtool.go
Steps to Reproduce (for bugs)
- Provision VMs on vSphere version 6.7u2 or later using one of the following operating systems: CentOS/RHEL/Oracle 8.3, SLES 15 SP2/SP3
- Install Kubernetes cluster on the nodes
- Install Calico with VXLAN overlay following official docs, e.g.:
- https://docs.projectcalico.org/networking/vxlan-ipip
- https://docs.projectcalico.org/getting-started/windows-calico/kubernetes/standard
Context
VXLAN packets are dropped on the Linux network stack due to incorrect checksums of inner packets. These incorrect checksums occur when enabling VXLAN hardware offload on the VMXNET3 interface (which recent Linux version do by default) and creating a VXLAN overlay network in the guest OS on ports other than 8472 (when NSX is not used) or 4789 (when NSX is used).
References:
- https://github.com/rancher/rancher/issues/33399
- https://bugzilla.redhat.com/show_bug.cgi?id=1935539
- https://bugzilla.redhat.com/show_bug.cgi?id=1941714 (not public)
Your Environment
- Calico version 3.19.1
- Orchestrator version: Kubernetes 1.19.12 (RKE)
- Operating System and version: CentOS/RHEL 8.3, SLES 15 SP2