Skip to content

ChecksumOffloadBroken autodetection doesn't necessarily detect all cases #4727

@janeczku

Description

@janeczku

Expected Behavior

Pod-pod and pod-service communication across nodes should work.

Current Behavior

All traffic between pods across nodes is dropped (with the exception of ICMP).

Possible Solution

VMware recommends to either:

  • Change the VXLAN port to 8472 (when NSX is not used) or 4789 (when NSX is used)
  • Disable the VXLAN hardware offload feature on the VMXNET3 NIC (which recent Linux driver version enable by default)

Since a port change is not feasible for Calico Windows (which requires 4789) disabling the hardware offload feature is the only feasible solution. Since this feature was not even supported by earlier Linux versions for that particular NIC device there is no performance impact of disabling it.

Given that the NIC firmware configuration is not something most users are used to manage i suggest to implement a transparent solution in Calico that disables the offload feature when Calico configures VXLAN on host interfaces backed by a VMXNET3 device.
To that effect: It looks like Calico already configures NIC driver settings: https://github.com/projectcalico/felix/blob/master/ethtool/ethtool.go

Steps to Reproduce (for bugs)

  1. Provision VMs on vSphere version 6.7u2 or later using one of the following operating systems: CentOS/RHEL/Oracle 8.3, SLES 15 SP2/SP3
  2. Install Kubernetes cluster on the nodes
  3. Install Calico with VXLAN overlay following official docs, e.g.:

Context

VXLAN packets are dropped on the Linux network stack due to incorrect checksums of inner packets. These incorrect checksums occur when enabling VXLAN hardware offload on the VMXNET3 interface (which recent Linux version do by default) and creating a VXLAN overlay network in the guest OS on ports other than 8472 (when NSX is not used) or 4789 (when NSX is used).

References:

Your Environment

  • Calico version 3.19.1
  • Orchestrator version: Kubernetes 1.19.12 (RKE)
  • Operating System and version: CentOS/RHEL 8.3, SLES 15 SP2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions