Skip to content

Conntrack tables having stale entries for UDP connection #125467

@mohideenibrahim08

Description

@mohideenibrahim08

What happened?

We experienced an EC2 node failure within our EKS cluster. This affected node was running two CoreDNS pods, which are responsible for DNS resolution in our Kubernetes cluster. Envoy connects to CoreDNS through the UDP protocol. After these CoreDNS pods were terminated, Envoy continued to attempt connections to the terminated IP for DNS resolution.
The kube-proxy failed to update the entry in the conntrack tables, causing some Envoy pods to still connect to the terminated CoreDNS pod IP. Once we restarted the Envoy pods, the entry was refreshed, and the DNS timeout issue was resolved.

Mapping in Conntrack table for src pod ip-10.103.83.53 for UDP protocol.

Query : “conntrack -p udp -L --src 10.103.83.53”

Response : “udp 17 27 src=10.103.83.53 dst= sport=21667 dport=53 [UNREPLIED] src=10.103.78.37 dst=10.103.83.53 sport=53 dport=21667 mark=0 use=1 contrack v1.4.4 (conntrack-tools): 1 flow entries have been shown”

What did you expect to happen?

Kubeproxy should update or refresh this conntrack table.
conntrack shouldn't have stale UDP connection routes.

How can we reproduce it (as minimally and precisely as possible)?

KubeProxy version we tested with - kube-proxy:v1.29.4-minimal-eksbuild.1

which include this fix as well - #119249

Steps we followed in our EKS cluster to stimulate this issue

  • Remove podAntiAffinity on the coreDNS deployment
  • Identify the node where we would want to concentrate the coreDNS to cordon it
  • Evict workloads from that node
  • Make that for not getting scaled down using node annotations
  • Uncordon that node, and cordon the rest of the node
  • Delete two coreDNS pods
  • Ensure they get scheduled to that targeted node
  • Login to that node, install tmux, and down all the network interfaces to simulate the 2/2 failure
  • Notice the node going into NotReady state, and check for the DNS c-ares errors

Anything else we need to know?

Kubernetes version

$ kubectl version
Server Version: v1.29.4-eks-036c24b

Cloud provider

AWS

OS version

# On Linux:
$ cat /etc/os-release
      NAME="Amazon Linux"
      VERSION="2"
      ID="amzn"
      ID_LIKE="centos rhel fedora"
      VERSION_ID="2"
      PRETTY_NAME="Amazon Linux 2"
      ANSI_COLOR="0;33"
      CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
      HOME_URL="https://amazonlinux.com/"
      SUPPORT_END="2025-06-30"
$ uname -a
Linux ip-10-185-97-105.ec2.internal 5.10.215-203.850.amzn2.aarch64 #1 SMP Tue Apr 23 20:32:21 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

area/kube-proxykind/bugCategorizes issue or PR as related to a bug.sig/networkCategorizes an issue or PR as relevant to SIG Network.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions