Skip to content

Conversation

julianwiedmann
Copy link
Member

@julianwiedmann julianwiedmann commented Nov 8, 2024

Once this PR is merged, a GitHub action will update the labels of these PRs:

 35098

@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.16 This PR represents a backport for Cilium 1.16.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Nov 8, 2024
@julianwiedmann
Copy link
Member Author

/test-backport-1.16

@jschwinger233
Copy link
Member

The broken connection is from pod to local pod via ingress (pod-to-ingress-service/pod-to-ingress-service), the expected flow is:

forward direction:

  1. from_lxc recognizes dst ip is an ingress ip, so sets proxy mark to skb1 and passes it to stack
  2. iptables tproxy delivers marked skb1 to envoy proxy
  3. envoy tries to establish a new connection to the real backend, which is another local pod, sends skb2 to stack
  4. skb2 is routed to cilium_host (tunnel routing), from_host there will bpf_redirect skb2 to the dst lxc
  5. stack takes care of the remaining delivery

return direction:

  1. from_lxc recognized this skb3 need to be hijacked by proxy, sets proxy mark, passes it to stack
  2. iptables tproxy ...
  3. envoy sends skb4 ...
  4. from_host at cilium_host delivers skb4 to lxc ...

This PR breaks step 1 in return direction. The expected datapath behavior is to punt to stack for proxy hijack, but this patch makes skb be fib_redirected to eth0.

But why main didn't suffer this issue is still unclear to me 😬

Looking at handle_ipv4_from_lxc(), if (ct->proxy_redirect) { redirect_to_proxy4(); } (https://github.com/cilium/cilium/blob/v1.16.3/bpf/bpf_lxc.c#L919) is ahead of goto maybe_pass_to_stack (https://github.com/cilium/cilium/blob/v1.16.3/bpf/bpf_lxc.c#L1231). My guess is main branch datapath stores correct ct entry so return packet returns from redirect_to_proxy4(), but 1.16 somehow misses the ct->proxy_redirect condition. Maybe has something to do with proxy behavior, like, transparent vs non-transparent, so ct entry doesn't match?

@thejosephstevens
Copy link
Contributor

thejosephstevens commented Nov 16, 2024

Hey all! I was pointed here from #35977 since I encountered this bug in my GKE clusters. Since it looked like there were still some questions around exactly what's going on in the bug I wanted to provide details on what we saw. We're running the same Cilium version in EKS and GKE, but only saw the failures in GKE.

edit: Note, I found that this issue does not occur in AKS with BYOCNI, the issue is limited to just gke

What I can provide is the values for each cloud that we use (below). The differences that I know are present are:

# Differences (overrides in GKE)
cni.binPath = /home/kubernetes/bin
nodeInit.enabled = true
nodeInit.reconfigureKubelet = true
nodeInit.removeCbrBridge = true
ipam.mode = kubernetes
bpf.hostLegacyRouting = true # this was added to fix the bug in GKE

# EKS - no issues
values:
  agentNotReadyTaintKey: ignore-taint.cluster-autoscaler.kubernetes.io/cilium-agent-not-ready
  aksbyocni:
    enabled: false
  authentication:
    mutual:
      spire:
        enabled: true
        install:
          enabled: true
          existingNamespace: true
          namespace: kube-system
  bpf:
    masquerade: true
  cni:
    binPath: /opt/cni/bin
  devices: eth+
  encryption:
    enabled: true
    nodeEncryption: true
    type: wireguard
  envoy:
    enabled: true
  gatewayAPI:
    enableAlpn: true
    enableAppProtocol: true
    enabled: true
    secretsNamespace:
      create: false
      name: kube-system
  hubble:
    listenAddress: :4244
    metrics:
      enabled:
      - dns
      - drop
      - tcp
      - flow
      - icmp
      - http
    relay:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: pool
                operator: In
                values:
                - default
                - control
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                k8s-app: cilium
            topologyKey: kubernetes.io/hostname
      enabled: true
    tls:
      auto:
        method: cronJob
    ui:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: pool
                operator: In
                values:
                - default
                - control
      enabled: true
  ingressController:
    enabled: false
  ipam:
    mode: cluster-pool
  k8sServiceHost: REDACTED
  k8sServicePort: 443
  kubeProxyReplacement: true
  l7Proxy: true
  loadBalancer:
    serviceTopology: true
  localRedirectPolicy: true
  nodeinit:
    enabled: false
    reconfigureKubelet: false
    removeCbrBridge: false
  operator:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: pool
              operator: In
              values:
              - default
              - control
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              io.cilium/app: operator
          topologyKey: kubernetes.io/hostname
    prometheus:
      enabled: true
  prometheus:
    enabled: true
  upgradeCompatibility: "1.9"
  wellKnownIdentities:
    enabled: true
---
# GKE - this failed until `bpf.hostLegacyRouting` was set to `true`
values:
  agentNotReadyTaintKey: ignore-taint.cluster-autoscaler.kubernetes.io/cilium-agent-not-ready
  aksbyocni:
    enabled: false
  authentication:
    mutual:
      spire:
        enabled: true
        install:
          enabled: true
          existingNamespace: true
          namespace: kube-system
  bpf:
    hostLegacyRouting: true
    masquerade: true
  cni:
    binPath: /home/kubernetes/bin
  devices: eth+
  encryption:
    enabled: true
    nodeEncryption: true
    type: wireguard
  envoy:
    enabled: true
  gatewayAPI:
    enableAlpn: true
    enableAppProtocol: true
    enabled: true
    secretsNamespace:
      create: false
      name: kube-system
  hubble:
    listenAddress: :4244
    metrics:
      enabled:
      - dns
      - drop
      - tcp
      - flow
      - icmp
      - http
    relay:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: pool
                operator: In
                values:
                - default
                - control
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                k8s-app: cilium
            topologyKey: kubernetes.io/hostname
      enabled: true
    tls:
      auto:
        method: cronJob
    ui:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: pool
                operator: In
                values:
                - default
                - control
      enabled: true
  ingressController:
    enabled: false
  ipam:
    mode: kubernetes
  k8sServiceHost: REDACTED
  k8sServicePort: 443
  kubeProxyReplacement: true
  l7Proxy: true
  loadBalancer:
    serviceTopology: true
  localRedirectPolicy: true
  nodeinit:
    enabled: true
    reconfigureKubelet: true
    removeCbrBridge: true
  operator:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: pool
              operator: In
              values:
              - default
              - control
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              io.cilium/app: operator
          topologyKey: kubernetes.io/hostname
    prometheus:
      enabled: true
  prometheus:
    enabled: true
  upgradeCompatibility: "1.9"
  wellKnownIdentities:
    enabled: true

Hopefully this is helpful!

@sayboras sayboras mentioned this pull request Nov 20, 2024
1 task
[ upstream commit 1adcc15 ]

When BPF host routing is enabled with tunnel, encap_and_redirect_lxc()
returns DROP_NO_TUNNEL_ENDPOINT for pod-to-world traffic, which then
goes up stack instead of being fib_redirected. This patch ensures
to-world traffic follow the expected path.

Please note that we handled this correctly for IPv6, so only IPv4 is
being amended.

Fixes: cilium#35023

Signed-off-by: gray <gray.liang@isovalent.com>
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann julianwiedmann force-pushed the v1.16-host-routing-tunnel branch from e0a4463 to 6ba07bd Compare November 25, 2024 06:55
@julianwiedmann
Copy link
Member Author

/test-backport-1.16

@julianwiedmann
Copy link
Member Author

(my PR branch didn't have Jarno's endpoint changes 😬)

@jschwinger233
Copy link
Member

ci-clustermesh turns green! 🟢

@julianwiedmann julianwiedmann marked this pull request as ready for review November 29, 2024 12:23
@julianwiedmann julianwiedmann requested a review from a team as a code owner November 29, 2024 12:23
@julianwiedmann julianwiedmann added this pull request to the merge queue Nov 29, 2024
Merged via the queue into cilium:v1.16 with commit 2932ac4 Nov 29, 2024
64 checks passed
@julianwiedmann julianwiedmann deleted the v1.16-host-routing-tunnel branch November 29, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.16 This PR represents a backport for Cilium 1.16.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants