Skip to content

Conversation

ajmmm
Copy link
Member

@ajmmm ajmmm commented Jul 23, 2025

PR to resolve a forwarding bug in data path.

This work is derived from work undertaken by Mikael Knutsson mikael.knutsson@gmail.com so has been credited in the relevant commit logs.

Summary of issue

Given a configured LoadBalancer or Cluster IP on a specific L4 destination port and protocol, ingress traffic towards the IP on an alternate destination port or protocol will not match eBPF service map lookup, so traffic be punted back to the network. Where the upstream network re-delivers the traffic back to Cilium, a packet loop is formed.

Summary of Changes

  • Introduces the concept of "wildcard" service entries in eBPF Service Maps, being a service entry with a zero destination and protocol field.
  • LoadBalancer and ClusterIP frontend services with external scope will be cross-referenced to a wildcard entry:
    • The first reference will trigger addition of wildcard entry in eBPF
    • The last reference deleted will trigger removal of wildcard entry from eBPF
  • Data path behaviour modified on traffic ingress. Where service lookup fails, an additional lookup is undertaken with zero destination and protocol. On successful match to such a wildcard entry, traffic is dropped.
  • Removal of data path logic that maintained compatibility with eBPF maps programmed by older versions of Cilium. Protocol differentiation was deprecated in 1.17, and automatic conversion of undifferentiated protocol was added in 1.18.
  • Appropriate unit test updates for the new logic
  • Minor typo fixes in unit test logic

Release Notes:

Introduce wildcard service entries to ensure traffic towards a LoadBalancer and ClusterIPs with an unknown protocol/port combination is dropped by the data path, rather than being forwarded back to the network.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 23, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jul 23, 2025
@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch 12 times, most recently from bd5a7e7 to f5a6f60 Compare July 29, 2025 14:01
@ajmmm ajmmm marked this pull request as ready for review July 29, 2025 14:31
@ajmmm ajmmm requested review from a team as code owners July 29, 2025 14:31
@ajmmm ajmmm requested a review from dylandreimerink July 29, 2025 14:31
@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch from fb75646 to adf633a Compare July 29, 2025 16:02
@ajmmm ajmmm requested a review from joamaki July 29, 2025 16:10
@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch from adf633a to c391c35 Compare July 31, 2025 08:50
@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch 4 times, most recently from 8b5e388 to f408d79 Compare August 1, 2025 08:42
Copy link
Member

@dylandreimerink dylandreimerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall. One point of feedback though.

@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch from f408d79 to 6b2266c Compare August 1, 2025 11:27
@ajmmm
Copy link
Member Author

ajmmm commented Aug 6, 2025

Notes from testing for general info:

Setup

I've used a kind cluster with an app deployment allowing me to run a TCP service on port 8000. I've also had to configure Cilium with:

kubeProxyReplacement: true
l2announcements:
  enabled: true
l2NeighDiscovery:
  enabled: true

Such that I can get host to LB VIP access directly, eg:

$ kubectl -n default get pod netshoot-server-6b478b859-g5djl -owide
NAME                              READY   STATUS    RESTARTS   AGE   IP             NODE                     NOMINATED NODE   READINESS GATES
netshoot-server-6b478b859-g5djl   1/1     Running   0          14m   10.244.1.145   cilium-testing-worker2   <none>           <none>

$ kubectl -n default get service netshoot-service -owide
NAME               TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)          AGE   SELECTOR
netshoot-service   LoadBalancer   10.96.80.251   172.18.255.200   8000:30990/TCP   15m   app=netshoot-server

$ arp
Address                  HWtype  HWaddress           Flags Mask            Iface
_gateway                 ether   86:2f:57:c7:2e:64   C                     enp0s1
172.18.0.4               ether   ce:b8:de:ea:41:c7   C                     br-ec86dd730695
172.18.0.3               ether   da:a4:c5:dd:95:50   C                     br-ec86dd730695
172.18.255.200           ether   da:a4:c5:dd:95:50   C                     br-ec86dd730695
172.18.0.2               ether   6a:06:2c:eb:29:4d   C                     br-ec86dd730695

[ajm@el9-dev default-cluster]$ kubectl -n kube-system get pods -owide | grep -E 'NAME|proxy'
NAME                                                   READY   STATUS    RESTARTS   AGE   IP            NODE                           NOMINATED NODE   READINESS GATES
kube-proxy-hvr9r                                       1/1     Running   0          17m   172.18.0.4    cilium-testing-control-plane   <none>           <none>
kube-proxy-l4tnx                                       1/1     Running   0          17m   172.18.0.2    cilium-testing-worker          <none>           <none>
kube-proxy-v5tnd                                       1/1     Running   0          17m   172.18.0.3    cilium-testing-worker2         <none>           <none>

At this stage I can't recreate the packet loop described, but I can see evidence of the correct behaviour change.

Test

Server: iperf3 running in the netshoot-server pod on TCP port 8000
Client: netcat at host level with L2 connectivity into the 172.18/16 network

I can't yet see the issue in tcpdump but I can see some evidence in Hubble.

Before PR

OSS main as of commit 2d5a128

Happy path:
nc 172.18.255.200 8000 results in good TCP conversation:

Aug  6 15:21:02.426: 172.18.0.1:33488 (world-ipv4) <> default/netshoot-service:8000 (world-ipv4) from-network FORWARDED (TCP Flags: SYN)
Aug  6 15:21:02.429: 172.18.0.1:33488 (world-ipv4) <- default/netshoot-service:8000 (ID:52286) to-network FORWARDED (TCP Flags: SYN, ACK)
Aug  6 15:21:02.429: 172.18.0.1:33488 (world-ipv4) <> default/netshoot-service:8000 (world-ipv4) from-network FORWARDED (TCP Flags: ACK)

Aug  6 15:21:04.117: 172.18.0.1:33488 (world-ipv4) <> default/netshoot-service:8000 (world-ipv4) from-network FORWARDED (TCP Flags: ACK, FIN)
Aug  6 15:21:04.119: 172.18.0.1:33488 (world-ipv4) <- default/netshoot-service:8000 (ID:52286) to-network FORWARDED (TCP Flags: ACK)
Aug  6 15:21:04.120: 172.18.0.1:33488 (world-ipv4) <- default/netshoot-service:8000 (ID:52286) to-network FORWARDED (TCP Flags: ACK, FIN)
Aug  6 15:21:04.120: 172.18.0.1:33488 (world-ipv4) <> default/netshoot-service:8000 (world-ipv4) from-network FORWARDED (TCP Flags: ACK)

Unhappy path:
nc 172.18.255.200 8001 results in repeated TCP SYNs with no evidence of action, eg:

Aug  6 15:21:07.908: 172.18.0.1:56880 (world-ipv4) <> 172.18.255.200:8001 (world-ipv4) from-network FORWARDED (TCP Flags: SYN)

Aug  6 15:21:08.948: 172.18.0.1:56880 (world-ipv4) <> 172.18.255.200:8001 (world-ipv4) from-network FORWARDED (TCP Flags: SYN)

After PR

Repeating the same tests

Happy path:
nc 172.18.255.200 8000 results in good TCP conversation:

Aug  6 15:12:09.688: 172.18.0.1:42462 (world-ipv4) <> default/netshoot-service:8000 (world-ipv4) from-network FORWARDED (TCP Flags: SYN)
Aug  6 15:12:09.692: default/netshoot-service:8000 (world-ipv4) <> 172.18.0.1:42462 (world-ipv4) to-network FORWARDED (TCP Flags: SYN, ACK)
Aug  6 15:12:09.692: 172.18.0.1:42462 (world-ipv4) <> default/netshoot-service:8000 (world-ipv4) from-network FORWARDED (TCP Flags: ACK)

Aug  6 15:12:15.289: 172.18.0.1:42462 (world-ipv4) <> default/netshoot-service:8000 (world-ipv4) from-network FORWARDED (TCP Flags: ACK, FIN)
Aug  6 15:12:15.294: default/netshoot-service:8000 (world-ipv4) <> 172.18.0.1:42462 (world-ipv4) to-network FORWARDED (TCP Flags: ACK, FIN)
Aug  6 15:12:15.294: 172.18.0.1:42462 (world-ipv4) <> default/netshoot-service:8000 (world-ipv4) from-network FORWARDED (TCP Flags: ACK)

Unhappy path:
nc 172.18.255.200 8001 results in repeated TCP SYNs but the data path now signals DROP_NO_SERVICE

Aug  6 15:13:17.590: 172.18.0.1:39256 (world-ipv4) <> 172.18.255.200:8001 (world-ipv4) from-network FORWARDED (TCP Flags: SYN)
Aug  6 15:13:17.590: 172.18.0.1:39256 (world-ipv4) <> 172.18.255.200:8001 (world-ipv4) Service backend not found DROPPED (TCP Flags: SYN)

Aug  6 15:13:18.619: 172.18.0.1:39256 (world-ipv4) <> 172.18.255.200:8001 (world-ipv4) from-network FORWARDED (TCP Flags: SYN)
Aug  6 15:13:18.619: 172.18.0.1:39256 (world-ipv4) <> 172.18.255.200:8001 (world-ipv4) Service backend not found DROPPED (TCP Flags: SYN)

@mikn
Copy link

mikn commented Aug 6, 2025

To actually get the packet loop, the interface that receives the packet (when forwarded) must make the routing decision to forward it to an external router, and the external router must make the routing decision to forward the packet back to the origin.

This happens in standard public internet configurations if using the LB IPAM.

I am not sure how to recreate the loop itself on a single host without significant scaffolding however.

@msune
Copy link
Member

msune commented Aug 7, 2025

nc 172.18.255.200 8001 results in repeated TCP SYNs with no evidence of action, eg:

Do you know where the packet ends? You can use pwru to check whether it's dropped or routed back.

I am not sure how to recreate the loop itself on a single host without significant scaffolding however.

First, if I understood correctly the issue - and @ajmmm and I have discussed a bit about it - as of today, the host NS should route back the packet via default GW, as the packet is punted to the kernel network stack.

If the routing were to be the problem, we can always add an explicit route in the host NS:

ip route add <service_VIP>/32 via <router_ip> dev <device_ingress>

But I suspect the packet is not looped back for other reasons in the kind setup of @ajmmm, or is dropped somewhere before reaching to the router (maybe outside of the kind cluster).

Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(lgtm, still needs a rebase)

Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bpf side lgtm! some small nits when you rebase anyway.

@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch from 23c7aed to 5cf22be Compare August 13, 2025 10:43
@ajmmm
Copy link
Member Author

ajmmm commented Aug 13, 2025

/test

@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch 2 times, most recently from 5f46cbb to da50a72 Compare August 13, 2025 16:02
@parlakisik
Copy link
Contributor

/test

@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch from da50a72 to 1d29ad9 Compare August 22, 2025 08:33
@ajmmm
Copy link
Member Author

ajmmm commented Aug 22, 2025

/test

@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch from 1d29ad9 to fe966e5 Compare August 22, 2025 11:14
@ajmmm
Copy link
Member Author

ajmmm commented Aug 22, 2025

/test

@ajmmm
Copy link
Member Author

ajmmm commented Aug 22, 2025

/ci-e2e-upgrade

@ajmmm
Copy link
Member Author

ajmmm commented Aug 27, 2025

/ci-ginkgo

ajmmm added 8 commits August 27, 2025 10:28
…ice().

Previously, if the lookup of a service derived from ingress 5-tuple results
in no match, we undertake an additional lookup with IPPROTO_ANY. This was
necessary to maintain backwards compatibility with earlier versions of Cilium
that did not support protocol differentiation when programming the data path.

Cilium 1.18 introduced logic to migrate IPPROTO_ANY service/backend BPF map
entries to an appropriate protocol automatically, such that the additional
lookups are no longer required.

Appropriate tests updated to include explicit lookups for IPPROTO_TCP,
IPPROTO_UDP, and mismatched protocol checks.

Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
This change modifies the behaviour of lbXX_lookup_service() such that if an external
service lookup does not find an entry, we undertake an additional lookup with DPort=0
and Protocol=0. This will facilitate a wildcard lookup that, when programmed with a
suitable wildcard entry by the control plane, will trigger the data path to drop
unmatched ingress traffic.

This change modifies the semantics of said routines such that they no longer propagate
changes to the key protocol field, unless a wildcard service entry was found in the
external scope, and returned. In this case, the dport and protocol field will be zero.

Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
Signed-off-by: Mikael Knutsson <mikael.knutsson@gmail.com>
…paths.

Previously, lbX_lookup_service() would zeroise the key protocol field,
such that callers would	need to	reinstate the key protocol field if their
logic re-used the key.

Following the previous commit, the lbX_lookup_service()	routines will now
only zeroise the key protocol (and dport) fields if an external	wild-card
entry is returned. Thus, the restoration of key values can be removed.

This commit also removes lbX_key_set_protocol() helpers as they became
broadly unnecessary.

Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
This set of tests verifies that if we receive a packet towards a VIP on either
an unknown destination port, or unknown protocol, the data plane will correctly
drop if a wildcard entry is present in the service map.

Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
This change extends the nodePortAddrKey structure to include the L4 Protocol
number of an address, such that addresses with differing addresses will be
better distributed in the nodePortAddrByPort map.

This required the introduction of a helper routine to translate the L4Addr
protocol field into the real IANA number.

Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
Introduce logic in bpf_reconciler to map frontend service entries to wildcard
service entries, and program them in the data path, for LoadBalancer and ClusterIP
frontends of external scope. This will ensure the data path drops traffic towards
unknown destination ports and/or protocols.

Wildcard state is tracked as a map (indexed by IP Address) that contains a slice
of Frontend ServiceIDs. The first entry that requires a wildcard will trigger
programming of the data path entry. The last entry that depends on a wildcard
entry to be removed will trigger removal of the wildcard entry.

Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
Signed-off-by: Mikael Knutsson <mikael.knutsson@gmail.com>
Update the unit tests for loadbalancer and ciliumenvoyconfig to include
wildcard entries.

Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
Remove lb_v4_add_mixed_proto_service_with_flags() as it no longer makes sense.

Signed-off-by: Alasdair McWilliam <alasdair.mcwilliam@isovalent.com>
@ajmmm ajmmm force-pushed the pr/slb-drop-wildcard-vip branch from fe966e5 to abf0fec Compare August 27, 2025 09:28
@ajmmm
Copy link
Member Author

ajmmm commented Aug 27, 2025

/test

@ajmmm
Copy link
Member Author

ajmmm commented Aug 27, 2025

ci-ginkgo failures is likely a flake - happens across different tests on main, eg:

https://github.com/cilium/cilium/actions/workflows/conformance-ginkgo.yaml?query=is%3Afailure+branch%3Amain

@borkmann borkmann merged commit 3091b2e into cilium:main Aug 27, 2025
67 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/community-contribution This was a contribution made by a community member. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants