Skip to content

Conversation

borkmann
Copy link
Member

@borkmann borkmann commented Jan 16, 2025

(see commit desc)

Fixes: #26709

bpf: Address backend selection under session affinity with Maglev

No functional change, refactor the helper so that custom sport can be
passed into the hashing function.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann borkmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. area/loadbalancing Impacts load-balancing and Kubernetes service implementations needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 16, 2025
@borkmann borkmann requested review from aspsk and mhofstetter January 16, 2025 09:28
@borkmann borkmann requested a review from a team as a code owner January 16, 2025 09:28
Kubernetes doc says [0]:

  If you want to make sure that connections from a particular client are
  passed to the same Pod each time, you can select the session affinity
  based on the client's IP addresses by setting `.spec.sessionAffinity`
  to `ClientIP` for a Service (the default is `None`).

In Cilium we do support session affinity natively [1]. Now, we also support
Maglev consistent hashing. The latter ensures consistent backend selection
across nodes (as opposed to a random pick).

When session affinity is in play with Maglev, then while it consistently
picks backends respecting client IP on a per-node basis, it does not pick
the same backend across nodes. This is due to Maglev taking in the source
port as part of the hash for the Maglev LUT lookup, so if the source port
changes, then backend selection becomes inconsistent.

This has been reported by Sebastian Zengerle:

  I am testing Cilium L4 LB with Maglev Algorithm. To test Maglev I've
  deployed 3 nodes connected via BGP to the uplink router and used the
  LB IPAM feature. To test Maglev under routing changes (e.g. node
  rolling upgrade will lead to re-hashing on the uplink router) I am
  influencing the prefix metrics on the uplink router in a way that I can
  control to which worker the ECMP anycast traffic is sent. When I use
  static source ports session Affinity is working fine. All 3 workers
  select the same backend_id and sessionAffinity is working.

  But if I am using dynamic ports while I still shuffle the selected
  worker by the uplink router sessionAffinity is not working. I replaced
  dynamic ports with different local-ports in curl to have reproducible
  results. The sessionAffinity of the svc is set to ClientIP - so my
  expectation was at the beginning that all workers will select the
  same backend based on the clientId - but source port influences backend
  selection. More a problem is that the last step would break a connection
  when a routing change is happening. In the maps I can see that the
  backend_id decision on worker 1 was persisted (which was based on
  source port 12000) - while on the others it is based on 12001.
  Therefore workers are using different backend_id.

In order to keep the promise of Maglev, we need to remove the source port
from the hash to fix this situation.

Fixes: #26709
Reported-by: Sebastian Zengerle <sebastian.zengerle@googlemail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://kubernetes.io/docs/reference/networking/virtual-ips/#session-affinity [0]
Link: https://docs.cilium.io/en/latest/network/kubernetes/kubeproxy-free/#session-affinity [1]
@borkmann borkmann force-pushed the pr/bpf-session-maglev branch from 12747dc to f3c6d32 Compare January 16, 2025 09:30
@borkmann
Copy link
Member Author

/test

Copy link
Contributor

@aspsk aspsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Member

@mhofstetter mhofstetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Maybe worth to add a note to the Cilium docs (chapter maglev & session afinity) about this special case when they are used in combination?

@mhofstetter mhofstetter added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/misc This PR makes changes that have no direct user impact. labels Jan 16, 2025
@mhofstetter
Copy link
Member

Also changed release note to minor to highlight this a little bit more

Add a note into the session affinity section about the behavior under
Maglev.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@borkmann borkmann requested review from a team as code owners January 16, 2025 11:13
@qmonnet
Copy link
Member

qmonnet commented Jan 16, 2025

/test

@qmonnet qmonnet enabled auto-merge January 16, 2025 12:00
@borkmann
Copy link
Member Author

(envoy embedded failure tracked in #36998)

@qmonnet qmonnet added this pull request to the merge queue Jan 16, 2025
Merged via the queue into main with commit 10f1de3 Jan 16, 2025
311 checks passed
@qmonnet qmonnet deleted the pr/bpf-session-maglev branch January 16, 2025 14:41
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jan 16, 2025
@rastislavs rastislavs mentioned this pull request Jan 21, 2025
45 tasks
@rastislavs rastislavs added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 21, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Jan 22, 2025
carlos-abad added a commit to carlos-abad/cilium that referenced this pull request Jan 28, 2025
The Maglev functionality doesn't have especific tests fro Maglev. This
change adds a test that checks that a client that uses different
source ports get routed to the same backend.

I had to split the Maglev eBPF maps in the main lb.h file because
BPF doesn't allow to create the inner map during unit tests.

I also had to modify node_config.h to program some preprocessor
macros differently during Maglev unit tests. Otherwise I couldn't
modify LB_MAGLEV_LUT_SIZE and LB_SELECTION before it was
interpreted by the proeprocessor in lb.h.

Tests: cilium#37005

Signed-off-by: Carlos Abad <carlosab@google.com>
carlos-abad added a commit to carlos-abad/cilium that referenced this pull request Mar 26, 2025
The Maglev functionality doesn't have especific tests fro Maglev. This
change adds a test that checks that a client that uses different
source ports get routed to the same backend.

I had to split the Maglev eBPF maps in the main lb.h file because
BPF doesn't allow to create the inner map during unit tests.

I also had to modify node_config.h to program some preprocessor
macros differently during Maglev unit tests. Otherwise I couldn't
modify LB_MAGLEV_LUT_SIZE and LB_SELECTION before it was
interpreted by the proeprocessor in lb.h.

Tests: cilium#37005

Signed-off-by: Carlos Abad <carlosab@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backend Selection with sessionAffinity is failing with dynamic source ports
6 participants