-
Notifications
You must be signed in to change notification settings - Fork 3.4k
cilium: Fix session affinity + maglev #37005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
No functional change, refactor the helper so that custom sport can be passed into the hashing function. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Kubernetes doc says [0]: If you want to make sure that connections from a particular client are passed to the same Pod each time, you can select the session affinity based on the client's IP addresses by setting `.spec.sessionAffinity` to `ClientIP` for a Service (the default is `None`). In Cilium we do support session affinity natively [1]. Now, we also support Maglev consistent hashing. The latter ensures consistent backend selection across nodes (as opposed to a random pick). When session affinity is in play with Maglev, then while it consistently picks backends respecting client IP on a per-node basis, it does not pick the same backend across nodes. This is due to Maglev taking in the source port as part of the hash for the Maglev LUT lookup, so if the source port changes, then backend selection becomes inconsistent. This has been reported by Sebastian Zengerle: I am testing Cilium L4 LB with Maglev Algorithm. To test Maglev I've deployed 3 nodes connected via BGP to the uplink router and used the LB IPAM feature. To test Maglev under routing changes (e.g. node rolling upgrade will lead to re-hashing on the uplink router) I am influencing the prefix metrics on the uplink router in a way that I can control to which worker the ECMP anycast traffic is sent. When I use static source ports session Affinity is working fine. All 3 workers select the same backend_id and sessionAffinity is working. But if I am using dynamic ports while I still shuffle the selected worker by the uplink router sessionAffinity is not working. I replaced dynamic ports with different local-ports in curl to have reproducible results. The sessionAffinity of the svc is set to ClientIP - so my expectation was at the beginning that all workers will select the same backend based on the clientId - but source port influences backend selection. More a problem is that the last step would break a connection when a routing change is happening. In the maps I can see that the backend_id decision on worker 1 was persisted (which was based on source port 12000) - while on the others it is based on 12001. Therefore workers are using different backend_id. In order to keep the promise of Maglev, we need to remove the source port from the hash to fix this situation. Fixes: #26709 Reported-by: Sebastian Zengerle <sebastian.zengerle@googlemail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://kubernetes.io/docs/reference/networking/virtual-ips/#session-affinity [0] Link: https://docs.cilium.io/en/latest/network/kubernetes/kubeproxy-free/#session-affinity [1]
12747dc
to
f3c6d32
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Maybe worth to add a note to the Cilium docs (chapter maglev & session afinity) about this special case when they are used in combination?
Also changed release note to |
Add a note into the session affinity section about the behavior under Maglev. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
/test |
(envoy embedded failure tracked in #36998) |
The Maglev functionality doesn't have especific tests fro Maglev. This change adds a test that checks that a client that uses different source ports get routed to the same backend. I had to split the Maglev eBPF maps in the main lb.h file because BPF doesn't allow to create the inner map during unit tests. I also had to modify node_config.h to program some preprocessor macros differently during Maglev unit tests. Otherwise I couldn't modify LB_MAGLEV_LUT_SIZE and LB_SELECTION before it was interpreted by the proeprocessor in lb.h. Tests: cilium#37005 Signed-off-by: Carlos Abad <carlosab@google.com>
The Maglev functionality doesn't have especific tests fro Maglev. This change adds a test that checks that a client that uses different source ports get routed to the same backend. I had to split the Maglev eBPF maps in the main lb.h file because BPF doesn't allow to create the inner map during unit tests. I also had to modify node_config.h to program some preprocessor macros differently during Maglev unit tests. Otherwise I couldn't modify LB_MAGLEV_LUT_SIZE and LB_SELECTION before it was interpreted by the proeprocessor in lb.h. Tests: cilium#37005 Signed-off-by: Carlos Abad <carlosab@google.com>
(see commit desc)
Fixes: #26709