-
Notifications
You must be signed in to change notification settings - Fork 3.4k
loadbalancer: Implement UDP socket termination #39012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loadbalancer: Implement UDP socket termination #39012
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
7b9b15d
to
58afb77
Compare
This comment was marked as resolved.
This comment was marked as resolved.
58afb77
to
c4cede4
Compare
ad75187
to
26593a1
Compare
26593a1
to
2f3b136
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cherry-picked my changes from #38693 on top of this branch and tested socket termination manually. Everything seemed to work; I saw the UDP socket get destroyed repeatedly.
ping @aditighag @tommyp1ckles |
2f3b136
to
d0fcf23
Compare
/test |
ping @aditighag @tommyp1ckles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Posted some comments from the first pass.
(1) Can you split the 2nd commit into preparatory changes (e.g., changes related to sock rev nat map creation/utility functions, etc) and changes porting the socket termination logic to the new LB control plane?
(2) How do we ensure the socket termination loop runs only on the relevant backend DB updates?
/test |
There are multiple backend attributes, state being one of them - https://github.com/cilium/cilium/blob/main/pkg/loadbalancer/backend.go#L25-L25. Do we want to run the socket termination loop for backend updates corresponding to changes to attributes such as weight, clusterID, UnhealthyUpdatedAt, etc that don't change its state? |
It waits for changes to take a look at for 50ms and then does a tree traversal to go through them to find relevant changes. This is fast and lockless, so that should be fine. If we want to reduce the overhead of context switching and tree traversal we can bump the waiting up by a bit to make this even cheaper. Alternatively we could tie this into the reconciler and when it deletes a backend it could also queue up the socket destruction. You got me now questioning whether this might be burning enough cycles looking through irrelevant changes that it’s worth saving those. I’ll do a benchmark today to quantify. |
0006d2f
to
df83588
Compare
Ok, I've added a benchmark for the iteration. We're spending 300ns per backend change, which let's say for 100 backend changes would cost 600us, which is about 0.06% of one CPU core (see the comment for more details). So if we'd have a very very busy cluster it's still very little spent on this and it's unlikely that we'd have this sort of constant churn. My argument here is that it's better to sacrifice few CPU cycles to have an implementation that is decoupled: the other code has no control coupling to the UDP socket termination and the more we implement in this style the easier the code is to work with => we strive to have each component in the system to have a single concern, referential transparency and less control coupling. This in fact is the core design principle around StateDB: we move away from tight coupling of components and imperative callbacks that not end up with a big ball of mud (e.g. see pkg/service, pkg/endpoint etc.) or reduce latency/throughput in unexpected ways, but also to avoid production incidents like dead locks when one component directly calls into the black box of another (control coupling). That is why I find the trade-off of doing little bit extra work here much better than the alternatives (e.g. BPF reconciler directly coupled to this thing when it deletes a backend). EDIT: Ah, it's the backend instances lookup that IsAlive() does that takes most of the time. Adding a check for the protocol and skipping non-UDP backends brings the time down to 24ns/backend. So this can now process 41 million backend changes per second if the changes are all non-UDP. That seems fine? |
df83588
to
0fa647a
Compare
/test |
The Current() and All() were missing which makes the linter unhappy. Signed-off-by: Jussi Maki <jussi@isovalent.com>
For implementing socket termination we'll need access to the SockRevNat BPF map. Add it to the LBMaps interface and expand implementations. Signed-off-by: Jussi Maki <jussi@isovalent.com>
Add IsAlive() for checking whether the backend is "alive", that is it has instances that are either active & healthy, or if no active instances then a terminating one that is healthy. This will be used by follow-up commit on the socket termination to check if a backend should be considered dead and thus the sockets connected to it should be terminated. Signed-off-by: Jussi Maki <jussi@isovalent.com>
This ports pkg/service/connections.go to the new control-plane to terminate connected UDP sockets when the backend is deleted or marked unhealthy. Signed-off-by: Jussi Maki <jussi@isovalent.com>
Port the 'TestTerminateUDPConnectionsToBackend' from pkg/loadbalancer/legacy/service/service_test.go to test the datapath side of the socket termination. Signed-off-by: Jussi Maki <jussi@isovalent.com>
Add benchmark for validating the overhead of processing every backend change in the socket termination loop. Results on my machine: goos: linux goarch: amd64 pkg: github.com/cilium/cilium/pkg/loadbalancer/reconciler cpu: 13th Gen Intel(R) Core(TM) i9-13950HX BenchmarkChangeIteration_TCP BenchmarkChangeIteration_TCP-32 50 24211174 ns/op 41303242 backends/sec 24.21 ns/backend BenchmarkChangeIteration_UDP BenchmarkChangeIteration_UDP-32 5 247585791 ns/op 4039004 backends/sec 247.6 ns/backend PASS We're spending 24ns looking at each backend change. If we assume a busy cluster with say 100 backends changing every 50 milliseconds we would be using 24ns * 100 * (1000/50) = 48000ns = 48us of CPU time every second or in other words 100*(48000 / 10^9) = 0.0048% of a single CPU core. If all changes were for UDP backends we'd hit the IsAlive() check and we'd get: 247ns * 100 * (1000/50) = 494us per second, which is 0.049% of a CPU core. Increasing the rate limit interval to say 500ms or 1s would reduce the overhead of the revision and graveyard revision index tree traversals significantly further reducing the CPU cost, but would increase the latency at which we react to dead backends. This seems like an acceptable trade-off to me considering that now the termination logic is completely decoupled from everything else. The alternative way to implement this would be to have the BPF reconciler queue up the terminations, but this would increase overall complexity. Signed-off-by: Jussi Maki <jussi@isovalent.com>
0fa647a
to
332a720
Compare
@aditighag PTAL |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good to me, any concerns were already brought up by other reviewers
Missed following up -- mainly to highlight the potential overhead of iterating kernel sockets unnecessarily. We can revisit this if users start reporting noticeable overhead. |
This ports the
pkg/service/connections.go
to the new control-plane.