loadbalancer: break the circular dependency with clustermesh #40786

giorio94 · 2025-07-29T15:55:41Z

Please refer to the individual commits for additional details.

clustermesh: fix regression possibly causing cross-cluster connections disruption if the clustermesh-apiserver is restarted at the same time as Cilium agents.

giorio94 · 2025-07-30T07:10:30Z

/ci-clustermesh

giorio94 · 2025-07-30T10:33:43Z

/ci-clustermesh

giorio94 · 2025-07-30T12:43:48Z

/ci-clustermesh

pkg/dial/resolver.go

giorio94 · 2025-07-31T09:57:08Z

All clustermesh tests passed three times in a row, so we should be good from that point of view. Performing a few final tests locally, before marking it ready for review.

giorio94 · 2025-07-31T13:06:33Z

/test

In preparation for the subsequent introduction of a second resolver to short-circuit the service frontend to backend load balancing process, let's slightly generalize the clustermesh common logic to allow configuring multiple resolvers. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

Let's generalize the context dialer resolvers to allow mutating both the host and the port, rather then the host only, to support use-cases such as automatic service backend selection. Additionally, let's always invoke all resolvers in sequential order, instead of exiting early once the first returns successfully, to support chained operations. The Resolver interface is finally changed to not return an error, given that it was not actually used. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

The new load balancer implementation is affected by a potential chicken-and-egg dependency when KVStoreMesh and KPR [^1] are enabled because the reconciliation of services needs to wait for clustermesh synchronization (in addition to Kubernetes synchronization), to avoid incorrectly removing valid backends during bootstrap. However, Cilium agents connect to the KVStoreMesh sidecar etcd instance via a service, creating a circular dependency in case KPR is enabled and KVStoreMesh got restarted while the Cilium agent was down, and the service information in the BPF maps is now stale. While this issue eventually resolves automatically thanks to circuit-breaker timeouts, it is still problematic, as it can delay startup, and disrupt existing cross-cluster connections. This commit builds on top of the existing custom dialer and resolvers, already used to bypass DNS resolution, and introduces a new resolver that internally handles the load-balancing by directly accessing the appropriate statedb table. This way, the connection directly targets one of the KVStoreMesh replicas, instead of going through the Kubernetes service, breaking the dependency. The resolver attempts to reuse the same backend for subsequent requests if still valid, to preserve session affinity. [^1] Strictly speaking, the issue occurs if Cilium is responsible for ClusterIP service load-balancing for host-network pods, that is when SocketLB is enabled (at least in the host-network namespace). Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

So that we can short-circuit the load-balancing process when the target etcd instance is behind a local service (most notably, when KVStoreMesh is enabled). This way, we break a possible chicken-and-egg dependency caused by the load balancing subsystem waiting on clustermesh synchronization before starting to update the BPF maps, and clustermesh failing to connect to the cluster (hence preventing synchronization) because the BPF maps are not up-to-date, e.g., if KVStoreMesh got restarted while the agent was down. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

…xisted There is no need to delay the reconciliation of the LB BPF maps if there was no previous state as in that scenario we wouldn't be scaling down the number of backends (since there are none yet). Signed-off-by: Jussi Maki <jussi@isovalent.com>

[ NOTE: this reverts the revert 01f8d50. Since we now only delay when existing BPF state existed we can have a higher timeout without impacting e.g. the initial ClusterMesh api-server connection. ] Before, we were waiting for initial synchronization of endpoints and services only for 10s. However, this is definitely too low, as Kubernetes SLO for List call for 99th percentile is <= 30s. ref https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#steady-state-slisslos Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>

The previous commit increased the timeout to 1 minute, which is suitable for production environments. However, it may still be too short for CI environments, especially in case of certain ClusterMesh configurations upon initial connection. While the timeout expiration would have in these cases no other effects than a warning log, that's still problematic as it causes CI failures. On the other hand, we don't want to get rid of the warning altogether, as it is useful to catch possible unexpected circular dependencies. Hence, let's make the timeout configurable via an hidden flag, so that it can be overridden where appropriate. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

This reverts commit d2aab1f. We need to prevent triggering this circular dependency again, even though caused for a slightly different reason. Specifically, the problem is that service reconciliation needs to wait on clustermesh synchronization, which at the same time requires the cmapisrv service to be reconciled to include the new backends, otherwise remote clusters cannot connect to it. While the previous commits addressed this problem when connecting to the local KVStoreMesh instance, bypassing service load-balancing, the same cannot be done when connecting to the remote cluster. However, letting non-global services to be reconciled immediately, without waiting on clustermesh synchronization, does involve a much more convoluted fix, which doesn't seem that necessary at the moment. Indeed, this is a relatively extreme test, simulating a condition that is unlikely to happen in reality (especially with multiple replicas of the clustermesh-apiserver), and that would resolve automatically anyhow thanks to the circuit-breaker timeouts (even though possibly at the cost of breaking cross-cluster connections). And the same issue already affected previous releases, even though for a different reason, so it is not even a regression from that point of view. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2025-07-31T14:18:02Z

/test

MrFreezeex

Thanks I left a quick suggestion/question inline but apart from that looks great!

Also do we have an existing issue that should be closed when this is getting merged (not saying that we need one necessarily btw)? IIUC this can be triggered by users in the last release so I guess we would need a release note right?

pkg/clustermesh/operator/clustermesh.go

giorio94 · 2025-08-01T07:17:05Z

Also do we have an existing issue that should be closed when this is getting merged (not saying that we need one necessarily btw)? IIUC this can be triggered by users in the last release so I guess we would need a release note right?

I'm not aware of any issue about this regression. But very good point about the release note, added.

giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch from ffee158 to b2bfd1a Compare July 30, 2025 06:36

giorio94 added the needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch label Jul 30, 2025

giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch from b2bfd1a to b1d3d9b Compare July 30, 2025 10:33

giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch from b1d3d9b to d10b288 Compare July 30, 2025 12:43

giorio94 requested a review from joamaki July 30, 2025 12:43

joamaki reviewed Jul 31, 2025

View reviewed changes

pkg/dial/resolver.go Outdated Show resolved Hide resolved

giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch 2 times, most recently from 8a3b99a to 1d01d80 Compare July 31, 2025 13:06

giorio94 marked this pull request as ready for review July 31, 2025 13:06

giorio94 requested review from a team as code owners July 31, 2025 13:06

giorio94 requested review from dylandreimerink, tklauser and MrFreezeex July 31, 2025 13:06

giorio94 requested review from aanm, christarazi and joamaki July 31, 2025 13:06

giorio94 and others added 8 commits July 31, 2025 16:09

giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch from 1d01d80 to 451a797 Compare July 31, 2025 14:17

joamaki approved these changes Jul 31, 2025

View reviewed changes

christarazi approved these changes Jul 31, 2025

View reviewed changes

MrFreezeex approved these changes Jul 31, 2025

View reviewed changes

pkg/clustermesh/operator/clustermesh.go Show resolved Hide resolved

dylandreimerink approved these changes Aug 1, 2025

View reviewed changes

giorio94 enabled auto-merge August 1, 2025 12:15

aanm approved these changes Aug 4, 2025

View reviewed changes

giorio94 mentioned this pull request Aug 4, 2025

loadbalancer: Only delay reconciliation if previous BPF state existed #40704

Closed

tklauser approved these changes Aug 5, 2025

View reviewed changes

giorio94 added this pull request to the merge queue Aug 5, 2025

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 5, 2025

Merged via the queue into main with commit 31d1408 Aug 5, 2025
306 of 307 checks passed

giorio94 deleted the pr/giorio94/main/clustermesh-bypass-service branch August 5, 2025 08:51

rastislavs mentioned this pull request Aug 6, 2025

v1.18 Backports 2025-08-06 #40979

Merged

17 tasks

rastislavs added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 6, 2025

github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Aug 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

loadbalancer: break the circular dependency with clustermesh #40786

loadbalancer: break the circular dependency with clustermesh #40786

Uh oh!

giorio94 commented Jul 29, 2025 •

edited

Loading

Uh oh!

giorio94 commented Jul 30, 2025

Uh oh!

giorio94 commented Jul 30, 2025

Uh oh!

giorio94 commented Jul 30, 2025

Uh oh!

Uh oh!

giorio94 commented Jul 31, 2025

Uh oh!

giorio94 commented Jul 31, 2025

Uh oh!

giorio94 commented Jul 31, 2025

Uh oh!

MrFreezeex left a comment •

edited

Loading

Uh oh!

Uh oh!

giorio94 commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

loadbalancer: break the circular dependency with clustermesh #40786

loadbalancer: break the circular dependency with clustermesh #40786

Uh oh!

Conversation

giorio94 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giorio94 commented Jul 30, 2025

Uh oh!

giorio94 commented Jul 30, 2025

Uh oh!

giorio94 commented Jul 30, 2025

Uh oh!

Uh oh!

giorio94 commented Jul 31, 2025

Uh oh!

giorio94 commented Jul 31, 2025

Uh oh!

giorio94 commented Jul 31, 2025

Uh oh!

MrFreezeex left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

giorio94 commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

giorio94 commented Jul 29, 2025 •

edited

Loading

MrFreezeex left a comment •

edited

Loading