Skip to content

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Jul 29, 2025

Please refer to the individual commits for additional details.

clustermesh: fix regression possibly causing cross-cluster connections disruption if the clustermesh-apiserver is restarted at the same time as Cilium agents. 

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels Jul 29, 2025
@giorio94 giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch from ffee158 to b2bfd1a Compare July 30, 2025 06:36
@giorio94
Copy link
Member Author

/ci-clustermesh

@giorio94 giorio94 added the needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch label Jul 30, 2025
@giorio94 giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch from b2bfd1a to b1d3d9b Compare July 30, 2025 10:33
@giorio94
Copy link
Member Author

/ci-clustermesh

@giorio94 giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch from b1d3d9b to d10b288 Compare July 30, 2025 12:43
@giorio94
Copy link
Member Author

/ci-clustermesh

@giorio94 giorio94 requested a review from joamaki July 30, 2025 12:43
@giorio94
Copy link
Member Author

All clustermesh tests passed three times in a row, so we should be good from that point of view. Performing a few final tests locally, before marking it ready for review.

@giorio94 giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch 2 times, most recently from 8a3b99a to 1d01d80 Compare July 31, 2025 13:06
@giorio94 giorio94 marked this pull request as ready for review July 31, 2025 13:06
@giorio94 giorio94 requested review from a team as code owners July 31, 2025 13:06
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 requested review from aanm, christarazi and joamaki July 31, 2025 13:06
giorio94 and others added 8 commits July 31, 2025 16:09
In preparation for the subsequent introduction of a second resolver to
short-circuit the service frontend to backend load balancing process,
let's slightly generalize the clustermesh common logic to allow
configuring multiple resolvers.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's generalize the context dialer resolvers to allow mutating both
the host and the port, rather then the host only, to support use-cases
such as automatic service backend selection. Additionally, let's always
invoke all resolvers in sequential order, instead of exiting early once
the first returns successfully, to support chained operations. The
Resolver interface is finally changed to not return an error, given
that it was not actually used.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The new load balancer implementation is affected by a potential
chicken-and-egg dependency when KVStoreMesh and KPR [^1] are enabled
because the reconciliation of services needs to wait for clustermesh
synchronization (in addition to Kubernetes synchronization), to
avoid incorrectly removing valid backends during bootstrap.

However, Cilium agents connect to the KVStoreMesh sidecar etcd
instance via a service, creating a circular dependency in case KPR
is enabled and KVStoreMesh got restarted while the Cilium agent
was down, and the service information in the BPF maps is now stale.
While this issue eventually resolves automatically thanks to
circuit-breaker timeouts, it is still problematic, as it can delay
startup, and disrupt existing cross-cluster connections.

This commit builds on top of the existing custom dialer and
resolvers, already used to bypass DNS resolution, and introduces
a new resolver that internally handles the load-balancing by
directly accessing the appropriate statedb table. This way,
the connection directly targets one of the KVStoreMesh replicas,
instead of going through the Kubernetes service, breaking the
dependency. The resolver attempts to reuse the same backend
for subsequent requests if still valid, to preserve session
affinity.

[^1] Strictly speaking, the issue occurs if Cilium is responsible
for ClusterIP service load-balancing for host-network pods, that
is when SocketLB is enabled (at least in the host-network namespace).

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
So that we can short-circuit the load-balancing process when the target
etcd instance is behind a local service (most notably, when KVStoreMesh
is enabled). This way, we break a possible chicken-and-egg dependency
caused by the load balancing subsystem waiting on clustermesh
synchronization before starting to update the BPF maps, and clustermesh
failing to connect to the cluster (hence preventing synchronization)
because the BPF maps are not up-to-date, e.g., if KVStoreMesh got
restarted while the agent was down.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
…xisted

There is no need to delay the reconciliation of the LB BPF maps if there
was no previous state as in that scenario we wouldn't be scaling down the
number of backends (since there are none yet).

Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ NOTE: this reverts the revert 01f8d50.
  Since we now only delay when existing BPF state existed we can have a higher timeout
  without impacting e.g. the initial ClusterMesh api-server connection.
]

Before, we were waiting for initial synchronization of endpoints and
services only for 10s. However, this is definitely too low, as
Kubernetes SLO for List call for 99th percentile is <= 30s.
ref https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#steady-state-slisslos

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
The previous commit increased the timeout to 1 minute, which is suitable
for production environments. However, it may still be too short for CI
environments, especially in case of certain ClusterMesh configurations
upon initial connection. While the timeout expiration would have in
these cases no other effects than a warning log, that's still problematic
as it causes CI failures. On the other hand, we don't want to get rid
of the warning altogether, as it is useful to catch possible unexpected
circular dependencies. Hence, let's make the timeout configurable via
an hidden flag, so that it can be overridden where appropriate.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This reverts commit d2aab1f.

We need to prevent triggering this circular dependency again, even
though caused for a slightly different reason. Specifically, the
problem is that service reconciliation needs to wait on clustermesh
synchronization, which at the same time requires the cmapisrv
service to be reconciled to include the new backends, otherwise
remote clusters cannot connect to it.

While the previous commits addressed this problem when connecting
to the local KVStoreMesh instance, bypassing service load-balancing,
the same cannot be done when connecting to the remote cluster.
However, letting non-global services to be reconciled immediately,
without waiting on clustermesh synchronization, does involve a much
more convoluted fix, which doesn't seem that necessary at the moment.

Indeed, this is a relatively extreme test, simulating a condition
that is unlikely to happen in reality (especially with multiple
replicas of the clustermesh-apiserver), and that would resolve
automatically anyhow thanks to the circuit-breaker timeouts (even
though possibly at the cost of breaking cross-cluster connections).
And the same issue already affected previous releases, even though
for a different reason, so it is not even a regression from that
point of view.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the pr/giorio94/main/clustermesh-bypass-service branch from 1d01d80 to 451a797 Compare July 31, 2025 14:17
@giorio94
Copy link
Member Author

/test

Copy link
Member

@MrFreezeex MrFreezeex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I left a quick suggestion/question inline but apart from that looks great!

Also do we have an existing issue that should be closed when this is getting merged (not saying that we need one necessarily btw)? IIUC this can be triggered by users in the last release so I guess we would need a release note right?

@giorio94
Copy link
Member Author

giorio94 commented Aug 1, 2025

Also do we have an existing issue that should be closed when this is getting merged (not saying that we need one necessarily btw)? IIUC this can be triggered by users in the last release so I guess we would need a release note right?

I'm not aware of any issue about this regression. But very good point about the release note, added.

@giorio94 giorio94 enabled auto-merge August 1, 2025 12:15
@giorio94 giorio94 added this pull request to the merge queue Aug 5, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 5, 2025
Merged via the queue into main with commit 31d1408 Aug 5, 2025
306 of 307 checks passed
@giorio94 giorio94 deleted the pr/giorio94/main/clustermesh-bypass-service branch August 5, 2025 08:51
@rastislavs rastislavs mentioned this pull request Aug 6, 2025
17 tasks
@rastislavs rastislavs added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 6, 2025
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants