-
Notifications
You must be signed in to change notification settings - Fork 3.4k
loadbalancer: break the circular dependency with clustermesh #40786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ffee158
to
b2bfd1a
Compare
/ci-clustermesh |
b2bfd1a
to
b1d3d9b
Compare
/ci-clustermesh |
b1d3d9b
to
d10b288
Compare
/ci-clustermesh |
All clustermesh tests passed three times in a row, so we should be good from that point of view. Performing a few final tests locally, before marking it ready for review. |
8a3b99a
to
1d01d80
Compare
/test |
In preparation for the subsequent introduction of a second resolver to short-circuit the service frontend to backend load balancing process, let's slightly generalize the clustermesh common logic to allow configuring multiple resolvers. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's generalize the context dialer resolvers to allow mutating both the host and the port, rather then the host only, to support use-cases such as automatic service backend selection. Additionally, let's always invoke all resolvers in sequential order, instead of exiting early once the first returns successfully, to support chained operations. The Resolver interface is finally changed to not return an error, given that it was not actually used. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The new load balancer implementation is affected by a potential chicken-and-egg dependency when KVStoreMesh and KPR [^1] are enabled because the reconciliation of services needs to wait for clustermesh synchronization (in addition to Kubernetes synchronization), to avoid incorrectly removing valid backends during bootstrap. However, Cilium agents connect to the KVStoreMesh sidecar etcd instance via a service, creating a circular dependency in case KPR is enabled and KVStoreMesh got restarted while the Cilium agent was down, and the service information in the BPF maps is now stale. While this issue eventually resolves automatically thanks to circuit-breaker timeouts, it is still problematic, as it can delay startup, and disrupt existing cross-cluster connections. This commit builds on top of the existing custom dialer and resolvers, already used to bypass DNS resolution, and introduces a new resolver that internally handles the load-balancing by directly accessing the appropriate statedb table. This way, the connection directly targets one of the KVStoreMesh replicas, instead of going through the Kubernetes service, breaking the dependency. The resolver attempts to reuse the same backend for subsequent requests if still valid, to preserve session affinity. [^1] Strictly speaking, the issue occurs if Cilium is responsible for ClusterIP service load-balancing for host-network pods, that is when SocketLB is enabled (at least in the host-network namespace). Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
So that we can short-circuit the load-balancing process when the target etcd instance is behind a local service (most notably, when KVStoreMesh is enabled). This way, we break a possible chicken-and-egg dependency caused by the load balancing subsystem waiting on clustermesh synchronization before starting to update the BPF maps, and clustermesh failing to connect to the cluster (hence preventing synchronization) because the BPF maps are not up-to-date, e.g., if KVStoreMesh got restarted while the agent was down. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
…xisted There is no need to delay the reconciliation of the LB BPF maps if there was no previous state as in that scenario we wouldn't be scaling down the number of backends (since there are none yet). Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ NOTE: this reverts the revert 01f8d50. Since we now only delay when existing BPF state existed we can have a higher timeout without impacting e.g. the initial ClusterMesh api-server connection. ] Before, we were waiting for initial synchronization of endpoints and services only for 10s. However, this is definitely too low, as Kubernetes SLO for List call for 99th percentile is <= 30s. ref https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#steady-state-slisslos Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
The previous commit increased the timeout to 1 minute, which is suitable for production environments. However, it may still be too short for CI environments, especially in case of certain ClusterMesh configurations upon initial connection. While the timeout expiration would have in these cases no other effects than a warning log, that's still problematic as it causes CI failures. On the other hand, we don't want to get rid of the warning altogether, as it is useful to catch possible unexpected circular dependencies. Hence, let's make the timeout configurable via an hidden flag, so that it can be overridden where appropriate. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This reverts commit d2aab1f. We need to prevent triggering this circular dependency again, even though caused for a slightly different reason. Specifically, the problem is that service reconciliation needs to wait on clustermesh synchronization, which at the same time requires the cmapisrv service to be reconciled to include the new backends, otherwise remote clusters cannot connect to it. While the previous commits addressed this problem when connecting to the local KVStoreMesh instance, bypassing service load-balancing, the same cannot be done when connecting to the remote cluster. However, letting non-global services to be reconciled immediately, without waiting on clustermesh synchronization, does involve a much more convoluted fix, which doesn't seem that necessary at the moment. Indeed, this is a relatively extreme test, simulating a condition that is unlikely to happen in reality (especially with multiple replicas of the clustermesh-apiserver), and that would resolve automatically anyhow thanks to the circuit-breaker timeouts (even though possibly at the cost of breaking cross-cluster connections). And the same issue already affected previous releases, even though for a different reason, so it is not even a regression from that point of view. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
1d01d80
to
451a797
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I left a quick suggestion/question inline but apart from that looks great!
Also do we have an existing issue that should be closed when this is getting merged (not saying that we need one necessarily btw)? IIUC this can be triggered by users in the last release so I guess we would need a release note right?
I'm not aware of any issue about this regression. But very good point about the release note, added. |
Please refer to the individual commits for additional details.