-
Notifications
You must be signed in to change notification settings - Fork 3.4k
clustermesh: introduce circuit breaker in wait for synchronization operations #32671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
julianwiedmann
merged 2 commits into
cilium:main
from
giorio94:mio/clustermesh-wait-circuit-breaker
May 28, 2024
Merged
clustermesh: introduce circuit breaker in wait for synchronization operations #32671
julianwiedmann
merged 2 commits into
cilium:main
from
giorio94:mio/clustermesh-wait-circuit-breaker
May 28, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The only reason for that function to return an error is that the parent context expired, which happens if the agent is being shut down while the synchronization has not yet completed. Hence, let's just return, rather than triggering a fatal error. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Upon agent and operator restart, we need to wait for full clustermesh synchronization in multiple subsystems, to prevent breaking existing cross-cluster connections due to e.g., garbage collection of valid but not yet retrieved entries for a given remote cluster. However, the absence of a timeout controlling this process is problematic as well, as the impossibility of connecting to a remote cluster (e.g., due to a misconfiguration) can cause issues for local communication to the blocked GC operations. Let's standardize the different wait for synchronization functions to automatically return after a user configurable timeout (tunable via the clustermesh-sync-timeout, and set to 1 minute by default) elapses. This mimics and replaces the already existing timeout used to unblock endpoint regeneration, generalizing it to all the other resources as well. The existing flag is deprecated, but it still takes precedence for consistency with the existing behavior. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
3726082
to
09a4124
Compare
/test |
ghost
approved these changes
May 22, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs good
tommyp1ckles
approved these changes
May 28, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endpoint lgtm
YutaroHayakawa
approved these changes
May 28, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
affects/v1.13
This issue affects v1.13 branch
affects/v1.14
This issue affects v1.14 branch
affects/v1.15
This issue affects v1.15 branch
area/clustermesh
Relates to multi-cluster routing functionality in Cilium.
backport/author
The backport will be carried out by the author of the PR.
backport-done/1.15
The backport for Cilium 1.15.x for this PR is done.
kind/bug
This is a bug in the Cilium logic.
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Upon agent and operator restart, we need to wait for full clustermesh synchronization in multiple subsystems, to prevent breaking existing cross-cluster connections due to e.g., garbage collection of valid but not yet retrieved entries for a given remote cluster. However, the absence of a timeout controlling this process is problematic as well, as the impossibility of connecting to a remote cluster (e.g., due to a misconfiguration) can cause issues for local communication to the blocked GC operations.
Let's standardize the different wait for synchronization functions to automatically return after a user configurable timeout (tunable via the clustermesh-sync-timeout, and set to 1 minute by default) elapses. This mimics and replaces the already existing timeout used to unblock endpoint regeneration, generalizing it to all the other resources as well. The existing flag is deprecated, but it still takes precedence for consistency with the existing behavior.