Skip to content

Conversation

giorio94
Copy link
Member

Upon agent and operator restart, we need to wait for full clustermesh synchronization in multiple subsystems, to prevent breaking existing cross-cluster connections due to e.g., garbage collection of valid but not yet retrieved entries for a given remote cluster. However, the absence of a timeout controlling this process is problematic as well, as the impossibility of connecting to a remote cluster (e.g., due to a misconfiguration) can cause issues for local communication to the blocked GC operations.

Let's standardize the different wait for synchronization functions to automatically return after a user configurable timeout (tunable via the clustermesh-sync-timeout, and set to 1 minute by default) elapses. This mimics and replaces the already existing timeout used to unblock endpoint regeneration, generalizing it to all the other resources as well. The existing flag is deprecated, but it still takes precedence for consistency with the existing behavior.

Introduce timeout when waiting for the initial synchronization from remote clusters, to avoid blocking forever necessary GC operations in case of clustermesh misconfigurations.

The only reason for that function to return an error is that the parent
context expired, which happens if the agent is being shut down while the
synchronization has not yet completed. Hence, let's just return, rather
than triggering a fatal error.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. needs-backport/1.15 backport/author The backport will be carried out by the author of the PR. labels May 22, 2024
Upon agent and operator restart, we need to wait for full clustermesh
synchronization in multiple subsystems, to prevent breaking existing
cross-cluster connections due to e.g., garbage collection of valid
but not yet retrieved entries for a given remote cluster. However,
the absence of a timeout controlling this process is problematic as
well, as the impossibility of connecting to a remote cluster (e.g.,
due to a misconfiguration) can cause issues for local communication
to the blocked GC operations.

Let's standardize the different wait for synchronization functions
to automatically return after a user configurable timeout (tunable
via the clustermesh-sync-timeout, and set to 1 minute by default)
elapses. This mimics and replaces the already existing timeout used
to unblock endpoint regeneration, generalizing it to all the other
resources as well. The existing flag is deprecated, but it still
takes precedence for consistency with the existing behavior.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/clustermesh-wait-circuit-breaker branch from 3726082 to 09a4124 Compare May 22, 2024 14:29
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 marked this pull request as ready for review May 22, 2024 14:31
@giorio94 giorio94 requested review from a team as code owners May 22, 2024 14:31
@giorio94 giorio94 requested review from tommyp1ckles, YutaroHayakawa and a user May 22, 2024 14:31
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs good

Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endpoint lgtm

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 28, 2024
@julianwiedmann julianwiedmann added this pull request to the merge queue May 28, 2024
Merged via the queue into cilium:main with commit cc7c27d May 28, 2024
@giorio94 giorio94 mentioned this pull request May 30, 2024
1 task
@giorio94 giorio94 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch and removed needs-backport/1.15 labels May 30, 2024
@github-actions github-actions bot removed the backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. label Jun 3, 2024
@github-actions github-actions bot added the backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport/author The backport will be carried out by the author of the PR. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

4 participants