Skip to content

Conversation

thorn3r
Copy link
Contributor

@thorn3r thorn3r commented Mar 29, 2024

This adds support for running clustermesh-apiserver deployments with multiple replicas for high availability.

Each clustermesh-apiserver pod runs its own etcd cluster. Depending on configuration, either the Cilium Agent or KVStoreMesh instance watches etcd in a remote cluster. All responses from the remote etcd cluster are intercepted and the header is inspected to retrieve the etcd cluster ID. If a failover event occurs and the cluster ID has changed, the remote connection is restarted to ensure that no events are missed and that no invalid data is retained. See individual commit messages for additional details.

Add support for deploying clustermesh-apiserver with multiple replicas for high availability.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 29, 2024
@thorn3r
Copy link
Contributor Author

thorn3r commented Mar 29, 2024

/test

@thorn3r thorn3r force-pushed the pr/thorn3r/clustermeshHA branch from 0ea2d4e to b855a05 Compare March 29, 2024 19:38
@thorn3r
Copy link
Contributor Author

thorn3r commented Mar 29, 2024

/test

@thorn3r thorn3r force-pushed the pr/thorn3r/clustermeshHA branch from b855a05 to 660dc0f Compare April 2, 2024 13:54
@thorn3r thorn3r added the release-note/major This PR introduces major new functionality to Cilium. label Apr 2, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024
@thorn3r
Copy link
Contributor Author

thorn3r commented Apr 2, 2024

/test

@thorn3r thorn3r marked this pull request as ready for review April 2, 2024 14:52
@thorn3r thorn3r requested review from a team as code owners April 2, 2024 14:52
@thorn3r thorn3r mentioned this pull request Apr 2, 2024
@thorn3r thorn3r added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/major This PR introduces major new functionality to Cilium. labels Apr 2, 2024
@thorn3r thorn3r changed the title ClusterMesh HA Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) Apr 2, 2024
@giorio94 giorio94 self-requested a review April 4, 2024 08:39
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Just a bunch of minor comments and nits inline.

@thorn3r thorn3r force-pushed the pr/thorn3r/clustermeshHA branch from 660dc0f to ee4fb0a Compare April 4, 2024 20:34
@thorn3r thorn3r requested a review from a team as a code owner April 4, 2024 20:34
@thorn3r
Copy link
Contributor Author

thorn3r commented Apr 4, 2024

/test

@giorio94
Copy link
Member

giorio94 commented Apr 8, 2024

TIL that comments in a multi-line command in bash are tricky

Yep, I typically just add them above the entire command, to avoid these kinds of problems.

Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 9, 2024
@joamaki joamaki added this pull request to the merge queue Apr 9, 2024
Merged via the queue into main with commit 0426636 Apr 9, 2024
@joamaki joamaki deleted the pr/thorn3r/clustermeshHA branch April 9, 2024 09:14
@giorio94
Copy link
Member

Marking for backport to v1.15 to address #30964. I'm going to backport a reduced version which only includes the configuration of the unique etcd Cluster ID and the interceptor logic, fixing a bug potentially causing Cilium agents to incorrectly restart an etcd watch against a different clustermesh-apiserver instance.

@giorio94 giorio94 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed needs-backport/1.15 backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Apr 16, 2024
@thorn3r thorn3r mentioned this pull request Apr 22, 2024
8 tasks
marseel added a commit to marseel/cilium that referenced this pull request Jun 10, 2024
With introduction of Clustermesh support for HA deployment in cilium#31677
let's change upgrade strategy to make sure that Clustermesh control
plane is always available.
This is also configuration that we test against in CI tests - maxSurge=1
and maxUnavailable=0. On top of that change required to
preferred antiAffinity to cover case with a single node cluster.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
marseel added a commit to marseel/cilium that referenced this pull request Jun 10, 2024
With introduction of Clustermesh support for HA deployment in cilium#31677
let's change upgrade strategy to make sure that Clustermesh control
plane is always available.
This is also configuration that we test against in CI tests - maxSurge=1
and maxUnavailable=0. On top of that change required to
preferred antiAffinity to cover case with a single node cluster.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
marseel added a commit to marseel/cilium that referenced this pull request Jun 10, 2024
With introduction of Clustermesh support for HA deployment in cilium#31677
let's change upgrade strategy to make sure that Clustermesh control
plane is always available.
This is also configuration that we test against in CI tests - maxSurge=1
and maxUnavailable=0. On top of that change required to
preferred antiAffinity to cover case with a single node cluster.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Jun 13, 2024
With introduction of Clustermesh support for HA deployment in #31677
let's change upgrade strategy to make sure that Clustermesh control
plane is always available.
This is also configuration that we test against in CI tests - maxSurge=1
and maxUnavailable=0. On top of that change required to
preferred antiAffinity to cover case with a single node cluster.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
youngnick pushed a commit to youngnick/cilium that referenced this pull request Jun 20, 2024
With introduction of Clustermesh support for HA deployment in cilium#31677
let's change upgrade strategy to make sure that Clustermesh control
plane is always available.
This is also configuration that we test against in CI tests - maxSurge=1
and maxUnavailable=0. On top of that change required to
preferred antiAffinity to cover case with a single node cluster.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
@giorio94 giorio94 added the area/clustermesh Relates to multi-cluster routing functionality in Cilium. label Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants