Skip to content

Conversation

marseel
Copy link
Contributor

@marseel marseel commented Jun 10, 2024

With introduction of Clustermesh support for HA deployment in #31677 let's change upgrade strategy to make sure that Clustermesh control plane is always available.
This is also configuration that we test against in CI tests - maxSurge=1 and maxUnavailable=0. On top of that change required to preferred antiAffinity to cover case with a single node cluster.

Change default Clustermesh control plane upgrade strategy to use surge strategy

@marseel marseel requested a review from giorio94 June 10, 2024 08:15
@marseel marseel requested review from a team as code owners June 10, 2024 08:15
@marseel marseel requested review from nathanjsweet and squeed June 10, 2024 08:15
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 10, 2024
@marseel
Copy link
Contributor Author

marseel commented Jun 10, 2024

/test

@marseel marseel force-pushed the set_maxunavailable_clustermesh branch from 3bc9e67 to 7ac6d79 Compare June 10, 2024 08:30
@marseel
Copy link
Contributor Author

marseel commented Jun 10, 2024

/test

@marseel marseel added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jun 10, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 10, 2024
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The preferred anti-affinity definition needs to be updated to match the expected format.

@giorio94 giorio94 added area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/helm Impacts helm charts and user deployment experience labels Jun 10, 2024
With introduction of Clustermesh support for HA deployment in cilium#31677
let's change upgrade strategy to make sure that Clustermesh control
plane is always available.
This is also configuration that we test against in CI tests - maxSurge=1
and maxUnavailable=0. On top of that change required to
preferred antiAffinity to cover case with a single node cluster.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
@marseel marseel force-pushed the set_maxunavailable_clustermesh branch from 7ac6d79 to c9bad71 Compare June 10, 2024 09:14
@marseel
Copy link
Contributor Author

marseel commented Jun 10, 2024

/test

@squeed
Copy link
Contributor

squeed commented Jun 11, 2024

Does it make sense to add a PodDisruptionBudget as well?

@marseel
Copy link
Contributor Author

marseel commented Jun 11, 2024

Does it make sense to add a PodDisruptionBudget as well?

yes, makes sense. I would probably do it together with switching the default number of replicas from 1 to 2 and setting maxUnavailable=1 on PodDisruptionBudget. There are also some nouances with evicting unhealthy pods:

It is recommended to set AlwaysAllow Unhealthy Pod Eviction Policy to your PodDisruptionBudgets to support eviction of misbehaving applications during a node drain. The default behavior is to wait for the application pods to become healthy before the drain can proceed.

but Unhealthy Pod Eviction Policy is beta in v1.27. So I would prefer to do it after considering all these edge cases and maybe waiting for Unhealthy Pod Eviction Policy to become stable.

@aanm aanm added this pull request to the merge queue Jun 13, 2024
Merged via the queue into cilium:main with commit dc52072 Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/helm Impacts helm charts and user deployment experience release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants