clustermesh: downtime/dropped packets due to network policy on upgrade

### Is there an existing issue for this?

- [X] I have searched the existing issues

### What happened?

We are currently running Cilium on EKS in chaining mode with aws-cni. We are utilizing clustermesh across 6 clusters and the configuration and installation is managed by FluxCD. Our application has an auth service that lives in one of the six clusters and proxy service that lives in all clusters. The proxies connect to each other in a mesh as well as initiate a connection to the auth service. The proxies rely on a global service to connect to auth, they dial each other directly via a different discovery mechanism.

Single cluster upgrades don't appear to be an issue but if multiple clusters upgrade close enough together we observe connection issues between proxies in those clusters. Specifically we are looking at the metric `hubble_drop_total{reason="Policy denied"}` as well as metric on our application that indicates the number of other proxies a proxy is connected to. Looking at the logs of the cilium agent of the node where affected proxies were running it seems like it isn't uncommon to see a connection to etcd, disconnect, and reconnect. It is fairly reproducible by running `kubectl rollout restart deployment clustermesh-apiserver` in cluster1 and `kubectl rollout restart ds cilium` in cluster2.

I considered trying to split the cilium install and the clustermesh install into separate helm installs as that would give us control when they would restart and not happen at the same time. Unfortunately, the `cilium-ca` secret is expected to be created every time and would cause a conflict (there may be other but this was the first roadblock I hit, could potentially get around this by allowing users to specify this).

We are hoping to better understand the following:
1. Expectations of upgrades as the upgrade guide states `Minimal to None` affect on L3/L4 and if we are hitting that or if we are hitting something else.
2. Is it expected that Cilium upgrades in clustermesh happen one cluster at a time? Our FluxCD reconciles every 10 min and currently have no coordination that at the moment but are looking at options for this.
3. Would we be better off configuring our cilium install to be backed by etcd? That seems like it would remove the possibility of etcd connections being broken on cilium startup since there wouldn't be a `clustermesh-apiserver` pod being recreated. Is there documentation for this by chance?
4. Possibly separate the `etcd` container (and give it a PV) from the `apiserver` container so that clustermesh isn't rolled at the same time? Any gotchas with this approach?

Thanks for all the work on this awesome CNI!

### Cilium Version

1.12.3

### Kernel Version

Linux ip-10-16-112-106.us-west-2.compute.internal 5.4.242-155.348.amzn2.x86_64 #1 SMP Mon May 8 12:52:40 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

### Kubernetes Version

Server Version: v1.23.17-eks-0a21954

### Sysdump

Sysdump file was too large to upload

### Relevant log output

```shell
level=info msg="Established connection to remote etcd" clusterName=redact config=/var/lib/cilium/clustermesh/redact kvstoreErr="<nil>" kvstoreStatus="etcd: 1/1 connected, lease-ID=0, lock lease-ID=0, has-quorum=timeout while waiting for initial connection, consecutive-errors=1: https://redact:2379 - 3.5.4 (Leader)" subsys=clustermesh
```


### Anything else?

_No response_

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clustermesh: downtime/dropped packets due to network policy on upgrade #26462

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

clustermesh: downtime/dropped packets due to network policy on upgrade #26462

Description

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions