Skip to content

clustermesh: downtime/dropped packets due to network policy on upgrade #26462

@rcanderson23

Description

@rcanderson23

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We are currently running Cilium on EKS in chaining mode with aws-cni. We are utilizing clustermesh across 6 clusters and the configuration and installation is managed by FluxCD. Our application has an auth service that lives in one of the six clusters and proxy service that lives in all clusters. The proxies connect to each other in a mesh as well as initiate a connection to the auth service. The proxies rely on a global service to connect to auth, they dial each other directly via a different discovery mechanism.

Single cluster upgrades don't appear to be an issue but if multiple clusters upgrade close enough together we observe connection issues between proxies in those clusters. Specifically we are looking at the metric hubble_drop_total{reason="Policy denied"} as well as metric on our application that indicates the number of other proxies a proxy is connected to. Looking at the logs of the cilium agent of the node where affected proxies were running it seems like it isn't uncommon to see a connection to etcd, disconnect, and reconnect. It is fairly reproducible by running kubectl rollout restart deployment clustermesh-apiserver in cluster1 and kubectl rollout restart ds cilium in cluster2.

I considered trying to split the cilium install and the clustermesh install into separate helm installs as that would give us control when they would restart and not happen at the same time. Unfortunately, the cilium-ca secret is expected to be created every time and would cause a conflict (there may be other but this was the first roadblock I hit, could potentially get around this by allowing users to specify this).

We are hoping to better understand the following:

  1. Expectations of upgrades as the upgrade guide states Minimal to None affect on L3/L4 and if we are hitting that or if we are hitting something else.
  2. Is it expected that Cilium upgrades in clustermesh happen one cluster at a time? Our FluxCD reconciles every 10 min and currently have no coordination that at the moment but are looking at options for this.
  3. Would we be better off configuring our cilium install to be backed by etcd? That seems like it would remove the possibility of etcd connections being broken on cilium startup since there wouldn't be a clustermesh-apiserver pod being recreated. Is there documentation for this by chance?
  4. Possibly separate the etcd container (and give it a PV) from the apiserver container so that clustermesh isn't rolled at the same time? Any gotchas with this approach?

Thanks for all the work on this awesome CNI!

Cilium Version

1.12.3

Kernel Version

Linux ip-10-16-112-106.us-west-2.compute.internal 5.4.242-155.348.amzn2.x86_64 #1 SMP Mon May 8 12:52:40 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Server Version: v1.23.17-eks-0a21954

Sysdump

Sysdump file was too large to upload

Relevant log output

level=info msg="Established connection to remote etcd" clusterName=redact config=/var/lib/cilium/clustermesh/redact kvstoreErr="<nil>" kvstoreStatus="etcd: 1/1 connected, lease-ID=0, lock lease-ID=0, has-quorum=timeout while waiting for initial connection, consecutive-errors=1: https://redact:2379 - 3.5.4 (Leader)" subsys=clustermesh

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

area/agentCilium agent related.area/clustermeshRelates to multi-cluster routing functionality in Cilium.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions