Skip to content

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented May 3, 2024

Troubleshooting etcd connectivity issues, regardless of whether to the Cilium kvstore or to a remote cluster, is a complex activity, as issues can concern network connectivity, TLS certificates mismatch, authn/authz policies and so on.

As an effort to simplify this process, let's introduce new CLI commands part of the Cilium agent, clustermesh-apiserver and kvstoremesh responsible for performing a set of sanity checks, and outputting the result in a user-friendly way:

  • cilium troubleshoot kvstore: troubleshoot connectivity towards the Cilium etcd kvstore;
  • cilium troubleshoot clustermesh [clusters...]: troubleshoot connectivity towards remote clusters (the check can be optionally limited to the specified cluster names);
  • kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c apiserver -- clustermesh-apiserver clustermesh-dbg troubleshoot: troubleshoot the clutermesh-apiserver connectivity to the local etcd kvstore;
  • kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c kvstoremesh -- clustermesh-apiserver kvstoremesh-dbg troubleshoot [--include-local] [clusters...]: troubleshoot the kvstoremesh connectivity to the local etcd kvstore and remote clusters (the check can be optionally limited to the specified cluster names);

Example output of cilium troubleshoot clustermesh in case of success is:

Found 1 remote cluster configurations

Remote cluster "cheerful-rhino":
📄 Configuration path: /var/lib/cilium/clustermesh/clustermesh1

🔌 Endpoints:
   - https://clustermesh-apiserver.kube-system.svc:2379
     ✅ Hostname resolved to: 172.20.2.218
     ✅ TCP connection successfully established to 172.20.2.218:2379
     ✅ TLS connection successfully established to 172.20.2.218:2379
     ℹ  Negotiated TLS version: TLS 1.3, ciphersuite TLS_AES_128_GCM_SHA256
     ℹ  Etcd server version: 3.5.13

🔑 Digital certificates:
   ✅ TLS Root CA certificates:
      - Serial number:       c0:ae:dd:fc:cd:20:b9:ca:c6:e6:be:ca:8e:b7:a1:03
        Subject:             CN=Cilium CA
        Issuer:              CN=Cilium CA
        Validity:
          Not before:  2024-05-03 07:15:23 +0000 UTC
          Not after:   2027-05-03 07:15:23 +0000 UTC
   ✅ TLS client certificates:
      - Serial number:       37:b3:23:16:b4:a0:61:b1:59:54:59:7a:e6:e2:54:13
        Subject:             CN=remote
        Issuer:              CN=Cilium CA
        Validity:
          Not before:  2024-05-03 07:15:29 +0000 UTC
          Not after:   2027-05-03 07:15:29 +0000 UTC

⚙ Etcd client:
   ✅ Etcd connection successfully established
   ℹ  Etcd cluster ID: 85e29469f06c55e1

Failure examples (providing only the relevant snippet for brevity) include:

  • The hostname doesn't resolve to any IP:
🔌 Endpoints:
   - https://bar.mesh.cilium.io:32380
     ❌ Cannot resolve hostname: lookup bar.mesh.cilium.io: no such host
     ❌ Cannot establish TCP connection to bar.mesh.cilium.io:32380: dial tcp: lookup bar.mesh.cilium.io: no such host
  • Failed to establish TCP connection:
🔌 Endpoints:
   - https://clustermesh2.mesh.cilium.io:9999
     ✅ Hostname resolved to: 172.19.0.4
     ❌ Cannot establish TCP connection to clustermesh2.mesh.cilium.io:9999: dial tcp 172.19.0.4:9999: connect: connection refused
  • The server certificate is not valid for the given IP/hostname:
🔌 Endpoints:
   - https://172.19.0.4:32380
     ✅ TCP connection successfully established to 172.19.0.4:32380
     ❌ Cannot establish TLS connection to 172.19.0.4:32380: tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, ::1, not 172.19.0.4
  - https://foo.cilium.io:32380
     ✅ Hostname resolved to: 172.19.0.4
     ✅ TCP connection successfully established to 172.19.0.4:32380
     ❌ Cannot establish TLS connection to foo.cilium.io:32380: tls: failed to verify certificate: x509: certificate is valid for clustermesh-apiserver.cilium.io, *.mesh.cilium.io, clustermesh-apiserver.kube-system.svc, not foo.cilium.io
  • The server certificate is signed by an unknown root CA:
🔌 Endpoints:
   - https://clustermesh2.mesh.cilium.io:32380
     ✅ Hostname resolved to: 172.19.0.4
     ✅ TCP connection successfully established to 172.19.0.4:32380
     ❌ Cannot establish TLS connection to clustermesh2.mesh.cilium.io:32380: tls: failed to verify certificate: x509: certificate signed by unknown authority
  • The client certificate is rejected by the peer as signed by an unknown CA:
🔌 Endpoints:
   - https://clustermesh2.mesh.cilium.io:32380
     ✅ Hostname resolved to: 172.19.0.4
     ✅ TCP connection successfully established to 172.19.0.4:32380
     ✅ TLS connection successfully established to 172.19.0.4:32380
     ℹ  Negotiated TLS version: TLS 1.3, ciphersuite TLS_AES_128_GCM_SHA256
     ❌ TLS client authentication failed: remote error: tls: unknown certificate authority

🔑 Digital certificates:
   ...
   ✅ TLS client certificates:
      - Serial number:       37:b3:23:16:b4:a0:61:b1:59:54:59:7a:e6:e2:54:13
        Subject:             CN=remote
        Issuer:              CN=Cilium CA
        Validity:
          Not before:  2024-05-03 07:15:29 +0000 UTC
          Not after:   2027-05-03 07:15:29 +0000 UTC
        ⚠ Cannot verify certificate with the configured root CAs
  • The etcd user is not authorized to retrieve the data:
⚙ Etcd client:
   ❌ Failed to retrieve key from etcd: etcdserver: permission denied

I've marked this PR for backport to all stable branches as it qualifies as Debug tool improvements, and it doesn't introduce risks considering that it includes CLI changes only.

Documentation changes and the collection of clustermesh-apiserver information as part of the sysdump will be addressed in follow-up PRs.

Fixes: #30937

Introduce CLI commands to troubleshoot connectivity issues to the etcd kvstore and clustermesh control plane

@giorio94 giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/bugtool Impacts gathering of data for debugging purposes. area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/kvstore Impacts the KVStore package interactions. backport/author The backport will be carried out by the author of the PR. needs-backport/1.13 labels May 3, 2024
@giorio94 giorio94 force-pushed the mio/etcd-troubleshoot branch from 6ea0617 to ca28045 Compare May 3, 2024 09:22
@giorio94
Copy link
Member Author

giorio94 commented May 3, 2024

/test

We shouldn't import testing code into production code, as it can lead to
unexpected side effects due to e.g., init functions. Let's address this
by hard-coding the "PolicyEnforcement" constant, rather than importing
it. This is consistent with the same usage as part of the "config" command.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/etcd-troubleshoot branch from ca28045 to c0ce4d3 Compare May 3, 2024 13:26
@giorio94
Copy link
Member Author

giorio94 commented May 3, 2024

/test

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one small thing.

giorio94 added 6 commits May 6, 2024 09:55
It is intended to be used by CLI tools to retrieve the configuration
files of all remote clusters in a given directory, to be used, e.g.,
for troubleshooting purposes.

While being there, let's also replace the path package with the filepath
one, which is more appropriate in this context, and it would allow to
theoretically handle Windows paths as well.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Troubleshooting etcd connectivity issues, regardless of whether to the
Cilium kvstore or to a remote cluster, is a complex activity, as issues
can concern network connectivity, TLS certificates mismatch, authn/authz
policies and so on.

As an effort to simplify this process, let's introduce a new utility
responsible for performing a set of sanity checks, and outputting the
result in a user-friendly way. This utility is intended to be then
leveraged by dedicated CLI commands integrated with the various
components. More in detail, this utility performs the following
operations:

* Asserts that the etcd configuration can be correctly parsed;
* For each endpoint:
  - Outputs the DNS resolution;
  - Assert that the endpoint is reachable at the network level (i.e.,
    that a TCP connection can be successfully established);
  - When https is enabled, asserts that a TLS connection can be correctly
    established to the endpoint (i.e., that the provided certificates
    are valid); the check includes both server and client (if enabled)
    authentication; additionally outputs TLS specific information;
  - Outputs the version of the endpoint, as returned by GET /version;
* Outputs information regarding Root CAs and client certificates, if
  configured; additionally checks whether the client certificate is
  valid according to the root CAs;
* Asserts that the etcd client can correctly establish a connection;
* Asserts that the heartbeat key can be retrieved, as a basic
  authorization check.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Introduce two new cilium-dbg commands, namely "troubleshoot kvstore" and
"troubleshoot clustermesh", responsible for running a set of sanity
checks to help troubleshoot etcd connectivity issues, covering network
connectivity, TLS authentication, authn/authz policies and so on.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
As useful to troubleshoot kvstore and clustermesh issues.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Extend the clustermesh-apiserver binary with a new clustermesh-dbg
troubleshoot subcommand, responsible for running a set of sanity
checks to help troubleshoot etcd connectivity issues, covering network
connectivity, TLS authentication, authn/authz policies and so on.

The command can be invoked through something along the lines of:

$ kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c apiserver \
  -- clustermesh-apiserver clustermesh-dbg troubleshoot

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Introduce a new troubleshoot subcommand to "clustermesh-apiserver
kvstoremesh-dbg", responsible for running a set of sanity checks
to help troubleshoot etcd connectivity issues, covering network
connectivity, TLS authentication, authn/authz policies and so on.

The command can be invoked through something along the lines of:

$ kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c kvstoremesh \
  -- clustermesh-apiserver kvstoremesh-dbg troubleshoot [--include-local]

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/etcd-troubleshoot branch from c0ce4d3 to 041321b Compare May 6, 2024 07:56
@giorio94
Copy link
Member Author

giorio94 commented May 6, 2024

/test

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giorio94 Very cool. Nice work Marco!

Copy link
Contributor

@thorn3r thorn3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, great work!

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 8, 2024
@julianwiedmann julianwiedmann added this pull request to the merge queue May 9, 2024
Merged via the queue into cilium:main with commit f575f94 May 9, 2024
@giorio94 giorio94 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels May 16, 2024
@giorio94 giorio94 mentioned this pull request May 16, 2024
2 tasks
@giorio94 giorio94 added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 labels May 16, 2024
@giorio94 giorio94 mentioned this pull request May 16, 2024
2 tasks
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. backport-pending/1.13 backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bugtool Impacts gathering of data for debugging purposes. area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/kvstore Impacts the KVStore package interactions. backport/author The backport will be carried out by the author of the PR. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
No open projects
Status: Needs backport from main
Status: Needs backport from main
Status: Needs backport from main
Status: Released
Status: Released
Status: Released
Development

Successfully merging this pull request may close these issues.

CFP: Simplify troubleshooting connectivity issues towards control plane components
5 participants