-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Introduce CLI commands to troubleshoot connectivity issues to the etcd kvstore and clustermesh control plane #32336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6ea0617
to
ca28045
Compare
/test |
We shouldn't import testing code into production code, as it can lead to unexpected side effects due to e.g., init functions. Let's address this by hard-coding the "PolicyEnforcement" constant, rather than importing it. This is consistent with the same usage as part of the "config" command. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
ca28045
to
c0ce4d3
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one small thing.
It is intended to be used by CLI tools to retrieve the configuration files of all remote clusters in a given directory, to be used, e.g., for troubleshooting purposes. While being there, let's also replace the path package with the filepath one, which is more appropriate in this context, and it would allow to theoretically handle Windows paths as well. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Troubleshooting etcd connectivity issues, regardless of whether to the Cilium kvstore or to a remote cluster, is a complex activity, as issues can concern network connectivity, TLS certificates mismatch, authn/authz policies and so on. As an effort to simplify this process, let's introduce a new utility responsible for performing a set of sanity checks, and outputting the result in a user-friendly way. This utility is intended to be then leveraged by dedicated CLI commands integrated with the various components. More in detail, this utility performs the following operations: * Asserts that the etcd configuration can be correctly parsed; * For each endpoint: - Outputs the DNS resolution; - Assert that the endpoint is reachable at the network level (i.e., that a TCP connection can be successfully established); - When https is enabled, asserts that a TLS connection can be correctly established to the endpoint (i.e., that the provided certificates are valid); the check includes both server and client (if enabled) authentication; additionally outputs TLS specific information; - Outputs the version of the endpoint, as returned by GET /version; * Outputs information regarding Root CAs and client certificates, if configured; additionally checks whether the client certificate is valid according to the root CAs; * Asserts that the etcd client can correctly establish a connection; * Asserts that the heartbeat key can be retrieved, as a basic authorization check. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Introduce two new cilium-dbg commands, namely "troubleshoot kvstore" and "troubleshoot clustermesh", responsible for running a set of sanity checks to help troubleshoot etcd connectivity issues, covering network connectivity, TLS authentication, authn/authz policies and so on. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
As useful to troubleshoot kvstore and clustermesh issues. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Extend the clustermesh-apiserver binary with a new clustermesh-dbg troubleshoot subcommand, responsible for running a set of sanity checks to help troubleshoot etcd connectivity issues, covering network connectivity, TLS authentication, authn/authz policies and so on. The command can be invoked through something along the lines of: $ kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c apiserver \ -- clustermesh-apiserver clustermesh-dbg troubleshoot Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Introduce a new troubleshoot subcommand to "clustermesh-apiserver kvstoremesh-dbg", responsible for running a set of sanity checks to help troubleshoot etcd connectivity issues, covering network connectivity, TLS authentication, authn/authz policies and so on. The command can be invoked through something along the lines of: $ kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c kvstoremesh \ -- clustermesh-apiserver kvstoremesh-dbg troubleshoot [--include-local] Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
c0ce4d3
to
041321b
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giorio94 Very cool. Nice work Marco!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, great work!
Troubleshooting etcd connectivity issues, regardless of whether to the Cilium kvstore or to a remote cluster, is a complex activity, as issues can concern network connectivity, TLS certificates mismatch, authn/authz policies and so on.
As an effort to simplify this process, let's introduce new CLI commands part of the Cilium agent, clustermesh-apiserver and kvstoremesh responsible for performing a set of sanity checks, and outputting the result in a user-friendly way:
cilium troubleshoot kvstore
: troubleshoot connectivity towards the Cilium etcd kvstore;cilium troubleshoot clustermesh [clusters...]
: troubleshoot connectivity towards remote clusters (the check can be optionally limited to the specified cluster names);kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c apiserver -- clustermesh-apiserver clustermesh-dbg troubleshoot
: troubleshoot the clutermesh-apiserver connectivity to the local etcd kvstore;kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c kvstoremesh -- clustermesh-apiserver kvstoremesh-dbg troubleshoot [--include-local] [clusters...]
: troubleshoot the kvstoremesh connectivity to the local etcd kvstore and remote clusters (the check can be optionally limited to the specified cluster names);Example output of
cilium troubleshoot clustermesh
in case of success is:Failure examples (providing only the relevant snippet for brevity) include:
I've marked this PR for backport to all stable branches as it qualifies as
Debug tool improvements
, and it doesn't introduce risks considering that it includes CLI changes only.Documentation changes and the collection of clustermesh-apiserver information as part of the sysdump will be addressed in follow-up PRs.
Fixes: #30937