Skip to content

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented May 16, 2024

Once this PR is merged, a GitHub action will update the labels of these PRs:

 32156 32336 32552

[ upstream commit f437b70 ]

Let's ensure consistent ordering by sorting the slice of remote clusters
status information, as otherwise undefined given that it is generated
iterating over map values.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added kind/backports This PR provides functionality previously merged into master. backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. labels May 16, 2024
giorio94 added 17 commits May 16, 2024 10:26
[ upstream commit 796bb18 ]

[ backporter's notes: skipped the Makefile.defs hunk, as the comment is
  not present. ]

Introduce a new KVStoreMesh API definition, which currently exposes a
/clusters path to provide information about the status of the connection
to remote clusters, mimicking the data exposed by Cilium agents.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit c8389d5 ]

[ backporter's notes: hit minor conflicts due to different surrounding
  contexts, solved accepting the combination of changes, and with
  trivial manual adaptations. ]

Let's mimic the same logic already provided by the clustermesh subsystem
part of the Cilium agent, which allows to retrieve key information about
the connection to and data retrieval from each remote cluster. A subsequent
commit is going to wire it to the /clusters API, so that it can then be
accessed through a dedicated CLI.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit f7bd2b4 ]

Wire the API server logic to expose the kvstoremesh API, and register
the handler which returns the remote clusters status information. By
default, the API is served on http://localhost:9889, although the
address can be tuned through a dedicated parameter.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit e670ca6 ]

Extract this logic into a separate function, so that it can be reused
for the kvstoremesh-dbg command as well. Similarly, let's also slightly
refactor and export the NumReadyClusters helper function.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 7cf1f29 ]

Extend the remote clusters output logic to support an additional
verbosity level, to be leveraged by the kvstoremesh-dbg command.

Specifically, supported verbosity levels are:
* verbose: outputs the full information for all clusters;
* brief: outputs the full information for non-ready clusters, and
  a brief one-line summary for ready ones;
* non-ready-only: outputs the full information for non-ready clusters,
  and omits the ready ones.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 4857bc7 ]

Extend the clustermesh-apiserver binary with a new kvstoremesh-dbg
subcommand to interact with the kvstoremesh API, and specifically
allow to query and output the status of the connection to remote
clusters. The command can be invoked through something along the
lines of:

$ kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c kvstoremesh \
    -- clustermesh-apiserver kvstoremesh-dbg status

And outputs the status using the same format of the clustermesh section
reported by cilium-dbg status --all-clusters. By default, the output
includes a brief one-line report for ready clusters, and full information
for non ready ones. Full information for all clusters can be retrieved
specifying the --verbose flag.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit d0af3d7 ]

We shouldn't import testing code into production code, as it can lead to
unexpected side effects due to e.g., init functions. Let's address this
by hard-coding the "PolicyEnforcement" constant, rather than importing
it. This is consistent with the same usage as part of the "config" command.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit cfb3b8a ]

It is intended to be used by CLI tools to retrieve the configuration
files of all remote clusters in a given directory, to be used, e.g.,
for troubleshooting purposes.

While being there, let's also replace the path package with the filepath
one, which is more appropriate in this context, and it would allow to
theoretically handle Windows paths as well.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 2d07cfc ]

[ backporter's notes: replaced cmp.Or usage, as not yet available
  in go 1.21. ]

Troubleshooting etcd connectivity issues, regardless of whether to the
Cilium kvstore or to a remote cluster, is a complex activity, as issues
can concern network connectivity, TLS certificates mismatch, authn/authz
policies and so on.

As an effort to simplify this process, let's introduce a new utility
responsible for performing a set of sanity checks, and outputting the
result in a user-friendly way. This utility is intended to be then
leveraged by dedicated CLI commands integrated with the various
components. More in detail, this utility performs the following
operations:

* Asserts that the etcd configuration can be correctly parsed;
* For each endpoint:
  - Outputs the DNS resolution;
  - Assert that the endpoint is reachable at the network level (i.e.,
    that a TCP connection can be successfully established);
  - When https is enabled, asserts that a TLS connection can be correctly
    established to the endpoint (i.e., that the provided certificates
    are valid); the check includes both server and client (if enabled)
    authentication; additionally outputs TLS specific information;
  - Outputs the version of the endpoint, as returned by GET /version;
* Outputs information regarding Root CAs and client certificates, if
  configured; additionally checks whether the client certificate is
  valid according to the root CAs;
* Asserts that the etcd client can correctly establish a connection;
* Asserts that the heartbeat key can be retrieved, as a basic
  authorization check.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 9654576 ]

Introduce two new cilium-dbg commands, namely "troubleshoot kvstore" and
"troubleshoot clustermesh", responsible for running a set of sanity
checks to help troubleshoot etcd connectivity issues, covering network
connectivity, TLS authentication, authn/authz policies and so on.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 9156e23 ]

As useful to troubleshoot kvstore and clustermesh issues.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 6fae9e7 ]

Extend the clustermesh-apiserver binary with a new clustermesh-dbg
troubleshoot subcommand, responsible for running a set of sanity
checks to help troubleshoot etcd connectivity issues, covering network
connectivity, TLS authentication, authn/authz policies and so on.

The command can be invoked through something along the lines of:

$ kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c apiserver \
  -- clustermesh-apiserver clustermesh-dbg troubleshoot

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit f575f94 ]

Introduce a new troubleshoot subcommand to "clustermesh-apiserver
kvstoremesh-dbg", responsible for running a set of sanity checks
to help troubleshoot etcd connectivity issues, covering network
connectivity, TLS authentication, authn/authz policies and so on.

The command can be invoked through something along the lines of:

$ kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c kvstoremesh \
  -- clustermesh-apiserver kvstoremesh-dbg troubleshoot [--include-local]

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 4172c62 ]

Document the usage of the newly introduced troubleshoot command to
investigate connectivity issues towards the clustermesh control plane
(i.e., etcd) in remote clusters.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 48b36f5 ]

When KVStoreMesh is enabled, this component is responsible for
connecting to the remote clusters. Document the command which
can be used to inspect its status and validate whether connection
are established correctly.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 189e8ba ]

Add a clarification note that the manual steps presented in the guide
are mostly alternative to using the automatic tools described in the
previous section. Additionally, drop the example errors from the TLS
certificates step, as potentially misleading. Users shall leverage
the troubleshoot command instead. Finally, let's fix a couple of typos.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
[ upstream commit 913e41b ]

They apply only when Cilium is configured in kvstore mode, which is
seldom the case these days. The lack of local information is also not
clustermesh specific, and would imply other serious issues. Moreover,
the given checks would not work, and lead to additional confusion when
Cilium operates in CRD mode. Hence, let's just replace them with the
suggestion of checking whether both Cilium agents and KVStoreMesh
(if enabled) are correctly connected to all remote clusters, and the
synchronization has completed.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the pr/v1.15-backport-2024-05-16-09-52 branch from f210127 to 196facb Compare May 16, 2024 08:28
@giorio94 giorio94 added the area/clustermesh Relates to multi-cluster routing functionality in Cilium. label May 16, 2024
@giorio94
Copy link
Member Author

/test-backport-1.15

@giorio94 giorio94 marked this pull request as ready for review May 16, 2024 09:25
@giorio94 giorio94 requested review from a team as code owners May 16, 2024 09:25
@giorio94 giorio94 requested a review from nathanjsweet May 16, 2024 09:25
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 16, 2024
@nathanjsweet nathanjsweet merged commit 963c406 into v1.15 May 16, 2024
@nathanjsweet nathanjsweet deleted the pr/v1.15-backport-2024-05-16-09-52 branch May 16, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

3 participants