Skip to content

Conversation

dctrwatson
Copy link
Contributor

Guide for setting up a multi-node kubernetes cluster using kind that passes the connectivity-check.yaml @ 3cc04e5

I also included a bonus Cluster Mesh sandbox environment that passes the same connectivity check and the x-wing/rebel mesh example.

Fixes: #10948

@dctrwatson dctrwatson requested a review from a team as a code owner April 25, 2020 05:43
@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@dctrwatson
Copy link
Contributor Author

Sorry in advance about formatting/etc. It's been a long time since I've used sphinx/rST.

I'm not sure if it should be included in the doc or not, but here's the script I wrote to periodically cleanup all the ebpf programs left behind by cilium every time a kind node/cluster is restarted: https://gist.github.com/dctrwatson/6f2f86747eec29c802afd9afed6ef54e

@aanm aanm requested a review from seanmwinn April 27, 2020 11:47
@aanm aanm added area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. release-note/misc This PR makes changes that have no direct user impact. labels Apr 27, 2020
@coveralls
Copy link

coveralls commented Apr 27, 2020

Coverage Status

Coverage decreased (-0.01%) to 37.886% when pulling e864d89 on planetscale:kind-sandbox-doc into 3a4c386 on cilium:master.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I had tried this a couple of weeks back but I got the kind configuration wrong so nothing would bootstrap. With this guide I was able to set up a 3-node kind cluster in my normal development environment and got the connectivity checker to work (all except one of the pods, but I think that's probably a YAML issue rather than anything to do with this PR).

I didn't try the multi-cluster instructions but it looks pretty straightforward.

Nice work, LGTM. Minor nit below on the links / passing the CI and host-services.

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! I need to try it out next :-)

Few typos below.

@maintainer-s-little-helper
Copy link

Commit a8ddc292345b731116fb3cb6569783fb0c676fd1 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Apr 29, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Apr 29, 2020
@dctrwatson
Copy link
Contributor Author

Commit a8ddc29 does not contain "Signed-off-by".

Gah, sorry. Forgot about that when applying suggestions via GitHub.
Just pushed a sign-off version of the same commit.

@pchaigno
Copy link
Member

@dctrwatson There's a small build error due to a title underline being too short: https://github.com/cilium/cilium/pull/11157/checks?check_run_id=630614851.

@dctrwatson
Copy link
Contributor Author

@dctrwatson There's a small build error due to a title underline being too short: #11157 (checks).

Fixed, and confirmed no more build errors locally.

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just deployed Cilium & Hubble using this guide and it works great! I'll try the cluster mesh setup tomorrow morning.

One nit below, regarding dependencies.

got the connectivity checker to work (all except one of the pods, but I think that's probably a YAML issue rather than anything to do with this PR).

@joestringer I can confirm I had all pods working in the connectivity check.

@pchaigno
Copy link
Member

pchaigno commented Apr 30, 2020

I've tried the cluster mesh setup, but a lot of different pods (cilium, cilium-operator, coredns, etc.) are failing to start when I deploy Cilium. I haven't fully debugged this, but it seems to be some issue with etcd.

On what version of each of the dependencies did you try this setup? I have tried with the following (+Ubuntu 18.04), but can try other combinations:

$ docker --version
Docker version 19.03.8, build afacb8b7f0
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-28T05:35:31Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
$ kind --version
kind version 0.8.0-alpha
$ helm version
version.BuildInfo{Version:"v3.2.0", GitCommit:"e11b7ce3b12db2941e90399e874513fbd24bcb71", GitTreeState:"clean", GoVersion:"go1.13.10"}

@dctrwatson
Copy link
Contributor Author

I've tried the cluster mesh setup, but a lot of different pods (cilium, cilium-operator, coredns, etc.) are failing to start when I deploy Cilium. I haven't fully debugged this, but it seems to be some issue with etcd.

Oh, this reminds me of another prerequisite. I don't think it will solve your issue though since I think you would've run into this before.

Ubuntu and Debian had lockdown enabled by default when secureboot was enabled. Their kernels from around 2019-09 until 2020-04, disallowed the use of ebpf with lockdown.

5.3.0-46 (and whatever the latest patch release of each supported kernel) has the patch applied remove the restriction on ebpf in lockdown.

On what version of each of the dependencies did you try this setup? I have tried with the following (+Ubuntu 18.04), but can try other combinations:

I've been using 18.04.

Current versions:

$ docker --version
Docker version 19.03.8, build afacb8b7f0
$ kubectl version --client
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.9", GitCommit:"500f5aba80d71253cc01ac6a8622b8377f4a7ef9", GitTreeState:"clean", BuildDate:"2019-11-13T11:21:43Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
$ kind --version
kind version 0.7.0
$ uname -r
5.3.0-46-generic

I think docker was updated recently so previous version of that too. Helm I've used 3.1.x and 3.2.0. Kubectl/kind I haven't updated in awhile. And until recently I was using 4.15 kernel.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to validate the multicluster side of this as well, great work 🎉

The only hiccup I hit was to ensure that I deployed the nodeport service for etcd into the correct namespace. I have a couple of other comments below, we could either address these & merge, or we could merge & follow up with another PR to improve those points.

It also took quite some time to spin up the kind clusters, particularly image pull.

$ kubectl -n kube-system exec -ti $(get_cilium_pod) -- cilium version
Client: 1.7.3 952090308 2020-04-29T15:29:53-07:00 go version go1.13.10 linux/amd64
Daemon: 1.7.3 952090308 2020-04-29T15:29:53-07:00 go version go1.13.10 linux/amd64
$ ks exec $(get_cilium_pod) -- cilium status --all-health
KVStore:                Ok   etcd: 1/1 connected, lease-ID=7b5b71eb9c74291a, lock lease-ID=7b5b71eb9c74291c, has-quorum=true: https://cilium-etcd-client.kube-system.svc:2379 
- 3.3.12 (Leader)
Kubernetes:             Ok   1.17 (v1.17.0) [linux/amd64]
Kubernetes APIs:        ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumNetworkPolicy", "core/v1::Endpoint", "core/v1::Namespace",
 "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   Partial   [NodePort (SNAT, 30000-32767), ExternalIPs]
Cilium:                 Ok        OK
NodeMonitor:            Disabled
Cilium health daemon:   Ok   
IPAM:                   IPv4: 4/255 allocated from 10.2.3.0/24, 
ClusterMesh:            1/1 clusters ready, 1 global-services
Controller Status:      36/36 healthy
Proxy Status:           OK, ip 10.2.3.9, 0 redirects active on ports 10000-20000
Cluster health:                           8/8 reachable   (2020-05-06T21:06:31Z)
  Name                                    IP              Reachable   Endpoints reachable
  cluster2/cluster2-worker3 (localhost)   172.17.0.8      true        true
  cluster1/cluster1-control-plane         172.17.0.5      true        true
  cluster1/cluster1-worker                172.17.0.2      true        true
  cluster1/cluster1-worker2               172.17.0.3      true        true
  cluster1/cluster1-worker3               172.17.0.4      true        true
  cluster2/cluster2-control-plane         172.17.0.9      true        true
  cluster2/cluster2-worker                172.17.0.7      true        true
  cluster2/cluster2-worker2               172.17.0.6      true        true
$ kubectl exec -ti x-wing-5fd8bf8468-bzqzc -- curl rebel-base
{"Galaxy": "Alderaan", "Cluster": "Cluster-2"}
$ kubectl exec -ti x-wing-5fd8bf8468-bzqzc -- curl rebel-base
{"Galaxy": "Alderaan", "Cluster": "Cluster-1"}

@joestringer
Copy link
Member

For reference, my versions:

$ docker --version
Docker version 19.03.6, build 369ce74a3c
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-21T01:25:41Z", GoVersion:"go1.13.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2020-01-14T00:09:19Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
$ kind --version
kind version 0.7.0
$ helm version
version.BuildInfo{Version:"v3.2.0", GitCommit:"e11b7ce3b12db2941e90399e874513fbd24bcb71", GitTreeState:"clean", GoVersion:"go1.13.10"}

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM given @joestringer was able to validate the cluster mesh part. I'll try to debug my setup later.

It also took quite some time to spin up the kind clusters, particularly image pull.

Using the preload tip sped things up a bit for me.

Copy link
Contributor Author

@dctrwatson dctrwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to immediately address 2 of the 3 comments.

@rolinh
Copy link
Member

rolinh commented May 8, 2020

@dctrwatson Would you mind squashing your commits together before we merge?

Signed-off-by: John Watson <johnw@planetscale.com>
@dctrwatson
Copy link
Contributor Author

@dctrwatson Would you mind squashing your commits together before we merge?

No problem, all squashed up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request: Cilium multi-node "kind" guide
8 participants