-
Notifications
You must be signed in to change notification settings - Fork 3.4k
docs: Cilium multi-node (and mesh) Kind Guide #11157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please set the appropriate release note label. |
Sorry in advance about formatting/etc. It's been a long time since I've used sphinx/rST. I'm not sure if it should be included in the doc or not, but here's the script I wrote to periodically cleanup all the ebpf programs left behind by cilium every time a kind node/cluster is restarted: https://gist.github.com/dctrwatson/6f2f86747eec29c802afd9afed6ef54e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I had tried this a couple of weeks back but I got the kind configuration wrong so nothing would bootstrap. With this guide I was able to set up a 3-node kind cluster in my normal development environment and got the connectivity checker to work (all except one of the pods, but I think that's probably a YAML issue rather than anything to do with this PR).
I didn't try the multi-cluster instructions but it looks pretty straightforward.
Nice work, LGTM. Minor nit below on the links / passing the CI and host-services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome! I need to try it out next :-)
Few typos below.
Commit a8ddc292345b731116fb3cb6569783fb0c676fd1 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
a8ddc29
to
1c06ae8
Compare
Gah, sorry. Forgot about that when applying suggestions via GitHub. |
@dctrwatson There's a small build error due to a title underline being too short: https://github.com/cilium/cilium/pull/11157/checks?check_run_id=630614851. |
Fixed, and confirmed no more build errors locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just deployed Cilium & Hubble using this guide and it works great! I'll try the cluster mesh setup tomorrow morning.
One nit below, regarding dependencies.
got the connectivity checker to work (all except one of the pods, but I think that's probably a YAML issue rather than anything to do with this PR).
@joestringer I can confirm I had all pods working in the connectivity check.
I've tried the cluster mesh setup, but a lot of different pods (cilium, cilium-operator, coredns, etc.) are failing to start when I deploy Cilium. I haven't fully debugged this, but it seems to be some issue with etcd. On what version of each of the dependencies did you try this setup? I have tried with the following (+Ubuntu 18.04), but can try other combinations: $ docker --version
Docker version 19.03.8, build afacb8b7f0
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-28T05:35:31Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
$ kind --version
kind version 0.8.0-alpha
$ helm version
version.BuildInfo{Version:"v3.2.0", GitCommit:"e11b7ce3b12db2941e90399e874513fbd24bcb71", GitTreeState:"clean", GoVersion:"go1.13.10"} |
Oh, this reminds me of another prerequisite. I don't think it will solve your issue though since I think you would've run into this before. Ubuntu and Debian had lockdown enabled by default when secureboot was enabled. Their kernels from around 2019-09 until 2020-04, disallowed the use of ebpf with lockdown.
I've been using 18.04. Current versions: $ docker --version
Docker version 19.03.8, build afacb8b7f0
$ kubectl version --client
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.9", GitCommit:"500f5aba80d71253cc01ac6a8622b8377f4a7ef9", GitTreeState:"clean", BuildDate:"2019-11-13T11:21:43Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
$ kind --version
kind version 0.7.0
$ uname -r
5.3.0-46-generic I think docker was updated recently so previous version of that too. Helm I've used 3.1.x and 3.2.0. Kubectl/kind I haven't updated in awhile. And until recently I was using 4.15 kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to validate the multicluster side of this as well, great work 🎉
The only hiccup I hit was to ensure that I deployed the nodeport service for etcd into the correct namespace. I have a couple of other comments below, we could either address these & merge, or we could merge & follow up with another PR to improve those points.
It also took quite some time to spin up the kind clusters, particularly image pull.
$ kubectl -n kube-system exec -ti $(get_cilium_pod) -- cilium version
Client: 1.7.3 952090308 2020-04-29T15:29:53-07:00 go version go1.13.10 linux/amd64
Daemon: 1.7.3 952090308 2020-04-29T15:29:53-07:00 go version go1.13.10 linux/amd64
$ ks exec $(get_cilium_pod) -- cilium status --all-health
KVStore: Ok etcd: 1/1 connected, lease-ID=7b5b71eb9c74291a, lock lease-ID=7b5b71eb9c74291c, has-quorum=true: https://cilium-etcd-client.kube-system.svc:2379
- 3.3.12 (Leader)
Kubernetes: Ok 1.17 (v1.17.0) [linux/amd64]
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumNetworkPolicy", "core/v1::Endpoint", "core/v1::Namespace",
"core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Partial [NodePort (SNAT, 30000-32767), ExternalIPs]
Cilium: Ok OK
NodeMonitor: Disabled
Cilium health daemon: Ok
IPAM: IPv4: 4/255 allocated from 10.2.3.0/24,
ClusterMesh: 1/1 clusters ready, 1 global-services
Controller Status: 36/36 healthy
Proxy Status: OK, ip 10.2.3.9, 0 redirects active on ports 10000-20000
Cluster health: 8/8 reachable (2020-05-06T21:06:31Z)
Name IP Reachable Endpoints reachable
cluster2/cluster2-worker3 (localhost) 172.17.0.8 true true
cluster1/cluster1-control-plane 172.17.0.5 true true
cluster1/cluster1-worker 172.17.0.2 true true
cluster1/cluster1-worker2 172.17.0.3 true true
cluster1/cluster1-worker3 172.17.0.4 true true
cluster2/cluster2-control-plane 172.17.0.9 true true
cluster2/cluster2-worker 172.17.0.7 true true
cluster2/cluster2-worker2 172.17.0.6 true true
$ kubectl exec -ti x-wing-5fd8bf8468-bzqzc -- curl rebel-base
{"Galaxy": "Alderaan", "Cluster": "Cluster-2"}
$ kubectl exec -ti x-wing-5fd8bf8468-bzqzc -- curl rebel-base
{"Galaxy": "Alderaan", "Cluster": "Cluster-1"}
For reference, my versions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM given @joestringer was able to validate the cluster mesh part. I'll try to debug my setup later.
It also took quite some time to spin up the kind clusters, particularly image pull.
Using the preload tip sped things up a bit for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was able to immediately address 2 of the 3 comments.
@dctrwatson Would you mind squashing your commits together before we merge? |
Signed-off-by: John Watson <johnw@planetscale.com>
2ab2e3d
to
e864d89
Compare
No problem, all squashed up. |
Guide for setting up a multi-node kubernetes cluster using kind that passes the connectivity-check.yaml @ 3cc04e5
I also included a bonus Cluster Mesh sandbox environment that passes the same connectivity check and the x-wing/rebel mesh example.
Fixes: #10948