daemon: listen on IPv4 and IPv6 for health endpoint #13203

tklauser · 2020-09-17T11:19:32Z

If the agent liveness/readiness probe host is set to the IPv6 address
::1 instead of the default IPv4 127.0.0.1, Cilium never becomes ready in
an IPv6-only environment. This is because the daemon health endpoint
currently listens on localhost:9876 which will not listen on both IPv4
and IPv6.

To fix this, listen on both IPv4 and IPv6 explicitly (depending on the
daemon's enable-ipv{4,6} flags) and only fail with an error if both of
them fail or one was disabled and the other one fails.

Also change liveness and readiness probes to perform the requests on 127.0.0.1 or ::1 depending on the enable-ipv4 flag as suggested in #13165 (comment).

Fixes #13165

Fix agent liveness/readiness probes for IPv6-only environment.

tklauser · 2020-09-17T11:22:56Z

test-me-please

tklauser · 2020-09-17T11:23:25Z

Manually tested in the dev VM as follows:

IPv4 and IPv6:

# check that lo interface has both IPv4 and IPv6 address
$ ip addr show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
$ sudo journalctl -u cilium | grep "healthz status API server"
Sep 17 10:25:10 runtime1 cilium-agent[16392]: level=info msg="Started healthz status API server on address 127.0.0.1:9876" subsys=daemon
Sep 17 10:25:10 runtime1 cilium-agent[16392]: level=info msg="Started healthz status API server on address [::1]:9876" subsys=daemon
$ curl -I 127.0.0.1:9876/healthz
HTTP/1.1 200 OK
Date: Thu, 17 Sep 2020 10:44:13 GMT
$ curl -I [::1]:9876/healthz
HTTP/1.1 200 OK
Date: Thu, 17 Sep 2020 10:44:34 GMT

IPv4 only:

# disable IPv6 address on lo interface
$ sudo ip addr del ::1/128 dev lo
$ ip addr show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
$ sudo systemctl restart cilium
$ sudo journalctl -u cilium | grep "healthz status API server"
Sep 17 10:49:09 runtime1 cilium-agent[24453]: level=info msg="Started healthz status API server on address 127.0.0.1:9876" subsys=daemon
Sep 17 10:49:09 runtime1 cilium-agent[24453]: level=info msg="healthz status API server not available on [::1]:9876" subsys=daemon
$ curl -I 127.0.0.1:9876/healthz
HTTP/1.1 200 OK
Date: Thu, 17 Sep 2020 10:49:46 GMT
$ curl -I [::1]:9876/healthz
curl: (7) Couldn't connect to server

IPv6 only:

Currently, Cilium fails to come up properly in the dev VM if the lo interface is IPv6 only.

aanm

💯

sayboras

We have conformance test with ipv6 only cluster, just curious why underlying issue happened 🤔

daemon/cmd/agenthealth.go

tklauser · 2020-09-17T12:01:27Z

Thanks for your review @sayboras

We have conformance test with ipv6 only cluster, just curious why underlying issue happened thinking

Just a guess, maybe because the helm chart was always hard-coding 127.0.0.1 and likely the lo interface in the ipv6 cluster still has IPv4 configured? I see the PR is currently failing on that GH action, so I will try to replicate and investigate further in a local kind cluster.

sayboras · 2020-09-17T12:03:16Z

I have checked the log in smoketest ipv6 failure, and notice below log, seems like newly built docker image was not used. Fixed in #13204

2020-09-17T11:40:22.747698085Z level=info msg="Started healthz status API server on address localhost:9876" subsys=daemon

tklauser · 2020-09-17T12:07:46Z

I have checked the log in smoketest ipv6 failure, and notice below log, seems like newly built docker image was not used. Fixed in #13204
2020-09-17T11:40:22.747698085Z level=info msg="Started healthz status API server on address localhost:9876" subsys=daemon

Thanks, will rebase this PR once #13204 is approved and merged.

sayboras · 2020-09-17T12:17:55Z

Just a guess, maybe because the helm chart was always hard-coding 127.0.0.1 and likely the lo interface in the ipv6 cluster still has IPv4 configured?

You are right, thanks 💯

$ ksysex cilium-gsws6 -- /bin/bash      
root@kind-worker:/home/cilium# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever

vadorovsky

#13204 is merged, you can rebase now :)

daemon/cmd/agenthealth.go

install/kubernetes/cilium/charts/agent/templates/daemonset.yaml

sayboras

LGTM, few questions for my understanding only.

tklauser · 2020-09-17T13:27:42Z

test-me-please

tklauser · 2020-09-17T14:00:33Z

test-me-please

tklauser · 2020-09-17T17:59:25Z

retest-net-next

joestringer

Docs LGTM.

Please avoid format variants on logging functions in favour of structured logging.

daemon/cmd/agenthealth.go

If the agent liveness/readiness probe host is set to the IPv6 address ::1 instead of the default IPv4 127.0.0.1, Cilium never becomes ready in an IPv6-only environment. This is because the daemon health endpoint currently listens on localhost:9876 which will not listen on both IPv4 and IPv6. To fix this, listen on both IPv4 and IPv6 explicitly (depending on the daemons's tenable-ipv{4,6} flags) and only fail with an error if both of them fail or one was disabled and the other one fails. Fixes #13165 Signed-off-by: Tobias Klauser <tklauser@distanz.ch>

…is disabled Change the liveness and readiness probes to perform the requests on 127.0.0.1 or ::1 depending on the enable-ipv4 flag. If that flag is false, change the readiness probe to perform requests to ::1, otherwise defaults to 127.0.0.1 (as it works for both v4 and v6 environments). Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Tobias Klauser <tklauser@distanz.ch>

tklauser · 2020-09-17T20:50:18Z

test-me-please

tklauser added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Sep 17, 2020

tklauser requested review from aanm and a team September 17, 2020 11:19

tklauser requested review from a team as code owners September 17, 2020 11:19

tklauser requested a review from a team September 17, 2020 11:19

tklauser force-pushed the pr/tklauser/agent-health-ipv6 branch from 6be911b to 3ce4e11 Compare September 17, 2020 11:22

aanm approved these changes Sep 17, 2020

View reviewed changes

sayboras reviewed Sep 17, 2020

View reviewed changes

daemon/cmd/agenthealth.go Outdated Show resolved Hide resolved

daemon/cmd/agenthealth.go Outdated Show resolved Hide resolved

daemon/cmd/agenthealth.go Outdated Show resolved Hide resolved

vadorovsky reviewed Sep 17, 2020

View reviewed changes

daemon/cmd/agenthealth.go Outdated Show resolved Hide resolved

sayboras reviewed Sep 17, 2020

View reviewed changes

install/kubernetes/cilium/charts/agent/templates/daemonset.yaml Show resolved Hide resolved

sayboras approved these changes Sep 17, 2020

View reviewed changes

tklauser force-pushed the pr/tklauser/agent-health-ipv6 branch from 3ce4e11 to d42b055 Compare September 17, 2020 13:22

tklauser force-pushed the pr/tklauser/agent-health-ipv6 branch from d42b055 to e18b709 Compare September 17, 2020 13:30

christarazi approved these changes Sep 17, 2020

View reviewed changes

vadorovsky approved these changes Sep 17, 2020

View reviewed changes

joestringer added the needs-backport/1.8 label Sep 17, 2020

joestringer requested changes Sep 17, 2020

View reviewed changes

daemon/cmd/agenthealth.go Outdated Show resolved Hide resolved

tklauser added 2 commits September 17, 2020 22:49

tklauser force-pushed the pr/tklauser/agent-health-ipv6 branch from e18b709 to b8cea9e Compare September 17, 2020 20:50

tklauser requested a review from joestringer September 17, 2020 20:50

tklauser mentioned this pull request Sep 17, 2020

CI: K8sBandwidthTest Checks Bandwidth Rate-Limiting Checks Pod to Pod bandwidth #13062

Closed

aanm merged commit 18896a1 into master Sep 21, 2020

aanm deleted the pr/tklauser/agent-health-ipv6 branch September 21, 2020 09:27

aanm mentioned this pull request Sep 21, 2020

CI: K8sBandwidthTest Checks Bandwidth Rate-Limiting Checks Pod to Pod bandwidth, geneve tunneling #13230

Closed

vadorovsky mentioned this pull request Sep 22, 2020

v1.8 backports 2020-09-22 #13246

Merged

vadorovsky added backport-pending/1.8 and removed needs-backport/1.8 labels Sep 22, 2020

qmonnet added backport-done/1.8 and removed backport-pending/1.8 labels Sep 24, 2020

joestringer mentioned this pull request Oct 1, 2020

Prepare for release v1.8.4 #13361

Merged

mhofstetter mentioned this pull request May 3, 2023

Cilium agent opens ports in all interfaces (maybe unnecessarily?) #23353

Closed

2 tasks

daemon: listen on IPv4 and IPv6 for health endpoint #13203

daemon: listen on IPv4 and IPv6 for health endpoint #13203

Uh oh!

Conversation

tklauser commented Sep 17, 2020

Uh oh!

tklauser commented Sep 17, 2020

Uh oh!

tklauser commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aanm left a comment

Choose a reason for hiding this comment

Uh oh!

sayboras left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tklauser commented Sep 17, 2020

Uh oh!

sayboras commented Sep 17, 2020

Uh oh!

tklauser commented Sep 17, 2020

Uh oh!

sayboras commented Sep 17, 2020

Uh oh!

vadorovsky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayboras left a comment

Choose a reason for hiding this comment

Uh oh!

tklauser commented Sep 17, 2020

Uh oh!

tklauser commented Sep 17, 2020

Uh oh!

tklauser commented Sep 17, 2020

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tklauser commented Sep 17, 2020

Uh oh!

Uh oh!

tklauser commented Sep 17, 2020 •

edited

Loading