Skip to content

Conversation

tklauser
Copy link
Member

If the agent liveness/readiness probe host is set to the IPv6 address
::1 instead of the default IPv4 127.0.0.1, Cilium never becomes ready in
an IPv6-only environment. This is because the daemon health endpoint
currently listens on localhost:9876 which will not listen on both IPv4
and IPv6.

To fix this, listen on both IPv4 and IPv6 explicitly (depending on the
daemon's enable-ipv{4,6} flags) and only fail with an error if both of
them fail or one was disabled and the other one fails.

Also change liveness and readiness probes to perform the requests on 127.0.0.1 or ::1 depending on the enable-ipv4 flag as suggested in #13165 (comment).

Fixes #13165

Fix agent liveness/readiness probes for IPv6-only environment.

@tklauser tklauser added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Sep 17, 2020
@tklauser tklauser requested review from aanm and a team September 17, 2020 11:19
@tklauser tklauser requested review from a team as code owners September 17, 2020 11:19
@tklauser tklauser requested a review from a team September 17, 2020 11:19
@tklauser tklauser force-pushed the pr/tklauser/agent-health-ipv6 branch from 6be911b to 3ce4e11 Compare September 17, 2020 11:22
@tklauser
Copy link
Member Author

test-me-please

@tklauser
Copy link
Member Author

tklauser commented Sep 17, 2020

Manually tested in the dev VM as follows:

  1. IPv4 and IPv6:
# check that lo interface has both IPv4 and IPv6 address
$ ip addr show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
$ sudo journalctl -u cilium | grep "healthz status API server"
Sep 17 10:25:10 runtime1 cilium-agent[16392]: level=info msg="Started healthz status API server on address 127.0.0.1:9876" subsys=daemon
Sep 17 10:25:10 runtime1 cilium-agent[16392]: level=info msg="Started healthz status API server on address [::1]:9876" subsys=daemon
$ curl -I 127.0.0.1:9876/healthz
HTTP/1.1 200 OK
Date: Thu, 17 Sep 2020 10:44:13 GMT
$ curl -I [::1]:9876/healthz
HTTP/1.1 200 OK
Date: Thu, 17 Sep 2020 10:44:34 GMT
  1. IPv4 only:
# disable IPv6 address on lo interface
$ sudo ip addr del ::1/128 dev lo
$ ip addr show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
$ sudo systemctl restart cilium
$ sudo journalctl -u cilium | grep "healthz status API server"
Sep 17 10:49:09 runtime1 cilium-agent[24453]: level=info msg="Started healthz status API server on address 127.0.0.1:9876" subsys=daemon
Sep 17 10:49:09 runtime1 cilium-agent[24453]: level=info msg="healthz status API server not available on [::1]:9876" subsys=daemon
$ curl -I 127.0.0.1:9876/healthz
HTTP/1.1 200 OK
Date: Thu, 17 Sep 2020 10:49:46 GMT
$ curl -I [::1]:9876/healthz
curl: (7) Couldn't connect to server
  1. IPv6 only:

Currently, Cilium fails to come up properly in the dev VM if the lo interface is IPv6 only.

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have conformance test with ipv6 only cluster, just curious why underlying issue happened 🤔

@tklauser
Copy link
Member Author

Thanks for your review @sayboras

We have conformance test with ipv6 only cluster, just curious why underlying issue happened thinking

Just a guess, maybe because the helm chart was always hard-coding 127.0.0.1 and likely the lo interface in the ipv6 cluster still has IPv4 configured? I see the PR is currently failing on that GH action, so I will try to replicate and investigate further in a local kind cluster.

@sayboras
Copy link
Member

I have checked the log in smoketest ipv6 failure, and notice below log, seems like newly built docker image was not used. Fixed in #13204

2020-09-17T11:40:22.747698085Z level=info msg="Started healthz status API server on address localhost:9876" subsys=daemon

@tklauser
Copy link
Member Author

I have checked the log in smoketest ipv6 failure, and notice below log, seems like newly built docker image was not used. Fixed in #13204

2020-09-17T11:40:22.747698085Z level=info msg="Started healthz status API server on address localhost:9876" subsys=daemon

Thanks, will rebase this PR once #13204 is approved and merged.

@sayboras
Copy link
Member

Just a guess, maybe because the helm chart was always hard-coding 127.0.0.1 and likely the lo interface in the ipv6 cluster still has IPv4 configured?

You are right, thanks 💯

$ ksysex cilium-gsws6 -- /bin/bash      
root@kind-worker:/home/cilium# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever

Copy link
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#13204 is merged, you can rebase now :)

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, few questions for my understanding only.

@tklauser tklauser force-pushed the pr/tklauser/agent-health-ipv6 branch from 3ce4e11 to d42b055 Compare September 17, 2020 13:22
@tklauser
Copy link
Member Author

test-me-please

@tklauser tklauser force-pushed the pr/tklauser/agent-health-ipv6 branch from d42b055 to e18b709 Compare September 17, 2020 13:30
@tklauser
Copy link
Member Author

test-me-please

@tklauser
Copy link
Member Author

retest-net-next

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM.

Please avoid format variants on logging functions in favour of structured logging.

If the agent liveness/readiness probe host is set to the IPv6 address
::1 instead of the default IPv4 127.0.0.1, Cilium never becomes ready in
an IPv6-only environment. This is because the daemon health endpoint
currently listens on localhost:9876 which will not listen on both IPv4
and IPv6.

To fix this, listen on both IPv4 and IPv6 explicitly (depending on the
daemons's tenable-ipv{4,6} flags) and only fail with an error if both of
them fail or one was disabled and the other one fails.

Fixes #13165

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
…is disabled

Change the liveness and readiness probes to perform the requests on
127.0.0.1 or ::1 depending on the enable-ipv4 flag. If that flag is
false, change the readiness probe to perform requests to ::1, otherwise
defaults to 127.0.0.1 (as it works for both v4 and v6 environments).

Suggested-by: André Martins <andre@cilium.io>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
@tklauser tklauser force-pushed the pr/tklauser/agent-health-ipv6 branch from e18b709 to b8cea9e Compare September 17, 2020 20:50
@tklauser
Copy link
Member Author

test-me-please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Health endpoint does not bind to tcp6 on a IPv6 only environment
7 participants