Skip to content

Conversation

pippolo84
Copy link
Member

The first step of the connectivity check is to test the DNS.
Since the DNS service pod may be located either on the same node of the client or on any other node, the test may introduce randomness, depending on how the connectivity is actually broken (e.g. same-node connectivity only is broken or across-nodes connectivity only is broken).

To guarantee a reliable check, the connectivity test has been extended to perform the following step for each client:

  1. query a DNS server on a pod located on the same node
  2. query a DNS server on a pod located on a different node
  3. query the DNS service

To perform the first two steps, the already existing echo-same-node and echo-other-node deployments have been modified to deploy a multi-container pod containing a test CoreDNS server.
The test CoreDNS server is configured with the local plugin, to be able to answer a mock DNS request for localhost. This should be enough to validate that DNS is working correctly.

Fixes #880

Signed-off-by: Fabio Falzoi fabio.falzoi@isovalent.com

@pippolo84 pippolo84 requested a review from a team as a code owner June 17, 2022 09:22
@pippolo84 pippolo84 requested a review from twpayne June 17, 2022 09:22
@pippolo84 pippolo84 temporarily deployed to ci June 17, 2022 09:22 Inactive
@pippolo84 pippolo84 force-pushed the pr/pippolo84/extend-dns-connectivity-test branch from ceb3dfa to 962a6cd Compare June 17, 2022 09:33
@pippolo84 pippolo84 temporarily deployed to ci June 17, 2022 09:33 Inactive
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, thanks! A couple very minor nits

@pippolo84 pippolo84 force-pushed the pr/pippolo84/extend-dns-connectivity-test branch from 962a6cd to 0fcc26b Compare June 21, 2022 14:49
@pippolo84 pippolo84 temporarily deployed to ci June 21, 2022 14:49 Inactive
@pippolo84 pippolo84 force-pushed the pr/pippolo84/extend-dns-connectivity-test branch from 0fcc26b to 233dbec Compare June 22, 2022 08:56
@pippolo84 pippolo84 temporarily deployed to ci June 22, 2022 08:56 Inactive
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks Fabio!

@tklauser tklauser removed the request for review from twpayne June 22, 2022 12:01
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuffs 💯

@tklauser
Copy link
Member

Opened #940 for the multicluster workflow failure and re-triggered the job.

@tklauser
Copy link
Member

The 2nd and 3rd runs of the multicluster workflow both failed with an issue other than #940 (that one happened on the 1st run):

2022-06-22T13:44:54.041264946Z 🐛 Validating Deployments...
2022-06-22T13:44:54.041342251Z ⌛ [gke_***_us-west2-a_cilium-cilium-cli-2541083393-mesh-1] Waiting for deployments [client client2 echo-same-node] to become ready...
2022-06-22T13:45:10.187792646Z ⌛ [gke_***_us-west2-a_cilium-cilium-cli-2541083393-mesh-2] Waiting for deployments [echo-other-node] to become ready...
2022-06-22T13:50:10.188433896Z connectivity test failed: waiting for deployment echo-other-node to become ready has been interrupted: context deadline exceeded

https://github.com/cilium/cilium-cli/runs/7005064156?check_suite_focus=true

Looks like this could be related to this PR itself. @pippolo84 could you take a look, please?

@pippolo84
Copy link
Member Author

Looks like this could be related to this PR itself. @pippolo84 could you take a look, please?

Sure, I'll look into it.

@pippolo84 pippolo84 force-pushed the pr/pippolo84/extend-dns-connectivity-test branch from 233dbec to 2d70e9f Compare June 23, 2022 14:27
@pippolo84 pippolo84 temporarily deployed to ci June 23, 2022 14:27 Inactive
@pippolo84
Copy link
Member Author

The job was failing due to a missing ConfigMap in the second cluster used in the multicluster setup.
Unfortunately, it is still failing for another reason. Looking into it.

Validate DNS in connectivity test against a server located in the same
node, then against a server located on another node and finally against
the DNS service.

This eliminates any kind of randomness in the test result, possibly due
to the location of the DNS service pod.

The DNS server used is CoreDNS, with the "local" plugin. To activate
that plugin a Corefile is needed, so a ConfigMap is created in each
cluster where the server will be deployed.

Fixes #880

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
@pippolo84 pippolo84 force-pushed the pr/pippolo84/extend-dns-connectivity-test branch from 2d70e9f to c0cf53f Compare June 23, 2022 15:13
@pippolo84 pippolo84 temporarily deployed to ci June 23, 2022 15:13 Inactive
@pippolo84
Copy link
Member Author

This time it was an issue related to an incorrect use of the source cluster client instead of the destination cluster client, where the echo-other-node deployment is located in the multicluster scenario. Now it's fixed.

@tklauser
Copy link
Member

Awesome work @pippolo84!

@tklauser tklauser merged commit 54e6e18 into master Jun 24, 2022
@tklauser tklauser deleted the pr/pippolo84/extend-dns-connectivity-test branch June 24, 2022 06:56
@jinsiang2008
Copy link

Looks like this test failed when I use the cilium-cli v0.11+ while all the tests in cilium-cli v0.9.1 passed

Waiting for pod cilium-test/client-6488dcf5d4-mmszr to reach DNS server on cilium-test/echo-same-node-745bd5c77-p4ngv pod...
🐛 Error looking up localhost from pod cilium-test/client-6488dcf5d4-mmszr to server on pod cilium-test/echo-same-node-745bd5c77-p4ngv: command terminated with exit code 1: ;; connection timed out; no servers could be reached



🐛 Error looking up localhost from pod cilium-test/client-6488dcf5d4-mmszr to server on pod cilium-test/echo-same-node-745bd5c77-p4ngv: command terminated with exit code 1: 
connectivity test failed: timeout reached waiting lookup for localhost from pod cilium-test/client-6488dcf5d4-mmszr to server on pod cilium-test/echo-same-node-745bd5c77-p4ngv to succeed

@gandro
Copy link
Member

gandro commented Aug 2, 2022

Looks like this test failed when I use the cilium-cli v0.11+ while all the tests in cilium-cli v0.9.1 passed

Waiting for pod cilium-test/client-6488dcf5d4-mmszr to reach DNS server on cilium-test/echo-same-node-745bd5c77-p4ngv pod...
🐛 Error looking up localhost from pod cilium-test/client-6488dcf5d4-mmszr to server on pod cilium-test/echo-same-node-745bd5c77-p4ngv: command terminated with exit code 1: ;; connection timed out; no servers could be reached



🐛 Error looking up localhost from pod cilium-test/client-6488dcf5d4-mmszr to server on pod cilium-test/echo-same-node-745bd5c77-p4ngv: command terminated with exit code 1: 
connectivity test failed: timeout reached waiting lookup for localhost from pod cilium-test/client-6488dcf5d4-mmszr to server on pod cilium-test/echo-same-node-745bd5c77-p4ngv to succeed

Thanks for the report. Could you provide more details (e.g. Hubble output of the failing test) and maybe even a sysdump of the system?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

connectivity: Check each DNS backend separately
5 participants