Skip to content

cilium-cli: Improve tcpdump termination timeout handling #36021

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

liyihuang
Copy link
Contributor

I ran into the consistent failure with cilium connectivity test --test pod-to-pod-encryption with the following error. I tested manually in the container and don't see the issue at all.

[=] [cilium-test-1] Test [pod-to-pod-encryption] [55/105]                                                                                                                                                                                                                                                                                                                                     
  🐛 Running sniffer in background on cilium-test-1/host-netns-7srs2 (masters), mode=assert: tcpdump -i ens5 --immediate-mode -w /tmp/pod-to-pod-encryption.pcap ((udp and (udp[8:2] = 0x0800 or dst port 8472 or dst port 6081)) and host 10.200.52.67 and host 10.200.52.68) or (host 11.0.1.214 and host 11.0.0.189 and tcp)                                                               
  🐛 Running /bin/sh -c ip -o route get 10.200.52.67 | grep -oE 'dev [^ ]*' | cut -d' ' -f2                                                                                                                                                                                                                                                                                                   
  🐛 Running sniffer in background on cilium-test-1/host-netns-t7lrw (workers), mode=assert: tcpdump -i ens4 --immediate-mode -w /tmp/pod-to-pod-encryption.pcap ((udp and (udp[8:2] = 0x0800 or dst port 8472 or dst port 6081)) and host 10.200.52.68 and host 10.200.52.67) or (host 11.0.0.189 and host 11.0.1.214 and tcp)                                                               
  [.] Action [pod-to-pod-encryption/pod-to-pod-encryption/curl-ipv4: cilium-test-1/client3-6c47f6fc46-flbkl (11.0.1.214) -> cilium-test-1/echo-same-node-b5ddc4d8f-kck5g (11.0.0.189:8080)]                                                                                                                                                                                                   
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://11.0.0.189:8080]                                                                                                                                                                         
.  🟥 Failed to execute tcpdump on cilium-test-1/host-netns-7srs2 (masters): failed to wait for tcpdump to terminate                                                                                                                                                                                                                                                                          
  🐛 Finalizing Test pod-to-pod-encryption                                                                                                                                                                                                                                                                                                                                                    
[=] [cilium-test-1] Skipping test [echo-ingress-knp] [21/105] (skipped by user)                                                                                                                                                                                                                                                                                                               
[=] [cilium-test-1] Skipping test [echo-ingress-from-outside] [20/105] (skipped by condition)                                                                                                                                                                                                                                                                                                 
[=] [cilium-test-1] Skipping test [client-ingress-icmp] [22/105] (skipped by user)                                                                                                                                                                                                                                                                                                            
[=] [cilium-test-1] Test [pod-to-pod-with-l7-policy-encryption] [56/105]                                                                                                                                                                                                                                                                                                                      
  🐛 Pod kube-system/cilium-thv6b's current policy revision 5                                                                                                                                  
  🐛 Pod kube-system/cilium-58pk9's current policy revision 5                                                                                                                                                                                                                                                                                                                                 
  ℹ️  📜 Applying CiliumNetworkPolicy 'client-egress-l7-http-from-any' to namespace 'cilium-test-1' on cluster cluster.local..                                                                                                                                                                                                                                                                 
  ℹ️  📜 Applying CiliumNetworkPolicy 'echo-ingress-l7-http-from-anywhere' to namespace 'cilium-test-1' on cluster cluster.local..                                                                                                                                                                                                                                                             
  🐛 Policy difference detected, waiting for Cilium agents to increment policy revisions..                                                                                                                                                                                                                                                                                                    
  🐛 Pod cluster.local/kube-system/cilium-thv6b revision > 5                                                                                                                                                                                                                                                                                                                                  
  🐛 Pod cluster.local/kube-system/cilium-58pk9 revision > 5                                                                                                                                   
  🐛 📜 Successfully applied 2 additional resources                                                                                                                                                                                                                                                                                                                                           
  [-] Scenario [pod-to-pod-with-l7-policy-encryption/pod-to-pod-encryption]                                                                                                                    
  🐛 Encapsulation before WG encryption                                                                                                                                                        
  🐛 Running /bin/sh -c ip -o route get 10.200.52.67 | grep -oE 'dev [^ ]*' | cut -d' ' -f2                                                                                                    
  🐛 Running sniffer in background on cilium-test-1/host-netns-t7lrw (workers), mode=assert: tcpdump -i ens4 --immediate-mode -w /tmp/pod-to-pod-encryption.pcap ((udp and (udp[8:2] = 0x0800 or dst port 8472 or dst port 6081)) and host 10.200.52.68 and host 10.200.52.67) or (host 11.0.0.98 and host 11.0.1.145 and tcp)
  🐛 Running /bin/sh -c ip -o route get 10.200.52.68 | grep -oE 'dev [^ ]*' | cut -d' ' -f2                                                                                                    
  🐛 Running sniffer in background on cilium-test-1/host-netns-7srs2 (masters), mode=assert: tcpdump -i ens5 --immediate-mode -w /tmp/pod-to-pod-encryption.pcap ((udp and (udp[8:2] = 0x0800 or dst port 8472 or dst port 6081)) and host 10.200.52.67 and host 10.200.52.68) or (host 11.0.1.145 and host 11.0.0.98 and tcp)                                                                
  [.] Action [pod-to-pod-with-l7-policy-encryption/pod-to-pod-encryption/curl-ipv4: cilium-test-1/client-7774757849-snfgp (11.0.0.98) -> cilium-test-1/echo-other-node-78db95fd8f-c6745 (11.0.1.145:8080)]                                                                                                                                                                                    
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://11.0.1.145:8080]                                                                                                                                                                         
.  🟥 Failed to execute tcpdump on cilium-test-1/host-netns-t7lrw (workers): failed to wait for tcpdump to terminate                                                                                                                                                                                                                                                                          
  🐛 Finalizing Test pod-to-pod-with-l7-policy-encryption                                                                                                                                                                                                                                                                                                                                     
  🐛 Pod kube-system/cilium-58pk9's current policy revision: 7                                                                                                                                 
  🐛 Pod kube-system/cilium-thv6b's current policy revision: 7                                                                                                                                 
  ℹ️  📜 Deleting CiliumNetworkPolicy 'client-egress-l7-http-from-any' in namespace 'cilium-test-1' on cluster cluster.local..                                                                                                                                                                                                                                                                 
  ℹ️  📜 Deleting CiliumNetworkPolicy 'echo-ingress-l7-http-from-anywhere' in namespace 'cilium-test-1' on cluster cluster.local..                                                              
  🐛 Pod cluster.local/kube-system/cilium-58pk9 revision > 7                                                                                                                                                                                                                                                                                                                                  
  🐛 Pod cluster.local/kube-system/cilium-thv6b revision > 7                                                                                                                                   
  🐛 📜 Successfully deleted 2 resources

After increasing the timeout from 1 to 5 in the code, the issue is gone.

[=] [cilium-test-1] Test [pod-to-pod-encryption] [55/105]
  🐛 Running sniffer in background on cilium-test-1/host-netns-t7lrw (workers), mode=assert: tcpdump -i ens4 --immediate-mode -w /tmp/pod-to-pod-encryption.pcap ((udp and (udp[8:2] = 0x0800 or dst port 8472 or dst port 6081)) and host 10.200.52.68 and host 10.200.52.67) or (host 11.0.0.98 and host 11.0.1.145 and tcp)
  🐛 Running /bin/sh -c ip -o route get 10.200.52.68 | grep -oE 'dev [^ ]*' | cut -d' ' -f2
  🐛 Running sniffer in background on cilium-test-1/host-netns-7srs2 (masters), mode=assert: tcpdump -i ens5 --immediate-mode -w /tmp/pod-to-pod-encryption.pcap ((udp and (udp[8:2] = 0x0800 or dst port 8472 or dst port 6081)) and host 10.200.52.67 and host 10.200.52.68) or (host 11.0.1.145 and host 11.0.0.98 and tcp)
  [.] Action [pod-to-pod-encryption/pod-to-pod-encryption/curl-ipv4: cilium-test-1/client-7774757849-snfgp (11.0.0.98) -> cilium-test-1/echo-other-node-78db95fd8f-c6745 (11.0.1.145:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://11.0.1.145:8080]
.  🐛 Finalizing Test pod-to-pod-encryption                      
[=] [cilium-test-1] Test [pod-to-pod-with-l7-policy-encryption] [56/105]
  🐛 Pod kube-system/cilium-thv6b's current policy revision 17    
  🐛 Pod kube-system/cilium-58pk9's current policy revision 17
  ℹ️  📜 Applying CiliumNetworkPolicy 'client-egress-l7-http-from-any' to namespace 'cilium-test-1' on cluster cluster.local..
  ℹ️  📜 Applying CiliumNetworkPolicy 'echo-ingress-l7-http-from-anywhere' to namespace 'cilium-test-1' on cluster cluster.local..
  🐛 Policy difference detected, waiting for Cilium agents to increment policy revisions..
  🐛 Pod cluster.local/kube-system/cilium-58pk9 revision > 17
  🐛 Pod cluster.local/kube-system/cilium-thv6b revision > 17
  🐛 📜 Successfully applied 2 additional resources
  [-] Scenario [pod-to-pod-with-l7-policy-encryption/pod-to-pod-encryption]
  🐛 Encapsulation before WG encryption           
  🐛 Running /bin/sh -c ip -o route get 10.200.52.67 | grep -oE 'dev [^ ]*' | cut -d' ' -f2
  🐛 Running sniffer in background on cilium-test-1/host-netns-t7lrw (workers), mode=assert: tcpdump -i ens4 --immediate-mode -w /tmp/pod-to-pod-encryption.pcap ((udp and (udp[8:2] = 0x0800 or dst port 8472 or dst port 6081)) and host 10.200.52.68 and host 10.200.52.67) or (host 11.0.0.98 and host 11.0.1.145 and tcp)
  🐛 Running /bin/sh -c ip -o route get 10.200.52.68 | grep -oE 'dev [^ ]*' | cut -d' ' -f2
  🐛 Running sniffer in background on cilium-test-1/host-netns-7srs2 (masters), mode=assert: tcpdump -i ens5 --immediate-mode -w /tmp/pod-to-pod-encryption.pcap ((udp and (udp[8:2] = 0x0800 or dst port 8472 or dst port 6081)) and host 10.200.52.67 and host 10.200.52.68) or (host 11.0.1.145 and host 11.0.0.98 and tcp)
  [.] Action [pod-to-pod-with-l7-policy-encryption/pod-to-pod-encryption/curl-ipv4: cilium-test-1/client-7774757849-snfgp (11.0.0.98) -> cilium-test-1/echo-other-node-78db95fd8f-c6745 (11.0.1.145:8080)]
  🐛 Executing command [curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 http://11.0.1.145:8080]
.  🐛 Finalizing Test pod-to-pod-with-l7-policy-encryption
  🐛 Pod kube-system/cilium-58pk9's current policy revision: 19
  🐛 Pod kube-system/cilium-thv6b's current policy revision: 19
  ℹ️  📜 Deleting CiliumNetworkPolicy 'client-egress-l7-http-from-any' in namespace 'cilium-test-1' on cluster cluster.local..
  ℹ️  📜 Deleting CiliumNetworkPolicy 'echo-ingress-l7-http-from-anywhere' in namespace 'cilium-test-1' on cluster cluster.local..
  🐛 Pod cluster.local/kube-system/cilium-58pk9 revision > 19
  🐛 Pod cluster.local/kube-system/cilium-thv6b revision > 19                                 
  🐛 📜 Successfully deleted 2 resources   

- Increase timeout for tcpdump termination from 1s to 5s
- Add more descriptive timeout error message

Signed-off-by: Liyi Huang <liyi.huang@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 18, 2024
@github-actions github-actions bot added cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary labels Nov 18, 2024
@liyihuang liyihuang marked this pull request as ready for review November 18, 2024 20:57
@liyihuang liyihuang requested a review from a team as a code owner November 18, 2024 20:57
@liyihuang liyihuang requested a review from brlbil November 18, 2024 20:57
@liyihuang liyihuang marked this pull request as draft November 18, 2024 21:01
@liyihuang liyihuang marked this pull request as ready for review November 18, 2024 21:04
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tklauser tklauser added the release-note/ci This PR makes changes to the CI. label Nov 19, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 19, 2024
@tklauser tklauser enabled auto-merge November 19, 2024 10:04
@tklauser
Copy link
Member

/test

@tklauser tklauser added this pull request to the merge queue Nov 19, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 19, 2024
Merged via the queue into cilium:main with commit 0c7b6d4 Nov 19, 2024
68 of 69 checks passed
@liyihuang liyihuang deleted the pr/liyihuang/increase_sniffer_time_out branch November 19, 2024 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants