-
Notifications
You must be signed in to change notification settings - Fork 3.4k
cilium-cli: retry exec-in-pod requests in case of transient errors #35961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
On AKS, we sometimes see EOF being returned for the request to retrieve pod logs, e.g. [=] Test [all-ingress-deny] [7/58] ℹ️ 📜 Applying CiliumNetworkPolicy 'all-ingress-deny' to namespace 'cilium-test'.. 🟥 Error reading Cilium logs: error getting cilium-agent logs for kube-system/cilium-8nxmm: Get "https://10.224.0.5:10250/containerLogs/kube-system/cilium-8nxmm/cilium-agent?sinceTime=2023-12-12T00%3A34%3A47Z×tamps=true": EOF Try to work around this transient (?) failure by retrying the request up to 3 times. Suggested-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Tobias Klauser <tobias@cilium.io>
On AKS we sometimes see command execution in pods failing due to a transient (?) error with the SPDY connection, e.g. ℹ️ unable to extract exit code from error: error with exec request (pod=cilium-test/client2-5bcdb85f5f-kqpbc, container=client2): error sending request: Post "[https://cilium-cil-cilium-cilium-71-986ec5-08sang3r.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/cilium-test/pods/client2-5bcdb85f5f-kqpbc/exec?command=curl&command=-w&command=%25%7Blocal_ip%7D%3A%25%7Blocal_port%7D+-%3E+%25%7Bremote_ip%7D%3A%25%7Bremote_port%7D+%3D+%25%7Bresponse_code%7D&command=--silent&command=--fail&command=--show-error&command=--output&command=%2Fdev%2Fnull&command=--connect-timeout&command=2&command=--max-time&command=10&command=https%3A%2F%2Fbing.com%3A443&container=client2&stderr=true&stdout=true](https://cilium-cil-cilium-cilium-71-986ec5-08sang3r.hcp.westeurope.azmk8s.io/api/v1/namespaces/cilium-test/pods/client2-5bcdb85f5f-kqpbc/exec?command=curl&command=-w&command=%25%7Blocal_ip%7D%3A%25%7Blocal_port%7D+-%3E+%25%7Bremote_ip%7D%3A%25%7Bremote_port%7D+%3D+%25%7Bresponse_code%7D&command=--silent&command=--fail&command=--show-error&command=--output&command=%2Fdev%2Fnull&command=--connect-timeout&command=2&command=--max-time&command=10&command=https%3A%2F%2Fbing.com%3A443&container=client2&stderr=true&stdout=true)": write tcp 10.1.0.173:39130->51.124.76.20:443: write: connection reset by peer ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 [https://bing.com:443](https://bing.com/)" failed with unexpected exit code: error with exec request (pod=cilium-test/client2-5bcdb85f5f-kqpbc, container=client2): error sending request: Post "[https://cilium-cil-cilium-cilium-71-986ec5-08sang3r.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/cilium-test/pods/client2-5bcdb85f5f-kqpbc/exec?command=curl&command=-w&command=%25%7Blocal_ip%7D%3A%25%7Blocal_port%7D+-%3E+%25%7Bremote_ip%7D%3A%25%7Bremote_port%7D+%3D+%25%7Bresponse_code%7D&command=--silent&command=--fail&command=--show-error&command=--output&command=%2Fdev%2Fnull&command=--connect-timeout&command=2&command=--max-time&command=10&command=https%3A%2F%2Fbing.com%3A443&container=client2&stderr=true&stdout=true](https://cilium-cil-cilium-cilium-71-986ec5-08sang3r.hcp.westeurope.azmk8s.io/api/v1/namespaces/cilium-test/pods/client2-5bcdb85f5f-kqpbc/exec?command=curl&command=-w&command=%25%7Blocal_ip%7D%3A%25%7Blocal_port%7D+-%3E+%25%7Bremote_ip%7D%3A%25%7Bremote_port%7D+%3D+%25%7Bresponse_code%7D&command=--silent&command=--fail&command=--show-error&command=--output&command=%2Fdev%2Fnull&command=--connect-timeout&command=2&command=--max-time&command=10&command=https%3A%2F%2Fbing.com%3A443&container=client2&stderr=true&stdout=true)": write tcp 10.1.0.173:39130->51.124.76.20:443: write: connection reset by peer (expected 28, found -2) Note that the exit code is -2 while curl would only return positive error codes (see https://curl.se/docs/manpage.html#EXIT). Exit code -2 corresponds to ExitInvalidCode which is indicative that not the command itself failed but some other error must have caused the request to fail and no command exit code can be extracted. Since we see these transient errors on AKS quite often, try to work around it by retrying the request up to 3 times. Suggested-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Tobias Klauser <tobias@cilium.io>
/ci-aks |
/test |
tommyp1ckles
approved these changes
Nov 15, 2024
joamaki
approved these changes
Nov 15, 2024
christarazi
approved these changes
Nov 15, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/CI
Continuous Integration testing issue or flake
cilium-cli
This PR contains changes related with cilium-cli
cilium-cli-exclusive
This PR only impacts cilium-cli binary
release-note/ci
This PR makes changes to the CI.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In some cases, connectivity tests fail due to transient errors with the underlying SPDY stream during command execution in pods. This is especially prevalent on AKS. The source of these errors is unknown but according to #29845 (comment) (and following discussion) there is not much that can be done to prevent them, except for treating the error as transient and retrying the respective request. Do that for the known cases documented in #29845.
Also see commit messages for details.
Fixes: #29845