Skip to content

Conversation

tklauser
Copy link
Member

@tklauser tklauser commented Nov 13, 2024

In some cases, connectivity tests fail due to transient errors with the underlying SPDY stream during command execution in pods. This is especially prevalent on AKS. The source of these errors is unknown but according to #29845 (comment) (and following discussion) there is not much that can be done to prevent them, except for treating the error as transient and retrying the respective request. Do that for the known cases documented in #29845.

Also see commit messages for details.

Fixes: #29845

On AKS, we sometimes see EOF being returned for the request to retrieve
pod logs, e.g.

    [=] Test [all-ingress-deny] [7/58]

    ℹ️  📜 Applying CiliumNetworkPolicy 'all-ingress-deny' to namespace 'cilium-test'..
    🟥 Error reading Cilium logs: error getting cilium-agent logs for kube-system/cilium-8nxmm: Get "https://10.224.0.5:10250/containerLogs/kube-system/cilium-8nxmm/cilium-agent?sinceTime=2023-12-12T00%3A34%3A47Z&timestamps=true": EOF

Try to work around this transient (?) failure by retrying the request up
to 3 times.

Suggested-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
On AKS we sometimes see command execution in pods failing due to a
transient (?) error with the SPDY connection, e.g.

    ℹ️  unable to extract exit code from error: error with exec request (pod=cilium-test/client2-5bcdb85f5f-kqpbc, container=client2): error sending request: Post "[https://cilium-cil-cilium-cilium-71-986ec5-08sang3r.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/cilium-test/pods/client2-5bcdb85f5f-kqpbc/exec?command=curl&command=-w&command=%25%7Blocal_ip%7D%3A%25%7Blocal_port%7D+-%3E+%25%7Bremote_ip%7D%3A%25%7Bremote_port%7D+%3D+%25%7Bresponse_code%7D&command=--silent&command=--fail&command=--show-error&command=--output&command=%2Fdev%2Fnull&command=--connect-timeout&command=2&command=--max-time&command=10&command=https%3A%2F%2Fbing.com%3A443&container=client2&stderr=true&stdout=true](https://cilium-cil-cilium-cilium-71-986ec5-08sang3r.hcp.westeurope.azmk8s.io/api/v1/namespaces/cilium-test/pods/client2-5bcdb85f5f-kqpbc/exec?command=curl&command=-w&command=%25%7Blocal_ip%7D%3A%25%7Blocal_port%7D+-%3E+%25%7Bremote_ip%7D%3A%25%7Bremote_port%7D+%3D+%25%7Bresponse_code%7D&command=--silent&command=--fail&command=--show-error&command=--output&command=%2Fdev%2Fnull&command=--connect-timeout&command=2&command=--max-time&command=10&command=https%3A%2F%2Fbing.com%3A443&container=client2&stderr=true&stdout=true)": write tcp 10.1.0.173:39130->51.124.76.20:443: write: connection reset by peer
    ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --output /dev/null --connect-timeout 2 --max-time 10 [https://bing.com:443](https://bing.com/)" failed with unexpected exit code: error with exec request (pod=cilium-test/client2-5bcdb85f5f-kqpbc, container=client2): error sending request: Post "[https://cilium-cil-cilium-cilium-71-986ec5-08sang3r.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/cilium-test/pods/client2-5bcdb85f5f-kqpbc/exec?command=curl&command=-w&command=%25%7Blocal_ip%7D%3A%25%7Blocal_port%7D+-%3E+%25%7Bremote_ip%7D%3A%25%7Bremote_port%7D+%3D+%25%7Bresponse_code%7D&command=--silent&command=--fail&command=--show-error&command=--output&command=%2Fdev%2Fnull&command=--connect-timeout&command=2&command=--max-time&command=10&command=https%3A%2F%2Fbing.com%3A443&container=client2&stderr=true&stdout=true](https://cilium-cil-cilium-cilium-71-986ec5-08sang3r.hcp.westeurope.azmk8s.io/api/v1/namespaces/cilium-test/pods/client2-5bcdb85f5f-kqpbc/exec?command=curl&command=-w&command=%25%7Blocal_ip%7D%3A%25%7Blocal_port%7D+-%3E+%25%7Bremote_ip%7D%3A%25%7Bremote_port%7D+%3D+%25%7Bresponse_code%7D&command=--silent&command=--fail&command=--show-error&command=--output&command=%2Fdev%2Fnull&command=--connect-timeout&command=2&command=--max-time&command=10&command=https%3A%2F%2Fbing.com%3A443&container=client2&stderr=true&stdout=true)": write tcp 10.1.0.173:39130->51.124.76.20:443: write: connection reset by peer (expected 28, found -2)

Note that the exit code is -2 while curl would only return positive
error codes (see https://curl.se/docs/manpage.html#EXIT). Exit code -2
corresponds to ExitInvalidCode which is indicative that not the command
itself failed but some other error must have caused the request to fail
and no command exit code can be extracted.

Since we see these transient errors on AKS quite often, try to work
around it by retrying the request up to 3 times.

Suggested-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Tobias Klauser <tobias@cilium.io>
@tklauser tklauser added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary labels Nov 13, 2024
@tklauser tklauser requested review from a team as code owners November 13, 2024 13:48
@tklauser tklauser requested review from joamaki and brlbil November 13, 2024 13:48
@tklauser
Copy link
Member Author

/ci-aks

@tklauser
Copy link
Member Author

/test

@tklauser tklauser enabled auto-merge November 15, 2024 07:51
@tklauser tklauser added this pull request to the merge queue Nov 15, 2024
Merged via the queue into main with commit 96232f0 Nov 15, 2024
227 checks passed
@tklauser tklauser deleted the pr/tklauser/cilium-cli-aks-deflake branch November 15, 2024 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: ConformanceAKS: write: connection reset by peer (expected 28, found -2)
4 participants