Skip to content

connectivity tests: keep tcpdump alive by printing to stdout #37984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 6, 2025

Conversation

asauber
Copy link
Member

@asauber asauber commented Mar 4, 2025

This is a fix for #37784

The observed error was

Failed to execute tcpdump on cilium-test-4/host-netns-8bpsl (kind-worker): command failed (pod=cilium-test-4/host-netns-8bpsl, container=): error reading from error stream: next reader: websocket: close 1006 (abnormal closure): unexpected EOF

The context here is that we are running tcpdump using a cilium-cli initialized websocket executor, as provided by apimachinery. As explained in these commit messages, websocket for exec was made the default in Kubernetes 1.31, as SPDY is a protocol which has been deprecated since 2015. We made it the default in Cilium as users began to report seeing the following error when running Cilium sysdump.

error with exec request ... : Upgrade request required: ""

After making this change, we began to see the above CI flakes. These are most likely due to a hardcoded 60 second timeout in the client-go websocket exec implementation.

Working on this assumption, we can keep the websocket connection open by doing the following:

  1. Having tcpdump print to both stdout, and the output file which we have been using for the sake of concurrency.
  2. Waiting a short period of time before invoking tcpdump, as using the new invocation immediately upon opening the websocket connection resulted in the following error message on occasion during testing.
Failed to execute tcpdump on cilium-test-4/host-netns-qdxrm (kind-worker2): command failed (pod=cilium-test-4/host-netns-qdxrm, container=): command terminated with exit code 14

tcpdump exits with status 14 when it cannot write to one of its output files. Since we have a relatively uncommon use case for the websocket connection (the most common use of this websocket connection is the kubectl exec command, which has used websocket by default since Kubernetes 1.29), where we override all input and output files, we are likely hitting a race with the websocket dialer for which we need to wait for all of the io.Writers to be satisfied.

This PR makes both of the above changes.

The ci-e2e-upgrade workflow has been hitting this failure for at least one of the related jobs more than 50% of the time, for those initial runs of the job which run all 31 configurations of the test. With these changes, there were zero failures among all 31 related jobs after 5 complete runs. Thus, the odds of this PR being a stable fix for this issue are about 97%.

fixes #37784

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 4, 2025
@github-actions github-actions bot added cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary labels Mar 4, 2025
@asauber asauber force-pushed the pr/asauber/tcpdump-timeout-fix branch from 05e8935 to b54fda6 Compare March 4, 2025 11:11
@asauber
Copy link
Member Author

asauber commented Mar 4, 2025

/ci-e2e-upgrade

@asauber
Copy link
Member Author

asauber commented Mar 5, 2025

/ci-e2e-upgrade

@asauber asauber force-pushed the pr/asauber/tcpdump-timeout-fix branch from 42589aa to 088a0a5 Compare March 5, 2025 22:39
@asauber
Copy link
Member Author

asauber commented Mar 5, 2025

/ci-e2e-upgrade

@asauber asauber added the release-note/ci This PR makes changes to the CI. label Mar 6, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 6, 2025
@asauber
Copy link
Member Author

asauber commented Mar 6, 2025

The flaky workflow ci-e2e-upgrade, has passed 5 times in a row with these changes

https://github.com/cilium/cilium/actions/runs/13687189843/attempts/5

Those runs were based on this commit which has since been rebased

088a0a5

@asauber asauber force-pushed the pr/asauber/tcpdump-timeout-fix branch from 088a0a5 to 5236387 Compare March 6, 2025 19:20
The observed error was

    Failed to execute tcpdump on cilium-test-4/host-netns-8bpsl
    (kind-worker): command failed (pod=cilium-test-4/host-netns-8bpsl,
    container=): error reading from error stream: next reader: websocket:
    close 1005 (abnormal closure): unexpected EOF

The context here is that we are running `tcpdump` using a `cilium-cli`
initialized websocket executor, as provided by apimachinery. Websocket
for exec was made the default in Kubernetes 1.31, as SPDY is a protocol
which has been deprecated since 2015. We made it the default in Cilium
as users began to report seeing the following error when running Cilium
sysdump.

    error with exec request ... : Upgrade request required: ""

After making this change, we began to see CI flakes. These are most
likely due to a hardcoded 60 second timeout in the client-go websocket
exec implementation.

Working on this assumption, we can keep the websocket connection open by
doing the following:

1. Having `tcpdump` print to _both_ stdout, and the output file which we
   have been using for the sake of concurrency.
2. Waiting a short period of time before invoking `tcpdump`, as using
   the new invocation immediately upon opening the websocket connection
   resulted in the following error message on occasion during testing.

    command terminated with exit code 14

`tcpdump` exits with status 14 when it cannot write to one of its output
files. Since we have a relatively uncommon use case for the websocket
connection (the most common use of this websocket connection is the
`kubectl exec` command, which has used websocket by default since
Kubernetes 1.29), where we override all input and output files, we are
likely hitting a race with the websocket dialer for which we need to
wait for all of the `io.Writers` to be satisfied.

This commit makes both of the above changes.

fixes #37784

Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
@asauber asauber force-pushed the pr/asauber/tcpdump-timeout-fix branch from 5236387 to 5ca6354 Compare March 6, 2025 19:25
@asauber
Copy link
Member Author

asauber commented Mar 6, 2025

/test

@asauber
Copy link
Member Author

asauber commented Mar 6, 2025

@asauber
Copy link
Member Author

asauber commented Mar 6, 2025

@asauber
Copy link
Member Author

asauber commented Mar 6, 2025

@joestringer joestringer enabled auto-merge March 6, 2025 21:43
@joestringer joestringer added this pull request to the merge queue Mar 6, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 6, 2025
Merged via the queue into main with commit 9e26c9f Mar 6, 2025
220 checks passed
@joestringer joestringer deleted the pr/asauber/tcpdump-timeout-fix branch March 6, 2025 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
3 participants