-
Notifications
You must be signed in to change notification settings - Fork 3.4k
connectivity tests: keep tcpdump alive by printing to stdout #37984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
05e8935
to
b54fda6
Compare
/ci-e2e-upgrade |
/ci-e2e-upgrade |
42589aa
to
088a0a5
Compare
/ci-e2e-upgrade |
The flaky workflow ci-e2e-upgrade, has passed 5 times in a row with these changes https://github.com/cilium/cilium/actions/runs/13687189843/attempts/5 Those runs were based on this commit which has since been rebased |
088a0a5
to
5236387
Compare
The observed error was Failed to execute tcpdump on cilium-test-4/host-netns-8bpsl (kind-worker): command failed (pod=cilium-test-4/host-netns-8bpsl, container=): error reading from error stream: next reader: websocket: close 1005 (abnormal closure): unexpected EOF The context here is that we are running `tcpdump` using a `cilium-cli` initialized websocket executor, as provided by apimachinery. Websocket for exec was made the default in Kubernetes 1.31, as SPDY is a protocol which has been deprecated since 2015. We made it the default in Cilium as users began to report seeing the following error when running Cilium sysdump. error with exec request ... : Upgrade request required: "" After making this change, we began to see CI flakes. These are most likely due to a hardcoded 60 second timeout in the client-go websocket exec implementation. Working on this assumption, we can keep the websocket connection open by doing the following: 1. Having `tcpdump` print to _both_ stdout, and the output file which we have been using for the sake of concurrency. 2. Waiting a short period of time before invoking `tcpdump`, as using the new invocation immediately upon opening the websocket connection resulted in the following error message on occasion during testing. command terminated with exit code 14 `tcpdump` exits with status 14 when it cannot write to one of its output files. Since we have a relatively uncommon use case for the websocket connection (the most common use of this websocket connection is the `kubectl exec` command, which has used websocket by default since Kubernetes 1.29), where we override all input and output files, we are likely hitting a race with the websocket dialer for which we need to wait for all of the `io.Writers` to be satisfied. This commit makes both of the above changes. fixes #37784 Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
5236387
to
5ca6354
Compare
/test |
christarazi
approved these changes
Mar 6, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cilium-cli
This PR contains changes related with cilium-cli
cilium-cli-exclusive
This PR only impacts cilium-cli binary
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/ci
This PR makes changes to the CI.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a fix for #37784
The observed error was
The context here is that we are running
tcpdump
using acilium-cli
initialized websocket executor, as provided by apimachinery. As explained in these commit messages, websocket for exec was made the default in Kubernetes 1.31, as SPDY is a protocol which has been deprecated since 2015. We made it the default in Cilium as users began to report seeing the following error when running Cilium sysdump.After making this change, we began to see the above CI flakes. These are most likely due to a hardcoded 60 second timeout in the client-go websocket exec implementation.
Working on this assumption, we can keep the websocket connection open by doing the following:
tcpdump
print to both stdout, and the output file which we have been using for the sake of concurrency.tcpdump
, as using the new invocation immediately upon opening the websocket connection resulted in the following error message on occasion during testing.tcpdump
exits with status 14 when it cannot write to one of its output files. Since we have a relatively uncommon use case for the websocket connection (the most common use of this websocket connection is thekubectl exec
command, which has used websocket by default since Kubernetes 1.29), where we override all input and output files, we are likely hitting a race with the websocket dialer for which we need to wait for all of theio.Writers
to be satisfied.This PR makes both of the above changes.
The
ci-e2e-upgrade
workflow has been hitting this failure for at least one of the related jobs more than 50% of the time, for those initial runs of the job which run all 31 configurations of the test. With these changes, there were zero failures among all 31 related jobs after 5 complete runs. Thus, the odds of this PR being a stable fix for this issue are about 97%.fixes #37784