Skip to content

Conversation

asauber
Copy link
Member

@asauber asauber commented Apr 16, 2025

The upstream websocket executor implementation has a variety of issues due to hardcoded timeouts, and continues to cause flakes in our test suites which use a remote executor.

See kubernetes/kubernetes@1d4be75 (upstream timeout bumped from 2 seconds to 30 seconds)
See kubernetes/kubernetes@b04d117 (upstream timeout bumped from 30 seconds to 60 seconds)
See also https://github.com/kubernetes/kubernetes/commits/master/staging/src/k8s.io/client-go/tools/remotecommand

We continue to use the fallback mechanism here to allow Websocket connections in cases where SPDY is not available. SPDY is not available when there is an HTTP-only proxy in front of the Kubernetes API server.

To properly fix this issue, we could make upstream contributions to improve the websocket executor to rely less on hardcoded timeouts and also expose a mechanism for indicating to both sides of the connection when the io.Writers for fd 0, 1, and 2 are ready for writes.

Additional context

We introduced the use of the Websocket executor with #37538 in order to address the "Upgrade request required" error which users see when they use the cilium sysdump subcommand and their Kubernetes API Server is behind a proxy which does not support SPDY. As described in the PR description there, SPDY as a protocol has been deprecated since 2015, and Kubernetes has used websocket by default since v1.29.

In this implementation, we included a fallback mechanism to SPDY, in the case that Websocket is not available.

After introducing websockets, we began to see at least two related errors crop up in our CI:

#37784

websocket: close 1006 (abnormal closure): unexpected EOF
command terminated with exit code 14

A change which resolved most, but not all of these errors was #37984. See the PR description there for additional details. Unfortunately, this change did not resolve the flakes that we see in CI for the cases where tcpdump writes to the stdout io.Writer before the websocket connection is fully established.

This PR should eliminate those flakes, and also allows users to continue using cilium sysdump in cases where SPDY is not available.

Fixes: #38643

As of 2025-04-16 the upstream websocket executor implementation has a
variety of issues due to hardcoded timeouts.
See kubernetes/kubernetes@1d4be75
See https://github.com/kubernetes/kubernetes/commits/master/staging/src/k8s.io/client-go/tools/remotecommand

We continue to use the fallback mechanism here to allow Websocket
connections in cases where SPDY is not available. SPDY is not available
when there is an HTTP-only proxy in front of the Kubernetes API server.

Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
@asauber asauber added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Apr 16, 2025
@asauber asauber requested a review from a team as a code owner April 16, 2025 21:37
@asauber asauber requested a review from nathanjsweet April 16, 2025 21:37
@github-actions github-actions bot added cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary labels Apr 16, 2025
@asauber
Copy link
Member Author

asauber commented Apr 16, 2025

/test

@tommyp1ckles
Copy link
Contributor

@asauber Do we know how long we have on the SPDY implementation before it's removed ?

@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented Apr 16, 2025

@asauber Been seeing this "failed to flush ct entries" flake, created an issue here: #38989

@asauber
Copy link
Member Author

asauber commented Apr 16, 2025

@asauber Do we know how long we have on the SPDY implementation before it's removed ?

The use of websockets by default for all cases was released with Kubernetes v1.31 (PortForward, "streaming", in addition to RemoteCommand). The implementation history can be found in the KEP.

So, this indicates to me that the Kubernetes API Server will continue to support SPDY for a while, but the exact timeline isn't clear. Seems like a good question for SIG Network.

@tommyp1ckles tommyp1ckles added this pull request to the merge queue Apr 16, 2025
Merged via the queue into main with commit 2c89d3e Apr 16, 2025
263 of 265 checks passed
@tommyp1ckles tommyp1ckles deleted the pr/asauber/default-to-spdy branch April 16, 2025 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Cilium E2E Upgrade (ci-e2e-upgrade) - node-to-node-encryption: "Failed to execute tcpdump [...] command terminated with exit code 14"
2 participants