Skip to content

Conversation

smagnani96
Copy link
Contributor

@smagnani96 smagnani96 commented Jul 11, 2025

Please do refer to commit messages for a detailed explanation.
I'd love as many feedback as possible, given this ended up being a not so easy change 😅

1st and 2nd commit are the major updates. Sometimes while re-reading them they do feel quite hacky, but haven't found a better way to implement that.
3rd commit connectivity: limit tcpdump capture also for Assert mode is a Why not?, definitively can go in a different PR (or even cancel it if I missed the reason why we shouldn't do that).

Fixes: #38643

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 11, 2025
@github-actions github-actions bot added cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary labels Jul 11, 2025
@smagnani96 smagnani96 force-pushed the pr/smagnani96/fix-tcpdump-error-14-v2 branch from 8fd6428 to 5e83713 Compare July 18, 2025 17:58
@joestringer joestringer added the release-note/ci This PR makes changes to the CI. label Jul 18, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 18, 2025
@smagnani96 smagnani96 force-pushed the pr/smagnani96/fix-tcpdump-error-14-v2 branch from 5e83713 to 61df29e Compare July 25, 2025 00:03
@smagnani96
Copy link
Contributor Author

smagnani96 commented Jul 25, 2025

/ci-e2e-upgrade

@smagnani96 smagnani96 self-assigned this Jul 25, 2025
@smagnani96 smagnani96 changed the title connectivity: remote tcpdump commands dump stream only to local file connectivity: rework sniffer to execute tcpdump in background Jul 25, 2025
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A few drive-by comments.

@gentoo-root
Copy link
Contributor

ci-e2e-upgrade didn't fail with the tcpdump error in 22 attempts. Some tests are unsuccessful because of other flakes and quay outage.

I'd say it's stable: the debug pull request #40692 can reproduce it after 3–4 retries, while this one survived 22 in a row.

@gentoo-root
Copy link
Contributor

/test

Copy link
Contributor

@gentoo-root gentoo-root left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simone is on vacation, I'm taking it over for the while.

@gentoo-root gentoo-root force-pushed the pr/smagnani96/fix-tcpdump-error-14-v2 branch from 61df29e to 2e3e11a Compare July 28, 2025 12:34
@gentoo-root
Copy link
Contributor

/test

@gentoo-root gentoo-root force-pushed the pr/smagnani96/fix-tcpdump-error-14-v2 branch 2 times, most recently from ae8206e to 041bfb8 Compare July 28, 2025 14:27
Prior to this commit, we were running remote sniffers via ExecInPodWithWriters
and streaming stdout/stderr through the connection. We observed race conditions
with underlying libraries (both WebSocket and SPDY) in case one of the
stream is not available yet/anymore. In that case, the remote tcpdump command
that is printing packet/logs in the streams would fail with error code 14
(broken pipe). Such flake was reproducible only in CI. The test consisted of
the following instructions:

1. running an ExecInPodWithWriters connection with stdout/stderr stream for tcpdump
2. running an ExecInPod at the end of the test to retrieve packets count
3. running an ExecInPod at the end of the test to retrieve and print packets

With this commit, we run the remote sniffer in background, without keeping
open connections for streaming data. Tcpdump now logs both stdout/stderr
and potential captured packets to local files in the pod, which are then
retrieved/read by us at the end of each test. The difference is in the
way we start/stop such command: given we do not have anymore an open connection
to close, we need to carefully make sure that we call stop in all the possible
circumstances (e.g., context canceled, timeout hit, errors in between two
`Sniff()`, etc.). To do that, we introduce a close/finalizer function that
is being returned in the costructor, so that the caller can defer its execution.
The `Validate()` is also able to stop the sniffer, as well as the constructor
`Sniff()` itself in case of errors (e.g., context canceled or timeout hit).
We do now perform the following operations:

i.   run an ExecInPod to create the remote tcpdump process in background
ii.  run an ExecInPod to stop the remote process at the end of the test
iii. running an ExecInPod at the end of the test to retrieve packets count
iv.  running an ExecInPod at the end of the test to retrieve and print packets

While it is true that it might overload more the API server given we do need
to ExecInPod once more, there's no relevant stdout/stderr in (i) and (ii)
anymore, hopefully mitigating the tcpdump issue we were facing in CI flakes.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
Co-developed-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Prior to this patch series of 3 commits, we were:

1. running an ExecInPodWithWriters connection with stdout/stderr stream for tcpdump
2. running an ExecInPod at the end of the test to retrieve packets count
3. running an ExecInPod at the end of the test to retrieve and print packets

In previous commits, we reworked the way we create and stop such sniffers:
the tcpdump command is now left in background in the pod, and no active
stream connections are present during the related tests. The total amount
of remote commands executions were as follows:

i.   run an ExecInPod to create the remote tcpdump process in background
ii.  run an ExecInPod to stop the remote process at the end of the test
iii. running an ExecInPod at the end of the test to retrieve packets count
iv.  running an ExecInPod at the end of the test to retrieve and print packets

This is clearly inefficient, given we moved to background tcpdump in pods
to avoid error 14 (broken pipe) due to race conditions while setting up
the stream (the flake happened with both WebSockets and SPDY connections),
observed especially under heavy load of the API server.

This commit optimizes such behavior, unifying (ii) - (iii) - (iv).
Upon successfully stopping the remote sniffer, the stop command executed
in (ii) will additionally print to stdout (iii) and (iv) without having
to re-execute ExecInPod. If the sniffer has been successfully stopped,
this command will not fail, even if facing errors while printing out
the packet count (iii) or the packet themselves (iv). This is aligned to
what we're already doing right now, but with less connections.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
When running in Sanity mode, we do limit the number of captured packets
to one, just to make sure that we see traffic in the interface where the
sniffer is listening to. In Assert mode, we capture all the packets matching
the filter, and we signal the leak at the end of the test by printing all
of them. I'd suggest we limit also in Assert mode the amount of packets
captured. Historically, I've seen captured packets to always repeat
themselves, and I'm not sure it could be very helpful to have, for instance,
more than 1000 leaked packets being logged. In addition, limiting the
amount of captured packets would also help with keeping low memory
footprint, given we do not remove the dump files yet from the pods.

Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
@gentoo-root gentoo-root force-pushed the pr/smagnani96/fix-tcpdump-error-14-v2 branch from 041bfb8 to a47aa75 Compare July 28, 2025 14:32
Avoid reverse DNS resolution of IP addresses.

Suggested-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
@gentoo-root gentoo-root force-pushed the pr/smagnani96/fix-tcpdump-error-14-v2 branch from a47aa75 to d5d7c91 Compare July 28, 2025 14:35
@gentoo-root
Copy link
Contributor

/test

@gentoo-root gentoo-root marked this pull request as ready for review July 28, 2025 15:50
@gentoo-root gentoo-root requested review from a team as code owners July 28, 2025 15:50
@gentoo-root gentoo-root requested review from rgo3 and christarazi July 28, 2025 15:50
@pchaigno pchaigno requested review from pchaigno and removed request for christarazi and rgo3 July 28, 2025 15:53
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice commit history! 🙏

// (tcpdump: Unable to write output: Broken pipe)
//
// Runs silently: succeeds if tcpdump is active ("listening on <iface>"),
// else fails with its error code. Includes `timeout 60` to prevent orphaned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit in case of repush:

Suggested change
// else fails with its error code. Includes `timeout 60` to prevent orphaned
// else fails with its error code. Includes `timeout` to prevent orphaned

given the timeout is not hardcoded here.

case ModeSanity:
count = 1
case ModeAssert:
count = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, even 100 would be largely enough. I wouldn't delay the PR on this though.

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jul 28, 2025
@pchaigno pchaigno enabled auto-merge July 28, 2025 16:39
@pchaigno pchaigno added this pull request to the merge queue Jul 28, 2025
Merged via the queue into main with commit 1c2807c Jul 28, 2025
238 checks passed
@pchaigno pchaigno deleted the pr/smagnani96/fix-tcpdump-error-14-v2 branch July 28, 2025 16:50
# Return an error if tcpdump timed out before capturing any packets.
if kill -0 $pid 2>/dev/null; then
kill -2 $pid 2>/dev/null
elif [ "$(tcpdump -r {{ .DumpPath }} --count 2>/dev/null)" = "0 packets" ]; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gentoo-root I believe this is incorrect for the ModeAssert where having 0 packets is good.
Isn't this checked later in Go anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is to error out if tcpdump exited prematurely. If it didn't collect any packets and exited, it means it may have missed some packets, so we better fail the test.

In ModeAssert, however, we expect that tcpdump survives until it's killed on line 99. We don't hit the check on line 100 if everything works properly. Only if tcpdump was running during the entire test, we can be sure that 0 packets is indeed 0 packets, and we return 0 on line 111 in this case. So it's all good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, given we now have also -c 1000 for ModeAssert.
Tcpdump might have already recorded 1000 packets and exited before our check here.

zocimek added a commit to zocimek/home-ops that referenced this pull request Aug 25, 2025
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
|
[aqua:cilium/cilium-cli](https://redirect.github.com/cilium/cilium-cli)
| patch | `0.18.5` -> `0.18.6` |

---

### Release Notes

<details>
<summary>cilium/cilium-cli (aqua:cilium/cilium-cli)</summary>

###
[`v0.18.6`](https://redirect.github.com/cilium/cilium-cli/releases/tag/v0.18.6)

[Compare
Source](https://redirect.github.com/cilium/cilium-cli/compare/v0.18.5...v0.18.6)

#### What's Changed

**Minor Changes:**

- Cilium uninstall now removes annotations from Kubernetes nodes when
clean-cilium-state: true
([cilium/cilium#39931](https://redirect.github.com/cilium/cilium/issues/39931),
[@&#8203;AritraDey-Dev](https://redirect.github.com/AritraDey-Dev))
- Deprecate `v2alpha1` version of `CiliumLoadBalancerIPPool` CRD in
favor of the `v2` version
([cilium/cilium#39134](https://redirect.github.com/cilium/cilium/issues/39134),
[@&#8203;pippolo84](https://redirect.github.com/pippolo84))

**Bugfixes:**

- Fix bug where we would display the Max Seq. Number for IPsec on
32bits.
([cilium/cilium#40622](https://redirect.github.com/cilium/cilium/issues/40622),
[@&#8203;pchaigno](https://redirect.github.com/pchaigno))

**CI Changes:**

- Add l7 proxy check for `to-fqdns` connectivity test
([cilium/cilium#40549](https://redirect.github.com/cilium/cilium/issues/40549),
[@&#8203;vipul-21](https://redirect.github.com/vipul-21))
- cli: switch coredns image to registry.k8s.io, and fix renovate
([cilium/cilium#40706](https://redirect.github.com/cilium/cilium/issues/40706),
[@&#8203;giorio94](https://redirect.github.com/giorio94))
- connectivity: Allow customization of tcpdump kill timeout
([cilium/cilium#40774](https://redirect.github.com/cilium/cilium/issues/40774),
[@&#8203;gentoo-root](https://redirect.github.com/gentoo-root))
- connectivity: rework sniffer to execute tcpdump in background
([cilium/cilium#40487](https://redirect.github.com/cilium/cilium/issues/40487),
[@&#8203;smagnani96](https://redirect.github.com/smagnani96))

**Misc Changes:**

- chore(deps): update docker.io/library/golang:1.24.4 docker digest to
[`20a022e`](https://redirect.github.com/cilium/cilium-cli/commit/20a022e)
(main)
([cilium/cilium#40379](https://redirect.github.com/cilium/cilium/issues/40379),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- chore(deps): update docker.io/library/golang:1.24.5 docker digest to
[`ef5b4be`](https://redirect.github.com/cilium/cilium-cli/commit/ef5b4be)
(main)
([cilium/cilium#40738](https://redirect.github.com/cilium/cilium/issues/40738),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- chore(deps): update go to v1.24.5 (main)
([cilium/cilium#40496](https://redirect.github.com/cilium/cilium/issues/40496),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- cilium-cli: Print partial output upon `bgp peers` errors
([cilium/cilium#40278](https://redirect.github.com/cilium/cilium/issues/40278),
[@&#8203;rastislavs](https://redirect.github.com/rastislavs))
- cilium-cli: Update default network-perf image
([cilium/cilium#40376](https://redirect.github.com/cilium/cilium/issues/40376),
[@&#8203;HadrienPatte](https://redirect.github.com/HadrienPatte))
- cilium-cli: Use slim k8s packages for connectivity tests
([cilium/cilium#40708](https://redirect.github.com/cilium/cilium/issues/40708),
[@&#8203;HadrienPatte](https://redirect.github.com/HadrienPatte))
- Fix misc typos
([cilium/cilium#40769](https://redirect.github.com/cilium/cilium/issues/40769),
[@&#8203;HadrienPatte](https://redirect.github.com/HadrienPatte))
- go.mod, vendor: pull in charts for Cilium 1.18.0 and Tetragon 1.5.0
([cilium/cilium#40823](https://redirect.github.com/cilium/cilium/issues/40823),
[@&#8203;tklauser](https://redirect.github.com/tklauser))
- Miscellaneous improvements to option.NewNamedMapOptions
([cilium/cilium#40529](https://redirect.github.com/cilium/cilium/issues/40529),
[@&#8203;giorio94](https://redirect.github.com/giorio94))
- The unableTranslateCIDRgroups variable is removed as it is not used
since the v1.17 release
([cilium/cilium#40267](https://redirect.github.com/cilium/cilium/issues/40267),
[@&#8203;Surya-7890](https://redirect.github.com/Surya-7890))
- vendor: Update github.com/google/go-github to v73
([cilium/cilium#40326](https://redirect.github.com/cilium/cilium/issues/40326),
[@&#8203;HadrienPatte](https://redirect.github.com/HadrienPatte))
- Update stable release to v0.18.5 by
[@&#8203;tklauser](https://redirect.github.com/tklauser) in
[https://github.com/cilium/cilium-cli/pull/3060](https://redirect.github.com/cilium/cilium-cli/pull/3060)
- chore(deps): update docker.io/library/golang:1.24.4 docker digest to
[`20a022e`](https://redirect.github.com/cilium/cilium-cli/commit/20a022e)
by [@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3061](https://redirect.github.com/cilium/cilium-cli/pull/3061)1
- Update RELEASE.md by
[@&#8203;michi-covalent](https://redirect.github.com/michi-covalent) in
[https://github.com/cilium/cilium-cli/pull/3062](https://redirect.github.com/cilium/cilium-cli/pull/3062)
- chore(deps): update golang docker tag to v1.24.5 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3063](https://redirect.github.com/cilium/cilium-cli/pull/3063)3
- chore(deps): update go to v1.24.5 (patch) by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3065](https://redirect.github.com/cilium/cilium-cli/pull/3065)5
- chore(deps): update golangci/golangci-lint docker tag to v2.2.2 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3066](https://redirect.github.com/cilium/cilium-cli/pull/3066)6
- chore(deps): update dependency cilium/cilium to v1.17.6 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3068](https://redirect.github.com/cilium/cilium-cli/pull/3068)8
- chore(deps): update golang:1.24.5-alpine3.21 docker digest to
[`3ebc008`](https://redirect.github.com/cilium/cilium-cli/commit/3ebc008)
by [@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3067](https://redirect.github.com/cilium/cilium-cli/pull/3067)7
- chore(deps): update golang:1.24.5-alpine3.21 docker digest to
[`72ff633`](https://redirect.github.com/cilium/cilium-cli/commit/72ff633)
by [@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3069](https://redirect.github.com/cilium/cilium-cli/pull/3069)9
- chore(deps): update golang:1.24.5-alpine3.21 docker digest to
[`6edc205`](https://redirect.github.com/cilium/cilium-cli/commit/6edc205)
by [@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3070](https://redirect.github.com/cilium/cilium-cli/pull/3070)0
- chore(deps): update golangci/golangci-lint docker tag to v2.3.0 -
autoclosed by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3071](https://redirect.github.com/cilium/cilium-cli/pull/3071)1
- chore(deps): update dependency cilium/cilium to v1.18.0 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3073](https://redirect.github.com/cilium/cilium-cli/pull/3073)3
- chore(deps): update docker.io/library/golang:1.24.5 docker digest to
[`ef5b4be`](https://redirect.github.com/cilium/cilium-cli/commit/ef5b4be)
by [@&#8203;renovate](https://redirect.github.com/renovate)\[bot]
in[https://github.com/cilium/cilium-cli/pull/3072](https://redirect.github.com/cilium/cilium-cli/pull/3072)2
- Prepare for v0.18.6 release by
[@&#8203;tklauser](https://redirect.github.com/tklauser) in
[https://github.com/cilium/cilium-cli/pull/3074](https://redirect.github.com/cilium/cilium-cli/pull/3074)

**Full Changelog**:
cilium/cilium-cli@v0.18.5...v0.18.6

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR was generated by [Mend Renovate](https://mend.io/renovate/).
View the [repository job
log](https://developer.mend.io/github/zocimek/home-ops).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MS40My41IiwidXBkYXRlZEluVmVyIjoiNDEuNDYuMyIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsidHlwZS9wYXRjaCJdfQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Łukasz Pospiech <zocimek@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Cilium E2E Upgrade (ci-e2e-upgrade) - node-to-node-encryption: "Failed to execute tcpdump [...] command terminated with exit code 14"
5 participants