-
Notifications
You must be signed in to change notification settings - Fork 3.4k
connectivity: rework sniffer to execute tcpdump in background #40487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8fd6428
to
5e83713
Compare
5e83713
to
61df29e
Compare
/ci-e2e-upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! A few drive-by comments.
ci-e2e-upgrade didn't fail with the tcpdump error in 22 attempts. Some tests are unsuccessful because of other flakes and quay outage. I'd say it's stable: the debug pull request #40692 can reproduce it after 3–4 retries, while this one survived 22 in a row. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simone is on vacation, I'm taking it over for the while.
61df29e
to
2e3e11a
Compare
/test |
ae8206e
to
041bfb8
Compare
Prior to this commit, we were running remote sniffers via ExecInPodWithWriters and streaming stdout/stderr through the connection. We observed race conditions with underlying libraries (both WebSocket and SPDY) in case one of the stream is not available yet/anymore. In that case, the remote tcpdump command that is printing packet/logs in the streams would fail with error code 14 (broken pipe). Such flake was reproducible only in CI. The test consisted of the following instructions: 1. running an ExecInPodWithWriters connection with stdout/stderr stream for tcpdump 2. running an ExecInPod at the end of the test to retrieve packets count 3. running an ExecInPod at the end of the test to retrieve and print packets With this commit, we run the remote sniffer in background, without keeping open connections for streaming data. Tcpdump now logs both stdout/stderr and potential captured packets to local files in the pod, which are then retrieved/read by us at the end of each test. The difference is in the way we start/stop such command: given we do not have anymore an open connection to close, we need to carefully make sure that we call stop in all the possible circumstances (e.g., context canceled, timeout hit, errors in between two `Sniff()`, etc.). To do that, we introduce a close/finalizer function that is being returned in the costructor, so that the caller can defer its execution. The `Validate()` is also able to stop the sniffer, as well as the constructor `Sniff()` itself in case of errors (e.g., context canceled or timeout hit). We do now perform the following operations: i. run an ExecInPod to create the remote tcpdump process in background ii. run an ExecInPod to stop the remote process at the end of the test iii. running an ExecInPod at the end of the test to retrieve packets count iv. running an ExecInPod at the end of the test to retrieve and print packets While it is true that it might overload more the API server given we do need to ExecInPod once more, there's no relevant stdout/stderr in (i) and (ii) anymore, hopefully mitigating the tcpdump issue we were facing in CI flakes. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com> Co-developed-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Prior to this patch series of 3 commits, we were: 1. running an ExecInPodWithWriters connection with stdout/stderr stream for tcpdump 2. running an ExecInPod at the end of the test to retrieve packets count 3. running an ExecInPod at the end of the test to retrieve and print packets In previous commits, we reworked the way we create and stop such sniffers: the tcpdump command is now left in background in the pod, and no active stream connections are present during the related tests. The total amount of remote commands executions were as follows: i. run an ExecInPod to create the remote tcpdump process in background ii. run an ExecInPod to stop the remote process at the end of the test iii. running an ExecInPod at the end of the test to retrieve packets count iv. running an ExecInPod at the end of the test to retrieve and print packets This is clearly inefficient, given we moved to background tcpdump in pods to avoid error 14 (broken pipe) due to race conditions while setting up the stream (the flake happened with both WebSockets and SPDY connections), observed especially under heavy load of the API server. This commit optimizes such behavior, unifying (ii) - (iii) - (iv). Upon successfully stopping the remote sniffer, the stop command executed in (ii) will additionally print to stdout (iii) and (iv) without having to re-execute ExecInPod. If the sniffer has been successfully stopped, this command will not fail, even if facing errors while printing out the packet count (iii) or the packet themselves (iv). This is aligned to what we're already doing right now, but with less connections. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
When running in Sanity mode, we do limit the number of captured packets to one, just to make sure that we see traffic in the interface where the sniffer is listening to. In Assert mode, we capture all the packets matching the filter, and we signal the leak at the end of the test by printing all of them. I'd suggest we limit also in Assert mode the amount of packets captured. Historically, I've seen captured packets to always repeat themselves, and I'm not sure it could be very helpful to have, for instance, more than 1000 leaked packets being logged. In addition, limiting the amount of captured packets would also help with keeping low memory footprint, given we do not remove the dump files yet from the pods. Signed-off-by: Simone Magnani <simone.magnani@isovalent.com>
041bfb8
to
a47aa75
Compare
Avoid reverse DNS resolution of IP addresses. Suggested-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
a47aa75
to
d5d7c91
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice commit history! 🙏
// (tcpdump: Unable to write output: Broken pipe) | ||
// | ||
// Runs silently: succeeds if tcpdump is active ("listening on <iface>"), | ||
// else fails with its error code. Includes `timeout 60` to prevent orphaned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit in case of repush:
// else fails with its error code. Includes `timeout 60` to prevent orphaned | |
// else fails with its error code. Includes `timeout` to prevent orphaned |
given the timeout is not hardcoded here.
case ModeSanity: | ||
count = 1 | ||
case ModeAssert: | ||
count = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, even 100 would be largely enough. I wouldn't delay the PR on this though.
# Return an error if tcpdump timed out before capturing any packets. | ||
if kill -0 $pid 2>/dev/null; then | ||
kill -2 $pid 2>/dev/null | ||
elif [ "$(tcpdump -r {{ .DumpPath }} --count 2>/dev/null)" = "0 packets" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gentoo-root I believe this is incorrect for the ModeAssert where having 0 packets
is good.
Isn't this checked later in Go anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is to error out if tcpdump exited prematurely. If it didn't collect any packets and exited, it means it may have missed some packets, so we better fail the test.
In ModeAssert, however, we expect that tcpdump survives until it's killed on line 99. We don't hit the check on line 100 if everything works properly. Only if tcpdump was running during the entire test, we can be sure that 0 packets is indeed 0 packets, and we return 0 on line 111 in this case. So it's all good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, given we now have also -c 1000
for ModeAssert.
Tcpdump might have already recorded 1000 packets and exited before our check here.
This PR contains the following updates: | Package | Update | Change | |---|---|---| | [aqua:cilium/cilium-cli](https://redirect.github.com/cilium/cilium-cli) | patch | `0.18.5` -> `0.18.6` | --- ### Release Notes <details> <summary>cilium/cilium-cli (aqua:cilium/cilium-cli)</summary> ### [`v0.18.6`](https://redirect.github.com/cilium/cilium-cli/releases/tag/v0.18.6) [Compare Source](https://redirect.github.com/cilium/cilium-cli/compare/v0.18.5...v0.18.6) #### What's Changed **Minor Changes:** - Cilium uninstall now removes annotations from Kubernetes nodes when clean-cilium-state: true ([cilium/cilium#39931](https://redirect.github.com/cilium/cilium/issues/39931), [@​AritraDey-Dev](https://redirect.github.com/AritraDey-Dev)) - Deprecate `v2alpha1` version of `CiliumLoadBalancerIPPool` CRD in favor of the `v2` version ([cilium/cilium#39134](https://redirect.github.com/cilium/cilium/issues/39134), [@​pippolo84](https://redirect.github.com/pippolo84)) **Bugfixes:** - Fix bug where we would display the Max Seq. Number for IPsec on 32bits. ([cilium/cilium#40622](https://redirect.github.com/cilium/cilium/issues/40622), [@​pchaigno](https://redirect.github.com/pchaigno)) **CI Changes:** - Add l7 proxy check for `to-fqdns` connectivity test ([cilium/cilium#40549](https://redirect.github.com/cilium/cilium/issues/40549), [@​vipul-21](https://redirect.github.com/vipul-21)) - cli: switch coredns image to registry.k8s.io, and fix renovate ([cilium/cilium#40706](https://redirect.github.com/cilium/cilium/issues/40706), [@​giorio94](https://redirect.github.com/giorio94)) - connectivity: Allow customization of tcpdump kill timeout ([cilium/cilium#40774](https://redirect.github.com/cilium/cilium/issues/40774), [@​gentoo-root](https://redirect.github.com/gentoo-root)) - connectivity: rework sniffer to execute tcpdump in background ([cilium/cilium#40487](https://redirect.github.com/cilium/cilium/issues/40487), [@​smagnani96](https://redirect.github.com/smagnani96)) **Misc Changes:** - chore(deps): update docker.io/library/golang:1.24.4 docker digest to [`20a022e`](https://redirect.github.com/cilium/cilium-cli/commit/20a022e) (main) ([cilium/cilium#40379](https://redirect.github.com/cilium/cilium/issues/40379), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - chore(deps): update docker.io/library/golang:1.24.5 docker digest to [`ef5b4be`](https://redirect.github.com/cilium/cilium-cli/commit/ef5b4be) (main) ([cilium/cilium#40738](https://redirect.github.com/cilium/cilium/issues/40738), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - chore(deps): update go to v1.24.5 (main) ([cilium/cilium#40496](https://redirect.github.com/cilium/cilium/issues/40496), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - cilium-cli: Print partial output upon `bgp peers` errors ([cilium/cilium#40278](https://redirect.github.com/cilium/cilium/issues/40278), [@​rastislavs](https://redirect.github.com/rastislavs)) - cilium-cli: Update default network-perf image ([cilium/cilium#40376](https://redirect.github.com/cilium/cilium/issues/40376), [@​HadrienPatte](https://redirect.github.com/HadrienPatte)) - cilium-cli: Use slim k8s packages for connectivity tests ([cilium/cilium#40708](https://redirect.github.com/cilium/cilium/issues/40708), [@​HadrienPatte](https://redirect.github.com/HadrienPatte)) - Fix misc typos ([cilium/cilium#40769](https://redirect.github.com/cilium/cilium/issues/40769), [@​HadrienPatte](https://redirect.github.com/HadrienPatte)) - go.mod, vendor: pull in charts for Cilium 1.18.0 and Tetragon 1.5.0 ([cilium/cilium#40823](https://redirect.github.com/cilium/cilium/issues/40823), [@​tklauser](https://redirect.github.com/tklauser)) - Miscellaneous improvements to option.NewNamedMapOptions ([cilium/cilium#40529](https://redirect.github.com/cilium/cilium/issues/40529), [@​giorio94](https://redirect.github.com/giorio94)) - The unableTranslateCIDRgroups variable is removed as it is not used since the v1.17 release ([cilium/cilium#40267](https://redirect.github.com/cilium/cilium/issues/40267), [@​Surya-7890](https://redirect.github.com/Surya-7890)) - vendor: Update github.com/google/go-github to v73 ([cilium/cilium#40326](https://redirect.github.com/cilium/cilium/issues/40326), [@​HadrienPatte](https://redirect.github.com/HadrienPatte)) - Update stable release to v0.18.5 by [@​tklauser](https://redirect.github.com/tklauser) in [https://github.com/cilium/cilium-cli/pull/3060](https://redirect.github.com/cilium/cilium-cli/pull/3060) - chore(deps): update docker.io/library/golang:1.24.4 docker digest to [`20a022e`](https://redirect.github.com/cilium/cilium-cli/commit/20a022e) by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3061](https://redirect.github.com/cilium/cilium-cli/pull/3061)1 - Update RELEASE.md by [@​michi-covalent](https://redirect.github.com/michi-covalent) in [https://github.com/cilium/cilium-cli/pull/3062](https://redirect.github.com/cilium/cilium-cli/pull/3062) - chore(deps): update golang docker tag to v1.24.5 by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3063](https://redirect.github.com/cilium/cilium-cli/pull/3063)3 - chore(deps): update go to v1.24.5 (patch) by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3065](https://redirect.github.com/cilium/cilium-cli/pull/3065)5 - chore(deps): update golangci/golangci-lint docker tag to v2.2.2 by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3066](https://redirect.github.com/cilium/cilium-cli/pull/3066)6 - chore(deps): update dependency cilium/cilium to v1.17.6 by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3068](https://redirect.github.com/cilium/cilium-cli/pull/3068)8 - chore(deps): update golang:1.24.5-alpine3.21 docker digest to [`3ebc008`](https://redirect.github.com/cilium/cilium-cli/commit/3ebc008) by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3067](https://redirect.github.com/cilium/cilium-cli/pull/3067)7 - chore(deps): update golang:1.24.5-alpine3.21 docker digest to [`72ff633`](https://redirect.github.com/cilium/cilium-cli/commit/72ff633) by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3069](https://redirect.github.com/cilium/cilium-cli/pull/3069)9 - chore(deps): update golang:1.24.5-alpine3.21 docker digest to [`6edc205`](https://redirect.github.com/cilium/cilium-cli/commit/6edc205) by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3070](https://redirect.github.com/cilium/cilium-cli/pull/3070)0 - chore(deps): update golangci/golangci-lint docker tag to v2.3.0 - autoclosed by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3071](https://redirect.github.com/cilium/cilium-cli/pull/3071)1 - chore(deps): update dependency cilium/cilium to v1.18.0 by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3073](https://redirect.github.com/cilium/cilium-cli/pull/3073)3 - chore(deps): update docker.io/library/golang:1.24.5 docker digest to [`ef5b4be`](https://redirect.github.com/cilium/cilium-cli/commit/ef5b4be) by [@​renovate](https://redirect.github.com/renovate)\[bot] in[https://github.com/cilium/cilium-cli/pull/3072](https://redirect.github.com/cilium/cilium-cli/pull/3072)2 - Prepare for v0.18.6 release by [@​tklauser](https://redirect.github.com/tklauser) in [https://github.com/cilium/cilium-cli/pull/3074](https://redirect.github.com/cilium/cilium-cli/pull/3074) **Full Changelog**: cilium/cilium-cli@v0.18.5...v0.18.6 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/zocimek/home-ops). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0MS40My41IiwidXBkYXRlZEluVmVyIjoiNDEuNDYuMyIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsidHlwZS9wYXRjaCJdfQ==--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Łukasz Pospiech <zocimek@users.noreply.github.com>
Please do refer to commit messages for a detailed explanation.
I'd love as many feedback as possible, given this ended up being a not so easy change 😅
1st and 2nd commit are the major updates. Sometimes while re-reading them they do feel quite hacky, but haven't found a better way to implement that.
3rd commit
connectivity: limit tcpdump capture also for Assert mode
is a Why not?, definitively can go in a different PR (or even cancel it if I missed the reason why we shouldn't do that).Fixes: #38643