Skip to content

ipsec,ci: enable conn disrupt north-south for ipsec traffic #39061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025

Conversation

ldelossa
Copy link
Contributor

Enable the connection disruption tests for north-south traffic for IPsec upgrade tests.

This commit potentially introduces the flake outlined in #37540 again, for further debugging.

Fixes: #37540

ipsec: fix connection disruption issue for ipv6 ipsec upgrade scenarios. 

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 21, 2025
@github-actions github-actions bot added cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary labels Apr 21, 2025
@ldelossa
Copy link
Contributor Author

/test

@ldelossa ldelossa added the release-note/ci This PR makes changes to the CI. label Apr 22, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 22, 2025
@ldelossa ldelossa requested a review from ysksuzuki April 22, 2025 15:55
@ldelossa ldelossa marked this pull request as ready for review April 22, 2025 15:55
@ldelossa ldelossa requested a review from a team as a code owner April 22, 2025 15:55
@ldelossa ldelossa requested a review from christarazi April 22, 2025 15:55
@ldelossa
Copy link
Contributor Author

@ysksuzuki I'm going to suggest we actually merge in this change.

I've been running the ipsec up/down tests for a bit now. 0/6 so far runs failed:https://github.com/cilium/cilium/actions/runs/14576550343

It does not look like the original issue #37540 is occurring.

@julianwiedmann
Copy link
Member

@ysksuzuki I'm going to suggest we actually merge in this change.

I've been running the ipsec up/down tests for a bit now. 0/6 so far runs failed:https://github.com/cilium/cilium/actions/runs/14576550343

It does not look like the original issue #37540 is occurring.

Hm I don't follow - isn't this exactly the symptom we're looking for?

(for context - #38757 moved the ipsec configs over into e2e-upgrade. I believe you've been trying to run ipsec-e2e for a repro instead? But that workflow doesn't even do any upgrade/downgrade ...)

@ysksuzuki
Copy link
Member

ysksuzuki commented Apr 23, 2025

Right, this is the issue I saw in #37540. I wasn't aware of that the ipsec configs are moved to e2e-upgrade, sorry.

  🟥 Pod test-conn-disrupt-client-non-backend-node-ipv6-internalip-xlwl5 flow was interrupted (restart count does not match 0 != 1)

Also, ci-ipsec-upgrade fails consitently with this change on v1.17. https://github.com/cilium/cilium/actions/runs/14606495152
I tested on #39097. I think we need to take care of it too.

@ldelossa
Copy link
Contributor Author

ldelossa commented Apr 23, 2025

Wait @julianwiedmann @ysksuzuki

Now I am a bit confused.

I believe you've been trying to run ipsec-e2e for a repro instead? But that workflow doesn't even do any upgrade/downgrade ...)

That's not the case, I've been testing: https://github.com/cilium/cilium/actions/runs/14576550343 like the original issue, which has 0/8 runs failing so far.

Hm I don't follow - isn't this exactly the symptom we're looking for?

That is not the case. The e2e-upgrade tests you point to are failing due to a ctmap flush error which is occurring currently, outside of this PR which turns north-south load balancing tests back on: https://github.com/cilium/cilium/actions/runs/14614535990

That ctmap flush error is being tracked in an entire separate pull request: #39018

@julianwiedmann
Copy link
Member

Wait @julianwiedmann @ysksuzuki

Now I am a bit confused.

I believe you've been trying to run ipsec-e2e for a repro instead? But that workflow doesn't even do any upgrade/downgrade ...)

That's not the case, I've been testing: https://github.com/cilium/cilium/actions/runs/14576550343 like the original issue, which has 0/8 runs failing so far.

The original issue was for ci-ipsec-upgrade. You're testing ci-ipsec-e2e.

Hm I don't follow - isn't this exactly the symptom we're looking for?

That is not the case. The e2e-upgrade tests you point to are failing due to a ctmap flush error which is occurring currently,

The ctmap error is a red herring. This is what's causing the workflow to fail:

📋 Test Report [cilium-test-1]
❌ 1/1 tests failed (0/0 actions), 120 tests skipped, 0 scenarios skipped:
Test [no-interrupted-connections]:
⛑️ The following owners are responsible for reliability of the testsuite:
- @cilium/sig-datapath (no-interrupted-connections)
- @cilium/ci-structure (.github/workflows/tests-e2e-upgrade.yaml)

And if you scroll up, that's

[-] Scenario [no-interrupted-connections/no-interrupted-connections]
🟥 Pod test-conn-disrupt-client-non-backend-node-ipv6-internalip-xlwl5 flow was interrupted (restart count does not match 0 != 1)

@ldelossa
Copy link
Contributor Author

Ahh I was looking at: https://github.com/cilium/cilium/actions/runs/14576551857/job/40950760283

Okay I see it now in ipsec-5. And yeah, just realizing now ipsec upgrade is no more, I mis-linked to the e2e tests looking for ipsec-upgrade.

@ldelossa ldelossa force-pushed the ldelossa/ipsec-v6-conn-disrupt-flake branch from a54da95 to a466c01 Compare April 24, 2025 17:30
@ldelossa
Copy link
Contributor Author

I am unable to reproduce the issue on my local machine.

This maybe due to the specific version of the kernel ipsec-5 is running on. I'm doing a test now which runs the ipsec-5 configuration with a 6.10 kernel to see if the error is reproduced on this kernel as well.

@ldelossa ldelossa force-pushed the ldelossa/ipsec-v6-conn-disrupt-flake branch from a466c01 to 5fb78f9 Compare April 24, 2025 19:46
@ldelossa ldelossa requested review from a team as code owners April 24, 2025 19:46
@ldelossa ldelossa requested a review from pchaigno April 24, 2025 19:46
@ldelossa
Copy link
Contributor Author

/test

@ldelossa ldelossa force-pushed the ldelossa/ipsec-v6-conn-disrupt-flake branch from 5fb78f9 to 23581c9 Compare April 25, 2025 11:39
@ldelossa ldelossa added the dont-merge/preview-only Only for preview or testing, don't merge it. label Apr 25, 2025
@ldelossa
Copy link
Contributor Author

/test

@ldelossa
Copy link
Contributor Author

ldelossa commented Apr 25, 2025

Interesting, when we change ipsec-5 to use the 6.10 kernel the issue goes away. It's starting to at least look like this is a 5.4 kernel issue specifically.

Just to confirm I'm going to add a few test configs which use the same params which cause the issue but spread over our supported kernel ranges. It should appear to us that only 5.4 fails.

@julianwiedmann
Copy link
Member

Had a quick look at the sysdump:

  • "drop_reason_desc":"UNSUPPORTED_PROTOCOL_FOR_NAT_MASQUERADE","Summary":"ICMPv4 TimeExceeded(TTLExceeded)": addressed by bpf: nat: ICMP v4 improvements #36767, although it won't help with this specific problem. Just reducing the noise.
  • "drop_reason_desc":"UNSUPPORTED_PROTOCOL_FOR_NAT_MASQUERADE","Summary":"ICMPv6 TimeExceeded(HopLimitExceeded)": we have upstream support for this ICMP message as well, might also backport.
  • "traffic_direction":"INGRESS","file":{"name":"nodeport.h","line":1213},"drop_reason_desc":"FIB_LOOKUP_FAILED": these should be promising, and might also explain why we're only seeing problems on v5.4 (different behavior in fib.h).

@ldelossa ldelossa force-pushed the ldelossa/ipsec-v6-conn-disrupt-flake branch 2 times, most recently from 415ef13 to 6ae90fc Compare May 7, 2025 13:58
Enable the connection disruption tests for north-south traffic for IPsec
upgrade tests.

This commit potentially introduces the flake outlined in #37540 again,
for further debugging however, it should no longer cause an issue after
f59a51a "(CI: stop testing 5.4 kernel)" has been merged. This is
because the issue was only apparent on 5.4 kernels.

Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
@ldelossa ldelossa force-pushed the ldelossa/ipsec-v6-conn-disrupt-flake branch from 6ae90fc to c3f1d4c Compare May 8, 2025 14:16
@ldelossa ldelossa added dont-merge/preview-only Only for preview or testing, don't merge it. and removed dont-merge/preview-only Only for preview or testing, don't merge it. labels May 8, 2025
@ldelossa
Copy link
Contributor Author

ldelossa commented May 8, 2025

/test

@ldelossa ldelossa force-pushed the ldelossa/ipsec-v6-conn-disrupt-flake branch from c3f1d4c to 05e3f7e Compare May 12, 2025 14:20
@ldelossa ldelossa removed the dont-merge/preview-only Only for preview or testing, don't merge it. label May 12, 2025
@ldelossa
Copy link
Contributor Author

This is ready for review.

@julianwiedmann has removed the 5.4 kernel tests which were causing an error here.
Also we have a v1.17 backport PR open which updates the stable workflows to ignore NS tests if kernel version is 5.4 here: #39443

@ldelossa
Copy link
Contributor Author

/test

@ldelossa ldelossa added this pull request to the merge queue May 12, 2025
Merged via the queue into main with commit 460f6a7 May 12, 2025
254 of 255 checks passed
@ldelossa ldelossa deleted the ldelossa/ipsec-v6-conn-disrupt-flake branch May 12, 2025 18:10
@pchaigno pchaigno linked an issue Jul 8, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants