Skip to content

Conversation

ysksuzuki
Copy link
Member

@ysksuzuki ysksuzuki commented Jan 28, 2025

This PR extends the conn-disrupt-test to ensure that external connections through a NodePort are not disrupted.

The test deploys a conn-disrupt-test server with a single replica in the cluster and multiple clients on an external node.
Clients establish long-lived TCP connections with the server using a NodePort. Each client selects a node where the backend is running, and one where the backend is not running. The clients send requests to both nodes using IPv4 and IPv6 addresses.

After upgrade, the test checks the restart counters, and compares them with the counters stored before the upgrade. A mismatch indicates that a connection was interrupted.

Example output from kubectl -n cilium-test-1 get po -o wide:

NAME                                                              READY   STATUS    RESTARTS   AGE     IP             NODE
test-conn-disrupt-client-backend-node-ipv4-internalip-7fc97pbp9   1/1     Running   0          34s     172.18.0.2     kind-worker3
test-conn-disrupt-client-backend-node-ipv6-internalip-66bdtmg2z   1/1     Running   0          34s     172.18.0.2     kind-worker3
test-conn-disrupt-client-non-backend-node-ipv4-internalip-d6bns   1/1     Running   0          34s     172.18.0.2     kind-worker3
test-conn-disrupt-client-non-backend-node-ipv6-internalip-4x77p   1/1     Running   0          34s     172.18.0.2     kind-worker3
test-conn-disrupt-server-ns-traffic-5cc46556fb-bm9lz              1/1     Running   0          37s     10.244.2.173   kind-worker2

The conn-disrupt-test for NS traffic runs in the following workflows

  • ci-e2e-upgrade
  • ci-ipsec-upgrade
  • ci-ipsec-e2

It will be skipped in the following because the feature node-without-cilium is disabled (The one for EW traffic will run)

  • conformance-eks
  • ci-clustermesh upgrade
  • ci-ces-migrate

Note that ci-e2e-upgrade and ci-ipsec-upgrade become unstable if we enable no-interrupted-connections for NS traffic test. Will skip these conditions (lb-acceleration and ipsec + KPR + ipv6) for now, and investigate the cause as follow-up.

ci-e2e-upgrade (lb-acceleration enabled)
https://github.com/cilium/cilium/actions/runs/12984833400/job/36208601353

  ❌ Found unexpected packet drops:
{
  "labels": {
    "direction": "EGRESS",
    "reason": "FIB lookup failed"
  },
  "name": "cilium_drop_count_total",
  "value": 1
}

ci-ipsec-upgrade (ipsec + KPR + ipv6)
https://github.com/cilium/cilium/actions/runs/12982469981/job/36203776562

  [-] Scenario [no-interrupted-connections/no-interrupted-connections]
  🟥 Pod test-conn-disrupt-client-non-backend-node-ipv6-internalip-9rp69 flow was interrupted (restart count does not match 0 != 1)

Fixes: #13530

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 28, 2025
@github-actions github-actions bot added the cilium-cli This PR contains changes related with cilium-cli label Jan 28, 2025
@ysksuzuki ysksuzuki added the release-note/ci This PR makes changes to the CI. label Jan 28, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 28, 2025
@ysksuzuki
Copy link
Member Author

/test

@ysksuzuki ysksuzuki force-pushed the pr/ysksuzuki/test-conn-disrupt-ns-traffic branch from b9785bf to 341684d Compare January 28, 2025 07:29
@ysksuzuki
Copy link
Member Author

/test

@joestringer
Copy link
Member

Fly-by comment, but I see this is making changes to test "from outside" which can often be a source of unreliable test results in CI runs (eg, passes sometimes but fails sometimes). I think it would be worthwhile to heavily stress these changes by running CI many times in order to demonstrate that the test changes have high reliability.

@ysksuzuki ysksuzuki force-pushed the pr/ysksuzuki/test-conn-disrupt-ns-traffic branch 2 times, most recently from 0c0ea09 to 620378f Compare January 29, 2025 04:09
@ysksuzuki
Copy link
Member Author

/test

@ysksuzuki
Copy link
Member Author

Looks like ci-e2e-upgrade failed due to a flake. opened #37325

@ysksuzuki ysksuzuki marked this pull request as ready for review January 29, 2025 11:18
@ysksuzuki ysksuzuki requested review from a team as code owners January 29, 2025 11:18
Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some nits.

…utside

This commit extends the conn-disrupt-test to ensure that external
connections through a NodePort are not disrupted.

The test deploys a conn-disrupt-test server with a single replica
in the cluster and multiple clients on an external node.
Clients establish long-lived TCP connections with the server
using a NodePort. Each client selects a node where the backend is
running, and one where the backend is not running.
The clients send requests to both nodes using IPv4 and IPv6 addresses.

After upgrade, the test checks the restart counters, and compares them
with the counters stored before the upgrade. A mismatch indicates that
a connection was interrupted.

Example output from `kubectl -n cilium-test-1 get po -o wide`:

```
NAME                                                              READY   STATUS    RESTARTS   AGE     IP             NODE
test-conn-disrupt-client-backend-node-ipv4-internalip-7fc97pbp9   1/1     Running   0          34s     172.18.0.2     kind-worker3
test-conn-disrupt-client-backend-node-ipv6-internalip-66bdtmg2z   1/1     Running   0          34s     172.18.0.2     kind-worker3
test-conn-disrupt-client-non-backend-node-ipv4-internalip-d6bns   1/1     Running   0          34s     172.18.0.2     kind-worker3
test-conn-disrupt-client-non-backend-node-ipv6-internalip-4x77p   1/1     Running   0          34s     172.18.0.2     kind-worker3
test-conn-disrupt-server-ns-traffic-5cc46556fb-bm9lz              1/1     Running   0          37s     10.244.2.173   kind-worker2
```

Signed-off-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
include-conn-disrupt-test-ns-traffic is false by default, and
we need to explicitly set it to true to run the conn-disrupt-test
for NS traffic. This is to prevent CI from disrupting when updating
cilium-cli on the stable branch. Also, this is intended to provide
a way to opt out if this test causes instability in CI.

Signed-off-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
lb-acceleration is enabled or ipsec and KPR is enabled.
ci-e2e-upgrade and ci-ipsec-upgrade become unstable with
those settings. Skip it for now.

Signed-off-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
@ysksuzuki ysksuzuki force-pushed the pr/ysksuzuki/test-conn-disrupt-ns-traffic branch from 620378f to 24ffc76 Compare January 30, 2025 07:13
This commit enables conn-disrupt-test for NS traffic
in ci-e2e-upgrade, ci-ipsec-upgrade and ci-ipsec-e2e.

It will be skipped in conformance-eks, ci-clustermesh
upgrade and ci-ces-migrate because the feature node-without-cilium
is disabled. (The one for EW traffic will run)

Signed-off-by: Yusuke Suzuki <yusuke.suzuki@isovalent.com>
@ysksuzuki ysksuzuki force-pushed the pr/ysksuzuki/test-conn-disrupt-ns-traffic branch from 24ffc76 to 22fdd9a Compare January 30, 2025 07:21
@ysksuzuki
Copy link
Member Author

/test

@julianwiedmann
Copy link
Member

The conn-disrupt-test for NS traffic runs in the following workflows

* ci-e2e-upgrade

* ci-ipsec-upgrade

* ci-ipsec-e2

Just to emphasize - the IPsec workflows would then currently run this test in any configuration where KPR is switched off, correct?

@julianwiedmann julianwiedmann added area/loadbalancing Impacts load-balancing and Kubernetes service implementations upgrade-impact This PR has potential upgrade or downgrade impact. labels Jan 30, 2025
@julianwiedmann
Copy link
Member

Fly-by comment, but I see this is making changes to test "from outside" which can often be a source of unreliable test results in CI runs (eg, passes sometimes but fails sometimes). I think it would be worthwhile to heavily stress these changes by running CI many times in order to demonstrate that the test changes have high reliability.

@joestringer At least for the kind-based CI parts this should be not a big concern - the "outside" entity is really just another kind node that's not part of the cluster. We use the same approach in the EgressGW tests for instance, and I'm not aware that it's adding much flakiness to the results.

But ack on the overall sentiment - I like that the PR is enabling this new test very selectively, and for instance we're not exposing older Cilium releases to it when bumping the CLI version in CI.

@ysksuzuki
Copy link
Member Author

the IPsec workflows would then currently run this test in any configuration where KPR is switched off, correct?

Yes, that is correct.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 4, 2025
@joestringer joestringer enabled auto-merge February 4, 2025 15:44
@joestringer joestringer added this pull request to the merge queue Feb 5, 2025
Merged via the queue into main with commit 820a62d Feb 5, 2025
290 checks passed
@joestringer joestringer deleted the pr/ysksuzuki/test-conn-disrupt-ns-traffic branch February 5, 2025 09:21
@github-actions github-actions bot added the backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. label Mar 25, 2025
@julianwiedmann julianwiedmann added backport/author The backport will be carried out by the author of the PR. needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Apr 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport/author The backport will be carried out by the author of the PR. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. cilium-cli This PR contains changes related with cilium-cli ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI. upgrade-impact This PR has potential upgrade or downgrade impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend test-conn-disrupt to test NodePort from outside
7 participants