Skip to content

Conversation

liuyuan10
Copy link
Contributor

Supports reverse selection in --devices flag with '!'

!excluded-prefix+: exclude devices with prefix "excluded-prefix"
!excluded: exclude the device named "excluded"

Supports device exclusion in --devices flag

@liuyuan10 liuyuan10 requested a review from a team as a code owner June 21, 2025 22:04
@liuyuan10 liuyuan10 requested a review from gentoo-root June 21, 2025 22:04
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 21, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jun 21, 2025
@aanm aanm added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs labels Jun 25, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 25, 2025
@aanm aanm enabled auto-merge June 25, 2025 09:00
@aanm aanm disabled auto-merge June 25, 2025 09:00
@joamaki
Copy link
Contributor

joamaki commented Jul 1, 2025

Please rebase (fixes the K8s Network E2E tests flake) and fix the typo:

  Error: pkg/datapath/linux/devices_controller.go:71:373: `criterias` is a misspelling of `criteria` (misspell)
  	flags.StringSlice(option.Devices, []string{}, "List of devices facing cluster/external network (used for BPF NodePort, BPF masquerading and host firewall); supports '+' as wildcard in device name, e.g. 'eth+'; support '!' to exclude devices, e.g. '!eth+' excludes any device with prefix 'eth'. Note '!' says nothing about which ones to include. A device must match other criterias to be selected; The filters are matched in order and whatever matched first wins.")

Then this is good to go!

@liuyuan10
Copy link
Contributor Author

Please rebase (fixes the K8s Network E2E tests flake) and fix the typo:

  Error: pkg/datapath/linux/devices_controller.go:71:373: `criterias` is a misspelling of `criteria` (misspell)
  	flags.StringSlice(option.Devices, []string{}, "List of devices facing cluster/external network (used for BPF NodePort, BPF masquerading and host firewall); supports '+' as wildcard in device name, e.g. 'eth+'; support '!' to exclude devices, e.g. '!eth+' excludes any device with prefix 'eth'. Note '!' says nothing about which ones to include. A device must match other criterias to be selected; The filters are matched in order and whatever matched first wins.")

Then this is good to go!

Done. Thanks

@joestringer joestringer removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jul 1, 2025
@liuyuan10
Copy link
Contributor Author

Is this change ready to merge? Please let me know if anything is missing

@dylandreimerink
Copy link
Member

/test

@joestringer joestringer enabled auto-merge July 23, 2025 15:30
@tommyp1ckles
Copy link
Contributor

https://github.com/cilium/cilium/actions/runs/16261006442/job/45906168035 Looks like some CI config issue, branch potentially needs a rebase @liuyuan10

@joestringer
Copy link
Member

Is this change ready to merge?

From a review perspective it looks good, I think the test results were just inconclusive. As Tom mentions, a rebase should likely help. We can retrigger the tests after and steer this in.

auto-merge was automatically disabled July 23, 2025 17:22

Head branch was pushed to by a user without write access

@liuyuan10
Copy link
Contributor Author

just rebased

@joestringer
Copy link
Member

/test

@liuyuan10
Copy link
Contributor Author

some are failing. any specific test failure that I should take a look?

@joestringer
Copy link
Member

joestringer commented Jul 23, 2025

@liuyuan10 All required tests must pass before merging the PR. I would suggest investigating each test, check whether the symptoms have been reported before under the "Issues" tab, and make a comment in the following form:

* clustermesh run `<here>` hit `<issue>`
* multi-pool run `<here>` hit a new issue, investigation below.
* ...

<investigation>

If you can't find an existing issue, that could either be due to the PR introducing the change, in which case you'd need to investigate further, or it could be a new failure that others didn't notice yet.

@jrife
Copy link
Contributor

jrife commented Jul 23, 2025

@joestringer At a glance, the failures look very similar to those I summarized here and here. The biggest culprit is #40682.

@liuyuan10
Copy link
Contributor Author

liuyuan10 commented Jul 23, 2025

Cilium Cluster Mesh upgrade run hit #39370
Conformance Cluster Mesh run hit #40682
Conformance IPsec E2E run hit #40682

@joestringer for "Conformance Kind Envoy Embedded" and "Conformance Multi Pool IPAM", I dont' find open issues. could you trigger the test again to see if it's flakes?

@jrife
Copy link
Contributor

jrife commented Jul 24, 2025

@liuyuan10 Rebase your changes. #40691 was merged today and should help stabilize some of the CI failures you are seeing. I can retrigger the CI tests if you do not have permissions to do so.

Supports reverse selection in --devices flag with '!'

!excluded-prefix+: exclude devices with prefix "excluded-prefix"
!excluded: exclude the device named "excluded"

Signed-off-by: Yuan Liu <liuyuan@google.com>
@liuyuan10
Copy link
Contributor Author

/test

1 similar comment
@jrife
Copy link
Contributor

jrife commented Jul 24, 2025

/test

@liuyuan10
Copy link
Contributor Author

@jrife rebased. but my "/test" is not making any process.

@jrife
Copy link
Contributor

jrife commented Jul 24, 2025

@liuyuan10 It looks like CI tests are running after my trigger.

@liuyuan10
Copy link
Contributor Author

@joestringer all tests have passed except Conformance Ginkgo. it was passing before so this is for sure a flake

@joestringer
Copy link
Member

@liuyuan10 would you mind triaging the failure and linking the corresponding issue?

@bowei
Copy link
Member

bowei commented Jul 25, 2025

Yuan -- can you dig into the Gingko failure a bit?

@bowei
Copy link
Member

bowei commented Jul 25, 2025

FWIW -- it looks like this is the failure:

• Failure [47.088 seconds]
K8sDatapathServicesTest
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
  Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc)
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
    Checks in-cluster KPR
    /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
      with L7 policy
      /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
        Tests NodePort with L7 Policy [It]
        /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515

        Request from kind-control-plane to service tftp://10.96.126.207:10069/hello failed
        Expected command: kubectl exec -n kube-system log-gatherer-29kwm -- /usr/bin/env bash -c 'fails=""; id=$RANDOM; for i in $(seq 1 10); do if curl -k --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 tftp://10.96.126.207:10069/hello -H "User-Agent: cilium-test-$id/$i"; then echo "Test round $id/$i exit code: $?"; else fails=$fails:$id/$i=$?; fi; done; if [ -n "$fails" ]; then echo "failed: $fails"; fi; cnt="${fails//[^:]}"; if [ ${#cnt} -gt 0 ]; then exit 42; fi' 
        To succeed, but it failed:
        Exitcode: 42 
        Err: exit status 42

@joestringer -- maybe for my edification -- what is the usual procedure for a test flake? Are people supposed to file an issue to track if it there isn't a corresponding issue already? I searched https://github.com/cilium/cilium/issues?q=is%3Aissue%20state%3Aopen%20test%20flake%20label%3Aci%2Fflake and there doesn't seem to be a corresponding issue for this.

@jrife
Copy link
Contributor

jrife commented Jul 25, 2025

@bowei This looks similar to #25467 but is a different test case. There's #21055 which matches this test case, but not the same failure. Looks like it might be worth creating a new CI issue to track this?

@liuyuan10
Copy link
Contributor Author

Conformance Ginkgo (ci-ginkgo) run hit a new issue, investigation below.

failed with error

• Failure [47.088 seconds]
K8sDatapathServicesTest
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
  Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc)
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
    Checks in-cluster KPR
    /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
      with L7 policy
      /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
        Tests NodePort with L7 Policy [It]
        /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515

        Request from kind-control-plane to service tftp://10.96.126.207:10069/hello failed
        Expected command: kubectl exec -n kube-system log-gatherer-29kwm -- /usr/bin/env bash -c 'fails=""; id=$RANDOM; for i in $(seq 1 10); do if curl -k --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 tftp://10.96.126.207:10069/hello -H "User-Agent: cilium-test-$id/$i"; then echo "Test round $id/$i exit code: $?"; else fails=$fails:$id/$i=$?; fi; done; if [ -n "$fails" ]; then echo "failed: $fails"; fi; cnt="${fails//[^:]}"; if [ ${#cnt} -gt 0 ]; then exit 42; fi' 
        To succeed, but it failed:
        Exitcode: 42 
        Err: exit status 42
        Stdout:
         	 
        	 Hostname: testds-45rvm
        	 
        	 Request Information:
        	 	client_address=10.0.0.183
        	 	client_port=33456
        	 	real path=/hello
        	 	request_scheme=tftp
        	 
        	 Test round 5733/1 exit code: 0
        	 
        	 Hostname: testds-4qh4x
        	 
        	 Request Information:
        	 	client_address=10.0.0.183
        	 	client_port=55083
        	 	real path=/hello
        	 	request_scheme=tftp
        	 
        	 Test round 5733/3 exit code: 0
        	 
        	 Hostname: testds-45rvm
        	 
        	 Request Information:
        	 	client_address=10.0.0.183
        	 	client_port=55570
        	 	real path=/hello
        	 	request_scheme=tftp
        	 
        	 Test round 5733/4 exit code: 0
        	 
        	 Hostname: testds-45rvm
        	 
        	 Request Information:
        	 	client_address=10.0.0.183
        	 	client_port=56408
        	 	real path=/hello
        	 	request_scheme=tftp
        	 
        	 Test round 5733/5 exit code: 0
        	 
        	 Hostname: testds-45rvm
        	 
        	 Request Information:
        	 	client_address=10.0.0.183
        	 	client_port=41282
        	 	real path=/hello
        	 	request_scheme=tftp
        	 
        	 Test round 5733/6 exit code: 0
        	 
        	 Hostname: testds-45rvm
        	 
        	 Request Information:
        	 	client_address=10.0.0.183
        	 	client_port=32796
        	 	real path=/hello
        	 	request_scheme=tftp
        	 
        	 Test round 5733/7 exit code: 0
        	 
        	 Hostname: testds-45rvm
        	 
        	 Request Information:
        	 	client_address=10.0.0.183
        	 	client_port=50080
        	 	real path=/hello
        	 	request_scheme=tftp
        	 
        	 Test round 5733/8 exit code: 0
        	 
        	 Hostname: testds-45rvm
        	 
        	 Request Information:
        	 	client_address=10.0.0.183
        	 	client_port=60410
        	 	real path=/hello
        	 	request_scheme=tftp
        	 
        	 Test round 5733/9 exit code: 0
        	 
        	 Hostname: testds-4qh4x
        	 
        	 Request Information:
        	 	client_address=10.0.0.183
        	 	client_port=35349
        	 	real path=/hello
        	 	request_scheme=tftp
        	 
        	 Test round 5733/10 exit code: 0
        	 failed: :5733/2=71
        	 
        Stderr:
         	 command terminated with exit code 42

Created #40724

@joestringer
Copy link
Member

@joestringer
Copy link
Member

Thanks for filing that. I think Jordan mentioned there was another similar symptom reported maybe in a slightly different test in the past couple of days, so there might be a regression in the tree around this.

From my understanding, the above test is basically saying "When connecting to a NodePort via L7 policy , despite ~10 repeated retries, I was unable to establish a connection to the target service".

I've retriggered the CI job to see if the failure reproduces on this PR or not.

@liuyuan10
Copy link
Contributor Author

Thanks. all the tests are now passing

@joestringer joestringer added this pull request to the merge queue Jul 25, 2025
Merged via the queue into cilium:main with commit 31748fc Jul 25, 2025
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants