fix aws-cni conformance test #20049

aanm · 2022-06-01T21:55:53Z

Commit 2d7af3a (".github: add support for cilium-cli in aws-cni
conformance tests") changed the AWS-CNI workflow to install Cilium using
the CLI instead of Helm. All of the Helm flags were passed through the
CLI using --helm-set.

However, the CLI also does it's own magic and isn't aware that we want
to install Cilium in chaining mode. Therefore, it detects EKS and
incorrectly sets IPAM mode to ENI.

As a result, Cilium attempts to setup the IP rules and routes for new
pods. It seems to fail in most cases—maybe because AWS-CNI already setup
the state—but sometimes succeeds. When it succeeds, pods end up with an
egress rule pointing to a non-existing ip route table.

This usually surfaces in the workflow runs as a DNS failure: the client
pod fails to egress to the DNS backend on a remote node. In rarer cases,
it surfaces as a failing connectivity test for some of the pods.

This commit fixes it by overwriting some of the helm flags set
automatically by Cilium-cli namely "eni.enabled=false" and
"ipam=cluster-pool".

The longer-term fix would be to support chaining mode in the CLI.

Reported-by: Paul Chaignon paul@cilium.io
Signed-off-by: André Martins andre@cilium.io

sayboras

pchaigno

Could we run this at least 5 times before re-enabling? Twice won't catch many flakes.

pchaigno

We also need to fix the v1.11 workflow.

pchaigno · 2022-06-02T13:19:46Z

.github/workflows/conformance-aws-cni.yaml

@@ -155,6 +155,8 @@ jobs:
            --helm-set=hubble.relay.image.tag=${SHA} \
            --helm-set=enableIPv4Masquerade=false \
            --helm-set=cni.chainingMode=aws-cni \
+            --helm-set=eni.enabled=false \
+            --helm-set=ipam.mode=cluster-pool \


Can we have a comment here to explain why setting what looks like the default value is necessary?

Commit 2d7af3a (".github: add support for cilium-cli in aws-cni conformance tests") changed the AWS-CNI workflow to install Cilium using the CLI instead of Helm. All of the Helm flags were passed through the CLI using --helm-set. However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI. As a result, Cilium attempts to setup the IP rules and routes for new pods. It seems to fail in most cases—maybe because AWS-CNI already setup the state—but sometimes succeeds. When it succeeds, pods end up with an egress rule pointing to a non-existing ip route table. This usually surfaces in the workflow runs as a DNS failure: the client pod fails to egress to the DNS backend on a remote node. In rarer cases, it surfaces as a failing connectivity test for some of the pods. This commit fixes it by overwriting some of the helm flags set automatically by Cilium-cli namely "eni.enabled=false" and "ipam=cluster-pool". The longer-term fix would be to support chaining mode in the CLI. Reported-by: Paul Chaignon <paul@cilium.io> Signed-off-by: André Martins <andre@cilium.io>

aanm · 2022-06-02T15:32:43Z

@pchaigno it failed here is this the same flake as before?

pchaigno · 2022-06-02T21:09:30Z

@aanm You know you can retrigger a run of the workflow in the top right corner of the previous run, without having to close and open the PR, right?

aanm · 2022-06-02T21:19:46Z

@aanm You know you can retrigger a run of the workflow in the top right corner of the previous run, without having to close and open the PR, right?

Where's the fun in that? 😄

pchaigno · 2022-06-02T21:23:19Z

@pchaigno it failed here is this the same flake as before?

That's a different flake. We should file an issue for it. Not sure how frequent, but if it is frequent enough, then maybe we shouldn't reenable.

It's a very weird one: policy denied drops on egress for pod-to-pod traffic on the same node, with correct identities, and a correct policy map at the time we collect the sysdump.

aanm added the release-note/misc This PR makes changes that have no direct user impact. label Jun 1, 2022

christarazi approved these changes Jun 1, 2022

View reviewed changes

aanm closed this Jun 1, 2022

aanm reopened this Jun 1, 2022

aanm force-pushed the pr/fix-aws-cni branch from 9ecff7d to 11a66a4 Compare June 1, 2022 22:44

aanm closed this Jun 2, 2022

aanm reopened this Jun 2, 2022

aanm closed this Jun 2, 2022

aanm reopened this Jun 2, 2022

aanm force-pushed the pr/fix-aws-cni branch from 11a66a4 to 72edddf Compare June 2, 2022 11:23

aanm marked this pull request as ready for review June 2, 2022 11:23

aanm requested review from a team as code owners June 2, 2022 11:23

aanm requested a review from pchaigno June 2, 2022 11:23

sayboras approved these changes Jun 2, 2022

View reviewed changes

pchaigno requested changes Jun 2, 2022

View reviewed changes

aanm force-pushed the pr/fix-aws-cni branch from 72edddf to 139125d Compare June 2, 2022 14:42

aanm requested a review from pchaigno June 2, 2022 14:42

aanm closed this Jun 2, 2022

aanm reopened this Jun 2, 2022

aanm closed this Jun 2, 2022

aanm reopened this Jun 2, 2022

aanm closed this Jun 2, 2022

aanm reopened this Jun 2, 2022

christarazi mentioned this pull request Jun 2, 2022

Add metric on datapath update latency due to FQDN IP updates #19992

Merged

aanm force-pushed the pr/fix-aws-cni branch from 139125d to 61a3cda Compare June 2, 2022 21:20

christarazi mentioned this pull request Jun 2, 2022

Optimize CIDR label functions #19843

Merged

sayboras mentioned this pull request Jun 3, 2022

clustermesh: Add ownerReferences for CiliumNodes #19959

Merged

aanm merged commit 6c17330 into master Jun 3, 2022

aanm deleted the pr/fix-aws-cni branch June 3, 2022 20:30

This was referenced Jun 6, 2022

pkg/policy/api: Optimize Decision MarshalJSON() #19704

Merged

pkg/policy/rule: Optimize rule String() #19822

Merged

aanm mentioned this pull request Jun 22, 2022

Prepare for release v1.12.0-rc3 #20279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix aws-cni conformance test #20049

fix aws-cni conformance test #20049

Uh oh!

aanm commented Jun 1, 2022

Uh oh!

sayboras left a comment

Uh oh!

pchaigno left a comment

Uh oh!

pchaigno left a comment

Uh oh!

pchaigno Jun 2, 2022

Uh oh!

aanm commented Jun 2, 2022

Uh oh!

pchaigno commented Jun 2, 2022

Uh oh!

aanm commented Jun 2, 2022

Uh oh!

pchaigno commented Jun 2, 2022

Uh oh!

Uh oh!

fix aws-cni conformance test #20049

fix aws-cni conformance test #20049

Uh oh!

Conversation

aanm commented Jun 1, 2022

Uh oh!

sayboras left a comment

Choose a reason for hiding this comment

Uh oh!

pchaigno left a comment

Choose a reason for hiding this comment

Uh oh!

pchaigno left a comment

Choose a reason for hiding this comment

Uh oh!

pchaigno Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

aanm commented Jun 2, 2022

Uh oh!

pchaigno commented Jun 2, 2022

Uh oh!

aanm commented Jun 2, 2022

Uh oh!

pchaigno commented Jun 2, 2022

Uh oh!

Uh oh!