Skip to content

Conversation

aanm
Copy link
Member

@aanm aanm commented Jun 1, 2022

Commit 2d7af3a (".github: add support for cilium-cli in aws-cni
conformance tests") changed the AWS-CNI workflow to install Cilium using
the CLI instead of Helm. All of the Helm flags were passed through the
CLI using --helm-set.

However, the CLI also does it's own magic and isn't aware that we want
to install Cilium in chaining mode. Therefore, it detects EKS and
incorrectly sets IPAM mode to ENI.

As a result, Cilium attempts to setup the IP rules and routes for new
pods. It seems to fail in most cases—maybe because AWS-CNI already setup
the state—but sometimes succeeds. When it succeeds, pods end up with an
egress rule pointing to a non-existing ip route table.

This usually surfaces in the workflow runs as a DNS failure: the client
pod fails to egress to the DNS backend on a remote node. In rarer cases,
it surfaces as a failing connectivity test for some of the pods.

This commit fixes it by overwriting some of the helm flags set
automatically by Cilium-cli namely "eni.enabled=false" and
"ipam=cluster-pool".

The longer-term fix would be to support chaining mode in the CLI.

Reported-by: Paul Chaignon paul@cilium.io
Signed-off-by: André Martins andre@cilium.io

@aanm aanm added the release-note/misc This PR makes changes that have no direct user impact. label Jun 1, 2022
@christarazi christarazi added area/CI Continuous Integration testing issue or flake area/eni Impacts ENI based IPAM. integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. sig/ipam release-note/ci This PR makes changes to the CI. and removed release-note/misc This PR makes changes that have no direct user impact. labels Jun 1, 2022
@aanm aanm closed this Jun 1, 2022
@aanm aanm reopened this Jun 1, 2022
@aanm aanm force-pushed the pr/fix-aws-cni branch from 9ecff7d to 11a66a4 Compare June 1, 2022 22:44
@aanm aanm closed this Jun 2, 2022
@aanm aanm reopened this Jun 2, 2022
@aanm aanm closed this Jun 2, 2022
@aanm aanm reopened this Jun 2, 2022
@aanm aanm force-pushed the pr/fix-aws-cni branch from 11a66a4 to 72edddf Compare June 2, 2022 11:23
@aanm aanm marked this pull request as ready for review June 2, 2022 11:23
@aanm aanm requested review from a team as code owners June 2, 2022 11:23
@aanm aanm requested a review from pchaigno June 2, 2022 11:23
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we run this at least 5 times before re-enabling? Twice won't catch many flakes.

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to fix the v1.11 workflow.

@@ -155,6 +155,8 @@ jobs:
--helm-set=hubble.relay.image.tag=${SHA} \
--helm-set=enableIPv4Masquerade=false \
--helm-set=cni.chainingMode=aws-cni \
--helm-set=eni.enabled=false \
--helm-set=ipam.mode=cluster-pool \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a comment here to explain why setting what looks like the default value is necessary?

Commit 2d7af3a (".github: add support for cilium-cli in aws-cni
conformance tests") changed the AWS-CNI workflow to install Cilium using
the CLI instead of Helm. All of the Helm flags were passed through the
CLI using --helm-set.

However, the CLI also does it's own magic and isn't aware that we want
to install Cilium in chaining mode. Therefore, it detects EKS and
incorrectly sets IPAM mode to ENI.

As a result, Cilium attempts to setup the IP rules and routes for new
pods. It seems to fail in most cases—maybe because AWS-CNI already setup
the state—but sometimes succeeds. When it succeeds, pods end up with an
egress rule pointing to a non-existing ip route table.

This usually surfaces in the workflow runs as a DNS failure: the client
pod fails to egress to the DNS backend on a remote node. In rarer cases,
it surfaces as a failing connectivity test for some of the pods.

This commit fixes it by overwriting some of the helm flags set
automatically by Cilium-cli namely "eni.enabled=false" and
"ipam=cluster-pool".

The longer-term fix would be to support chaining mode in the CLI.

Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
@aanm aanm force-pushed the pr/fix-aws-cni branch from 72edddf to 139125d Compare June 2, 2022 14:42
@aanm aanm requested a review from pchaigno June 2, 2022 14:42
@aanm aanm closed this Jun 2, 2022
@aanm aanm reopened this Jun 2, 2022
@aanm
Copy link
Member Author

aanm commented Jun 2, 2022

@pchaigno it failed here is this the same flake as before?

@aanm aanm closed this Jun 2, 2022
@pchaigno
Copy link
Member

pchaigno commented Jun 2, 2022

@aanm You know you can retrigger a run of the workflow in the top right corner of the previous run, without having to close and open the PR, right?

@aanm
Copy link
Member Author

aanm commented Jun 2, 2022

@aanm You know you can retrigger a run of the workflow in the top right corner of the previous run, without having to close and open the PR, right?

Where's the fun in that? 😄

@aanm aanm force-pushed the pr/fix-aws-cni branch from 139125d to 61a3cda Compare June 2, 2022 21:20
@pchaigno
Copy link
Member

pchaigno commented Jun 2, 2022

@pchaigno it failed here is this the same flake as before?

That's a different flake. We should file an issue for it. Not sure how frequent, but if it is frequent enough, then maybe we shouldn't reenable.

It's a very weird one: policy denied drops on egress for pod-to-pod traffic on the same node, with correct identities, and a correct policy map at the time we collect the sysdump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/eni Impacts ENI based IPAM. integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants