-
Notifications
You must be signed in to change notification settings - Fork 3.4k
fix aws-cni conformance test #20049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix aws-cni conformance test #20049
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we run this at least 5 times before re-enabling? Twice won't catch many flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to fix the v1.11 workflow.
@@ -155,6 +155,8 @@ jobs: | |||
--helm-set=hubble.relay.image.tag=${SHA} \ | |||
--helm-set=enableIPv4Masquerade=false \ | |||
--helm-set=cni.chainingMode=aws-cni \ | |||
--helm-set=eni.enabled=false \ | |||
--helm-set=ipam.mode=cluster-pool \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a comment here to explain why setting what looks like the default value is necessary?
Commit 2d7af3a (".github: add support for cilium-cli in aws-cni conformance tests") changed the AWS-CNI workflow to install Cilium using the CLI instead of Helm. All of the Helm flags were passed through the CLI using --helm-set. However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI. As a result, Cilium attempts to setup the IP rules and routes for new pods. It seems to fail in most cases—maybe because AWS-CNI already setup the state—but sometimes succeeds. When it succeeds, pods end up with an egress rule pointing to a non-existing ip route table. This usually surfaces in the workflow runs as a DNS failure: the client pod fails to egress to the DNS backend on a remote node. In rarer cases, it surfaces as a failing connectivity test for some of the pods. This commit fixes it by overwriting some of the helm flags set automatically by Cilium-cli namely "eni.enabled=false" and "ipam=cluster-pool". The longer-term fix would be to support chaining mode in the CLI. Reported-by: Paul Chaignon <paul@cilium.io> Signed-off-by: André Martins <andre@cilium.io>
@aanm You know you can retrigger a run of the workflow in the top right corner of the previous run, without having to close and open the PR, right? |
Where's the fun in that? 😄 |
That's a different flake. We should file an issue for it. Not sure how frequent, but if it is frequent enough, then maybe we shouldn't reenable. It's a very weird one: policy denied drops on egress for pod-to-pod traffic on the same node, with correct identities, and a correct policy map at the time we collect the sysdump. |
Commit 2d7af3a (".github: add support for cilium-cli in aws-cni
conformance tests") changed the AWS-CNI workflow to install Cilium using
the CLI instead of Helm. All of the Helm flags were passed through the
CLI using --helm-set.
However, the CLI also does it's own magic and isn't aware that we want
to install Cilium in chaining mode. Therefore, it detects EKS and
incorrectly sets IPAM mode to ENI.
As a result, Cilium attempts to setup the IP rules and routes for new
pods. It seems to fail in most cases—maybe because AWS-CNI already setup
the state—but sometimes succeeds. When it succeeds, pods end up with an
egress rule pointing to a non-existing ip route table.
This usually surfaces in the workflow runs as a DNS failure: the client
pod fails to egress to the DNS backend on a remote node. In rarer cases,
it surfaces as a failing connectivity test for some of the pods.
This commit fixes it by overwriting some of the helm flags set
automatically by Cilium-cli namely "eni.enabled=false" and
"ipam=cluster-pool".
The longer-term fix would be to support chaining mode in the CLI.
Reported-by: Paul Chaignon paul@cilium.io
Signed-off-by: André Martins andre@cilium.io