-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Revert ".github: add support for cilium-cli in aws-cni conformance tests" #19994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
29645df
to
efbe0c4
Compare
…sts" This reverts commit 2d7af3a. Commit 2d7af3a (".github: add support for cilium-cli in aws-cni conformance tests") changed the AWS-CNI workflow to install Cilium using the CLI instead of Helm. All of the Helm flags were passed through the CLI using --helm-set. However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI. As a result, Cilium attempts to setup the IP rules and routes for new pods. It seems to fail in most cases—maybe because AWS-CNI already setup the state—but sometimes succeeds. When it succeeds, pods end up with an egress rule pointing to a non-existing ip route table. This usually surfaces in the workflow runs as a DNS failure: the client pod fails to egress to the DNS backend on a remote node. In rarer cases, it surfaces as a failing connectivity test for some of the pods. This commit fixes it by reverting 2d7af3a. We unfortunately can't simply set IPAM mode to cluster-pool (default) explicitly because the CLI also uses the wrong operator image (operator-aws instead of operator-generic). The longer-term fix would be to support chaining mode in the CLI. Signed-off-by: Paul Chaignon <paul@cilium.io>
efbe0c4
to
7276654
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI.
cilium-cli >= v1.11 allows to overwrite any default using the helm-set
command therefore we could set the right ipam mode that we wish with --helm-set ipam: kubernetes
I tried that. From commit message and PR description:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to Helm based installation LGTM, just a couple of questions
@@ -66,7 +66,7 @@ concurrency: | |||
env: | |||
clusterName: ${{ github.repository_owner }}-${{ github.event.repository.name }}-${{ github.run_id }} | |||
region: us-east-2 | |||
cilium_cli_version: v0.11.7 | |||
cilium_cli_version: v0.10.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to downgrade the CLI to 0.10
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure later versions will work without the set-helm
flags. I'm sticking to the revert rather than risking breaking other things.
@@ -290,6 +281,13 @@ jobs: | |||
- name: Post-test information gathering | |||
if: ${{ !success() }} | |||
run: | | |||
echo "=== Install latest stable CLI ===" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed given that we are already installing the CLI in a previous step (which is also related to my other comment about downgrading it to 0.10
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. Missed that part of the revert I guess.
I'd rather improve as a followup PR (which will be needed anyway) than to have to rerun the tests 10 times here, again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI.
cilium-cli >= v1.11 allows to overwrite any default using the
helm-set
command therefore we could set the right ipam mode that we wish with--helm-set ipam: kubernetes
I tried that. From commit message and PR description:
We unfortunately can't simply set IPAM mode to cluster-pool (default) explicitly because the CLI also uses the wrong operator image (operator-aws instead of operator-generic).
yes we can. --operator-image=quay.io/cilium/operator-generic:${SHA}
should set it to the operator generic
Commit 2d7af3a (".github: add support for cilium-cli in aws-cni conformance tests") changed the AWS-CNI worflow to install Cilium using the CLI instead of Helm. All of the Helm flags were passed through the CLI using
--helm-set
.However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI.
As a result, Cilium attempts to setup the IP rules and routes for new pods. It seems to fail in most cases—maybe because AWS-CNI already setup the state—but sometimes succeeds. When it succeeds, pods end up with an egress rule pointing to a non-existing ip route table.
This usually surfaces in the workflow runs as a DNS failure: the client pod fails to egress to the DNS backend on a remote node. In rarer cases, it surfaces as a failing connectivity test for some of the pods.
This pull request fixes it by reverting #19555. We unfortunately can't simply set IPAM mode to cluster-pool (default) explicitly because the CLI also uses the wrong operator image (operator-aws instead of operator-generic).
The longer-term fix would be to support chaining mode in the CLI (cf. cilium/cilium-cli#461).
I tested this revert by running the workflow with the test commit ten times at https://github.com/cilium/cilium/runs/6644709056?check_suite_focus=true.