Skip to content

Revert ".github: add support for cilium-cli in aws-cni conformance tests" #19994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented May 28, 2022

Commit 2d7af3a (".github: add support for cilium-cli in aws-cni conformance tests") changed the AWS-CNI worflow to install Cilium using the CLI instead of Helm. All of the Helm flags were passed through the CLI using --helm-set.

However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI.

As a result, Cilium attempts to setup the IP rules and routes for new pods. It seems to fail in most cases—maybe because AWS-CNI already setup the state—but sometimes succeeds. When it succeeds, pods end up with an egress rule pointing to a non-existing ip route table.

This usually surfaces in the workflow runs as a DNS failure: the client pod fails to egress to the DNS backend on a remote node. In rarer cases, it surfaces as a failing connectivity test for some of the pods.

This pull request fixes it by reverting #19555. We unfortunately can't simply set IPAM mode to cluster-pool (default) explicitly because the CLI also uses the wrong operator image (operator-aws instead of operator-generic).

The longer-term fix would be to support chaining mode in the CLI (cf. cilium/cilium-cli#461).

I tested this revert by running the workflow with the test commit ten times at https://github.com/cilium/cilium/runs/6644709056?check_suite_focus=true.

@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels May 28, 2022
@pchaigno pchaigno force-pushed the pr/pchaigno/fix-aws-cni-workflow branch 2 times, most recently from 29645df to efbe0c4 Compare May 29, 2022 09:51
@pchaigno pchaigno changed the title workflows: Fix IPAM mode in AWS-CNI workflow Revert ".github: add support for cilium-cli in aws-cni conformance tests" May 29, 2022
…sts"

This reverts commit 2d7af3a.

Commit 2d7af3a (".github: add support for cilium-cli in aws-cni
conformance tests") changed the AWS-CNI workflow to install Cilium using
the CLI instead of Helm. All of the Helm flags were passed through the
CLI using --helm-set.

However, the CLI also does it's own magic and isn't aware that we want
to install Cilium in chaining mode. Therefore, it detects EKS and
incorrectly sets IPAM mode to ENI.

As a result, Cilium attempts to setup the IP rules and routes for new
pods. It seems to fail in most cases—maybe because AWS-CNI already setup
the state—but sometimes succeeds. When it succeeds, pods end up with an
egress rule pointing to a non-existing ip route table.

This usually surfaces in the workflow runs as a DNS failure: the client
pod fails to egress to the DNS backend on a remote node. In rarer cases,
it surfaces as a failing connectivity test for some of the pods.

This commit fixes it by reverting 2d7af3a. We unfortunately can't
simply set IPAM mode to cluster-pool (default) explicitly because the
CLI also uses the wrong operator image (operator-aws instead of
operator-generic).

The longer-term fix would be to support chaining mode in the CLI.

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno force-pushed the pr/pchaigno/fix-aws-cni-workflow branch from efbe0c4 to 7276654 Compare May 29, 2022 17:35
@pchaigno pchaigno marked this pull request as ready for review May 30, 2022 09:16
@pchaigno pchaigno requested review from a team as code owners May 30, 2022 09:16
@pchaigno pchaigno requested review from joestringer and nebril May 30, 2022 09:16
aanm
aanm previously requested changes May 30, 2022
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI.

cilium-cli >= v1.11 allows to overwrite any default using the helm-set command therefore we could set the right ipam mode that we wish with --helm-set ipam: kubernetes

@pchaigno
Copy link
Member Author

However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI.

cilium-cli >= v1.11 allows to overwrite any default using the helm-set command therefore we could set the right ipam mode that we wish with --helm-set ipam: kubernetes

I tried that. From commit message and PR description:

We unfortunately can't simply set IPAM mode to cluster-pool (default) explicitly because the CLI also uses the wrong operator image (operator-aws instead of operator-generic).

@pchaigno pchaigno dismissed aanm’s stale review May 30, 2022 16:35

Explained in commit message already.

Copy link
Member

@jibi jibi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to Helm based installation LGTM, just a couple of questions

@@ -66,7 +66,7 @@ concurrency:
env:
clusterName: ${{ github.repository_owner }}-${{ github.event.repository.name }}-${{ github.run_id }}
region: us-east-2
cilium_cli_version: v0.11.7
cilium_cli_version: v0.10.4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to downgrade the CLI to 0.10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure later versions will work without the set-helm flags. I'm sticking to the revert rather than risking breaking other things.

@@ -290,6 +281,13 @@ jobs:
- name: Post-test information gathering
if: ${{ !success() }}
run: |
echo "=== Install latest stable CLI ==="
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed given that we are already installing the CLI in a previous step (which is also related to my other comment about downgrading it to 0.10)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. Missed that part of the revert I guess.
I'd rather improve as a followup PR (which will be needed anyway) than to have to rerun the tests 10 times here, again.

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the CLI also does it's own magic and isn't aware that we want to install Cilium in chaining mode. Therefore, it detects EKS and incorrectly sets IPAM mode to ENI.

cilium-cli >= v1.11 allows to overwrite any default using the helm-set command therefore we could set the right ipam mode that we wish with --helm-set ipam: kubernetes

I tried that. From commit message and PR description:

We unfortunately can't simply set IPAM mode to cluster-pool (default) explicitly because the CLI also uses the wrong operator image (operator-aws instead of operator-generic).

yes we can. --operator-image=quay.io/cilium/operator-generic:${SHA} should set it to the operator generic

@pchaigno pchaigno closed this May 31, 2022
@pchaigno pchaigno deleted the pr/pchaigno/fix-aws-cni-workflow branch May 31, 2022 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants