Skip to content

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Aug 5, 2025

Requires #40935
Requires #41191

The goal here is to ensure that in the event of a failure condition that
may lead to the CLI timing out (such as every test failing for the same
reason, or some fundamental issue in the infrastructure), the CLI will
terminate early, gather a sysdump, and return before GitHub timeouts
kick in. This should ensure that even in such adverse situations,
developers will have test results they can consult to identify what went
wrong in a workflow.

Taking the e2e-upgrade workflow for example, this runs twice within the
setup-and-test job, along with a bunch of other logic to pull images,
provision a VM, install, upgrade, downgrade and so on. Even with all of
those other steps, it should be possible for two full runs of the CLI
connectivity tests to complete and report failures within the 55m timer.

Birol reports the following timeouts for upper quartile mean (min:sec)
of workflow step run time after removing the upper/lower 5% of results:

| workflow_name                                  | upper_quartile_mean_dur  |
|------------------------------------------------|--------------------------|
| Conformance GKE (ci-gke)                       | 25:00                    |
| Conformance EKS (ci-eks)                       | 14:48                    |
| CiliumEndpointSlice migration (ci-ces-migrate) | 14:34                    |
| Conformance AKS (ci-aks)                       | 12:25                    |
| Conformance Cluster Mesh (ci-clustermesh)      | 10:43                    |
| Conformance Kind Envoy Embedded                | 10:41                    |
| Conformance AWS-CNI (ci-awscni)                | 09:59                    |
| Cilium E2E Upgrade (ci-e2e-upgrade)            | 09:06                    |
| Conformance IPsec E2E (ci-ipsec-e2e)           | 08:38                    |
| Cilium IPsec upgrade (ci-ipsec-upgrade)        | 08:00                    |
| Conformance Kubespray (ci-kubespray)           | 01:57                    |
| Cilium Cluster Mesh upgrade (ci-clustermesh)   | 01:24                    |

There is another submission out for review that will lower the GKE
runtime to match the other run times by introducing concurrency of test
runs, so I have excluded that case from this calculation. Hence setting
a 20m timeout seems reasonable to catch the vast majority of test runs.

Suggested-by: @brlbil 

@joestringer joestringer added the release-note/ci This PR makes changes to the CI. label Aug 5, 2025
@joestringer joestringer added the dont-merge/blocked Another PR must be merged before this one. label Aug 5, 2025
@joestringer joestringer changed the title workflows/e2e: Apply connectivity test timeout at 15m github: Apply connectivity test timeout at 15m Aug 5, 2025
@joestringer joestringer force-pushed the pr/joe/reusable-test-cfg branch from 58f83d3 to bb22d6e Compare August 5, 2025 17:48
@joestringer joestringer force-pushed the pr/joe/cli-timeouts branch from f450230 to 74afdc7 Compare August 5, 2025 17:48
@joestringer
Copy link
Member Author

/test

@joestringer joestringer force-pushed the pr/joe/reusable-test-cfg branch from bb22d6e to d6049f8 Compare August 11, 2025 22:16
Base automatically changed from pr/joe/reusable-test-cfg to main August 13, 2025 12:30
@joestringer
Copy link
Member Author

/test

@joestringer joestringer removed the dont-merge/blocked Another PR must be merged before this one. label Aug 13, 2025
@joestringer joestringer marked this pull request as ready for review August 13, 2025 14:54
@joestringer joestringer requested review from a team as code owners August 13, 2025 14:54
@joestringer joestringer requested a review from brlbil August 13, 2025 14:54
@brlbil
Copy link
Contributor

brlbil commented Aug 13, 2025

At the moment, the full CLI testsuite takes around 8 minutes. Rounding
to 10 and adding a 50% margin brings the number up to 15m, so set this
as the CLI timeout.

@joestringer Just curious, where have collected this information? I checked couple of runs and it seems there are runs take more than 15m. For example, https://github.com/cilium/cilium/actions/runs/16931249833/job/47977122484

@joestringer
Copy link
Member Author

I must admit this was mostly just eyeballing a bunch of workflows, so thanks for sharing that counter-example. Do you have any thoughts on a more rigorous approach to come up with a value?

I was hoping that 15m was far enough out because I recall prior conversations where we expected that CI results should be returned in around that amount of time just to provide developers with feedback quickly. But evidently clouds can add a bit of variation on the reliability of this. Even in the workflow you shared, the other three test runs completed in ~10m so I wonder whether there's a reason this particular run took 50% longer. Of course the other issue with stricter timeouts is that this itself can be a source of flakes when the tests take longer than expected but would otherwise pass 🤔

The naive solution is bump this to 20m for now, which would at least catch the case where a serious issue causes the CLI to block for a full GH workflow 55m timeout.

@joestringer
Copy link
Member Author

joestringer commented Aug 13, 2025

Even in the workflow you shared, the other three test runs completed in ~10m so I wonder whether there's a reason this particular run took 50% longer.

At a glance, it's the initial deployment time:

good: https://github.com/cilium/cilium/actions/runs/16931249833/job/47977122501

Wed, 13 Aug 2025 08:11:57 GMT + docker run --network host -v /tmp/tmp.JBkgwCrG9t/control-plane-kubeconfig.yaml:/root/.kube/config -v /home/runner/work/cilium/cilium:/root/app -v /home/runner/.aws:/root/.aws -v /home/runner/.azure:/root/.azure -v /home/runner/.config/gcloud:/root/.config/gcloud -e GITHUB_WORKFLOW_REF=cilium/cilium/.github/workflows/conformance-aks.yaml@refs/heads/main quay.io/cilium/cilium-cli-ci:ae55131fe8c9e7fdb8c192983e6473c50d629fc9 cilium connectivity test --flow-validation=disabled --log-code-owners --code-owners=CODEOWNERS --exclude-code-owners=@cilium/github-sec --hubble=false --collect-sysdump-on-failure --external-target bing.com. --external-cidr 8.0.0.0/8 --external-ip 8.8.4.4 --external-other-ip 8.8.8.8 --test-concurrency=5 --test '!seq-.*' --junit-file 'cilium-junits/Installation and Connectivity Test-concurrent-clear (1.30, westus3, 1).xml' --junit-property 'github_job_step=Run connectivity test (1.30, westus3, 1)'
...
Wed, 13 Aug 2025 08:12:32 GMT
🏃[cilium-test-1] Running 20 tests ...

bad: https://github.com/cilium/cilium/actions/runs/16931249833/job/47977122484#step:19:35

Wed, 13 Aug 2025 08:13:44 GMT + docker run --network host -v /tmp/tmp.2JVaYa17d7/control-plane-kubeconfig.yaml:/root/.kube/config -v /home/runner/work/cilium/cilium:/root/app -v /home/runner/.aws:/root/.aws -v /home/runner/.azure:/root/.azure -v /home/runner/.config/gcloud:/root/.config/gcloud -e GITHUB_WORKFLOW_REF=cilium/cilium/.github/workflows/conformance-aks.yaml@refs/heads/main quay.io/cilium/cilium-cli-ci:ae55131fe8c9e7fdb8c192983e6473c50d629fc9 cilium connectivity test --flow-validation=disabled --log-code-owners --code-owners=CODEOWNERS --exclude-code-owners=@cilium/github-sec --hubble=false --collect-sysdump-on-failure --external-target bing.com. --external-cidr 8.0.0.0/8 --external-ip 8.8.4.4 --external-other-ip 8.8.8.8 --test-concurrency=5 --test '!seq-.*' --junit-file 'cilium-junits/Installation and Connectivity Test-concurrent-clear (1.31, westus2, 2).xml' --junit-property 'github_job_step=Run connectivity test (1.31, westus2, 2)'
...
Wed, 13 Aug 2025 08:18:21 GMT
🏃[cilium-test-1] Running 20 tests ...

@brlbil
Copy link
Contributor

brlbil commented Aug 13, 2025

I will collect run data and come up with a number.

@joestringer joestringer added the dont-merge/waiting-for-user-feedback Waiting for feedback from user before the PR should be merged. label Aug 13, 2025
@brlbil
Copy link
Contributor

brlbil commented Aug 14, 2025

I took some doing. I filtered all the runs for the workflow that has been changed in #40935, and came up with the longest-running connectivity run per workflow. I have to say it is not what I expected. These results are for the last 40 days.

job_id run_id workflow_name step_name step_started_at step_completed_at duration_seconds duration_human
46460255964 16440446977 Conformance EKS (ci-eks) Run connectivity test (1.32, ca-central-1, true, true, true, true) 2025-07-22 09:56:03 2025-07-22 11:09:49 4426 01:13:46
47006263165 16615191107 Conformance IPsec E2E (ci-ipsec-e2e) Run concurrent tests (ipsec-7, 5.15-20250721.013324, iptables, false, vxlan, ipv6, ipsec, rfc4106-gcm-aes, rfc4106-gcm-aes, true) 2025-07-30 06:36:29 2025-07-30 07:45:52 4163 01:09:23
47268234634 16699372505 Conformance GKE (ci-gke) Run connectivity test (1.33, 3, ipsec) 2025-08-03 01:03:10 2025-08-03 02:09:28 3978 01:06:18
47145783434 16657399678 Conformance Cluster Mesh (ci-clustermesh) Run connectivity test (5, disabled, dual, ipsec, iptables, external, helm, 255, disabled, true, disabled) 2025-07-31 18:53:09 2025-07-31 19:50:23 3434 00:57:14
47163577282 16662850243 Conformance Kind Envoy Embedded Run connectivity test 2025-08-01 00:23:00 2025-08-01 01:16:55 3235 00:53:55
46807821574 16551838600 Cilium E2E Upgrade (ci-e2e-upgrade) Run concurrent Cilium tests 2025-07-27 13:56:50 2025-07-27 14:45:56 2946 00:49:06
47016754371 16618445014 Conformance AKS (ci-aks) Run concurrent connectivity test with IPSec (1.31, eastus, 3, true) 2025-07-30 09:41:05 2025-07-30 10:19:07 2282 00:38:02
47383553552 16738965222 Conformance AWS-CNI (ci-awscni) Run connectivity test (1.32, ca-central-1, true, true) 2025-08-05 02:48:08 2025-08-05 03:18:03 1795 00:29:55
46504462337 16453317892 Cilium IPsec upgrade (ci-ipsec-upgrade) Run sequential tests after upgrading (ipsec-9, 6.12-20250721.013324, none, true, {eth0,eth1}, true, vxlan, false, ipv6, ipsec, true, patch) 2025-07-22 19:26:19 2025-07-22 19:54:11 1672 00:27:52
47376814689 16736590400 Conformance Kubespray (ci-kubespray) Run connectivity test 2025-08-04 23:45:15 2025-08-05 00:02:13 1018 00:16:58
46617603991 16488356163 CiliumEndpointSlice migration (ci-ces-migrate) Run tests after migration 2025-07-24 05:14:26 2025-07-24 05:29:08 882 00:14:42
48011332848 16941296306 Cilium Cluster Mesh upgrade (ci-clustermesh) Run connectivity test - post-upgrade (2, disabled, none, false, 511, migration) 2025-08-13 15:19:42 2025-08-13 15:29:05 563 00:09:23

@joestringer
Copy link
Member Author

@brlbil could you share the query as well? One thing in particular I'm curious about is which event triggered these. For instance part of the reason that motivated this PR was when a PR run took around 55m because something was really broken in the PR, then instead of the CLI terminating early, the GitHub workflow timeout terminated the CLI and we didn't end up with any sysdump at the end for further investigation.

Something else that might help is to transform this table into a set of links so we can dive a bit deeper. I feel like there's no legitimate reason for these tests to take more than 20m today, but it would be nice to qualify that feeling with some more data from these cases that do take longer.

Something else I'm wondering about is splitting the deployment vs. test into two steps in the CLI. As per my last comment we can see that deploying the pods could take 1m or 5m (or maybe longer, who knows?) -- I'd be fine with adding a separate additional timeout for that provisioning step, and that might help to better analyze what the underlying problem is. Is it an issue with the production code, test code, or the underlying cloud infra is taking a long time to provision due to the cloud control plane.

@joestringer
Copy link
Member Author

joestringer commented Aug 14, 2025

OK, transforming the first row into a link we get this workflow run which is indeed for a PR; so these results may include changes that cause the CLI not to complete due to some change being made in the PR:

https://github.com/cilium/cilium/actions/runs/16440446977/job/46459839321#step:2:16

Looking deeper at the target job, here's the first failure at 10:05:

https://github.com/cilium/cilium/actions/runs/16440446977/job/46460255964#step:21:318

Many tests are failing, final one around 11:10:

https://github.com/cilium/cilium/actions/runs/16440446977/job/46460255964#step:21:6356

That is to say, at least this first example run in the table shared above is exactly the target case I am aiming to force to fail early because if the tests aren't successful in 10-20m that has a high likelihood of being caused by unrecoverable errors (for instance because of a serious code bug). Something so serious is wrong that the failures in the first 20m should be fixed before attempting to proceed with testing; the additional test runs are generating a lot more data about what's wrong but there is probably already plenty enough to indicate the failure already. Better in this case to fail out early to let the developer know they need to make a change than letting the full suite run.

@joestringer
Copy link
Member Author

One small comment for the EKS workflow above is that it did in fact gather the sysdumps from the environment, so the case I was concerned with where in some workflows the sysdumps are not gathered due to GitHub terminating the workflow early might only affect some of the workflows. At a glance EKS splits artifact gathering into a separate job so that job is not affected by the timeout for the tests. Probably that's a pattern we should repeat for some of the other workflows.

Regardless, I think there is still merit to this idea that we should have an upper expected bound on how long we wait for the tests to complete, and terminate early if something is so fundamentally broken that all tests are failing & causing those failure results to take 1h+ to report back.

@brlbil
Copy link
Contributor

brlbil commented Aug 14, 2025

Quite a few comments 😄

Here is the data.zip that has the related steps data and the SQL query that was used to calculate the longest running test. You can use duckdb to make queries. I have the complete runs file that I extracted the steps data from, but it little bit for upload here.

Regardless, I think there is still merit to this idea that we should have an upper expected bound on how long we wait for the >tests to complete, and terminate early if something is so fundamentally broken that all tests are failing & causing those >failure results to take 1h+ to report back.

I agree with you that tests that take too much time should fail to indicate a problem there.

@joestringer
Copy link
Member Author

joestringer commented Aug 14, 2025

Just taking the results from the table above:

$ tail -n+3 top-times.txt | awk -F'|' '{ print "https://github.com/cilium/cilium/actions/runs/"$3"/job/"$2 }' | sed 's/ //g'

https://github.com/cilium/cilium/actions/runs/16440446977/job/46460255964
https://github.com/cilium/cilium/actions/runs/16615191107/job/47006263165
https://github.com/cilium/cilium/actions/runs/16699372505/job/47268234634
https://github.com/cilium/cilium/actions/runs/16657399678/job/47145783434
https://github.com/cilium/cilium/actions/runs/16662850243/job/47163577282
https://github.com/cilium/cilium/actions/runs/16551838600/job/46807821574
https://github.com/cilium/cilium/actions/runs/16618445014/job/47016754371
https://github.com/cilium/cilium/actions/runs/16738965222/job/47383553552
https://github.com/cilium/cilium/actions/runs/16453317892/job/46504462337
https://github.com/cilium/cilium/actions/runs/16736590400/job/47376814689
https://github.com/cilium/cilium/actions/runs/16488356163/job/46617603991
https://github.com/cilium/cilium/actions/runs/16941296306/job/48011332848

Of these, here's the anomaly:
https://github.com/cilium/cilium/actions/runs/16618445014/job/47016754371 - Passed after 40m. Bizarrely slow, run was on AKS. EDIT: Actually it looks like the concurrency argument is missing - this is a real bug. Fix: #41161

CES migration test could also benefit from enabling test concurrency. EDIT: Proposal - #41162

The rest of the workflows either took <10m, or had many failures such that if we set a 15m or 20m timeout I don't think we would lose fidelity.

@brlbil
Copy link
Contributor

brlbil commented Aug 15, 2025

@joestringer I have made some filtering, removed outliers below %5 and above %95, and take upper quartile mean.
These are the results. This shows us the bigger value's average. It confirms your observation.

We keep the 15m default and maybe add overwrite the value for eks (20m) and gke (30m). What do you think?

workflow_name upper_quartile_mean_dur (min:sec)
Conformance GKE (ci-gke) 25:00
Conformance EKS (ci-eks) 14:48
CiliumEndpointSlice migration (ci-ces-migrate) 14:34
Conformance AKS (ci-aks) 12:25
Conformance Cluster Mesh (ci-clustermesh) 10:43
Conformance Kind Envoy Embedded 10:41
Conformance AWS-CNI (ci-awscni) 09:59
Cilium E2E Upgrade (ci-e2e-upgrade) 09:06
Conformance IPsec E2E (ci-ipsec-e2e) 08:38
Cilium IPsec upgrade (ci-ipsec-upgrade) 08:00
Conformance Kubespray (ci-kubespray) 01:57
Cilium Cluster Mesh upgrade (ci-clustermesh) 01:24

@joestringer
Copy link
Member Author

@brlbil I have a counter-proposal: Fix the reason why GKE is taking so long (#41191) and just pick a default that works for everything like 20m.

@joestringer joestringer changed the title github: Apply connectivity test timeout at 15m github: Apply connectivity test timeout at 20m Aug 15, 2025
@joestringer joestringer added dont-merge/blocked Another PR must be merged before this one. and removed dont-merge/waiting-for-user-feedback Waiting for feedback from user before the PR should be merged. labels Aug 15, 2025
@joestringer
Copy link
Member Author

Incidentally EKS might be taking longer because it also is inconsistently running the tests compared to the other workflows; it doesn't only run the concurrent tests but all tests including the sequential ones.

Copy link
Contributor

@brlbil brlbil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brlbil I have a counter-proposal: Fix the reason why GKE is taking so long (#41191) and just pick a default that works for everything like 20m.

Thanks, this is better.

@joestringer
Copy link
Member Author

joestringer commented Aug 15, 2025

/test

EDIT: Woops, forgot to rebase. There's been a bunch of churn on the workflows so this is important to run with the latest workflow definitions.

The goal here is to ensure that in the event of a failure condition that
may lead to the CLI timing out (such as every test failing for the same
reason, or some fundamental issue in the infrastructure), the CLI will
terminate early, gather a sysdump, and return before GitHub timeouts
kick in. This should ensure that even in such adverse situations,
developers will have test results they can consult to identify what went
wrong in a workflow.

Taking the e2e-upgrade workflow for example, this runs twice within the
setup-and-test job, along with a bunch of other logic to pull images,
provision a VM, install, upgrade, downgrade and so on. Even with all of
those other steps, it should be possible for two full runs of the CLI
connectivity tests to complete and report failures within the 55m timer.

Birol reports the following timeouts for upper quartile mean (min:sec)
of workflow step run time after removing the upper/lower 5% of results:

| workflow_name                                  | upper_quartile_mean_dur  |
|------------------------------------------------|--------------------------|
| Conformance GKE (ci-gke)                       | 25:00                    |
| Conformance EKS (ci-eks)                       | 14:48                    |
| CiliumEndpointSlice migration (ci-ces-migrate) | 14:34                    |
| Conformance AKS (ci-aks)                       | 12:25                    |
| Conformance Cluster Mesh (ci-clustermesh)      | 10:43                    |
| Conformance Kind Envoy Embedded                | 10:41                    |
| Conformance AWS-CNI (ci-awscni)                | 09:59                    |
| Cilium E2E Upgrade (ci-e2e-upgrade)            | 09:06                    |
| Conformance IPsec E2E (ci-ipsec-e2e)           | 08:38                    |
| Cilium IPsec upgrade (ci-ipsec-upgrade)        | 08:00                    |
| Conformance Kubespray (ci-kubespray)           | 01:57                    |
| Cilium Cluster Mesh upgrade (ci-clustermesh)   | 01:24                    |

There is another submission out for review that will lower the GKE
runtime to match the other run times by introducing concurrency of test
runs, so I have excluded that case from this calculation. Hence setting
a 20m timeout seems reasonable to catch the vast majority of test runs.

Suggested-by: Birol Bilgin <birol@cilium.io>
Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer
Copy link
Member Author

/test

@joestringer joestringer marked this pull request as draft August 27, 2025 00:42
@joestringer joestringer removed the dont-merge/blocked Another PR must be merged before this one. label Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants