github: Apply connectivity test timeout at 20m #40957

joestringer · 2025-08-05T17:25:52Z

Requires #40935
Requires #41191

The goal here is to ensure that in the event of a failure condition that
may lead to the CLI timing out (such as every test failing for the same
reason, or some fundamental issue in the infrastructure), the CLI will
terminate early, gather a sysdump, and return before GitHub timeouts
kick in. This should ensure that even in such adverse situations,
developers will have test results they can consult to identify what went
wrong in a workflow.

Taking the e2e-upgrade workflow for example, this runs twice within the
setup-and-test job, along with a bunch of other logic to pull images,
provision a VM, install, upgrade, downgrade and so on. Even with all of
those other steps, it should be possible for two full runs of the CLI
connectivity tests to complete and report failures within the 55m timer.

Birol reports the following timeouts for upper quartile mean (min:sec)
of workflow step run time after removing the upper/lower 5% of results:

| workflow_name                                  | upper_quartile_mean_dur  |
|------------------------------------------------|--------------------------|
| Conformance GKE (ci-gke)                       | 25:00                    |
| Conformance EKS (ci-eks)                       | 14:48                    |
| CiliumEndpointSlice migration (ci-ces-migrate) | 14:34                    |
| Conformance AKS (ci-aks)                       | 12:25                    |
| Conformance Cluster Mesh (ci-clustermesh)      | 10:43                    |
| Conformance Kind Envoy Embedded                | 10:41                    |
| Conformance AWS-CNI (ci-awscni)                | 09:59                    |
| Cilium E2E Upgrade (ci-e2e-upgrade)            | 09:06                    |
| Conformance IPsec E2E (ci-ipsec-e2e)           | 08:38                    |
| Cilium IPsec upgrade (ci-ipsec-upgrade)        | 08:00                    |
| Conformance Kubespray (ci-kubespray)           | 01:57                    |
| Cilium Cluster Mesh upgrade (ci-clustermesh)   | 01:24                    |

There is another submission out for review that will lower the GKE
runtime to match the other run times by introducing concurrency of test
runs, so I have excluded that case from this calculation. Hence setting
a 20m timeout seems reasonable to catch the vast majority of test runs.

Suggested-by: @brlbil

joestringer · 2025-08-05T17:58:41Z

/test

joestringer · 2025-08-13T14:53:56Z

/test

brlbil · 2025-08-13T15:17:27Z

At the moment, the full CLI testsuite takes around 8 minutes. Rounding
to 10 and adding a 50% margin brings the number up to 15m, so set this
as the CLI timeout.

@joestringer Just curious, where have collected this information? I checked couple of runs and it seems there are runs take more than 15m. For example, https://github.com/cilium/cilium/actions/runs/16931249833/job/47977122484

joestringer · 2025-08-13T15:49:58Z

I must admit this was mostly just eyeballing a bunch of workflows, so thanks for sharing that counter-example. Do you have any thoughts on a more rigorous approach to come up with a value?

I was hoping that 15m was far enough out because I recall prior conversations where we expected that CI results should be returned in around that amount of time just to provide developers with feedback quickly. But evidently clouds can add a bit of variation on the reliability of this. Even in the workflow you shared, the other three test runs completed in ~10m so I wonder whether there's a reason this particular run took 50% longer. Of course the other issue with stricter timeouts is that this itself can be a source of flakes when the tests take longer than expected but would otherwise pass 🤔

The naive solution is bump this to 20m for now, which would at least catch the case where a serious issue causes the CLI to block for a full GH workflow 55m timeout.

joestringer · 2025-08-13T15:58:19Z

Even in the workflow you shared, the other three test runs completed in ~10m so I wonder whether there's a reason this particular run took 50% longer.

At a glance, it's the initial deployment time:

good: https://github.com/cilium/cilium/actions/runs/16931249833/job/47977122501

Wed, 13 Aug 2025 08:11:57 GMT + docker run --network host -v /tmp/tmp.JBkgwCrG9t/control-plane-kubeconfig.yaml:/root/.kube/config -v /home/runner/work/cilium/cilium:/root/app -v /home/runner/.aws:/root/.aws -v /home/runner/.azure:/root/.azure -v /home/runner/.config/gcloud:/root/.config/gcloud -e GITHUB_WORKFLOW_REF=cilium/cilium/.github/workflows/conformance-aks.yaml@refs/heads/main quay.io/cilium/cilium-cli-ci:ae55131fe8c9e7fdb8c192983e6473c50d629fc9 cilium connectivity test --flow-validation=disabled --log-code-owners --code-owners=CODEOWNERS --exclude-code-owners=@cilium/github-sec --hubble=false --collect-sysdump-on-failure --external-target bing.com. --external-cidr 8.0.0.0/8 --external-ip 8.8.4.4 --external-other-ip 8.8.8.8 --test-concurrency=5 --test '!seq-.*' --junit-file 'cilium-junits/Installation and Connectivity Test-concurrent-clear (1.30, westus3, 1).xml' --junit-property 'github_job_step=Run connectivity test (1.30, westus3, 1)'
...
Wed, 13 Aug 2025 08:12:32 GMT
🏃[cilium-test-1] Running 20 tests ...

bad: https://github.com/cilium/cilium/actions/runs/16931249833/job/47977122484#step:19:35

Wed, 13 Aug 2025 08:13:44 GMT + docker run --network host -v /tmp/tmp.2JVaYa17d7/control-plane-kubeconfig.yaml:/root/.kube/config -v /home/runner/work/cilium/cilium:/root/app -v /home/runner/.aws:/root/.aws -v /home/runner/.azure:/root/.azure -v /home/runner/.config/gcloud:/root/.config/gcloud -e GITHUB_WORKFLOW_REF=cilium/cilium/.github/workflows/conformance-aks.yaml@refs/heads/main quay.io/cilium/cilium-cli-ci:ae55131fe8c9e7fdb8c192983e6473c50d629fc9 cilium connectivity test --flow-validation=disabled --log-code-owners --code-owners=CODEOWNERS --exclude-code-owners=@cilium/github-sec --hubble=false --collect-sysdump-on-failure --external-target bing.com. --external-cidr 8.0.0.0/8 --external-ip 8.8.4.4 --external-other-ip 8.8.8.8 --test-concurrency=5 --test '!seq-.*' --junit-file 'cilium-junits/Installation and Connectivity Test-concurrent-clear (1.31, westus2, 2).xml' --junit-property 'github_job_step=Run connectivity test (1.31, westus2, 2)'
...
Wed, 13 Aug 2025 08:18:21 GMT
🏃[cilium-test-1] Running 20 tests ...

brlbil · 2025-08-13T16:34:58Z

I will collect run data and come up with a number.

brlbil · 2025-08-14T13:40:16Z

I took some doing. I filtered all the runs for the workflow that has been changed in #40935, and came up with the longest-running connectivity run per workflow. I have to say it is not what I expected. These results are for the last 40 days.

job_id	run_id	workflow_name	step_name	step_started_at	step_completed_at	duration_seconds	duration_human
46460255964	16440446977	Conformance EKS (ci-eks)	Run connectivity test (1.32, ca-central-1, true, true, true, true)	2025-07-22 09:56:03	2025-07-22 11:09:49	4426	01:13:46
47006263165	16615191107	Conformance IPsec E2E (ci-ipsec-e2e)	Run concurrent tests (ipsec-7, 5.15-20250721.013324, iptables, false, vxlan, ipv6, ipsec, rfc4106-gcm-aes, rfc4106-gcm-aes, true)	2025-07-30 06:36:29	2025-07-30 07:45:52	4163	01:09:23
47268234634	16699372505	Conformance GKE (ci-gke)	Run connectivity test (1.33, 3, ipsec)	2025-08-03 01:03:10	2025-08-03 02:09:28	3978	01:06:18
47145783434	16657399678	Conformance Cluster Mesh (ci-clustermesh)	Run connectivity test (5, disabled, dual, ipsec, iptables, external, helm, 255, disabled, true, disabled)	2025-07-31 18:53:09	2025-07-31 19:50:23	3434	00:57:14
47163577282	16662850243	Conformance Kind Envoy Embedded	Run connectivity test	2025-08-01 00:23:00	2025-08-01 01:16:55	3235	00:53:55
46807821574	16551838600	Cilium E2E Upgrade (ci-e2e-upgrade)	Run concurrent Cilium tests	2025-07-27 13:56:50	2025-07-27 14:45:56	2946	00:49:06
47016754371	16618445014	Conformance AKS (ci-aks)	Run concurrent connectivity test with IPSec (1.31, eastus, 3, true)	2025-07-30 09:41:05	2025-07-30 10:19:07	2282	00:38:02
47383553552	16738965222	Conformance AWS-CNI (ci-awscni)	Run connectivity test (1.32, ca-central-1, true, true)	2025-08-05 02:48:08	2025-08-05 03:18:03	1795	00:29:55
46504462337	16453317892	Cilium IPsec upgrade (ci-ipsec-upgrade)	Run sequential tests after upgrading (ipsec-9, 6.12-20250721.013324, none, true, {eth0,eth1}, true, vxlan, false, ipv6, ipsec, true, patch)	2025-07-22 19:26:19	2025-07-22 19:54:11	1672	00:27:52
47376814689	16736590400	Conformance Kubespray (ci-kubespray)	Run connectivity test	2025-08-04 23:45:15	2025-08-05 00:02:13	1018	00:16:58
46617603991	16488356163	CiliumEndpointSlice migration (ci-ces-migrate)	Run tests after migration	2025-07-24 05:14:26	2025-07-24 05:29:08	882	00:14:42
48011332848	16941296306	Cilium Cluster Mesh upgrade (ci-clustermesh)	Run connectivity test - post-upgrade (2, disabled, none, false, 511, migration)	2025-08-13 15:19:42	2025-08-13 15:29:05	563	00:09:23

joestringer · 2025-08-14T16:13:50Z

@brlbil could you share the query as well? One thing in particular I'm curious about is which event triggered these. For instance part of the reason that motivated this PR was when a PR run took around 55m because something was really broken in the PR, then instead of the CLI terminating early, the GitHub workflow timeout terminated the CLI and we didn't end up with any sysdump at the end for further investigation.

Something else that might help is to transform this table into a set of links so we can dive a bit deeper. I feel like there's no legitimate reason for these tests to take more than 20m today, but it would be nice to qualify that feeling with some more data from these cases that do take longer.

Something else I'm wondering about is splitting the deployment vs. test into two steps in the CLI. As per my last comment we can see that deploying the pods could take 1m or 5m (or maybe longer, who knows?) -- I'd be fine with adding a separate additional timeout for that provisioning step, and that might help to better analyze what the underlying problem is. Is it an issue with the production code, test code, or the underlying cloud infra is taking a long time to provision due to the cloud control plane.

joestringer · 2025-08-14T16:22:04Z

OK, transforming the first row into a link we get this workflow run which is indeed for a PR; so these results may include changes that cause the CLI not to complete due to some change being made in the PR:

https://github.com/cilium/cilium/actions/runs/16440446977/job/46459839321#step:2:16

Looking deeper at the target job, here's the first failure at 10:05:

https://github.com/cilium/cilium/actions/runs/16440446977/job/46460255964#step:21:318

Many tests are failing, final one around 11:10:

https://github.com/cilium/cilium/actions/runs/16440446977/job/46460255964#step:21:6356

That is to say, at least this first example run in the table shared above is exactly the target case I am aiming to force to fail early because if the tests aren't successful in 10-20m that has a high likelihood of being caused by unrecoverable errors (for instance because of a serious code bug). Something so serious is wrong that the failures in the first 20m should be fixed before attempting to proceed with testing; the additional test runs are generating a lot more data about what's wrong but there is probably already plenty enough to indicate the failure already. Better in this case to fail out early to let the developer know they need to make a change than letting the full suite run.

joestringer · 2025-08-14T16:26:08Z

One small comment for the EKS workflow above is that it did in fact gather the sysdumps from the environment, so the case I was concerned with where in some workflows the sysdumps are not gathered due to GitHub terminating the workflow early might only affect some of the workflows. At a glance EKS splits artifact gathering into a separate job so that job is not affected by the timeout for the tests. Probably that's a pattern we should repeat for some of the other workflows.

Regardless, I think there is still merit to this idea that we should have an upper expected bound on how long we wait for the tests to complete, and terminate early if something is so fundamentally broken that all tests are failing & causing those failure results to take 1h+ to report back.

brlbil · 2025-08-14T17:18:05Z

Quite a few comments 😄

Here is the data.zip that has the related steps data and the SQL query that was used to calculate the longest running test. You can use duckdb to make queries. I have the complete runs file that I extracted the steps data from, but it little bit for upload here.

Regardless, I think there is still merit to this idea that we should have an upper expected bound on how long we wait for the >tests to complete, and terminate early if something is so fundamentally broken that all tests are failing & causing those >failure results to take 1h+ to report back.

I agree with you that tests that take too much time should fail to indicate a problem there.

joestringer · 2025-08-14T21:12:04Z

Just taking the results from the table above:

$ tail -n+3 top-times.txt | awk -F'|' '{ print "https://github.com/cilium/cilium/actions/runs/"$3"/job/"$2 }' | sed 's/ //g'

Of these, here's the anomaly:
https://github.com/cilium/cilium/actions/runs/16618445014/job/47016754371 - Passed after 40m. Bizarrely slow, run was on AKS. EDIT: Actually it looks like the concurrency argument is missing - this is a real bug. Fix: #41161

CES migration test could also benefit from enabling test concurrency. EDIT: Proposal - #41162

The rest of the workflows either took <10m, or had many failures such that if we set a 15m or 20m timeout I don't think we would lose fidelity.

brlbil · 2025-08-15T05:25:35Z

@joestringer I have made some filtering, removed outliers below %5 and above %95, and take upper quartile mean.
These are the results. This shows us the bigger value's average. It confirms your observation.

We keep the 15m default and maybe add overwrite the value for eks (20m) and gke (30m). What do you think?

workflow_name	upper_quartile_mean_dur (min:sec)
Conformance GKE (ci-gke)	25:00
Conformance EKS (ci-eks)	14:48
CiliumEndpointSlice migration (ci-ces-migrate)	14:34
Conformance AKS (ci-aks)	12:25
Conformance Cluster Mesh (ci-clustermesh)	10:43
Conformance Kind Envoy Embedded	10:41
Conformance AWS-CNI (ci-awscni)	09:59
Cilium E2E Upgrade (ci-e2e-upgrade)	09:06
Conformance IPsec E2E (ci-ipsec-e2e)	08:38
Cilium IPsec upgrade (ci-ipsec-upgrade)	08:00
Conformance Kubespray (ci-kubespray)	01:57
Cilium Cluster Mesh upgrade (ci-clustermesh)	01:24

joestringer · 2025-08-15T17:18:09Z

@brlbil I have a counter-proposal: Fix the reason why GKE is taking so long (#41191) and just pick a default that works for everything like 20m.

joestringer · 2025-08-15T17:37:56Z

Incidentally EKS might be taking longer because it also is inconsistently running the tests compared to the other workflows; it doesn't only run the concurrent tests but all tests including the sequential ones.

brlbil

@brlbil I have a counter-proposal: Fix the reason why GKE is taking so long (#41191) and just pick a default that works for everything like 20m.

Thanks, this is better.

joestringer · 2025-08-15T17:55:33Z

/test

EDIT: Woops, forgot to rebase. There's been a bunch of churn on the workflows so this is important to run with the latest workflow definitions.

The goal here is to ensure that in the event of a failure condition that may lead to the CLI timing out (such as every test failing for the same reason, or some fundamental issue in the infrastructure), the CLI will terminate early, gather a sysdump, and return before GitHub timeouts kick in. This should ensure that even in such adverse situations, developers will have test results they can consult to identify what went wrong in a workflow. Taking the e2e-upgrade workflow for example, this runs twice within the setup-and-test job, along with a bunch of other logic to pull images, provision a VM, install, upgrade, downgrade and so on. Even with all of those other steps, it should be possible for two full runs of the CLI connectivity tests to complete and report failures within the 55m timer. Birol reports the following timeouts for upper quartile mean (min:sec) of workflow step run time after removing the upper/lower 5% of results: | workflow_name | upper_quartile_mean_dur | |------------------------------------------------|--------------------------| | Conformance GKE (ci-gke) | 25:00 | | Conformance EKS (ci-eks) | 14:48 | | CiliumEndpointSlice migration (ci-ces-migrate) | 14:34 | | Conformance AKS (ci-aks) | 12:25 | | Conformance Cluster Mesh (ci-clustermesh) | 10:43 | | Conformance Kind Envoy Embedded | 10:41 | | Conformance AWS-CNI (ci-awscni) | 09:59 | | Cilium E2E Upgrade (ci-e2e-upgrade) | 09:06 | | Conformance IPsec E2E (ci-ipsec-e2e) | 08:38 | | Cilium IPsec upgrade (ci-ipsec-upgrade) | 08:00 | | Conformance Kubespray (ci-kubespray) | 01:57 | | Cilium Cluster Mesh upgrade (ci-clustermesh) | 01:24 | There is another submission out for review that will lower the GKE runtime to match the other run times by introducing concurrency of test runs, so I have excluded that case from this calculation. Hence setting a 20m timeout seems reasonable to catch the vast majority of test runs. Suggested-by: Birol Bilgin <birol@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer · 2025-08-15T17:56:43Z

/test

joestringer added the release-note/ci This PR makes changes to the CI. label Aug 5, 2025

joestringer mentioned this pull request Aug 5, 2025

Add reusable test config workflow #40935

Merged

joestringer added the dont-merge/blocked Another PR must be merged before this one. label Aug 5, 2025

joestringer changed the title ~~workflows/e2e: Apply connectivity test timeout at 15m~~ github: Apply connectivity test timeout at 15m Aug 5, 2025

joestringer force-pushed the pr/joe/reusable-test-cfg branch from 58f83d3 to bb22d6e Compare August 5, 2025 17:48

joestringer force-pushed the pr/joe/cli-timeouts branch from f450230 to 74afdc7 Compare August 5, 2025 17:48

joestringer force-pushed the pr/joe/reusable-test-cfg branch from bb22d6e to d6049f8 Compare August 11, 2025 22:16

Base automatically changed from pr/joe/reusable-test-cfg to main August 13, 2025 12:30

joestringer force-pushed the pr/joe/cli-timeouts branch from 74afdc7 to 966e128 Compare August 13, 2025 14:53

joestringer removed the dont-merge/blocked Another PR must be merged before this one. label Aug 13, 2025

joestringer marked this pull request as ready for review August 13, 2025 14:54

joestringer requested review from a team as code owners August 13, 2025 14:54

joestringer requested a review from brlbil August 13, 2025 14:54

joestringer added the dont-merge/waiting-for-user-feedback Waiting for feedback from user before the PR should be merged. label Aug 13, 2025

joestringer force-pushed the pr/joe/cli-timeouts branch from 966e128 to 4bced4a Compare August 15, 2025 17:36

joestringer changed the title ~~github: Apply connectivity test timeout at 15m~~ github: Apply connectivity test timeout at 20m Aug 15, 2025

joestringer added dont-merge/blocked Another PR must be merged before this one. and removed dont-merge/waiting-for-user-feedback Waiting for feedback from user before the PR should be merged. labels Aug 15, 2025

brlbil approved these changes Aug 15, 2025

View reviewed changes

joestringer force-pushed the pr/joe/cli-timeouts branch from 4bced4a to ef0aa35 Compare August 15, 2025 17:56

joestringer marked this pull request as draft August 27, 2025 00:42

joestringer removed the dont-merge/blocked Another PR must be merged before this one. label Aug 27, 2025

github: Apply connectivity test timeout at 20m #40957

Are you sure you want to change the base?

github: Apply connectivity test timeout at 20m #40957

Uh oh!

Conversation

joestringer commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer commented Aug 5, 2025

Uh oh!

joestringer commented Aug 13, 2025

Uh oh!

brlbil commented Aug 13, 2025

Uh oh!

joestringer commented Aug 13, 2025

Uh oh!

joestringer commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brlbil commented Aug 13, 2025

Uh oh!

brlbil commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer commented Aug 14, 2025

Uh oh!

joestringer commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer commented Aug 14, 2025

Uh oh!

brlbil commented Aug 14, 2025

Uh oh!

joestringer commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brlbil commented Aug 15, 2025

Uh oh!

joestringer commented Aug 15, 2025

Uh oh!

joestringer commented Aug 15, 2025

Uh oh!

brlbil left a comment

Choose a reason for hiding this comment

Uh oh!

joestringer commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer commented Aug 15, 2025

Uh oh!

Uh oh!

joestringer commented Aug 5, 2025 •

edited

Loading

joestringer commented Aug 13, 2025 •

edited

Loading

brlbil commented Aug 14, 2025 •

edited

Loading

joestringer commented Aug 14, 2025 •

edited

Loading

joestringer commented Aug 14, 2025 •

edited

Loading

joestringer commented Aug 15, 2025 •

edited

Loading