-
Notifications
You must be signed in to change notification settings - Fork 3.4k
github: Apply connectivity test timeout at 20m #40957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
58f83d3
to
bb22d6e
Compare
f450230
to
74afdc7
Compare
/test |
bb22d6e
to
d6049f8
Compare
74afdc7
to
966e128
Compare
/test |
@joestringer Just curious, where have collected this information? I checked couple of runs and it seems there are runs take more than 15m. For example, https://github.com/cilium/cilium/actions/runs/16931249833/job/47977122484 |
I must admit this was mostly just eyeballing a bunch of workflows, so thanks for sharing that counter-example. Do you have any thoughts on a more rigorous approach to come up with a value? I was hoping that 15m was far enough out because I recall prior conversations where we expected that CI results should be returned in around that amount of time just to provide developers with feedback quickly. But evidently clouds can add a bit of variation on the reliability of this. Even in the workflow you shared, the other three test runs completed in ~10m so I wonder whether there's a reason this particular run took 50% longer. Of course the other issue with stricter timeouts is that this itself can be a source of flakes when the tests take longer than expected but would otherwise pass 🤔 The naive solution is bump this to 20m for now, which would at least catch the case where a serious issue causes the CLI to block for a full GH workflow 55m timeout. |
At a glance, it's the initial deployment time: good: https://github.com/cilium/cilium/actions/runs/16931249833/job/47977122501
bad: https://github.com/cilium/cilium/actions/runs/16931249833/job/47977122484#step:19:35
|
I will collect run data and come up with a number. |
I took some doing. I filtered all the runs for the workflow that has been changed in #40935, and came up with the longest-running connectivity run per workflow. I have to say it is not what I expected. These results are for the last 40 days.
|
@brlbil could you share the query as well? One thing in particular I'm curious about is which event triggered these. For instance part of the reason that motivated this PR was when a PR run took around 55m because something was really broken in the PR, then instead of the CLI terminating early, the GitHub workflow timeout terminated the CLI and we didn't end up with any sysdump at the end for further investigation. Something else that might help is to transform this table into a set of links so we can dive a bit deeper. I feel like there's no legitimate reason for these tests to take more than 20m today, but it would be nice to qualify that feeling with some more data from these cases that do take longer. Something else I'm wondering about is splitting the deployment vs. test into two steps in the CLI. As per my last comment we can see that deploying the pods could take 1m or 5m (or maybe longer, who knows?) -- I'd be fine with adding a separate additional timeout for that provisioning step, and that might help to better analyze what the underlying problem is. Is it an issue with the production code, test code, or the underlying cloud infra is taking a long time to provision due to the cloud control plane. |
OK, transforming the first row into a link we get this workflow run which is indeed for a PR; so these results may include changes that cause the CLI not to complete due to some change being made in the PR: https://github.com/cilium/cilium/actions/runs/16440446977/job/46459839321#step:2:16 Looking deeper at the target job, here's the first failure at 10:05: https://github.com/cilium/cilium/actions/runs/16440446977/job/46460255964#step:21:318 Many tests are failing, final one around 11:10: https://github.com/cilium/cilium/actions/runs/16440446977/job/46460255964#step:21:6356 That is to say, at least this first example run in the table shared above is exactly the target case I am aiming to force to fail early because if the tests aren't successful in 10-20m that has a high likelihood of being caused by unrecoverable errors (for instance because of a serious code bug). Something so serious is wrong that the failures in the first 20m should be fixed before attempting to proceed with testing; the additional test runs are generating a lot more data about what's wrong but there is probably already plenty enough to indicate the failure already. Better in this case to fail out early to let the developer know they need to make a change than letting the full suite run. |
One small comment for the EKS workflow above is that it did in fact gather the sysdumps from the environment, so the case I was concerned with where in some workflows the sysdumps are not gathered due to GitHub terminating the workflow early might only affect some of the workflows. At a glance EKS splits artifact gathering into a separate job so that job is not affected by the timeout for the tests. Probably that's a pattern we should repeat for some of the other workflows. Regardless, I think there is still merit to this idea that we should have an upper expected bound on how long we wait for the tests to complete, and terminate early if something is so fundamentally broken that all tests are failing & causing those failure results to take 1h+ to report back. |
Quite a few comments 😄 Here is the data.zip that has the related steps data and the SQL query that was used to calculate the longest running test. You can use duckdb to make queries. I have the complete runs file that I extracted the steps data from, but it little bit for upload here.
I agree with you that tests that take too much time should fail to indicate a problem there. |
@joestringer I have made some filtering, removed outliers below %5 and above %95, and take upper quartile mean. We keep the 15m default and maybe add overwrite the value for eks (20m) and gke (30m). What do you think?
|
966e128
to
4bced4a
Compare
Incidentally EKS might be taking longer because it also is inconsistently running the tests compared to the other workflows; it doesn't only run the concurrent tests but all tests including the sequential ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/test EDIT: Woops, forgot to rebase. There's been a bunch of churn on the workflows so this is important to run with the latest workflow definitions. |
The goal here is to ensure that in the event of a failure condition that may lead to the CLI timing out (such as every test failing for the same reason, or some fundamental issue in the infrastructure), the CLI will terminate early, gather a sysdump, and return before GitHub timeouts kick in. This should ensure that even in such adverse situations, developers will have test results they can consult to identify what went wrong in a workflow. Taking the e2e-upgrade workflow for example, this runs twice within the setup-and-test job, along with a bunch of other logic to pull images, provision a VM, install, upgrade, downgrade and so on. Even with all of those other steps, it should be possible for two full runs of the CLI connectivity tests to complete and report failures within the 55m timer. Birol reports the following timeouts for upper quartile mean (min:sec) of workflow step run time after removing the upper/lower 5% of results: | workflow_name | upper_quartile_mean_dur | |------------------------------------------------|--------------------------| | Conformance GKE (ci-gke) | 25:00 | | Conformance EKS (ci-eks) | 14:48 | | CiliumEndpointSlice migration (ci-ces-migrate) | 14:34 | | Conformance AKS (ci-aks) | 12:25 | | Conformance Cluster Mesh (ci-clustermesh) | 10:43 | | Conformance Kind Envoy Embedded | 10:41 | | Conformance AWS-CNI (ci-awscni) | 09:59 | | Cilium E2E Upgrade (ci-e2e-upgrade) | 09:06 | | Conformance IPsec E2E (ci-ipsec-e2e) | 08:38 | | Cilium IPsec upgrade (ci-ipsec-upgrade) | 08:00 | | Conformance Kubespray (ci-kubespray) | 01:57 | | Cilium Cluster Mesh upgrade (ci-clustermesh) | 01:24 | There is another submission out for review that will lower the GKE runtime to match the other run times by introducing concurrency of test runs, so I have excluded that case from this calculation. Hence setting a 20m timeout seems reasonable to catch the vast majority of test runs. Suggested-by: Birol Bilgin <birol@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
4bced4a
to
ef0aa35
Compare
/test |
Requires #40935
Requires #41191