-
Notifications
You must be signed in to change notification settings - Fork 3.4k
test: Increase timeout on privileged unit tests #13944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Increase timeout on privileged unit tests #13944
Conversation
Flakes on privileged unit tests with 'context deadline exceeded' are becoming a bit more common. There seem to be a lot of variations [1] in actual duration of those tests, with a long tail of high durations. This commit doubles the timeout value in an effort to reduce flakes. 1 - #12862 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>
retest-runtime |
@@ -29,7 +29,7 @@ import ( | |||
const ( | |||
// The privileged unit tests can take more than 4 minutes, the default | |||
// timeout for helper commands. | |||
privilegedUnitTestTimeout = 8 * time.Minute | |||
privilegedUnitTestTimeout = 16 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 It unlikely to increase the test duration on average since we never get "stuck" in unit tests and outliers with high duration are rare (although annoying).
I wonder what would be the best way to track these spikes in execution, so we don't just increase timeouts ad infinitum.
@nebril I agree, but I'm unsure what to track next (it's not due to CPU frequency issues). Definitely not planning to increase the timeout beyond 16min 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with merging that, but I wonder what would be the best way to track these spikes in execution, so we don't just increase timeouts ad infinitum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common.
Can we have a metric on that? This way we can go back and check once this PR is applied whether the metric was improved or not.
This CI dashboard report has a comparison of number of failures biweekly. PR builds have more numbers but need to be interpreted with a bit more care (some failures are legitimate due to in-progress PRs). |
Can these failures be filtered to count only those with context deadline exceeded? |
Not via the current dashboard, but it's maybe possible to extend the dashboard to have filters on the stacktrace text. It's on my TODO list to check once I have cycles. |
Overall, I'm worried by the "a bit more common" part which is not clear to me if this solves an actual problem or not, both because it is not easy to show the problem but also because we cannot verify that it indeed solves it. Having said that, I'm OK with merging this patch as a quality-of-life improvement for people working on CI so that we can tackle bigger issues first. |
Do we get timestamps for each subtest that runs? Then maybe we could evaluate whether there are specific tests that end up taking a long time in the outlier cases? |
Flakes on privileged unit tests with
context deadline exceeded
are becoming a bit more common. There seem to be a lot of variations in actual duration of those tests, with a long tail of high durations.This pull request doubles the timeout value in an effort to reduce flakes.
Fixes: #12862