test: Increase timeout on privileged unit tests #13944

pchaigno · 2020-11-09T09:43:21Z

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common. There seem to be a lot of variations in actual duration of those tests, with a long tail of high durations.

This pull request doubles the timeout value in an effort to reduce flakes.

Fixes: #12862

Flakes on privileged unit tests with 'context deadline exceeded' are becoming a bit more common. There seem to be a lot of variations [1] in actual duration of those tests, with a long tail of high durations. This commit doubles the timeout value in an effort to reduce flakes. 1 - #12862 (comment) Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno · 2020-11-09T09:43:52Z

retest-runtime

rolinh · 2020-11-09T10:02:34Z

test/runtime/privileged_tests.go

@@ -29,7 +29,7 @@ import (
 const (
 	// The privileged unit tests can take more than 4 minutes, the default
 	// timeout for helper commands.
-	privilegedUnitTestTimeout = 8 * time.Minute
+	privilegedUnitTestTimeout = 16 * time.Minute


🤷 It unlikely to increase the test duration on average since we never get "stuck" in unit tests and outliers with high duration are rare (although annoying).

I wonder what would be the best way to track these spikes in execution, so we don't just increase timeouts ad infinitum.

@nebril I agree, but I'm unsure what to track next (it's not due to CPU frequency issues). Definitely not planning to increase the timeout beyond 16min 😆

nebril

I am OK with merging that, but I wonder what would be the best way to track these spikes in execution, so we don't just increase timeouts ad infinitum.

kkourt

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common.

Can we have a metric on that? This way we can go back and check once this PR is applied whether the metric was improved or not.

pchaigno · 2020-11-09T11:31:06Z

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common.

Can we have a metric on that? This way we can go back and check once this PR is applied whether the metric was improved or not.

This CI dashboard report has a comparison of number of failures biweekly. PR builds have more numbers but need to be interpreted with a bit more care (some failures are legitimate due to in-progress PRs).

kkourt · 2020-11-09T11:35:26Z

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common.

Can we have a metric on that? This way we can go back and check once this PR is applied whether the metric was improved or not.

This CI dashboard report has a comparison of number of failures biweekly. PR builds have more numbers but need to be interpreted with a bit more care (some failures are legitimate due to in-progress PRs).

Can these failures be filtered to count only those with context deadline exceeded?

pchaigno · 2020-11-09T11:44:48Z

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common.

Can we have a metric on that? This way we can go back and check once this PR is applied whether the metric was improved or not.

This CI dashboard report has a comparison of number of failures biweekly. PR builds have more numbers but need to be interpreted with a bit more care (some failures are legitimate due to in-progress PRs).

Can these failures be filtered to count only those with context deadline exceeded?

Not via the current dashboard, but it's maybe possible to extend the dashboard to have filters on the stacktrace text. It's on my TODO list to check once I have cycles.

kkourt · 2020-11-09T12:00:14Z

Overall, I'm worried by the "a bit more common" part which is not clear to me if this solves an actual problem or not, both because it is not easy to show the problem but also because we cannot verify that it indeed solves it.

Having said that, I'm OK with merging this patch as a quality-of-life improvement for people working on CI so that we can tackle bigger issues first.

joestringer · 2020-11-09T18:41:22Z

Do we get timestamps for each subtest that runs? Then maybe we could evaluate whether there are specific tests that end up taking a long time in the outlier cases?

pchaigno added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. ci/flake This is a known failure that occurs in the tree. Please investigate me! needs-backport/1.8 labels Nov 9, 2020

pchaigno requested a review from a team as a code owner November 9, 2020 09:43

pchaigno requested a review from kkourt November 9, 2020 09:43

maintainer-s-little-helper bot assigned kkourt Nov 9, 2020

rolinh reviewed Nov 9, 2020

View reviewed changes

nebril approved these changes Nov 9, 2020

View reviewed changes

kkourt reviewed Nov 9, 2020

View reviewed changes

nebril added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 9, 2020

christarazi approved these changes Nov 9, 2020

View reviewed changes

joestringer merged commit 35af73e into master Nov 9, 2020

joestringer deleted the pr/pchaigno/increase-timeout-priv-unit-tests branch November 9, 2020 18:43

jrajahalme mentioned this pull request Nov 9, 2020

v1.9 backports 2020-11-09 #13957

Merged

jrajahalme added backport-pending/1.9 and removed needs-backport/1.9 labels Nov 9, 2020

joestringer added backport-done/1.9 and removed backport-pending/1.9 labels Nov 10, 2020

twpayne mentioned this pull request Nov 13, 2020

v1.8 backports 2020-11-13 #14022

Merged

twpayne added backport-pending/1.8 and removed needs-backport/1.8 labels Nov 13, 2020

joestringer added backport-done/1.8 and removed backport-pending/1.8 labels Nov 18, 2020

aanm mentioned this pull request Dec 4, 2020

Prepare for release v1.8.6 #14275

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Increase timeout on privileged unit tests #13944

test: Increase timeout on privileged unit tests #13944

Uh oh!

pchaigno commented Nov 9, 2020

Uh oh!

pchaigno commented Nov 9, 2020

Uh oh!

rolinh Nov 9, 2020

Uh oh!

pchaigno Nov 9, 2020

Uh oh!

nebril left a comment

Uh oh!

kkourt left a comment

Uh oh!

pchaigno commented Nov 9, 2020

Uh oh!

kkourt commented Nov 9, 2020

Uh oh!

pchaigno commented Nov 9, 2020

Uh oh!

kkourt commented Nov 9, 2020 •

edited

Loading

Uh oh!

joestringer commented Nov 9, 2020

Uh oh!

Uh oh!

test: Increase timeout on privileged unit tests #13944

test: Increase timeout on privileged unit tests #13944

Uh oh!

Conversation

pchaigno commented Nov 9, 2020

Uh oh!

pchaigno commented Nov 9, 2020

Uh oh!

rolinh Nov 9, 2020

Choose a reason for hiding this comment

Uh oh!

pchaigno Nov 9, 2020

Choose a reason for hiding this comment

Uh oh!

nebril left a comment

Choose a reason for hiding this comment

Uh oh!

kkourt left a comment

Choose a reason for hiding this comment

Uh oh!

pchaigno commented Nov 9, 2020

Uh oh!

kkourt commented Nov 9, 2020

Uh oh!

pchaigno commented Nov 9, 2020

Uh oh!

kkourt commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer commented Nov 9, 2020

Uh oh!

Uh oh!

kkourt commented Nov 9, 2020 •

edited

Loading