Skip to content

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Nov 9, 2020

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common. There seem to be a lot of variations in actual duration of those tests, with a long tail of high durations.

This pull request doubles the timeout value in an effort to reduce flakes.

Fixes: #12862

Flakes on privileged unit tests with 'context deadline exceeded' are
becoming a bit more common. There seem to be a lot of variations [1] in
actual duration of those tests, with a long tail of high durations.

This commit doubles the timeout value in an effort to reduce flakes.

1 - #12862 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. ci/flake This is a known failure that occurs in the tree. Please investigate me! needs-backport/1.8 labels Nov 9, 2020
@pchaigno pchaigno requested a review from a team as a code owner November 9, 2020 09:43
@pchaigno pchaigno requested a review from kkourt November 9, 2020 09:43
@pchaigno
Copy link
Member Author

pchaigno commented Nov 9, 2020

retest-runtime

@@ -29,7 +29,7 @@ import (
const (
// The privileged unit tests can take more than 4 minutes, the default
// timeout for helper commands.
privilegedUnitTestTimeout = 8 * time.Minute
privilegedUnitTestTimeout = 16 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 It unlikely to increase the test duration on average since we never get "stuck" in unit tests and outliers with high duration are rare (although annoying).

I wonder what would be the best way to track these spikes in execution, so we don't just increase timeouts ad infinitum.

@nebril I agree, but I'm unsure what to track next (it's not due to CPU frequency issues). Definitely not planning to increase the timeout beyond 16min 😆

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with merging that, but I wonder what would be the best way to track these spikes in execution, so we don't just increase timeouts ad infinitum.

Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common.

Can we have a metric on that? This way we can go back and check once this PR is applied whether the metric was improved or not.

@pchaigno
Copy link
Member Author

pchaigno commented Nov 9, 2020

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common.

Can we have a metric on that? This way we can go back and check once this PR is applied whether the metric was improved or not.

This CI dashboard report has a comparison of number of failures biweekly. PR builds have more numbers but need to be interpreted with a bit more care (some failures are legitimate due to in-progress PRs).

@kkourt
Copy link
Contributor

kkourt commented Nov 9, 2020

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common.

Can we have a metric on that? This way we can go back and check once this PR is applied whether the metric was improved or not.

This CI dashboard report has a comparison of number of failures biweekly. PR builds have more numbers but need to be interpreted with a bit more care (some failures are legitimate due to in-progress PRs).

Can these failures be filtered to count only those with context deadline exceeded?

@pchaigno
Copy link
Member Author

pchaigno commented Nov 9, 2020

Flakes on privileged unit tests with context deadline exceeded are becoming a bit more common.

Can we have a metric on that? This way we can go back and check once this PR is applied whether the metric was improved or not.

This CI dashboard report has a comparison of number of failures biweekly. PR builds have more numbers but need to be interpreted with a bit more care (some failures are legitimate due to in-progress PRs).

Can these failures be filtered to count only those with context deadline exceeded?

Not via the current dashboard, but it's maybe possible to extend the dashboard to have filters on the stacktrace text. It's on my TODO list to check once I have cycles.

@kkourt
Copy link
Contributor

kkourt commented Nov 9, 2020

Overall, I'm worried by the "a bit more common" part which is not clear to me if this solves an actual problem or not, both because it is not easy to show the problem but also because we cannot verify that it indeed solves it.

Having said that, I'm OK with merging this patch as a quality-of-life improvement for people working on CI so that we can tackle bigger issues first.

@nebril nebril added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 9, 2020
@joestringer
Copy link
Member

Do we get timestamps for each subtest that runs? Then maybe we could evaluate whether there are specific tests that end up taking a long time in the outlier cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: RuntimePrivilegedUnitTests Run Tests: Context deadline exceeded
8 participants