Skip to content

Conversation

boquan-fang
Copy link
Contributor

Release Summary:

Resolved issues:

Description of changes:

The daily scheduled CI run detected flakiness in udp::fuzz_test. I changed the test in my forked repo with the failed input. The test would run successfully locally in my EC2 instance, but failed in GithubAction. The changed code can be found in this snippet. I believe that the reason for such failure in GithubAction is because the EC2 instance runs faster than the one that GithubAction is ran on. I then increase the test duration from 120 seconds to 300 seconds, and the test succeeded:

test stream::tests::request_response::udp::fuzz_test has been running for over 60 seconds
...
test stream::tests::request_response::udp::fuzz_test ... ok
test result: ok. 327 passed; 0 failed; 2 ignored; 0 measured; 0 filtered out; finished in 253.08s

I then run the same test but update the test duration to 180 seconds, and the test succeeded as well. Hence, I conclude that the udp::fuzz_test flakiness is due to the test duration. Bolero sometimes generate a client and server that needs a long time to run the test, which is causing the flakiness. Hence, increase the test duration will mitigate that.

Call-outs:

  • I intentionally choose 180 seconds over 300 seconds because we should increase the duration by the smallest amount which will make the test works. If that is proven to be not enough, then we can increase it again.
  • I also fixed a typo in the comment that I detected.

Testing:

Already mentioned the test method in the section of Description of changes.

Analysis:

The failed input looks like:

(
    Client {
        delays: Delays {
            read: 1.058095ms,
            write: 1.560203ms,
            shutdown_write: 949.615µs,
            shutdown_read: 1.532273ms,
            drop: 1.61057ms,
        },
        count: 3,
        concurrency: 2,
        max_read_len: 29444,
        max_mtu: Some(
            3153,
        ),
    },
    Server {
        delays: Delays {
            read: 569.478µs,
            write: 1.385789ms,
            shutdown_write: 1.391925ms,
            shutdown_read: 746.348µs,
            drop: 800.292µs,
        },
        count: 4,
        max_read_len: 1,
        max_mtu: Some(
            2252,
        ),
    },
    [
        Request {
            count: 5,
            request_size: 98312,
            response_size: 95544,
        },
    ],
)

The server's max_read_len is set to 1 due to fuzz test randomness, which makes the server to read extremely slow. I believe that's why this test is taking a long time.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@maddeleine
Copy link
Contributor

maddeleine commented Jul 1, 2025

Are you able to reliably reproduce this failure using that bolero input in a github actions environment?

Edit: Sounds like the answer is yes.

@boquan-fang
Copy link
Contributor Author

This PR is pending quic-attack run. We can merge it after the quic-attack is checked on this PR.

@boquan-fang
Copy link
Contributor Author

This is the quic-attck job running against this PR. This run is successful.

@boquan-fang boquan-fang merged commit 1ff931d into aws:main Jul 2, 2025
117 checks passed
@boquan-fang boquan-fang deleted the upd-fuzz-test-flakiness branch July 10, 2025 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants