ci: add interop retries #2769

lrstewart · 2025-08-21T18:49:08Z

Release Summary:

Description of changes:

As an alternative to f217c85, we could instead retry the interop tests when they fail. As long as we have remaining attempts, we will not report the failure via qns-status-report.

I modeled this solution off of https://github.com/orgs/community/discussions/67654#discussioncomment-8038649.

Call-outs:

Currently, I'm retrying the entire "qns" workflow. I could probably limit the retry to just the interop tests by introducing a new "interop-trigger" job that "interop" depends on, and then rerun just that job. HOWEVER, Github counts "attempts" per workflow, NOT per job. So by retrying a subset of jobs, the counter we rely on to check for retries goes up for all jobs anyway. I think that could be confusing? Retrying the entire workflow seemed more straightforward and clear.

Testing:

How was this change tested? With difficulty.

Running the workflow at all

You can't use "gh workflow run" on workflows that don't exist on the default branch yet, meaning workflows that haven't been committed to main. So this change can't actually be tested as-is. Here's an issue about that problem: www.github.com/cli/cli/issues/9781

I worked around this by hijacking the tshark workflow, which already exists on main and even has a single workflow_dispatch input (the workflow_dispatch inputs of your spec also have to match what's on main). I overwrote the tshark workflow with my new retry workflow, and then tested with that. Comparing the two files:

<       run_id:
---
>       version:
16c12
<       - name: rerun ${{ inputs.run_id }}
---
>       - name: rerun ${{ inputs.version }}
21,22c17,18
<           gh run watch ${{ inputs.run_id }}
<           gh run rerun ${{ inputs.run_id }}
\ No newline at end of file
---
>           gh run watch ${{ inputs.version }}
>           gh run rerun ${{ inputs.version }}

Failure testing

Here's the PR I opened to test: #2762

I made a few other changes to 1) let me test with a PR and 2) ensure the interop tests fail. See the "enable testing" commit: 6cb209a.

Note that qns has 3 attempts: https://github.com/aws/s2n-quic/actions/runs/17124819438/job/48579132466 Those were automatically triggered. "qns-status-report" was also skipped on all but the last attempt.

Success Testing

I then modified my forced failure to succeed on the second attempt. See the "enable testing success" commit: b902d1a

Note that qns then has only 2 attempts, and succeeds on the second attempt: https://github.com/aws/s2n-quic/actions/runs/17133219460/job/48602891524?pr=2762 The first attempt skipped qns-status-report.

Accidental Real Failure testing

While I was doing "Success Testing", the handshakeloss interop test also failed for real :) https://github.com/aws/s2n-quic/actions/runs/17133219460/job/48607253793?pr=2762 That attempt was supposed to succeed, but it instead failed and a third attempt was made.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

.github/workflows/qns.yml

ci: add interop retries

4fadfc2

lrstewart mentioned this pull request Aug 21, 2025

Testing interop retries #2762

Closed

lrstewart marked this pull request as ready for review August 21, 2025 19:14

lrstewart requested a review from boquan-fang August 21, 2025 19:14

boquan-fang reviewed Aug 21, 2025

View reviewed changes

.github/workflows/qns.yml Show resolved Hide resolved

boquan-fang approved these changes Aug 21, 2025

View reviewed changes

lrstewart merged commit 625be88 into aws:main Aug 22, 2025
121 checks passed

lrstewart deleted the ci branch August 22, 2025 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: add interop retries #2769

ci: add interop retries #2769

Uh oh!

lrstewart commented Aug 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ci: add interop retries #2769

ci: add interop retries #2769

Uh oh!

Conversation

lrstewart commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Summary:

Description of changes:

Call-outs:

Testing:

Running the workflow at all

Failure testing

Success Testing

Accidental Real Failure testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lrstewart commented Aug 21, 2025 •

edited

Loading