Skip to content

Conversation

lrstewart
Copy link
Contributor

@lrstewart lrstewart commented Aug 21, 2025

Release Summary:

Description of changes:

As an alternative to f217c85, we could instead retry the interop tests when they fail. As long as we have remaining attempts, we will not report the failure via qns-status-report.

I modeled this solution off of https://github.com/orgs/community/discussions/67654#discussioncomment-8038649.

Call-outs:

Currently, I'm retrying the entire "qns" workflow. I could probably limit the retry to just the interop tests by introducing a new "interop-trigger" job that "interop" depends on, and then rerun just that job. HOWEVER, Github counts "attempts" per workflow, NOT per job. So by retrying a subset of jobs, the counter we rely on to check for retries goes up for all jobs anyway. I think that could be confusing? Retrying the entire workflow seemed more straightforward and clear.

Testing:

How was this change tested? With difficulty.

Running the workflow at all

You can't use "gh workflow run" on workflows that don't exist on the default branch yet, meaning workflows that haven't been committed to main. So this change can't actually be tested as-is. Here's an issue about that problem: www.github.com/cli/cli/issues/9781

I worked around this by hijacking the tshark workflow, which already exists on main and even has a single workflow_dispatch input (the workflow_dispatch inputs of your spec also have to match what's on main). I overwrote the tshark workflow with my new retry workflow, and then tested with that. Comparing the two files:

<       run_id:
---
>       version:
16c12
<       - name: rerun ${{ inputs.run_id }}
---
>       - name: rerun ${{ inputs.version }}
21,22c17,18
<           gh run watch ${{ inputs.run_id }}
<           gh run rerun ${{ inputs.run_id }}
\ No newline at end of file
---
>           gh run watch ${{ inputs.version }}
>           gh run rerun ${{ inputs.version }}

Failure testing

Here's the PR I opened to test: #2762

I made a few other changes to 1) let me test with a PR and 2) ensure the interop tests fail. See the "enable testing" commit: 6cb209a.

Note that qns has 3 attempts: https://github.com/aws/s2n-quic/actions/runs/17124819438/job/48579132466 Those were automatically triggered. "qns-status-report" was also skipped on all but the last attempt.

Success Testing

I then modified my forced failure to succeed on the second attempt. See the "enable testing success" commit: b902d1a

Note that qns then has only 2 attempts, and succeeds on the second attempt: https://github.com/aws/s2n-quic/actions/runs/17133219460/job/48602891524?pr=2762 The first attempt skipped qns-status-report.

Accidental Real Failure testing

While I was doing "Success Testing", the handshakeloss interop test also failed for real :) https://github.com/aws/s2n-quic/actions/runs/17133219460/job/48607253793?pr=2762 That attempt was supposed to succeed, but it instead failed and a third attempt was made.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@lrstewart lrstewart marked this pull request as ready for review August 21, 2025 19:14
@lrstewart lrstewart requested a review from boquan-fang August 21, 2025 19:14
@lrstewart lrstewart merged commit 625be88 into aws:main Aug 22, 2025
121 checks passed
@lrstewart lrstewart deleted the ci branch August 22, 2025 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants