Skip to content

Conversation

brandtkeller
Copy link
Member

@brandtkeller brandtkeller commented Aug 8, 2025

Description

Adding support for retry logic to the package publish operations (package, skeleton, oci). This allows for more robust/resilient handling of publish operations - that may take time on larger packages - to allow retrying.

Proposing the retry logic live within the zoci pkg.

For local package/skeleton - opting to include all remote interactions during the PushPackage() logic as it makes the operation more resilient to network flakiness - something that may very well be a consistent Zarf constraint.

For OCI to OCI copy - updating the logic to utilize oras directly and create logic that mirrors the Push operation. Willing to discuss this further and revert if we see a reason.

Operations that push an artifact to a remote registry should return the descriptor to enable follow-on operations.

Related Issue

Fixes #3787

Checklist before merging

Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Copy link

netlify bot commented Aug 8, 2025

Deploy Preview for zarf-docs ready!

Name Link
🔨 Latest commit 3a29caa
🔍 Latest deploy log https://app.netlify.com/projects/zarf-docs/deploys/68a83d18f76d8b0008da6d92
😎 Deploy Preview https://deploy-preview-4064--zarf-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link

codecov bot commented Aug 8, 2025

Codecov Report

❌ Patch coverage is 44.37870% with 94 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pkg/zoci/copier.go 0.00% 46 Missing ⚠️
src/pkg/zoci/push.go 54.92% 27 Missing and 5 partials ⚠️
src/pkg/packager/publish.go 67.50% 8 Missing and 5 partials ⚠️
src/cmd/package.go 25.00% 3 Missing ⚠️
Files with missing lines Coverage Δ
src/cmd/viper.go 54.76% <100.00%> (+1.67%) ⬆️
src/pkg/packager/create.go 56.79% <100.00%> (+2.24%) ⬆️
src/pkg/zoci/common.go 56.00% <ø> (ø)
src/cmd/package.go 38.20% <25.00%> (-0.05%) ⬇️
src/pkg/packager/publish.go 63.46% <67.50%> (+0.34%) ⬆️
src/pkg/zoci/push.go 66.12% <54.92%> (-3.67%) ⬇️
src/pkg/zoci/copier.go 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@brandtkeller brandtkeller changed the title feat(publish): add retry logic to publish operations feat!(publish): add retry logic to publish operations Aug 11, 2025
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
@brandtkeller brandtkeller marked this pull request as ready for review August 11, 2025 18:02
@brandtkeller brandtkeller requested review from a team as code owners August 11, 2025 18:02
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
@brandtkeller brandtkeller self-assigned this Aug 11, 2025
Copy link
Contributor

@mkcp mkcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! Some design consideration and nits.

Copy link
Contributor

@AustinAbro321 AustinAbro321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small items. Also before merging this I would like to ask the user if they still run into this when --oci-concurrency=1. I'll make a comment in the issue

@brandtkeller
Copy link
Member Author

A few small items. Also before merging this I would like to ask the user if they still run into this when --oci-concurrency=1. I'll make a comment in the issue

User feedback is always good. I'd probably wager we're solving for a separate constraint here than what oci-concurrency offers.

@AustinAbro321
Copy link
Contributor

AustinAbro321 commented Aug 11, 2025

I think there's a fair likelihood it's related in a majority network flakes. I know that when we switched to oras from crane we enabled the oci-concurrency, setting for images pushes. Setting it to one completely solved the problem for some users. Later on we set the oci-concurrency to one automatically on retry and we've heard far fewer reports of image push issues since then.

ociConcurrency = 1
It's something we should consider here if it solves the problem for users or makes the flake less likely.

There are differences between the usual flows though so oci-concurrency could be unrelated. Usually image pushes will go through a Kubernetes port-forward whereas usually package publishes will not. The registry pod also occasionally is rolled so multiple attempts by default is especially important in that case.

@brandtkeller
Copy link
Member Author

I think there's a fair likelihood it's related in a majority network flakes. I know that when we switched to oras from crane we enabled the oci-concurrency, setting for images pushes. Setting it to one completely solved the problem for some users. Later on we set the oci-concurrency to one automatically on retry and we've heard far fewer reports of image push issues since then.

I agree! I interpret this request as more of a roboustness feature as opposed to single-root-cause. Zarf should expect to run in network constrained environments. Having the ability to retry operations that rely on the network should likely be a pattern we replicate more broadly.

Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
@AustinAbro321
Copy link
Contributor

@brandtkeller That's fair, and I might be too stringent on the request that we must know what the issue is in the user case, but I do think it helps inform how we design the fallback mechanism. I'm a bit torn on if this should be a feature, but I do see the argument, especially if there is only one attempt by default

Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Copy link
Contributor

@AustinAbro321 AustinAbro321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments based on the UX for this

Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
@brandtkeller brandtkeller requested a review from soltysh August 21, 2025 14:30
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@soltysh soltysh enabled auto-merge August 22, 2025 09:52
@soltysh soltysh dismissed AustinAbro321’s stale review August 22, 2025 11:37

Comments have been addressed

@soltysh soltysh added this pull request to the merge queue Aug 22, 2025
Merged via the queue into main with commit ea5dad1 Aug 22, 2025
40 checks passed
@soltysh soltysh deleted the 3787_publish_retries branch August 22, 2025 12:02
Ansible-man pushed a commit to Ansible-man/zarf that referenced this pull request Sep 6, 2025
Signed-off-by: Brandt Keller <brandt.keller@defenseunicorns.com>
Co-authored-by: Maciej Szulik <maciej@defenseunicorns.com>
Signed-off-by: Cade Thomas <cadethomas23@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support retries on package publishing
4 participants