Skip to content

Conversation

IvanHunters
Copy link
Collaborator

@IvanHunters IvanHunters commented Sep 1, 2025

What this PR does

Release note

fix race conditions for seaweedfs and fix tests preparing

Summary by CodeRabbit

  • Chores
    • Increased deployment timeouts to 10 minutes and set install/upgrade remediation to unlimited retries for SeaweedFS, ingress, and monitoring components to improve deployment resilience.
  • Tests
    • Extended end-to-end readiness waits for alerting components from 5 to 15 minutes for more stable test runs.

Copy link
Contributor

coderabbitai bot commented Sep 1, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Added/updated HelmRelease remediation retry settings to -1 and increased HelmRelease timeouts to 10m for several charts; extended VictoriaMetrics readiness waits to 15m in the e2e install script.

Changes

Cohort / File(s) Summary
Tenant SeaweedFS HelmRelease
packages/apps/tenant/templates/seaweedfs.yaml
Set spec.install.remediation.retries: -1 and spec.upgrade.remediation.retries: -1; increased spec.timeout from 5m0s to 10m0s.
Extra SeaweedFS HelmRelease
packages/extra/seaweedfs/templates/seaweedfs.yaml
Set spec.install.remediation.retries: -1 and spec.upgrade.remediation.retries: -1; increased spec.timeout from 5m0s to 10m0s.
Nginx Ingress HelmRelease
packages/extra/ingress/templates/nginx-ingress.yaml
Added spec.install.remediation.retries: -1 and spec.upgrade.remediation.retries: -1; increased spec.timeout from 5m0s to 10m0s.
Monitoring HelmRelease
packages/apps/tenant/templates/monitoring.yaml
Changed spec.install.remediation.retries and spec.upgrade.remediation.retries from 10 to -1.
E2E install script
hack/e2e-install-cozystack.bats
Increased VictoriaMetrics readiness waits (vmalert/vmalert-shortterm and vmalertmanager/alertmanager) from 5m to 15m.

Sequence Diagram(s)

sequenceDiagram
  participant CI as CI / Git
  participant Repo as Repo (manifests)
  participant Flux as Flux / Controller
  participant HR as HelmRelease
  participant K8s as Kubernetes
  participant Script as e2e-install script

  rect rgba(232,246,255,0.6)
    CI->>Repo: push manifests
    Repo->>Flux: apply/reconcile
    Flux->>HR: create/update HelmRelease
    note right of HR: spec.timeout = 10m\ninstall.upgrade.remediation.retries = -1
    HR->>K8s: install/upgrade chart (longer timeout, unlimited retries)
  end

  rect rgba(232,255,232,0.6)
    Script->>K8s: wait for resources readiness
    note right of Script: VictoriaMetrics waits increased to 15m
    Script->>K8s: wait vmalert/vmalert-shortterm (15m)
    Script->>K8s: wait vmalertmanager/alertmanager (15m)
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • Release v0.35.2 #1359 — touches the e2e install script and readiness/reconcile handling; likely related to readiness timeout adjustments.

Suggested reviewers

  • lllamnyp
  • klinch0

Poem

Thump-thump, I hop where manifests flow,
Retries set endless, timeouts grow slow.
Tests wait patient, metrics stretch their time,
Helm rolls steady — carrots taste sublime. 🥕

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-race-conditions-seaweedfs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @IvanHunters, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving the stability and reliability of SeaweedFS deployments by making its Helm chart more robust against intermittent issues and extending the time allowed for its operations. It also encompasses preparatory work for related tests.

Highlights

  • SeaweedFS Helm Chart Robustness: Introduced remediation retries (10 attempts) for both install and upgrade operations within the SeaweedFS Helm chart to enhance resilience against transient failures and race conditions.
  • Increased Helm Chart Timeout: Extended the timeout for SeaweedFS Helm chart operations from 5 minutes to 10 minutes, providing more time for deployments to complete successfully.
  • Test Preparation: Includes preparations for tests, as indicated by the pull request title, to ensure the stability of the system.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve the robustness of SeaweedFS Helm release deployments by adding remediation retries and increasing the timeout. These are positive changes. However, I've found a critical configuration issue in the HelmRelease resource where the reconciliation interval is shorter than the operation timeout. This violates FluxCD's requirements and will cause the deployment to fail. I have provided a specific comment and code suggestion to resolve this.

Comment on lines 26 to +27
interval: 1m0s
timeout: 5m0s
timeout: 10m0s

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

According to the FluxCD HelmRelease specification, the spec.interval must be greater than or equal to spec.timeout. With the current values (interval: 1m0s, timeout: 10m0s), the Helm release reconciliation will fail. The interval should be increased to be at least equal to the timeout.

  interval: 10m0s
  timeout: 10m0s

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
packages/apps/tenant/templates/seaweedfs.yaml (1)

20-25: Consider making remediation explicit and double-check uninstall safety for a stateful chart.

  • The retries fields are valid. However:
    • On upgrade, explicitly set the remediation strategy (defaults to rollback) and consider enabling remediateLastFailure for clarity. (v2-5.docs.fluxcd.io, v2-4.docs.fluxcd.io)
    • On install, remediation performs an uninstall between attempts; verify this won’t purge SeaweedFS PVCs/data before enabling remediateLastFailure on the last failure. (v2-4.docs.fluxcd.io)

Apply if safe:

  install:
    remediation:
      retries: 10
+     remediateLastFailure: true   # enable only if uninstall is safe for data
  upgrade:
    remediation:
      retries: 10
+     strategy: rollback
+     remediateLastFailure: true
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 33691c2 and 3b2d319.

📒 Files selected for processing (1)
  • packages/apps/tenant/templates/seaweedfs.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (1)
packages/apps/tenant/templates/seaweedfs.yaml (1)

27-27: Timeout bump looks good; just note the cumulative effect with retries.

Default is 5m; 10m extends per-operation wait for install/upgrade/rollback. With 10 retries this can prolong convergence; ensure this matches your CI expectations. (fluxcd.io)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
packages/apps/tenant/templates/monitoring.yaml (1)

21-25: Unlimited retries (-1): add safeguards to avoid endless thrash on persistent failures

  • -1 is valid and means “retry forever.” This can mask broken charts/values and keep reconciling indefinitely.
  • Recommend enabling rollback on upgrade failures and surfacing alerts so humans can intervene.

Apply inside the existing blocks:

   install:
     remediation:
       retries: -1
+      # Optional: also persist last failure for visibility
+      remediateLastFailure: true

   upgrade:
     remediation:
       retries: -1
+      remediateLastFailure: true
+      strategy: rollback

Operational follow-ups:

  • Ensure alerting on repeated HelmRelease failures (e.g., Alertmanager/Slack) so unlimited retries don’t go unnoticed.
  • Confirm this change replaces any previously scripted reconcile workarounds in CI.
packages/apps/tenant/templates/seaweedfs.yaml (1)

20-25: Good call on remediation; suggest enabling rollback on upgrade for quicker recovery

Unlimited retries should reduce flakiness from transient SeaweedFS starts. To prevent repeated bad upgrades from looping, enable rollback and keep last failure for visibility.

   upgrade:
     remediation:
-      retries: -1
+      retries: -1
+      remediateLastFailure: true
+      strategy: rollback

Ops note:

  • Add/confirm alerting on HelmRelease failures for seaweedfs so infinite retries don’t silently burn cycles.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 235e17a and fecf6db.

📒 Files selected for processing (2)
  • packages/apps/tenant/templates/monitoring.yaml (1 hunks)
  • packages/apps/tenant/templates/seaweedfs.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (1)
packages/apps/tenant/templates/seaweedfs.yaml (1)

26-27: Approve unchanged interval/timeout settings
Flux supports a shorter reconcile interval than the per-action timeout; no constraint requiring interval >= timeout.

Copy link
Member

@kvaps kvaps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are failed, please also patch these two helmreleases:

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
packages/extra/ingress/templates/nginx-ingress.yaml (1)

18-23: Unlimited remediation retries (-1) are supported; consider adding guardrails.

Flux Helm Controller v2 allows negative integers for install/upgrade remediation.retries to mean unlimited retries, so this is valid. (v2-0.docs.fluxcd.io, fluxcd.io, github.com)

Optional hardening:

  • Explicitly set uninstall.keepHistory: false to avoid history bloat if remediation loops.
  • Add alerting on frequent remediations to catch persistent failures.

Example (place under spec):

uninstall:
  keepHistory: false
rollback:
  cleanupOnFail: true
packages/extra/seaweedfs/templates/seaweedfs.yaml (1)

54-59: Unlimited remediation retries are valid; add safeguards for stateful upgrades.

  • Using -1 for retries is supported by Flux and aligns with removing bespoke retry scripts. (v2-0.docs.fluxcd.io, v2-4.docs.fluxcd.io)
  • For a stateful system like SeaweedFS, consider:
    • Setting uninstall.keepHistory: false to prevent release history growth during repeated remediations.
    • Monitoring/alerts for repeated remediations to avoid silent churn/downtime.

Example (under spec):

uninstall:
  keepHistory: false
# Optional: be explicit, even though default strategy is rollback
upgrade:
  remediation:
    strategy: rollback
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dcbf88d and f020341.

📒 Files selected for processing (2)
  • packages/extra/ingress/templates/nginx-ingress.yaml (1 hunks)
  • packages/extra/seaweedfs/templates/seaweedfs.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (2)
packages/extra/ingress/templates/nginx-ingress.yaml (1)

25-25: Timeout 10m0s: verify it’s sufficient under cluster load.

Ingress controller pods sometimes take longer on fresh clusters (image pulls, admission webhook startup). Please confirm 10m covers worst-case on CI and low-bandwidth nodes; adjust if flakes persist.

packages/extra/seaweedfs/templates/seaweedfs.yaml (1)

61-61: Timeout 10m0s: double-check against SeaweedFS first-boot timings.

Initial master/volume format, PVC binding, and CNPG readiness can exceed 10m on constrained clusters. If e2e still flakes, consider 12–15m here to match test waits.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
hack/e2e-install-cozystack.bats (4)

128-130: Wait long enough for tenant HelmReleases (HRs) given 10m chart timeouts

HRs now have 10m timeouts/remediation; the current 4m can fail prematurely.

-  kubectl wait hr/etcd hr/ingress hr/tenant-root hr/seaweedfs -n tenant-root --timeout=4m --for=condition=ready
+  kubectl wait hr/etcd hr/ingress hr/tenant-root hr/seaweedfs -n tenant-root --timeout=10m --for=condition=ready

132-135: Monitoring HR wait is too short vs. chart timeout/remediation

2m is below the chart/Flux timelines; increase to 10m to match.

-  if ! kubectl wait hr/monitoring -n tenant-root --timeout=2m --for=condition=ready; then
+  if ! kubectl wait hr/monitoring -n tenant-root --timeout=10m --for=condition=ready; then
     flux reconcile hr monitoring -n tenant-root --force
-    kubectl wait hr/monitoring -n tenant-root --timeout=2m --for=condition=ready
+    kubectl wait hr/monitoring -n tenant-root --timeout=10m --for=condition=ready
   fi

137-140: SeaweedFS HR wait is too short; raises race risk with your SeaweedFS fixes

Mirror the chart timeout (10m) to eliminate flakes.

-  if ! kubectl wait hr/seaweedfs-system -n tenant-root --timeout=2m --for=condition=ready; then
+  if ! kubectl wait hr/seaweedfs-system -n tenant-root --timeout=10m --for=condition=ready; then
     flux reconcile hr seaweedfs-system -n tenant-root --force
-    kubectl wait hr/seaweedfs-system -n tenant-root --timeout=2m --for=condition=ready
+    kubectl wait hr/seaweedfs-system -n tenant-root --timeout=10m --for=condition=ready
   fi

34-38: Test does not fail when HRs aren’t Ready

The “Some HelmReleases failed to reconcile” branch doesn’t exit non‑zero, so the test passes despite failures.

   if kubectl get hr -A | grep -v " True " | grep -v NAME; then
     kubectl get hr -A
     echo "Some HelmReleases failed to reconcile" >&2
+    exit 1
   fi
🧹 Nitpick comments (1)
hack/e2e-install-cozystack.bats (1)

163-167: Make the Grafana ingress check deterministic and fail fast if IP missing

“Found” string matching can be brittle. Check HTTP status and guard against empty LB IP for clearer failures.

-  ingress_ip=$(kubectl get svc root-ingress-controller -n tenant-root -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
-  if ! curl -sS -k "https://${ingress_ip}" -H 'Host: grafana.example.org' --max-time 30 | grep -q Found; then
+  ingress_ip=$(kubectl get svc root-ingress-controller -n tenant-root -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
+  if [ -z "$ingress_ip" ]; then
+    echo "Ingress IP not assigned yet" >&2
+    kubectl get svc root-ingress-controller -n tenant-root -o yaml >&2
+    exit 1
+  fi
+  if [ "$(curl -sS -k -o /dev/null -w "%{http_code}" "https://${ingress_ip}" -H 'Host: grafana.example.org' --max-time 30)" != "302" ]; then
     echo "Failed to access Grafana via ingress at ${ingress_ip}" >&2
     exit 1
   fi
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f020341 and 75ca80d.

📒 Files selected for processing (1)
  • hack/e2e-install-cozystack.bats (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (1)
hack/e2e-install-cozystack.bats (1)

142-142: No-op whitespace change

Nothing to review here.

Comment on lines +154 to 157
kubectl wait vmalert/vmalert-shortterm vmalertmanager/alertmanager -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=15m
kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=5m
kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=5m

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

VM alert waits: good bump; align the rest to avoid flakiness

15m for vmalert/vmalertmanager is sensible. To prevent partial readiness causing false negatives, bump the other VictoriaMetrics waits too (they can legitimately exceed 5m on cold clusters).

-  kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=5m
-  kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=5m
+  kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=10m
+  kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=10m
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
kubectl wait vmalert/vmalert-shortterm vmalertmanager/alertmanager -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=15m
kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=5m
kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=5m
kubectl wait vmalert/vmalert-shortterm vmalertmanager/alertmanager -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=15m
kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=10m
kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=10m
🤖 Prompt for AI Agents
In hack/e2e-install-cozystack.bats around lines 154 to 157 the kubectl wait
timeouts for vlogs/generic and vmcluster/shortterm vmcluster/longterm are set to
5m which can cause flakiness on cold clusters; change their --timeout values to
15m (matching the vmalert/vmalertmanager wait) so all VictoriaMetrics-related
waits use 15m to avoid false negatives.

Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
@IvanHunters IvanHunters force-pushed the fix-race-conditions-seaweedfs branch from 98d7e20 to 8e906be Compare September 1, 2025 13:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
packages/apps/tenant/templates/monitoring.yaml (1)

21-23: Limit HelmRelease install retries to a finite number
Setting remediation.retries: -1 triggers unlimited retries, which can mask persistent chart failures and lead to continuous churn. Use a high but bounded value (e.g., 60–120) or leverage spec.dependsOn if startup ordering is the root cause.

-    remediation:
-      retries: -1
+    remediation:
+      retries: 120  # ~2h at 1m intervals
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 75ca80d and 8e906be.

📒 Files selected for processing (5)
  • hack/e2e-install-cozystack.bats (2 hunks)
  • packages/apps/tenant/templates/monitoring.yaml (1 hunks)
  • packages/apps/tenant/templates/seaweedfs.yaml (1 hunks)
  • packages/extra/ingress/templates/nginx-ingress.yaml (1 hunks)
  • packages/extra/seaweedfs/templates/seaweedfs.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • hack/e2e-install-cozystack.bats
  • packages/apps/tenant/templates/seaweedfs.yaml
  • packages/extra/ingress/templates/nginx-ingress.yaml
  • packages/extra/seaweedfs/templates/seaweedfs.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (1)
packages/apps/tenant/templates/monitoring.yaml (1)

24-25: Ignore the remediateLastFailure suggestion
With retries: -1 (infinite retries), remediation never exhausts and remediateLastFailure is redundant—no changes needed.

Likely an incorrect or invalid review comment.

Copy link
Member

@kvaps kvaps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kvaps kvaps merged commit c4e048b into main Sep 1, 2025
36 of 37 checks passed
@kvaps kvaps deleted the fix-race-conditions-seaweedfs branch September 1, 2025 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants