-
Notifications
You must be signed in to change notification settings - Fork 119
fix race conditions for seaweedfs and fix tests preparing #1371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdded/updated HelmRelease remediation retry settings to -1 and increased HelmRelease timeouts to 10m for several charts; extended VictoriaMetrics readiness waits to 15m in the e2e install script. Changes
Sequence Diagram(s)sequenceDiagram
participant CI as CI / Git
participant Repo as Repo (manifests)
participant Flux as Flux / Controller
participant HR as HelmRelease
participant K8s as Kubernetes
participant Script as e2e-install script
rect rgba(232,246,255,0.6)
CI->>Repo: push manifests
Repo->>Flux: apply/reconcile
Flux->>HR: create/update HelmRelease
note right of HR: spec.timeout = 10m\ninstall.upgrade.remediation.retries = -1
HR->>K8s: install/upgrade chart (longer timeout, unlimited retries)
end
rect rgba(232,255,232,0.6)
Script->>K8s: wait for resources readiness
note right of Script: VictoriaMetrics waits increased to 15m
Script->>K8s: wait vmalert/vmalert-shortterm (15m)
Script->>K8s: wait vmalertmanager/alertmanager (15m)
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @IvanHunters, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request focuses on improving the stability and reliability of SeaweedFS deployments by making its Helm chart more robust against intermittent issues and extending the time allowed for its operations. It also encompasses preparatory work for related tests.
Highlights
- SeaweedFS Helm Chart Robustness: Introduced remediation retries (10 attempts) for both install and upgrade operations within the SeaweedFS Helm chart to enhance resilience against transient failures and race conditions.
- Increased Helm Chart Timeout: Extended the timeout for SeaweedFS Helm chart operations from 5 minutes to 10 minutes, providing more time for deployments to complete successfully.
- Test Preparation: Includes preparations for tests, as indicated by the pull request title, to ensure the stability of the system.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve the robustness of SeaweedFS Helm release deployments by adding remediation retries and increasing the timeout. These are positive changes. However, I've found a critical configuration issue in the HelmRelease
resource where the reconciliation interval
is shorter than the operation timeout
. This violates FluxCD's requirements and will cause the deployment to fail. I have provided a specific comment and code suggestion to resolve this.
interval: 1m0s | ||
timeout: 5m0s | ||
timeout: 10m0s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the FluxCD HelmRelease
specification, the spec.interval
must be greater than or equal to spec.timeout
. With the current values (interval: 1m0s
, timeout: 10m0s
), the Helm release reconciliation will fail. The interval should be increased to be at least equal to the timeout.
interval: 10m0s
timeout: 10m0s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
packages/apps/tenant/templates/seaweedfs.yaml (1)
20-25
: Consider making remediation explicit and double-check uninstall safety for a stateful chart.
- The retries fields are valid. However:
- On upgrade, explicitly set the remediation strategy (defaults to rollback) and consider enabling remediateLastFailure for clarity. (v2-5.docs.fluxcd.io, v2-4.docs.fluxcd.io)
- On install, remediation performs an uninstall between attempts; verify this won’t purge SeaweedFS PVCs/data before enabling remediateLastFailure on the last failure. (v2-4.docs.fluxcd.io)
Apply if safe:
install: remediation: retries: 10 + remediateLastFailure: true # enable only if uninstall is safe for data upgrade: remediation: retries: 10 + strategy: rollback + remediateLastFailure: true
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/apps/tenant/templates/seaweedfs.yaml
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build
🔇 Additional comments (1)
packages/apps/tenant/templates/seaweedfs.yaml (1)
27-27
: Timeout bump looks good; just note the cumulative effect with retries.Default is 5m; 10m extends per-operation wait for install/upgrade/rollback. With 10 retries this can prolong convergence; ensure this matches your CI expectations. (fluxcd.io)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
packages/apps/tenant/templates/monitoring.yaml (1)
21-25
: Unlimited retries (-1): add safeguards to avoid endless thrash on persistent failures
- -1 is valid and means “retry forever.” This can mask broken charts/values and keep reconciling indefinitely.
- Recommend enabling rollback on upgrade failures and surfacing alerts so humans can intervene.
Apply inside the existing blocks:
install: remediation: retries: -1 + # Optional: also persist last failure for visibility + remediateLastFailure: true upgrade: remediation: retries: -1 + remediateLastFailure: true + strategy: rollbackOperational follow-ups:
- Ensure alerting on repeated HelmRelease failures (e.g., Alertmanager/Slack) so unlimited retries don’t go unnoticed.
- Confirm this change replaces any previously scripted reconcile workarounds in CI.
packages/apps/tenant/templates/seaweedfs.yaml (1)
20-25
: Good call on remediation; suggest enabling rollback on upgrade for quicker recoveryUnlimited retries should reduce flakiness from transient SeaweedFS starts. To prevent repeated bad upgrades from looping, enable rollback and keep last failure for visibility.
upgrade: remediation: - retries: -1 + retries: -1 + remediateLastFailure: true + strategy: rollbackOps note:
- Add/confirm alerting on HelmRelease failures for seaweedfs so infinite retries don’t silently burn cycles.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
packages/apps/tenant/templates/monitoring.yaml
(1 hunks)packages/apps/tenant/templates/seaweedfs.yaml
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build
🔇 Additional comments (1)
packages/apps/tenant/templates/seaweedfs.yaml (1)
26-27
: Approve unchanged interval/timeout settings
Flux supports a shorter reconcile interval than the per-action timeout; no constraint requiringinterval >= timeout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are failed, please also patch these two helmreleases:
kind: HelmRelease kind: HelmRelease
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
packages/extra/ingress/templates/nginx-ingress.yaml (1)
18-23
: Unlimited remediation retries (-1) are supported; consider adding guardrails.Flux Helm Controller v2 allows negative integers for install/upgrade remediation.retries to mean unlimited retries, so this is valid. (v2-0.docs.fluxcd.io, fluxcd.io, github.com)
Optional hardening:
- Explicitly set uninstall.keepHistory: false to avoid history bloat if remediation loops.
- Add alerting on frequent remediations to catch persistent failures.
Example (place under spec):
uninstall: keepHistory: false rollback: cleanupOnFail: truepackages/extra/seaweedfs/templates/seaweedfs.yaml (1)
54-59
: Unlimited remediation retries are valid; add safeguards for stateful upgrades.
- Using -1 for retries is supported by Flux and aligns with removing bespoke retry scripts. (v2-0.docs.fluxcd.io, v2-4.docs.fluxcd.io)
- For a stateful system like SeaweedFS, consider:
- Setting uninstall.keepHistory: false to prevent release history growth during repeated remediations.
- Monitoring/alerts for repeated remediations to avoid silent churn/downtime.
Example (under spec):
uninstall: keepHistory: false # Optional: be explicit, even though default strategy is rollback upgrade: remediation: strategy: rollback
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
packages/extra/ingress/templates/nginx-ingress.yaml
(1 hunks)packages/extra/seaweedfs/templates/seaweedfs.yaml
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build
🔇 Additional comments (2)
packages/extra/ingress/templates/nginx-ingress.yaml (1)
25-25
: Timeout 10m0s: verify it’s sufficient under cluster load.Ingress controller pods sometimes take longer on fresh clusters (image pulls, admission webhook startup). Please confirm 10m covers worst-case on CI and low-bandwidth nodes; adjust if flakes persist.
packages/extra/seaweedfs/templates/seaweedfs.yaml (1)
61-61
: Timeout 10m0s: double-check against SeaweedFS first-boot timings.Initial master/volume format, PVC binding, and CNPG readiness can exceed 10m on constrained clusters. If e2e still flakes, consider 12–15m here to match test waits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
hack/e2e-install-cozystack.bats (4)
128-130
: Wait long enough for tenant HelmReleases (HRs) given 10m chart timeoutsHRs now have 10m timeouts/remediation; the current 4m can fail prematurely.
- kubectl wait hr/etcd hr/ingress hr/tenant-root hr/seaweedfs -n tenant-root --timeout=4m --for=condition=ready + kubectl wait hr/etcd hr/ingress hr/tenant-root hr/seaweedfs -n tenant-root --timeout=10m --for=condition=ready
132-135
: Monitoring HR wait is too short vs. chart timeout/remediation2m is below the chart/Flux timelines; increase to 10m to match.
- if ! kubectl wait hr/monitoring -n tenant-root --timeout=2m --for=condition=ready; then + if ! kubectl wait hr/monitoring -n tenant-root --timeout=10m --for=condition=ready; then flux reconcile hr monitoring -n tenant-root --force - kubectl wait hr/monitoring -n tenant-root --timeout=2m --for=condition=ready + kubectl wait hr/monitoring -n tenant-root --timeout=10m --for=condition=ready fi
137-140
: SeaweedFS HR wait is too short; raises race risk with your SeaweedFS fixesMirror the chart timeout (10m) to eliminate flakes.
- if ! kubectl wait hr/seaweedfs-system -n tenant-root --timeout=2m --for=condition=ready; then + if ! kubectl wait hr/seaweedfs-system -n tenant-root --timeout=10m --for=condition=ready; then flux reconcile hr seaweedfs-system -n tenant-root --force - kubectl wait hr/seaweedfs-system -n tenant-root --timeout=2m --for=condition=ready + kubectl wait hr/seaweedfs-system -n tenant-root --timeout=10m --for=condition=ready fi
34-38
: Test does not fail when HRs aren’t ReadyThe “Some HelmReleases failed to reconcile” branch doesn’t exit non‑zero, so the test passes despite failures.
if kubectl get hr -A | grep -v " True " | grep -v NAME; then kubectl get hr -A echo "Some HelmReleases failed to reconcile" >&2 + exit 1 fi
🧹 Nitpick comments (1)
hack/e2e-install-cozystack.bats (1)
163-167
: Make the Grafana ingress check deterministic and fail fast if IP missing“Found” string matching can be brittle. Check HTTP status and guard against empty LB IP for clearer failures.
- ingress_ip=$(kubectl get svc root-ingress-controller -n tenant-root -o jsonpath='{.status.loadBalancer.ingress[0].ip}') - if ! curl -sS -k "https://${ingress_ip}" -H 'Host: grafana.example.org' --max-time 30 | grep -q Found; then + ingress_ip=$(kubectl get svc root-ingress-controller -n tenant-root -o jsonpath='{.status.loadBalancer.ingress[0].ip}') + if [ -z "$ingress_ip" ]; then + echo "Ingress IP not assigned yet" >&2 + kubectl get svc root-ingress-controller -n tenant-root -o yaml >&2 + exit 1 + fi + if [ "$(curl -sS -k -o /dev/null -w "%{http_code}" "https://${ingress_ip}" -H 'Host: grafana.example.org' --max-time 30)" != "302" ]; then echo "Failed to access Grafana via ingress at ${ingress_ip}" >&2 exit 1 fi
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
hack/e2e-install-cozystack.bats
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build
🔇 Additional comments (1)
hack/e2e-install-cozystack.bats (1)
142-142
: No-op whitespace changeNothing to review here.
kubectl wait vmalert/vmalert-shortterm vmalertmanager/alertmanager -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=15m | ||
kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=5m | ||
kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=5m | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
VM alert waits: good bump; align the rest to avoid flakiness
15m for vmalert/vmalertmanager is sensible. To prevent partial readiness causing false negatives, bump the other VictoriaMetrics waits too (they can legitimately exceed 5m on cold clusters).
- kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=5m
- kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=5m
+ kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=10m
+ kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=10m
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
kubectl wait vmalert/vmalert-shortterm vmalertmanager/alertmanager -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=15m | |
kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=5m | |
kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=5m | |
kubectl wait vmalert/vmalert-shortterm vmalertmanager/alertmanager -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=15m | |
kubectl wait vlogs/generic -n tenant-root --for=jsonpath='{.status.updateStatus}'=operational --timeout=10m | |
kubectl wait vmcluster/shortterm vmcluster/longterm -n tenant-root --for=jsonpath='{.status.clusterStatus}'=operational --timeout=10m |
🤖 Prompt for AI Agents
In hack/e2e-install-cozystack.bats around lines 154 to 157 the kubectl wait
timeouts for vlogs/generic and vmcluster/shortterm vmcluster/longterm are set to
5m which can cause flakiness on cold clusters; change their --timeout values to
15m (matching the vmalert/vmalertmanager wait) so all VictoriaMetrics-related
waits use 15m to avoid false negatives.
Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
98d7e20
to
8e906be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
packages/apps/tenant/templates/monitoring.yaml (1)
21-23
: Limit HelmRelease install retries to a finite number
Settingremediation.retries: -1
triggers unlimited retries, which can mask persistent chart failures and lead to continuous churn. Use a high but bounded value (e.g., 60–120) or leveragespec.dependsOn
if startup ordering is the root cause.- remediation: - retries: -1 + remediation: + retries: 120 # ~2h at 1m intervals
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
hack/e2e-install-cozystack.bats
(2 hunks)packages/apps/tenant/templates/monitoring.yaml
(1 hunks)packages/apps/tenant/templates/seaweedfs.yaml
(1 hunks)packages/extra/ingress/templates/nginx-ingress.yaml
(1 hunks)packages/extra/seaweedfs/templates/seaweedfs.yaml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- hack/e2e-install-cozystack.bats
- packages/apps/tenant/templates/seaweedfs.yaml
- packages/extra/ingress/templates/nginx-ingress.yaml
- packages/extra/seaweedfs/templates/seaweedfs.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build
🔇 Additional comments (1)
packages/apps/tenant/templates/monitoring.yaml (1)
24-25
: Ignore theremediateLastFailure
suggestion
Withretries: -1
(infinite retries), remediation never exhausts andremediateLastFailure
is redundant—no changes needed.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What this PR does
Release note
Summary by CodeRabbit