Skip to content
This repository was archived by the owner on Jul 11, 2023. It is now read-only.

Add more robust CRD conversion patching #5300

Merged
merged 1 commit into from
Mar 30, 2023

Conversation

keithmattix
Copy link
Contributor

@keithmattix keithmattix commented Mar 27, 2023

Change the reoncile label to refer to the version of OSM that is in charge of reconciling the CRDs. This ensures that on upgrade, the new version is the only version managing the CRDs

Description:
Fixes the upgrade bug from pre-v1.x.3 patches

Testing done:
E2E tests

Affected area:

Functional Area
Control Plane [x]
Install [x]
Upgrade [x]

Please answer the following questions with yes/no.

  1. Does this change contain code from or inspired by another project? no

    • Did you notify the maintainers and provide attribution? N/A
  2. Is this a breaking change? no

  3. Has documentation corresponding to this change been updated in the osm-docs repo (if applicable)? N/A

@codecov-commenter
Copy link

codecov-commenter commented Mar 27, 2023

Codecov Report

Merging #5300 (4e6da88) into main (dc3f841) will decrease coverage by 0.02%.
The diff coverage is 0.00%.

❗ Current head 4e6da88 differs from pull request most recent head b3b0f9a. Consider uploading reports for the commit b3b0f9a to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #5300      +/-   ##
==========================================
- Coverage   69.53%   69.52%   -0.02%     
==========================================
  Files         197      197              
  Lines       16070    16071       +1     
==========================================
- Hits        11174    11173       -1     
- Misses       4839     4841       +2     
  Partials       57       57              
Flag Coverage Δ
unittests 69.52% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/cli/version.go 43.96% <0.00%> (ø)
cmd/osm-bootstrap/osm-bootstrap.go 45.42% <0.00%> (ø)
pkg/reconciler/client.go 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@steeling
Copy link
Contributor

Can we also set rollout strategy on the bootstrap deployment to replace?

@keithmattix
Copy link
Contributor Author

Can we also set rollout strategy on the bootstrap deployment to replace?

I'm not a huge fan of that change honestly. Replace effectively guarantees downtime in the gap between the old replicaset going down and the new one becoming Ready

@steeling
Copy link
Contributor

Can we also set rollout strategy on the bootstrap deployment to replace?

I'm not a huge fan of that change honestly. Replace effectively guarantees downtime in the gap between the old replicaset going down and the new one becoming Ready

The only thing the bootstrap pod does after it starts up (after it creates/updates CRD's on startup), is the reconciler. The bootstrap pod should really be a job anyways. IMO. Not blocking on this though

Technically it has a webhook for:

  1. metrics
  2. version reporting
  3. health checks

@keithmattix
Copy link
Contributor Author

@steeling it's those webhooks I'm worried about, especially during upgrades. Any client or operator of an OSM resource that's using an older CRD version will end up triggering a conversion request to OSM bootstrap which, for a time, wouldn't have any replicas able to serve traffic

@steeling
Copy link
Contributor

@steeling it's those webhooks I'm worried about, especially during upgrades. Any client or operator of an OSM resource that's using an older CRD version will end up triggering a conversion request to OSM bootstrap which, for a time, wouldn't have any replicas able to serve traffic

I'm not sure a couple seconds of downtime on a small subset of operations is necessarily a bad thing during an upgrade, given the issues we've had here. Adding further complexity to the code, and banking on allowing an old reconciler to flip/flop back to prior states is undesirable IMO

@keithmattix keithmattix marked this pull request as draft March 28, 2023 14:44
@keithmattix keithmattix marked this pull request as ready for review March 28, 2023 16:16
@keithmattix
Copy link
Contributor Author

I'm not sure a couple seconds of downtime on a small subset of operations is necessarily a bad thing during an upgrade, given the issues we've had here. Adding further complexity to the code, and banking on allowing an old reconciler to flip/flop back to prior states is undesirable IMO

If anyone builds on top of OSM, we've now introduced errors/downtime in that upgrade path. Downtime during upgrade can be ok if all parties are aware and plan for it. To me, making that kind of change in a patch release is unexpected and error prone

@jaellio
Copy link
Contributor

jaellio commented Mar 29, 2023

Hey @keithmattix, could you update the PR description to describe the osm version label approach? Thanks!

@keithmattix
Copy link
Contributor Author

Good catch @jaellio! Changed it

Copy link
Contributor

@jaellio jaellio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@keithmattix keithmattix requested a review from steeling March 29, 2023 23:13
@@ -243,7 +242,7 @@ func applyOrUpdateCRDs(crdClient *apiclient.ApiextensionsV1Client) {
log.Fatal().Err(err).Msgf("Error decoding CRD file %s", file)
}

crd.Labels[constants.ReconcileLabel] = strconv.FormatBool(enableReconciler)
crd.Labels[constants.ReconcileLabel] = osmVersion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe conditionally add the label based on enableReconciler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense in general, but I'm wary of changing more behavior than I need to. The label wouldn't actually be used until the reconciler is turned on, so I think it's safe to keep this logic as is

Watch CRDs over a span of 15s to make sure that the CRD conversion patch
is not overwritten by another reconciler

Signed-off-by: Keith Mattix II <keithmattix2@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants