Skip to content
This repository was archived by the owner on Jul 11, 2023. It is now read-only.

fix(certs): update checkAndRotate to use current durations #4800

Merged
merged 3 commits into from
Jun 22, 2022

Conversation

jaellio
Copy link
Contributor

@jaellio jaellio commented Jun 8, 2022

Description:

Updates the checkAndRotate method to update certs with the current
validity duration specified in the MeshConfig for their respective
cert type. Adds a certType to the certificate struct that is used
to lookup validity duration when calling IssuingCertificate.
Updates shouldRotate to check if the cert needs to be rotated due
to an in-progress root certificate rotation. GetFromCache will
return nil if the cert is not found in the cache, expiring, or
needs to be rotated because of a root cert rotation.

Resolves #4810

Testing done:

  • Verified expiring certs are re-issued with the current serviceCertValidityDuration as defined in the OSM MeshConfig. Currently, expiring service certs are only re-issued with the original serviceCertValidityDuration.
    • Verified the bug on the main branch by running the osm demo with the serviceCertValidityDuration set to 5m. After OSM install, I patched the MeshConfig and updated the validity duration to 4m. Expiring certs were reissued using the original 5m validity duration.
    • Verified the bug is no longer present with this change by following the same steps as above
  • CI

Affected area:

Functional Area
New Functionality [x]
Certificate Management [x]

Please answer the following questions with yes/no.

  1. Does this change contain code from or inspired by another project? no

    • Did you notify the maintainers and provide attribution?
  2. Is this a breaking change? no

  3. Has documentation corresponding to this change been updated in the osm-docs repo (if applicable)?

@jaellio jaellio force-pushed the addCertTypes branch 7 times, most recently from 1e833a9 to 76ad808 Compare June 9, 2022 20:51
@jaellio jaellio marked this pull request as ready for review June 9, 2022 20:51
@codecov-commenter
Copy link

codecov-commenter commented Jun 9, 2022

Codecov Report

Merging #4800 (ee0d641) into main (8da8732) will decrease coverage by 0.06%.
The diff coverage is 73.68%.

@@            Coverage Diff             @@
##             main    #4800      +/-   ##
==========================================
- Coverage   69.47%   69.41%   -0.07%     
==========================================
  Files         218      218              
  Lines       15766    15819      +53     
==========================================
+ Hits        10954    10980      +26     
- Misses       4760     4787      +27     
  Partials       52       52              
Flag Coverage Δ
unittests 69.41% <73.68%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/osm-controller/osm-controller.go 14.46% <0.00%> (ø)
pkg/certificate/certificate.go 94.87% <ø> (-0.97%) ⬇️
pkg/certificate/fake_manager.go 69.11% <0.00%> (-2.10%) ⬇️
pkg/certificate/options.go 33.33% <ø> (-31.38%) ⬇️
pkg/certificate/providers/types.go 0.00% <ø> (ø)
pkg/certificate/types.go 100.00% <ø> (ø)
pkg/configurator/methods.go 65.38% <0.00%> (-6.65%) ⬇️
pkg/configurator/mock_client_generated.go 0.00% <0.00%> (ø)
pkg/certificate/manager.go 79.25% <89.10%> (+0.79%) ⬆️
pkg/certificate/providers/config.go 76.82% <100.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8da8732...ee0d641. Read the comment docs.

@jaellio jaellio changed the title fix(certs): updates checkAndRotate to use current durations fix(certs): update checkAndRotate to use current durations Jun 9, 2022
@jaellio jaellio force-pushed the addCertTypes branch 3 times, most recently from 39990d4 to 378ce46 Compare June 14, 2022 17:43
continue
}
if newCert != cert {
// Certificate was rotated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's a race condition here where IssueCertificate could have triggered a rotation between the call to list, and the check here.

One thought is we can add the logic to issuecertificate for messaging on a rotation if we need to rotate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved publishing to the cert pub-sub to IssueCertificate to ensure a message is published whenever a certificate is rotated (including outside of checkAndRotate).

cert = cert.newMergedWithRoot(validatingIssuer.CertificateAuthority)
}
validityDuration := m.getValidityDurationForCertType(ct)
cert, err = signingIssuer.IssueCertificate(cn, validityDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with my above comment, there's still a race condition if issue gets called twice in a row.

singleflight is a package meant to fix exactly that issue, so we may want to leverage that here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated IssueCertificate so all calls to issue certificates are included in group.Do.

@jaellio jaellio force-pushed the addCertTypes branch 4 times, most recently from 0730a64 to 637fd7a Compare June 14, 2022 23:39
@@ -162,6 +165,22 @@ func (c *Client) GetServiceCertValidityPeriod() time.Duration {
return validityDuration
}

// GetIngressGatewayCertValidityPeriod returns the validity duration for ingress gateway certificates, and a default in case of unspecified or invalid duration
func (c *Client) GetIngressGatewayCertValidityPeriod() time.Duration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case to use a default if unspecified? I'd rather make this field required in the CRD spec vs assuming user intent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If GetIngressGatewaCertValidityPeriod() is somehow called when the ingressGateway field isn't specified should it return an error? Currently, If the ingressGateway ValidityDuration field is updated and the duration is invalid the ingressGatewayCert won't be updated. On rotation of the ingressGateway cert we'll attempt to get the duration from the meshConfig, but it will be invalid. Do you think it makes sense to return a default duration in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an ingressGateway is specified in the MeshConfig then the duration is required. We could specify a default instead of making it required? I guess ideally we would validate the MeshConfig and prevent invalid durations before they are applied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a required property, why do we need a default? Is there a default to simplify the code path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ingressGateway field is removed/unset, the behavior as implemented in pkg/ingress/gateway.go is to call ReleaseCertificate(). So rotation should not happen for such a scenario?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Additionally,

On rotation of the ingressGateway cert we'll attempt to get the duration from the meshConfig, but it will be invalid.

I am wondering if we ever want to use a default during rotation for the ingress cert. I don't see a need to rotate the ingress gateway cert when the state of the gateway cert spec is deemed invalid. Would it be easier to prevent a misconfig?

This is not a blocking comment. I can understand if there are a few corner cases being guarded against through defensive programming here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would imagine, if the updated value is invalid, we should continue to use the older value encoded in the previous cert and not a default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ingress gateway cert's validity duration is set in the meshConfig to an invalid duration the cert will not be reissued but the invalid duration will remain set in the meshConfig. When the ingress gateway cert expires should the previous validity duration be used? Prior to my change, the service cert validity duration would have been used for the ingress gateway cert on expiration. I could add a validity duration field to the Certificate struct or I could add validation for the MeshConfig, but I think that is beyond the scope of this PR.

Copy link
Contributor Author

@jaellio jaellio Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashankram Could this be addressed in a follow-up PR? I think this could be addressed when we add validation for the MeshConfig. I am also happy to make the changes to include the validity duration as a field in Certificate or make those changes in a follow-up PR as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure. This comment is not blocking in nature.

@jaellio jaellio force-pushed the addCertTypes branch 2 times, most recently from e84a5fe to 85e1c60 Compare June 17, 2022 22:38
}

// WithValidityPeriod modifies the expiration on a newly issued certificate.
func WithValidityPeriod(validityPeriod time.Duration) IssueOption {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steeling I removed this because I didn't want to make the cert type/ validity duration optional and assume a default. With my current design for cert rotation and issuing it is required that I have a cert type or something similar to determine whether to call getServiceCertValidityDuration, getIngressGatewayCertValidityDuration, or use the OSM internal cert duration. Do you have any suggestions for how to incorporate the validity period options with my changes or take a different approach? Thanks!!

@@ -15,10 +17,13 @@ import (
"github.com/openservicemesh/osm/pkg/messaging"
)

var group singleflight.Group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd consider scoping this to the certmanager, in case there are ever multiple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I think I might leave this as-is for this PR unless this is a blocking comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it’s a bug not to, since if 2 Cert manager objects are created the cache would not be populated correctly.

It’s a 1-2 line fix, so I would push for this in this pr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

func (m *Manager) issueCertificate(prefix string, ct CertType, opts ...IssueOption) (*Certificate, error) {
var rotate bool
cert := m.getFromCache(prefix) // Don't call this while holding the lock
if cert != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put the nil check in should rotate to reduce nesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the nesting the call to shouldRotate. It makes sense to only call shouldRotate when the cert is not nil. Happy to make the change if you think moving the nil check to shouldRotate improves readability.

jaellio added 2 commits June 21, 2022 13:21
Updates the checkAndRotate method to update certs with the current
validity duration specified in the MeshConfig for their respective
cert type. Adds a certType to the certificate struct that is used
to lookup validity duration when calling IssuingCertificates.
Updates shouldRotate to check if the cert needs to be rotated due
to an in progress root certificate rotation. GetFromCache will
return nil if the cert is not found in the cache, expiring, or
needs to be rotated because of a root cert rotation.

Signed-off-by: jaellio <jaellio@microsoft.com>
Signed-off-by: jaellio <jaellio@microsoft.com>
Signed-off-by: jaellio <jaellio@microsoft.com>
@jaellio jaellio merged commit 28b3238 into openservicemesh:main Jun 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preserve validity duration during cert rotation
5 participants