Skip to content

Fix AWS Route53 error detection for not-found errors during deletion #7548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

alex-berger
Copy link
Contributor

Pull Request Motivation

Fixes #7547

Kind

/kind bug

Release Note

Fix AWS Route53 error detection for not-found errors during deletion of DNS records.

@cert-manager-prow cert-manager-prow bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Feb 3, 2025
@cert-manager-prow
Copy link
Contributor

Hi @alex-berger. Thanks for your PR.

I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cert-manager-prow cert-manager-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/acme Indicates a PR directly modifies the ACME Issuer code area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code labels Feb 3, 2025
@alex-berger
Copy link
Contributor Author

/cc

@cert-manager-prow
Copy link
Contributor

@alex-berger: GitHub didn't allow me to request PR reviews from the following users: alex-berger.

Note that only cert-manager members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@alex-berger
Copy link
Contributor Author

@munnerz @euank pinging you for review/approval as your names appear in the OWNERS file 😇.

Copy link
Contributor

@euank euank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

It would be nice if we could also add a unit test for that error path as well, but even without it this change makes sense to me!

@cert-manager-prow cert-manager-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2025
@SgtCoDFish SgtCoDFish added this to the 1.18 milestone Feb 20, 2025
@wallrj wallrj self-assigned this Feb 20, 2025
@wallrj
Copy link
Member

wallrj commented Feb 20, 2025

/ok-to-test

@cert-manager-prow cert-manager-prow bot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 20, 2025
@wallrj wallrj self-requested a review February 20, 2025 18:07
@wallrj-cyberark
Copy link
Contributor

WIP: I've been testing this change by following the AWS tutorial, with cert-manager deployed from the master branch and then again when deployed from this branch.

I observed the bug before and then, after switching to this branch and increasing the log verbosity, I could see the new improved behaviour.

# values.yaml
crds:
  enabled: true
global:
  logLevel: 4
export KO_REGISTRY=ttl.sh/$(uuidgen)
export KO_HELM_VALUES_FILES=$PWD/values.yaml
make -j4 ko-deploy-certmanager

Before:

E0226 16:28:35.993546 1 sync.go:275] "error cleaning up challenge" err="failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: , InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.xyz.cert-manager-aws-tutorial.richard-gcp.jetstacker.net.', type='TXT', set-identifier='"5CiRHXrp9tvpLNX8F9M8qbi8u9kwb3xnHrKdLNDlRQA"'] but it was not found]" logger="cert-manager.controller.finalizer" resource_name="www-5-1428445754-1332908481" resource_namespace="default" resource_kind="Challenge" resource_version="v1" dnsName="xyz.cert-manager-aws-tutorial.richard-gcp.jetstacker.net" type="DNS-01"

After:

I0226 16:48:06.834638 1 route53.go:289] "ignoring InvalidChangeBatch error" logger="cert-manager.controller.CleanUp" resource_name="www-7-2741477351-1963851674" resource_namespace="default" resource_kind="Challenge" resource_version="v1" dnsName="def.cert-manager-aws-tutorial.richard-gcp.jetstacker.net" type="DNS-01" resource_name="www-7-2741477351-1963851674" resource_namespace="default" resource_kind="Challenge" resource_version="v1" domain="def.cert-manager-aws-tutorial.richard-gcp.jetstacker.net" error="operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: 92d95ac1-3cd9-4418-ba49-422a59c0e399, InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.def.cert-manager-aws-tutorial.richard-gcp.jetstacker.net.', type='TXT', set-identifier='"GuLvhn6cdGlVSjPZAOcrW36aUPRTdTzPulnFN3X6nJk"'] but it was not found]"

// If we try to delete something and get a 'InvalidChangeBatch' that
// means it's already deleted, no need to consider it an error.
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alex-berger

I tested this change yesterday and yes, this does work and apologies for causing this bug with my careless code review remarks. I have learned that errors.Is should only be used with sentinel errors.

The AWS-SDK-GO V2 docs say this about error handling:

Service operations can return modeled error types to indicate specific errors. These modeled types can be used with errors.As to unwrap and determine if the operation failure was due to a specific error.

In this case, there is a modelled error, route53types.InvalidChangeBatch, which we can use with errors.As,
so I suggest we use that instead of unwrapping the underlying smithy error and string matching "InvalidChangeBatch".
This seems to be the code where the modelled error is created:

It would be nice to have a unit-test for this. It looks like it should be possible to inject a fake route53 client with a ChangeResourceRecordSets and a canned error response.
If you have time, please add a test, else let me know and I'll try and write one.

Copy link
Contributor Author

@alex-berger alex-berger Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I first tried something like this:

var invalidChangeBatch route53types.InvalidChangeBatch
if errors.As(err, &invalidChangeBatch) {
  ...
}

But, it did not work (errors.As(err, &invalidChangeBatch) always evaluated to false). In the end I ended up with the current solution, which actually seems to work:

var apiErr smithy.APIError
if errors.As(err, &apiErr) {
   if apiErr.ErrorCode() == "InvalidChangeBatch" && action == route53types.ChangeActionDelete {
      ...
   }
}

However, I am neither an expert in Go nor in AWS SDK for Go, so there might well be a better approach to this :-).

It would be nice to have a unit-test for this. It looks like it should be possible to inject a fake route53 client with a ChangeResourceRecordSets and a canned error response.
If you have time, please add a test, else let me know and I'll try and write one.

I agree, having a unit test would be great. However, mocking route53 client is not that easy and we would also have to make sure we return the exact same error (type). And still, if they change error handling again in the AWS SDK, the test will be pretty useless, as we test against our mock (which will not reflect the change in the SDK). So, the test would still succeed, while at runtime it will fail again.

Those days, I have not much spare capacity. So, unless we have a better idea for unit testing (one that really tests against the AWS SDK behavior and not against a mock), I will not invest more time for now. Of course, if you have some spare time, be my guest :-).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I first tried something like this:

var invalidChangeBatch route53types.InvalidChangeBatch
if errors.As(err, &invalidChangeBatch) {
  ...
}

I think it might be because the returned error in this case is a pointer, so you need to do:

var invalidChangeBatch *route53types.InvalidChangeBatch

I'll play around with it and push to your branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wallrj Any news? If not can we merge this PR and consider addressing potential improvements to error handling with a future PR. I would really love to see this fix to land in the next cert-manager release, to make sure automatic certificate renewal works seamless (without human intervention).

@alex-berger
Copy link
Contributor Author

@wallrj any feedback/update on #7548 (comment)?

wallrj added 2 commits April 9, 2025 15:26
Signed-off-by: Richard Wall <richard.wall@venafi.com>
And use the modelled error type, as recommended by the AWS SDK documentation

Signed-off-by: Richard Wall <richard.wall@venafi.com>
@cert-manager-prow cert-manager-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 9, 2025
Signed-off-by: Richard Wall <richard.wall@venafi.com>
@cert-manager-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: euank
Once this PR has been reviewed and has the lgtm label, please ask for approval from wallrj. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cert-manager-prow cert-manager-prow bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 9, 2025
@cert-manager-prow cert-manager-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 9, 2025
@wallrj
Copy link
Member

wallrj commented Apr 9, 2025

I've started adding some unit-tests for this, but not quite finished and had to make some changes to the test server and to the retry code, to get timely failures....

$ go test ./pkg/issuer/acme/dns/route53/...  -run TestRoute53Cleanup -v
=== RUN   TestRoute53Cleanup
=== RUN   TestRoute53Cleanup/success
    wait.go:464: I0409 18:02:57.694787] Returning discovered zone record zoneRecord="example.com." fqdn="_acme-challenge.example.com."
    testutil_test.go:30: GET: /2013-04-01/hostedzonesbyname?dnsname=example.com
    testutil_test.go:30: POST: /2013-04-01/hostedzone/ABCDEFG/rrset
    testutil_test.go:30: GET: /2013-04-01/change/123456
=== RUN   TestRoute53Cleanup/ignore-not-found
    wait.go:410: I0409 18:02:57.803752] Returning cached zone record zoneRecord="example.com." fqdn="_acme-challenge.example.com."
    testutil_test.go:30: GET: /2013-04-01/hostedzonesbyname?dnsname=example.com
    testutil_test.go:30: POST: /2013-04-01/hostedzone/ABCDEFG/rrset
    route53.go:268: I0409 18:02:57.806232] Got InvalidChangeBatch error when attempting to delete the TXT record. Ignoring the error and asssuming that the TXT record has already been deleted. error="failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: SOMEREQUESTID, InvalidChangeBatch: ChangeBatch errors occurred" error-messages=["Tried to delete resource record set [name='_acme-challenge.example.com.', type='TXT', set-identifier='\"5CiRHXrp9tvpLNX8F9M8qbi8u9kwb3xnHrKdLNDlRQA\"'] but it was not found"]
--- PASS: TestRoute53Cleanup (0.24s)
    --- PASS: TestRoute53Cleanup/success (0.13s)
    --- PASS: TestRoute53Cleanup/ignore-not-found (0.10s)
PASS
ok      github.com/cert-manager/cert-manager/pkg/issuer/acme/dns/route53        0.248s

Signed-off-by: Richard Wall <richard.wall@venafi.com>
@wallrj
Copy link
Member

wallrj commented Apr 15, 2025

@alex-berger I pushed too many unrelated changes to your branch and then wasn't able to force-push to remove them....so I moved everything to another PR which has now been merged:

Sorry for the slow turn around.
I'll try and do an alpha release soon so that you and others can verify the fix more easily.

@wallrj wallrj closed this Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/acme/dns01 Indicates a PR modifies ACME DNS01 provider code area/acme Indicates a PR directly modifies the ACME Issuer code dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/bug Categorizes issue or PR as related to a bug. ok-to-test release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS Route53 - certificate renewal stuck due clean-up failure caused by no longer existing TXT record
5 participants