Fix AWS Route53 error detection for not-found errors during deletion #7548

alex-berger · 2025-02-03T15:57:58Z

Pull Request Motivation

Kind

/kind bug

Release Note

Fix AWS Route53 error detection for not-found errors during deletion of DNS records.

Fixes cert-manager#7547 Signed-off-by: alex-berger <alex-berger@gmx.ch>

cert-manager-prow · 2025-02-03T15:58:09Z

Hi @alex-berger. Thanks for your PR.

I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

alex-berger · 2025-02-07T07:19:55Z

/cc

cert-manager-prow · 2025-02-07T07:19:58Z

@alex-berger: GitHub didn't allow me to request PR reviews from the following users: alex-berger.

Note that only cert-manager members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

alex-berger · 2025-02-11T09:23:44Z

@munnerz @euank pinging you for review/approval as your names appear in the OWNERS file 😇.

euank

LGTM!

It would be nice if we could also add a unit test for that error path as well, but even without it this change makes sense to me!

wallrj · 2025-02-20T18:06:53Z

/ok-to-test

wallrj-cyberark · 2025-02-26T17:02:14Z

WIP: I've been testing this change by following the AWS tutorial, with cert-manager deployed from the master branch and then again when deployed from this branch.

I observed the bug before and then, after switching to this branch and increasing the log verbosity, I could see the new improved behaviour.

# values.yaml
crds:
  enabled: true
global:
  logLevel: 4

export KO_REGISTRY=ttl.sh/$(uuidgen)
export KO_HELM_VALUES_FILES=$PWD/values.yaml
make -j4 ko-deploy-certmanager

Before:

E0226 16:28:35.993546 1 sync.go:275] "error cleaning up challenge" err="failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: , InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.xyz.cert-manager-aws-tutorial.richard-gcp.jetstacker.net.', type='TXT', set-identifier='"5CiRHXrp9tvpLNX8F9M8qbi8u9kwb3xnHrKdLNDlRQA"'] but it was not found]" logger="cert-manager.controller.finalizer" resource_name="www-5-1428445754-1332908481" resource_namespace="default" resource_kind="Challenge" resource_version="v1" dnsName="xyz.cert-manager-aws-tutorial.richard-gcp.jetstacker.net" type="DNS-01"

After:

I0226 16:48:06.834638 1 route53.go:289] "ignoring InvalidChangeBatch error" logger="cert-manager.controller.CleanUp" resource_name="www-7-2741477351-1963851674" resource_namespace="default" resource_kind="Challenge" resource_version="v1" dnsName="def.cert-manager-aws-tutorial.richard-gcp.jetstacker.net" type="DNS-01" resource_name="www-7-2741477351-1963851674" resource_namespace="default" resource_kind="Challenge" resource_version="v1" domain="def.cert-manager-aws-tutorial.richard-gcp.jetstacker.net" error="operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: 92d95ac1-3cd9-4418-ba49-422a59c0e399, InvalidChangeBatch: [Tried to delete resource record set [name='_acme-challenge.def.cert-manager-aws-tutorial.richard-gcp.jetstacker.net.', type='TXT', set-identifier='"GuLvhn6cdGlVSjPZAOcrW36aUPRTdTzPulnFN3X6nJk"'] but it was not found]"

wallrj · 2025-02-27T10:00:31Z

pkg/issuer/acme/dns/route53/route53.go

+				// If we try to delete something and get a 'InvalidChangeBatch' that
+				// means it's already deleted, no need to consider it an error.
+				return nil
+			}


Thanks @alex-berger

I tested this change yesterday and yes, this does work and apologies for causing this bug with my careless code review remarks. I have learned that errors.Is should only be used with sentinel errors.

The AWS-SDK-GO V2 docs say this about error handling:

Service operations can return modeled error types to indicate specific errors. These modeled types can be used with errors.As to unwrap and determine if the operation failure was due to a specific error.

In this case, there is a modelled error, route53types.InvalidChangeBatch, which we can use with errors.As,
so I suggest we use that instead of unwrapping the underlying smithy error and string matching "InvalidChangeBatch".
This seems to be the code where the modelled error is created:

https://github.com/aws/aws-sdk-go-v2/blob/main/service/route53/internal/customizations/doc.go#L8-L37

https://github.com/aws/aws-sdk-go-v2/blob/ef40d0c7f18167e851e939d28357a541102a6a94/service/route53/internal/customizations/custom_error_deser.go#L76

It would be nice to have a unit-test for this. It looks like it should be possible to inject a fake route53 client with a ChangeResourceRecordSets and a canned error response.
If you have time, please add a test, else let me know and I'll try and write one.

Actually, I first tried something like this:

var invalidChangeBatch route53types.InvalidChangeBatch if errors.As(err, &invalidChangeBatch) { ... }

But, it did not work (errors.As(err, &invalidChangeBatch) always evaluated to false). In the end I ended up with the current solution, which actually seems to work:

var apiErr smithy.APIError if errors.As(err, &apiErr) { if apiErr.ErrorCode() == "InvalidChangeBatch" && action == route53types.ChangeActionDelete { ... } }

However, I am neither an expert in Go nor in AWS SDK for Go, so there might well be a better approach to this :-).

It would be nice to have a unit-test for this. It looks like it should be possible to inject a fake route53 client with a ChangeResourceRecordSets and a canned error response.
If you have time, please add a test, else let me know and I'll try and write one.

I agree, having a unit test would be great. However, mocking route53 client is not that easy and we would also have to make sure we return the exact same error (type). And still, if they change error handling again in the AWS SDK, the test will be pretty useless, as we test against our mock (which will not reflect the change in the SDK). So, the test would still succeed, while at runtime it will fail again.

Those days, I have not much spare capacity. So, unless we have a better idea for unit testing (one that really tests against the AWS SDK behavior and not against a mock), I will not invest more time for now. Of course, if you have some spare time, be my guest :-).

Actually, I first tried something like this:

var invalidChangeBatch route53types.InvalidChangeBatch if errors.As(err, &invalidChangeBatch) { ... }

I think it might be because the returned error in this case is a pointer, so you need to do:

var invalidChangeBatch *route53types.InvalidChangeBatch

I'll play around with it and push to your branch.

@wallrj Any news? If not can we merge this PR and consider addressing potential improvements to error handling with a future PR. I would really love to see this fix to land in the next cert-manager release, to make sure automatic certificate renewal works seamless (without human intervention).

alex-berger · 2025-03-26T06:32:08Z

@wallrj any feedback/update on #7548 (comment)?

Signed-off-by: Richard Wall <richard.wall@venafi.com>

And use the modelled error type, as recommended by the AWS SDK documentation Signed-off-by: Richard Wall <richard.wall@venafi.com>

Signed-off-by: Richard Wall <richard.wall@venafi.com>

cert-manager-prow · 2025-04-09T17:04:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: euank
Once this PR has been reviewed and has the lgtm label, please ask for approval from wallrj. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wallrj · 2025-04-09T17:07:48Z

I've started adding some unit-tests for this, but not quite finished and had to make some changes to the test server and to the retry code, to get timely failures....

$ go test ./pkg/issuer/acme/dns/route53/...  -run TestRoute53Cleanup -v
=== RUN   TestRoute53Cleanup
=== RUN   TestRoute53Cleanup/success
    wait.go:464: I0409 18:02:57.694787] Returning discovered zone record zoneRecord="example.com." fqdn="_acme-challenge.example.com."
    testutil_test.go:30: GET: /2013-04-01/hostedzonesbyname?dnsname=example.com
    testutil_test.go:30: POST: /2013-04-01/hostedzone/ABCDEFG/rrset
    testutil_test.go:30: GET: /2013-04-01/change/123456
=== RUN   TestRoute53Cleanup/ignore-not-found
    wait.go:410: I0409 18:02:57.803752] Returning cached zone record zoneRecord="example.com." fqdn="_acme-challenge.example.com."
    testutil_test.go:30: GET: /2013-04-01/hostedzonesbyname?dnsname=example.com
    testutil_test.go:30: POST: /2013-04-01/hostedzone/ABCDEFG/rrset
    route53.go:268: I0409 18:02:57.806232] Got InvalidChangeBatch error when attempting to delete the TXT record. Ignoring the error and asssuming that the TXT record has already been deleted. error="failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, https response error StatusCode: 400, RequestID: SOMEREQUESTID, InvalidChangeBatch: ChangeBatch errors occurred" error-messages=["Tried to delete resource record set [name='_acme-challenge.example.com.', type='TXT', set-identifier='\"5CiRHXrp9tvpLNX8F9M8qbi8u9kwb3xnHrKdLNDlRQA\"'] but it was not found"]
--- PASS: TestRoute53Cleanup (0.24s)
    --- PASS: TestRoute53Cleanup/success (0.13s)
    --- PASS: TestRoute53Cleanup/ignore-not-found (0.10s)
PASS
ok      github.com/cert-manager/cert-manager/pkg/issuer/acme/dns/route53        0.248s

Signed-off-by: Richard Wall <richard.wall@venafi.com>

wallrj · 2025-04-15T16:31:40Z

@alex-berger I pushed too many unrelated changes to your branch and then wasn't able to force-push to remove them....so I moved everything to another PR which has now been merged:

Fix AWS Route53 error detection for not-found errors during deletion #7690

Sorry for the slow turn around.
I'll try and do an alpha release soon so that you and others can verify the fix more easily.

Fix AWS Route53 error detection for not-found errors during deletion

26b867b

Fixes cert-manager#7547 Signed-off-by: alex-berger <alex-berger@gmx.ch>

cert-manager-prow bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Feb 3, 2025

euank approved these changes Feb 14, 2025

View reviewed changes

cert-manager-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2025

SgtCoDFish added this to the 1.18 milestone Feb 20, 2025

wallrj self-assigned this Feb 20, 2025

cert-manager-prow bot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 20, 2025

wallrj self-requested a review February 20, 2025 18:07

wallrj requested changes Feb 27, 2025

View reviewed changes

wallrj added 2 commits April 9, 2025 15:26

Unit test

33b5109

Signed-off-by: Richard Wall <richard.wall@venafi.com>

Move the InvalidChangeBatch check to the Cleanup function

d6cfe1b

And use the modelled error type, as recommended by the AWS SDK documentation Signed-off-by: Richard Wall <richard.wall@venafi.com>

cert-manager-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 9, 2025

WIP

5b7e67f

Signed-off-by: Richard Wall <richard.wall@venafi.com>

cert-manager-prow bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 9, 2025

cert-manager-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 9, 2025

More tests for the route53 cleanup function

a855d81

Signed-off-by: Richard Wall <richard.wall@venafi.com>

wallrj mentioned this pull request Apr 15, 2025

Fix AWS Route53 error detection for not-found errors during deletion #7690

Merged

wallrj closed this Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix AWS Route53 error detection for not-found errors during deletion #7548

Fix AWS Route53 error detection for not-found errors during deletion #7548

Uh oh!

alex-berger commented Feb 3, 2025

Uh oh!

cert-manager-prow bot commented Feb 3, 2025

Uh oh!

alex-berger commented Feb 7, 2025

Uh oh!

cert-manager-prow bot commented Feb 7, 2025

Uh oh!

alex-berger commented Feb 11, 2025

Uh oh!

euank left a comment

Uh oh!

wallrj commented Feb 20, 2025

Uh oh!

wallrj-cyberark commented Feb 26, 2025

Uh oh!

wallrj Feb 27, 2025

Uh oh!

alex-berger Feb 27, 2025 •

edited

Loading

Uh oh!

wallrj Feb 28, 2025

Uh oh!

alex-berger Mar 10, 2025

Uh oh!

alex-berger commented Mar 26, 2025

Uh oh!

cert-manager-prow bot commented Apr 9, 2025

Uh oh!

wallrj commented Apr 9, 2025

Uh oh!

wallrj commented Apr 15, 2025

Uh oh!

Uh oh!

Fix AWS Route53 error detection for not-found errors during deletion #7548

Fix AWS Route53 error detection for not-found errors during deletion #7548

Uh oh!

Conversation

alex-berger commented Feb 3, 2025

Pull Request Motivation

Kind

Release Note

Uh oh!

cert-manager-prow bot commented Feb 3, 2025

Uh oh!

alex-berger commented Feb 7, 2025

Uh oh!

cert-manager-prow bot commented Feb 7, 2025

Uh oh!

alex-berger commented Feb 11, 2025

Uh oh!

euank left a comment

Choose a reason for hiding this comment

Uh oh!

wallrj commented Feb 20, 2025

Uh oh!

wallrj-cyberark commented Feb 26, 2025

Uh oh!

wallrj Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

alex-berger Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wallrj Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

alex-berger Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

alex-berger commented Mar 26, 2025

Uh oh!

cert-manager-prow bot commented Apr 9, 2025

Uh oh!

wallrj commented Apr 9, 2025

Uh oh!

wallrj commented Apr 15, 2025

Uh oh!

Uh oh!

alex-berger Feb 27, 2025 •

edited

Loading