Skip to content

FindZoneByFqdn does not honor SOA TTL #7459

@trevorackerman

Description

@trevorackerman

Describe the bug:
cert-manager does not remove entries from the cache fqdnToZone when the corresponding SOA record expires
This causes a failure of checking for the propagated TXT DNS01 acme challenge record.

This specifically happened when I was using the delegated DNS01 acme challenge.
I had made the mistake of not adding NS records for google cloud dns into my cloudflare domain.

The SOA record from cloudflare had an 1800 second (30 minute) TTL, but cert-manager never removes the cached entry.
So cert-manager kept returning the cached nameservers from cloudflare.
This in turn makes cert-manager always ask the authoritative cloudflare nameservers for the TXT records and those aren't recursive, so the TXT record can never correctly be returned.
(I acknowleged I could workaround the issue by using the recursive dns servers flags)
Only when the cert-manager controller pod was restarted did it correctly retrieve the SOA record for the subdomain and then ask the Google Cloud DNS server for the TXT record.

Expected behaviour:
When cert-manager runs FindZoneByFqdn function, it needs to invalidate cache entries older than the SOA TTL expiration date.

Steps to reproduce the bug:
Add the -v 4 flag to get debug logging for cert-manager controller
While cert-manager controller is running (without restarting the pod)
Use delegated DNS between two providers such as CloudFlare and Google Cloud DNS, but do not add NS entries in CloudFlare for Google Cloud DNS, only the CNAME record for the TXT challenge.
Monitor logs and verify that cert-manager controller discovers and caches the higher level domain from CloudFlare for the fqdn for the dns acme challenge.
Look up the TTL for the SOA record from the CloudFlare domain.
Add the proper NS records in CloudFlare to point to Google Cloud DNS.
Wait longer than the TTL.
Verify from the logs that it is still using the incorrect cached data.
Restart the controller pod and verify that it now gets the SOA record from the delegated domain in Google Cloud DNS.

Anything else we need to know?:

Environment details::

  • Kubernetes version: 1.28
  • Cloud-provider/provisioner: GKE, CloudFlare, Google Cloud DNS
  • cert-manager version: 1.11
  • Install method: static manifests

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/acme/dns01Indicates a PR modifies ACME DNS01 provider codegood first issueDenotes an issue ready for a new contributor, according to the "help wanted" guidelines.kind/featureCategorizes issue or PR as related to a new feature.priority/backlogHigher priority than priority/awaiting-more-evidence.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions