Skip to content

Fix resolving network gateway names with multiple and/or misbehaving DNS upstreams #38781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 13, 2022

Conversation

sangwa
Copy link
Contributor

@sangwa sangwa commented May 6, 2022

Please provide a description of this PR:

  • Fixes an issue with resolving network gateway names when multiple DNS upstreams are provided to the control plane (e.g. via resolv.conf) but only the first one is ever tried
  • Fixes an issue when network gateway names cannot be resolved if DNS upstreams don't support ANY queries (e.g. DigitalOcean - returns REFUSED or Cloudflare - returns NOTIMPL)

Fixes #38689

@sangwa sangwa requested review from a team as code owners May 6, 2022 23:47
@istio-policy-bot
Copy link

😊 Welcome @sangwa! This is either your first contribution to the Istio istio repo, or it's been
awhile since you've been here.

You can learn more about the Istio working groups, code of conduct, and contributing guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test labels May 6, 2022
@istio-testing
Copy link
Collaborator

Hi @sangwa. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hzxuzhonghu
Copy link
Member

/ok-to-test

@istio-testing istio-testing added ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. and removed needs-ok-to-test labels May 7, 2022
@sangwa sangwa force-pushed the fix-network-gw-resolution branch from d687a4d to 3193a84 Compare May 7, 2022 03:06
var wg sync.WaitGroup
var resA, resAAAA *dns.Msg

wg.Add(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we see what other clients do? parallel DNS resolution is tricky concept that has caused many bugs in other products I have seen. I don't feel comfortable doing much other than directly copying something established like coredns,consul,tailscale, etc. Or even stdlib

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. I introduced the parallel resolution trying to avoid an overhead when we only have one kind of records (either A or AAAA) but have to resolve both sequentially. However I haven't considered that we have two independent instances of the DNS client here and they can receive each other's datagrams mixed up, in this case the client library drops them. stdlib sends both datagrams sequentially and then waits for the both results specifically within the same handler function using request IDs to discern them. That is not possible to implement with the client library in use here, so I revert to sequential resolution.

@sangwa sangwa force-pushed the fix-network-gw-resolution branch from 3193a84 to 763cdfb Compare May 10, 2022 00:11
}
codeString := dns.RcodeToString[code]
// TODO make Debug as to not spam the logs on retries?
log.Infof("upstream dns error: %v: %v: %v", upstream, getReqNames(req), codeString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an error when only one type of DNS record supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-behaved DNS servers should return NOERROR with an empty record set, as per RFC, but that's not guaranteed. If an upstream DNS server has no proper IPv6 support then each AAAA lookup can lead to a logged NXDOMAIN error.

I think we can omit logging here if the status is NXDOMAIN. If there is either A or AAAA record the error for another one is redundant, and if both are missing the error will be logged in the downstream function NetworkManager.resolveHostnameGateways():

if len(addrs) == 0 {
log.Warnf("could not resolve hostname %q for %d gateways", host, len(gwsForHost))
}

However if there is a concern for other unexpected errors from the upstream I think it should be indeed downgraded to log.Debugf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my concern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to log.Debugf. Not skipping on NXDOMAIN then as it may indicate a misbehaving upstream and it's on the debug log level now.

@sangwa sangwa requested review from howardjohn and hzxuzhonghu May 12, 2022 22:34
@sangwa
Copy link
Contributor Author

sangwa commented May 12, 2022

@howardjohn @stevenctl PTAL
Would be really nice to have this shipped in 1.14

@sangwa sangwa force-pushed the fix-network-gw-resolution branch from 5511d24 to ddc1003 Compare May 19, 2022 23:42
@sangwa sangwa force-pushed the fix-network-gw-resolution branch from ddc1003 to 42f3818 Compare May 23, 2022 23:45
@sangwa
Copy link
Contributor Author

sangwa commented May 24, 2022

Added another small fix in 42f3818: the TTL of a DNS record is returned in seconds but was treated as nanosecords because the derived time.Duration was not multiplied by time.Second and so the TTL was never actually used (always overridden by the default of 30 seconds).

Also I took a closer look at the DNS client library in use. It creates a new net.Dialer for every query unless a specific net.Dialer is pre-set directly by the calling code, so every new query uses a different local port binding and therefore parallel DNS resolution is safe, at least while the library's logic stays unmodified. I can revert the parallel resolution code back in place if it seems appropriate.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jun 12, 2022
@stevenctl
Copy link
Contributor

LGTM

@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Network gateway hostnames are not correctly resolved when DNS upstreams don't support ANY queries
6 participants