-
Notifications
You must be signed in to change notification settings - Fork 8k
Fix resolving network gateway names with multiple and/or misbehaving DNS upstreams #38781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
😊 Welcome @sangwa! This is either your first contribution to the Istio istio repo, or it's been You can learn more about the Istio working groups, code of conduct, and contributing guidelines Thanks for contributing! Courtesy of your friendly welcome wagon. |
Hi @sangwa. Thanks for your PR. I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
d687a4d
to
3193a84
Compare
pilot/pkg/model/network.go
Outdated
var wg sync.WaitGroup | ||
var resA, resAAAA *dns.Msg | ||
|
||
wg.Add(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we see what other clients do? parallel DNS resolution is tricky concept that has caused many bugs in other products I have seen. I don't feel comfortable doing much other than directly copying something established like coredns,consul,tailscale, etc. Or even stdlib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. I introduced the parallel resolution trying to avoid an overhead when we only have one kind of records (either A or AAAA) but have to resolve both sequentially. However I haven't considered that we have two independent instances of the DNS client here and they can receive each other's datagrams mixed up, in this case the client library drops them. stdlib sends both datagrams sequentially and then waits for the both results specifically within the same handler function using request IDs to discern them. That is not possible to implement with the client library in use here, so I revert to sequential resolution.
3193a84
to
763cdfb
Compare
pilot/pkg/model/network.go
Outdated
} | ||
codeString := dns.RcodeToString[code] | ||
// TODO make Debug as to not spam the logs on retries? | ||
log.Infof("upstream dns error: %v: %v: %v", upstream, getReqNames(req), codeString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an error when only one type of DNS record supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well-behaved DNS servers should return NOERROR with an empty record set, as per RFC, but that's not guaranteed. If an upstream DNS server has no proper IPv6 support then each AAAA lookup can lead to a logged NXDOMAIN error.
I think we can omit logging here if the status is NXDOMAIN. If there is either A or AAAA record the error for another one is redundant, and if both are missing the error will be logged in the downstream function NetworkManager.resolveHostnameGateways():
istio/pilot/pkg/model/network.go
Lines 191 to 193 in e67c34b
if len(addrs) == 0 { | |
log.Warnf("could not resolve hostname %q for %d gateways", host, len(gwsForHost)) | |
} |
However if there is a concern for other unexpected errors from the upstream I think it should be indeed downgraded to log.Debugf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's my concern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to log.Debugf
. Not skipping on NXDOMAIN then as it may indicate a misbehaving upstream and it's on the debug log level now.
@howardjohn @stevenctl PTAL |
5511d24
to
ddc1003
Compare
…ANY queries and improve related logging
ddc1003
to
42f3818
Compare
Added another small fix in 42f3818: the TTL of a DNS record is returned in seconds but was treated as nanosecords because the derived time.Duration was not multiplied by time.Second and so the TTL was never actually used (always overridden by the default of 30 seconds). Also I took a closer look at the DNS client library in use. It creates a new net.Dialer for every query unless a specific net.Dialer is pre-set directly by the calling code, so every new query uses a different local port binding and therefore parallel DNS resolution is safe, at least while the library's logic stays unmodified. I can revert the parallel resolution code back in place if it seems appropriate. |
LGTM |
Please provide a description of this PR:
Fixes #38689