Skip to content

fix: improve TCP DNS client interactions with TCP actor #32635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

leviramsey
Copy link
Contributor

The TCP DNS client responds to a backpressure signal (a CommandFailed(Write(...))) from the TCP connection actor by shutting itself down (which is, in its own way a form of backpressure), but means that two requests requiring TCP in fast enough succession (the second arriving before the connection actor has been able to hand the data off to the socket) will cause the DNS client to stop before it can receive a response to the first. DNS resolutions (e.g. through Discovery) will typically retry after some time, but if there are two retries being resolved that failed sufficiently close to each other the scheduler resolution (default 10 millis) will tend to mean that they retry at effectively the same time: it's thus unlikely that this retry loop will ever be broken.

This change implements ack-based throttling, it will not issue a subsequent TCP write until the connection actor has written the data to the socket and sent back an ack (this does not require any response from the other end, to be clear). Since there typically should not be that many in-flight DNS resolutions happening (the main use is for service discovery), the impact on applications should be minimal.

Also improves the logging within the TCP DNS client.

Copy link
Contributor

@johanandren johanandren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments but will probably address myself to not have to wait for American TZ-morning.

@@ -186,7 +189,9 @@ import akka.pattern.{ BackoffOpts, BackoffSupervisor }
log.debug("DNS response truncated, falling back to TCP")
inflightRequests.get(msg.id) match {
case Some((_, msg)) =>
tcpRequests = tcpRequests.incl(msg.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only clear this on failure, so for the happy path the set will just accumulate ids, so needs a remove also on success. But, to avoid that, could we make it a boolean flag on the value in inFlightRequest instead and avoid the extra set?

@@ -38,41 +41,50 @@ import akka.util.ByteString
case _: Tcp.Connected =>
log.debug("Connected to TCP address [{}]", ns)
val connection = sender()
context.become(ready(connection))
writer = new Writing(connection, log)
context.become(ready())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing a mutable field with the become-style already present here is not great for readability. Let's use just context become closing over state instead.

def maybeWriteMessage(msg: Message): Writer = {
writeMessage(msg)

new Buffering(connection, log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we really want to allow just the one in flight write message between this actor and the TCP manager

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's actually how we show the ack based protocol in the docs, so might be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the TCP manager only allows one write at a time. I originally went for the "NACK-based with write-suspending" approach (assume the write will be accepted until you get the backpressure signal, then resend the ones that were backpressured), but the complexity there got a bit out of hand and I wasn't sure that actually gave us that much.

@johanandren
Copy link
Contributor

Superseeded by #32636

@johanandren johanandren closed this Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants