Skip to content

Conversation

esmet
Copy link
Contributor

@esmet esmet commented Feb 17, 2021

Towards max lifetime duration for upstream connections

Commit Message: conn_pool: support max stream duration for upstream connections
Additional Description:
Risk Level: low
Testing: TODO
Docs Changes: TODO
Release Notes: TODO
Fixes #15107

Signed-off-by: John Esmet <john.esmet@gmail.com>
Signed-off-by: John Esmet <john.esmet@gmail.com>
Signed-off-by: John Esmet <john.esmet@gmail.com>
Signed-off-by: John Esmet <john.esmet@gmail.com>
@esmet
Copy link
Contributor Author

esmet commented Feb 18, 2021

@snowp / @alyssawilk do you mind providing a quick analysis and sanity check on this diff? Still needs docs but I want to make sure this is pointing in the right direction

@alyssawilk
Copy link
Contributor

I'm still interested in second opinions of if filtering should happen before or after.
Maybe @snowp or @mattklein123 could comment there?

@mattklein123
Copy link
Member

I'm still interested in second opinions of if filtering should happen before or after.

Sorry is this comment meant for a different issue?

@esmet
Copy link
Contributor Author

esmet commented Feb 19, 2021

I'm still interested in second opinions of if filtering should happen before or after.

Sorry is this comment meant for a different issue?

Alyssa may have been referring to #14926

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - I'd thought you were pinging me on the rewrite PR! Here's thoughts on this one ;-)

@@ -728,6 +729,11 @@ class ClusterInfo {
*/
virtual const absl::optional<std::chrono::milliseconds> idleTimeout() const PURE;

/**
* @return the max duration for upstream connection pool connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume all the API plumbing to actually set this is TODO?
I'd rephrase the comments to make it clear that this is the max time before drain, not a hard max cutoff. There's a huge difference between the two and I'm not actually clear on which one is desired (or if both are)

Copy link
Contributor Author

@esmet esmet Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing API plumbing pieces might be some sloppiness on my part from managing two branches of this code - let me check that out.

I will rephrase the comment to mention that it is the max time before drain. I want the behavior to enforce a maximum time after which new requests will no longer use this connection. It's ok for existing requests to drain and finish, as long as new ones no longer attempt to use this connection. The idea is to provide a guarantee that after the configured time, all new traffic is flowing through newly created connections, which in practice will be (well, could be) newer upstream backends that all happen to share an IP (think Kubernetes Cluster IPs backed by pods via label selectors). Let me know if that makes sense or if my thinking is misguided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having this cause draining is the most flexible, since it can be coupled with max_stream_duration to ensure that even long lived streams eventually move to the new connection.

Re your example I would imagine that TCP connections wouldn't survive the IP moving from one pod to another since the new upstream would probably respond with a RST if Envoy tried to send it data on a connection established with a previous pod, causing Envoy to establish a new connection? I still think this feature is useful (e.g. when going through a L4 proxy), but not clear to me why this would be required for the IP reuse case.

Copy link
Contributor Author

@esmet esmet Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have clarified that the use case is a single IP whose upstream backends change without any of them going unhealthy. Consider a case where there is a Service IP with a label selector that targets healthy pods with version=old. Then, the Service manifest is updated to select on labels version=new instead. Now, any new connection through this IP will reach pods with version=new, but the existing connections to healthy pods with version=old still remain. For the sake of this example, pods with version=old and version=new may coexist for an arbitrary amount of time. The max connection duration feature would help ensure that after a configured amount of time, those old connections would be drained so that all new requests eventually go through new connections.

Copy link
Contributor

@alyssawilk alyssawilk Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, any reason the upstream can't do its own drain (version=old doing GOAWAY + HTTP/1.1 connection close, nack health)? It'd be a useful feature either way for things like internet based load balancing having limited TTL when DNS can expire, so no objection just curious :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream could do its own drain, but I think it's extra useful for the solution to be upstream-agnostic and implemented purely in Envoy.

The exact use case is for canary deployments using two Service IPs, one that represents the canary and one that represents the stable service. We use Envoy's traffic splitting capabilities to slowly move traffic over to the canary IP, and, finally, update the stable IP to point to the canary pods. I think the real solution is to just discover Endpoint objects and push those into Envoy instead of using the abstract Service IP. However, and to your point, it's useful to be able to solve this use case with Service IPs alone.

@@ -559,5 +564,14 @@ void ActiveClient::onConnectTimeout() {
close();
}

void ActiveClient::onLifetimeTimeout() {
if (state_ != ActiveClient::State::CLOSED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure capacity stats are handed correctly. This should decrease the local capacity and possibly trigger prefetching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll look into this.

@esmet
Copy link
Contributor Author

esmet commented Mar 29, 2021

Pulling this out of "draft" although I still need to address Alyssa's feedback on prefetching

@esmet esmet marked this pull request as ready for review March 29, 2021 20:38
@sschepens
Copy link
Contributor

@esmet what's the status on this? we might be needing this

@esmet
Copy link
Contributor Author

esmet commented Apr 28, 2021

@esmet what's the status on this? we might be needing this

Hey! I think this code functionally works however I have some other things on my plate so I've had to put this down temporarily.

This PR really needs tests, docs, and a changelog entry in order to be complete. But if you need something like this, I definitely encourage you to try it out and let me know if it works for you and provide whatever feedback you have.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 28, 2021
@github-actions
Copy link

github-actions bot commented Jun 4, 2021

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Jun 4, 2021
@esmet
Copy link
Contributor Author

esmet commented Aug 31, 2021

I'm reviving this. @sschepens is this still something you could try out for your use case?

Either way I need to work on writing unit tests. I'll do that soon.

@esmet
Copy link
Contributor Author

esmet commented Aug 31, 2021

Revived as #17932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upstream max connection duration should be configurable.
6 participants