conn_pool: support max stream duration for upstream connections #15068

esmet · 2021-02-17T00:21:40Z

Towards max lifetime duration for upstream connections

Commit Message: conn_pool: support max stream duration for upstream connections
Additional Description:
Risk Level: low
Testing: TODO
Docs Changes: TODO
Release Notes: TODO
Fixes #15107

Signed-off-by: John Esmet <john.esmet@gmail.com>

esmet · 2021-02-18T18:47:36Z

@snowp / @alyssawilk do you mind providing a quick analysis and sanity check on this diff? Still needs docs but I want to make sure this is pointing in the right direction

alyssawilk · 2021-02-18T20:59:44Z

I'm still interested in second opinions of if filtering should happen before or after.
Maybe @snowp or @mattklein123 could comment there?

mattklein123 · 2021-02-19T00:49:28Z

I'm still interested in second opinions of if filtering should happen before or after.

Sorry is this comment meant for a different issue?

esmet · 2021-02-19T16:00:48Z

I'm still interested in second opinions of if filtering should happen before or after.

Sorry is this comment meant for a different issue?

Alyssa may have been referring to #14926

alyssawilk

Sorry - I'd thought you were pinging me on the rewrite PR! Here's thoughts on this one ;-)

alyssawilk · 2021-02-22T14:31:08Z

include/envoy/upstream/upstream.h

@@ -728,6 +729,11 @@ class ClusterInfo {
   */
  virtual const absl::optional<std::chrono::milliseconds> idleTimeout() const PURE;

+  /**
+   * @return the max duration for upstream connection pool connections.


I assume all the API plumbing to actually set this is TODO?
I'd rephrase the comments to make it clear that this is the max time before drain, not a hard max cutoff. There's a huge difference between the two and I'm not actually clear on which one is desired (or if both are)

The missing API plumbing pieces might be some sloppiness on my part from managing two branches of this code - let me check that out.

I will rephrase the comment to mention that it is the max time before drain. I want the behavior to enforce a maximum time after which new requests will no longer use this connection. It's ok for existing requests to drain and finish, as long as new ones no longer attempt to use this connection. The idea is to provide a guarantee that after the configured time, all new traffic is flowing through newly created connections, which in practice will be (well, could be) newer upstream backends that all happen to share an IP (think Kubernetes Cluster IPs backed by pods via label selectors). Let me know if that makes sense or if my thinking is misguided.

I think having this cause draining is the most flexible, since it can be coupled with max_stream_duration to ensure that even long lived streams eventually move to the new connection.

Re your example I would imagine that TCP connections wouldn't survive the IP moving from one pod to another since the new upstream would probably respond with a RST if Envoy tried to send it data on a connection established with a previous pod, causing Envoy to establish a new connection? I still think this feature is useful (e.g. when going through a L4 proxy), but not clear to me why this would be required for the IP reuse case.

I should have clarified that the use case is a single IP whose upstream backends change without any of them going unhealthy. Consider a case where there is a Service IP with a label selector that targets healthy pods with version=old. Then, the Service manifest is updated to select on labels version=new instead. Now, any new connection through this IP will reach pods with version=new, but the existing connections to healthy pods with version=old still remain. For the sake of this example, pods with version=old and version=new may coexist for an arbitrary amount of time. The max connection duration feature would help ensure that after a configured amount of time, those old connections would be drained so that all new requests eventually go through new connections.

curious, any reason the upstream can't do its own drain (version=old doing GOAWAY + HTTP/1.1 connection close, nack health)? It'd be a useful feature either way for things like internet based load balancing having limited TTL when DNS can expire, so no objection just curious :-)

The upstream could do its own drain, but I think it's extra useful for the solution to be upstream-agnostic and implemented purely in Envoy.

The exact use case is for canary deployments using two Service IPs, one that represents the canary and one that represents the stable service. We use Envoy's traffic splitting capabilities to slowly move traffic over to the canary IP, and, finally, update the stable IP to point to the canary pods. I think the real solution is to just discover Endpoint objects and push those into Envoy instead of using the abstract Service IP. However, and to your point, it's useful to be able to solve this use case with Service IPs alone.

alyssawilk · 2021-02-22T14:32:39Z

source/common/conn_pool/conn_pool_base.cc

@@ -559,5 +564,14 @@ void ActiveClient::onConnectTimeout() {
  close();
 }

+void ActiveClient::onLifetimeTimeout() {
+  if (state_ != ActiveClient::State::CLOSED) {


Please make sure capacity stats are handed correctly. This should decrease the local capacity and possibly trigger prefetching.

Good call, I'll look into this.

esmet · 2021-03-29T20:38:19Z

Pulling this out of "draft" although I still need to address Alyssa's feedback on prefetching

sschepens · 2021-04-27T13:03:15Z

@esmet what's the status on this? we might be needing this

esmet · 2021-04-28T16:37:47Z

@esmet what's the status on this? we might be needing this

Hey! I think this code functionally works however I have some other things on my plate so I've had to put this down temporarily.

This PR really needs tests, docs, and a changelog entry in order to be complete. But if you need something like this, I definitely encourage you to try it out and let me know if it works for you and provide whatever feedback you have.

github-actions · 2021-05-28T20:19:48Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2021-06-04T20:25:29Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

esmet · 2021-08-31T16:49:29Z

I'm reviving this. @sschepens is this still something you could try out for your use case?

Either way I need to work on writing unit tests. I'll do that soon.

esmet · 2021-08-31T17:01:51Z

Revived as #17932

esmet added 3 commits February 17, 2021 00:19

conn_pool: support max stream duration for upstream connections

f143d0c

Signed-off-by: John Esmet <john.esmet@gmail.com>

Formatting

4e8e73d

Signed-off-by: John Esmet <john.esmet@gmail.com>

Only drain non-closed connections

9d7d810

Signed-off-by: John Esmet <john.esmet@gmail.com>

antoniovicente added the waiting label Feb 18, 2021

Pipe down maxConnectionDuration from CommonHttpProtocolOptions

0964e31

Signed-off-by: John Esmet <john.esmet@gmail.com>

repokitteh-read-only bot removed the waiting label Feb 18, 2021

alyssawilk reviewed Feb 22, 2021

View reviewed changes

mattklein123 assigned alyssawilk Mar 1, 2021

alyssawilk added the waiting label Mar 3, 2021

esmet marked this pull request as ready for review March 29, 2021 20:38

esmet mentioned this pull request Mar 29, 2021

Ambassador Megapatch on release/v1.17 gravitee-io/envoy#20

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 28, 2021

github-actions bot closed this Jun 4, 2021

conn_pool: support max stream duration for upstream connections #15068

conn_pool: support max stream duration for upstream connections #15068

Uh oh!

Conversation

esmet commented Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esmet commented Feb 18, 2021

Uh oh!

alyssawilk commented Feb 18, 2021

Uh oh!

mattklein123 commented Feb 19, 2021

Uh oh!

esmet commented Feb 19, 2021

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

alyssawilk Feb 22, 2021

Choose a reason for hiding this comment

Uh oh!

esmet Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snowp Feb 22, 2021

Choose a reason for hiding this comment

Uh oh!

esmet Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alyssawilk Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esmet Feb 23, 2021

Choose a reason for hiding this comment

Uh oh!

alyssawilk Feb 22, 2021

Choose a reason for hiding this comment

Uh oh!

esmet Feb 22, 2021

Choose a reason for hiding this comment

Uh oh!

esmet commented Mar 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sschepens commented Apr 27, 2021

Uh oh!

esmet commented Apr 28, 2021

Uh oh!

github-actions bot commented May 28, 2021

Uh oh!

github-actions bot commented Jun 4, 2021

Uh oh!

esmet commented Aug 31, 2021

Uh oh!

esmet commented Aug 31, 2021

Uh oh!

Uh oh!

esmet commented Feb 17, 2021 •

edited

Loading

esmet Feb 22, 2021 •

edited

Loading

esmet Feb 22, 2021 •

edited

Loading

alyssawilk Feb 22, 2021 •

edited

Loading

esmet commented Mar 29, 2021 •

edited

Loading