Skip to content

Conversation

ggreenway
Copy link
Member

@ggreenway ggreenway commented Jun 11, 2021

Delete connection pools when they have no connections anymore. This fixes unbounded memory use for cases where a new connection pool is needed for each downstream connection, such as when using upstream PROXY protocol.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Co-authored-by: Craig Radcliffe <craig.radcliffe@broadcom.com>

Fixes #16682
Continues @chradcliffe's work from #13061

Risk Level: High! Connection pools used to live until the cluster was destroyed. If anything is caching a pool that now gets deleted sooner, this could crash, corrupt memory, etc.
Testing: Existing and new tests pass
Docs Changes: Updated
Release Notes: Added
Platform Specific Features: None
Runtime guard: envoy.reloadable_features.conn_pool_delete_when_idle
[Optional Deprecated:]
[Optional API Considerations:]

Craig Radcliffe and others added 30 commits September 11, 2020 13:27
* Add pool idle timeout timer and callbacks to
ConnectionPool::ConnPoolImplBase

* Add logic to set the idle timer when there are no pending or active
connections and disable the timer when new activity occurs

* Add logic to ClusterManagerImpl to remove pools after their idle
timeout expires and to remove hosts from the pool maps when the map is
empty after an idle timer expiry

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Originally the idle check used the lack of active streams as the start
of the idle timer; however, this is too complicated. Instead, we now
start the timeout after the last connection has been drained from the pool.

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
…/clean-up-connection-pools

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
…/clean-up-connection-pools

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
…/clean-up-connection-pools

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
…/clean-up-connection-pools

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
…/clean-up-connection-pools

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
* Fix original conn pool logic and add test

* Ensure that the idle timer is disabled when draining

* Drain connections after the idle timer fires but before erasing the
pool

* Make logic consistent beween conn_pool_base and original_conn_pool

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
* Added note about connection pool lifetime

* Added missing note about pools per downstream connection

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
)

Clarify that the server may not resend resources upon stream restart only if it has some way to know that the client is not subscribing to new resources that it wasn't previously subscribed to.

Risk Level: Low
Testing: N/A
Docs Changes: Included in PR

Clarifies text added in envoyproxy#12580.

Signed-off-by: Mark D. Roth <roth@google.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
…x a typo (envoyproxy#13696)

The comment in v3 version was missing the "If specified.." clause from the v2 version of that comment

Risk Level: low
Testing: Ran ./ci/run_envoy_docker.sh './ci/do_ci.sh fix_format'
Docs Changes: comment in a proto file changed

Signed-off-by: Sanjay Pujare <sanjaypujare@google.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
This updates the defaut container config to:
- use the envoyproxy.io website as the default upstream proxy
- listen for admin interface on 0.0.0.0 address

Fixes envoyproxy#13497

Signed-off-by: Ryan Northey <ryan@synca.io>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
* test: Check in all TLS test certs

- Will prevent openssl fork-emulation issues on Windows/msys2 that cause
  test flakiness
- modifies context_impl_test to no longer requires a cert that is
  generated on the fly to expire in 15 days

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
* Added more documentation

* Added additional test

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
* Replaced addIdleTimeoutCallback/addDrainedCallback with
  addIdleCallback with a parameter to indicate whether we should start
  draining

* Removed the pool idle timer, instead relying on the connection idle
  timeouts to empty the pool

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
…/clean-up-connection-pools

Signed-off-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
@ggreenway
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16948 (comment) was created by @ggreenway.

see: more, trace.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 🤞

Per offline discussion we can merge to get a bit of bake time before release and @alyssawilk can follow up on Monday with any post merge comments.

@ggreenway ggreenway merged commit 3876d7c into envoyproxy:main Jul 9, 2021
@rgs1
Copy link
Member

rgs1 commented Jul 9, 2021

I'll try to canary this today so we can provide some signal/feedback.

@ggreenway
Copy link
Member Author

I'll try to canary this today so we can provide some signal/feedback.

Thanks!

ggreenway added a commit to ggreenway/envoy that referenced this pull request Jul 12, 2021
If a host is transitioned to unhealthy, and closing its idle
connections results in the pool becoming idle, it would trigger this crash.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
ggreenway added a commit that referenced this pull request Jul 13, 2021
If a host is transitioned to unhealthy, and closing its idle
connections results in the pool becoming idle, it would trigger this crash.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
dubious90 pushed a commit to envoyproxy/nighthawk that referenced this pull request Jul 13, 2021
Summary of actions/changes:
- Update bazel/repositories
- Other files already in sync
- Update AddDrainedCallback calls to instead be AddIdleCallback, due to [envoy PR#16948](envoyproxy/envoy#16948).  AddIdleCallback serves an equivalent, if not exactly the same, purpose.

Signed-off-by: Nathan Perry <nbperry@google.com>
@bianpengyuan
Copy link
Contributor

We are getting crash at checkForIdleAndCloseIdleConnsIfDraining after upgrading to the latest envoy at Istio, including #17302:

2021-07-13T19:21:36.345453Z	critical	envoy backtrace	Caught Segmentation fault, suspect faulting address 0x0
2021-07-13T19:21:36.345502Z	critical	envoy backtrace	Backtrace (use tools/stack_decode.py to get line numbers):
2021-07-13T19:21:36.345506Z	critical	envoy backtrace	Envoy version: 45fb150aa722b5c4bbbf58af5d59910ca08010f0/1.19.0-dev/Clean/RELEASE/BoringSSL
2021-07-13T19:21:36.345764Z	critical	envoy backtrace	#0: __restore_rt [0x7f0819a3f3c0]
2021-07-13T19:21:36.360093Z	critical	envoy backtrace	#1: std::__1::__function::__func<>::operator()() [0x55d76850c7ba]
2021-07-13T19:21:36.373821Z	critical	envoy backtrace	#2: Envoy::Tcp::OriginalConnPoolImpl::checkForIdleAndCloseIdleConnsIfDraining() [0x55d7686e3323]
2021-07-13T19:21:36.387629Z	critical	envoy backtrace	#3: Envoy::Tcp::OriginalConnPoolImpl::onConnectionEvent() [0x55d7686e4b05]
2021-07-13T19:21:36.401213Z	critical	envoy backtrace	#4: Envoy::Tcp::OriginalConnPoolImpl::ActiveConn::onEvent() [0x55d7686e7cfa]
2021-07-13T19:21:36.415301Z	critical	envoy backtrace	#5: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x55d768aa942b]
2021-07-13T19:21:36.428448Z	critical	envoy backtrace	#6: Envoy::Network::ConnectionImpl::raiseEvent() [0x55d768aa015f]
2021-07-13T19:21:36.444259Z	critical	envoy backtrace	#7: Envoy::Network::ConnectionImpl::closeSocket() [0x55d768a9fdfb]
2021-07-13T19:21:36.457692Z	critical	envoy backtrace	#8: Envoy::Network::ConnectionImpl::onReadReady() [0x55d768aa4452]
2021-07-13T19:21:36.470884Z	critical	envoy backtrace	#9: Envoy::Network::ConnectionImpl::onFileEvent() [0x55d768aa203f]
2021-07-13T19:21:36.484227Z	critical	envoy backtrace	#10: std::__1::__function::__func<>::operator()() [0x55d768a89181]
2021-07-13T19:21:36.499012Z	critical	envoy backtrace	#11: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x55d768a8a42c]

Link to the full log: https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/33974/integ-security-multicluster-tests_istio/1415021532173832192/artifacts/security-930b7bd830c741e097d857/TestReachability/_test_context/test-ns1-1-54234-state395228267/primary/a-v1-754f5f947c-kq82l_istio-proxy.previous.log It is likely because of this change.

@ggreenway
Copy link
Member Author

@bianpengyuan I'm looking into this. Out of curiosity, why are you using the original tcp conn pool? You must be setting a runtime setting to be getting that, because it is no longer the default. Did you have issues with the new one?

@rgs1
Copy link
Member

rgs1 commented Jul 13, 2021

Was about to ask the same thing that Greg asked. We've been on the new connection pool for a while now (even before it was on by default in upstream), so I am curious if there are outstanding issues that we maybe missed and are causing you to use the old pool...

@bianpengyuan
Copy link
Contributor

Yeah we use to have some issue around new tcp pool, but I think the issue has been resolved. Let us turn off that runtime flag and see if this goes away.

ggreenway added a commit to ggreenway/envoy that referenced this pull request Jul 13, 2021
…xy#17302)"

This reverts commit 3c266bb.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
ggreenway added a commit to ggreenway/envoy that referenced this pull request Jul 13, 2021
This reverts commit 3876d7c.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
@ggreenway
Copy link
Member Author

@bianpengyuan I think we're going to revert the change for now, but it'll go back in after the next release, so any information you can gather about the crash would be helpful so that I can hopefully fix any issues before it is re-submitted.

ggreenway added a commit to ggreenway/envoy that referenced this pull request Jul 19, 2021
Delete connection pools when they have no connections anymore. This
fixes unbounded memory use for cases where a new connection pool is
needed for each downstream connection, such as when using upstream
PROXY protocol.

Fixes envoyproxy#16682

This reverts commit b7bc539.
This reverts PR envoyproxy#17319, by re-adding envoyproxy#17302 and envoyproxy#16948.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Co-authored-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
ggreenway added a commit that referenced this pull request Jul 26, 2021
Delete connection pools when they have no connections anymore. This
fixes unbounded memory use for cases where a new connection pool is
needed for each downstream connection, such as when using upstream
PROXY protocol.

This reverts commit b7bc539.
This reverts PR #17319, by re-adding #17302 and #16948.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Co-authored-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
Delete connection pools when they have no connections anymore. This fixes unbounded memory use for cases where a new connection pool is needed for each downstream connection, such as when using upstream PROXY protocol.

Fixes envoyproxy#16682

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Co-authored-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
If a host is transitioned to unhealthy, and closing its idle
connections results in the pool becoming idle, it would trigger this crash.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
Delete connection pools when they have no connections anymore. This
fixes unbounded memory use for cases where a new connection pool is
needed for each downstream connection, such as when using upstream
PROXY protocol.

This reverts commit b7bc539.
This reverts PR envoyproxy#17319, by re-adding envoyproxy#17302 and envoyproxy#16948.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Co-authored-by: Craig Radcliffe <craig.radcliffe@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upstream PROXY protocol results in unbound number of connection pools