Skip to content

Conversation

qiwzhang
Copy link
Contributor

@qiwzhang qiwzhang commented Sep 6, 2018

Signed-off-by: Wayne Zhang qiwzhang@google.com

Description:

This change tries to fix #4352

Problem:
When GrpcSubscriptionImpl is deleted, it tries to send one more request update. Newly added sds_api is using GrpcSubscriptionImpl which is owned by a ListenerImpl. When server is killed, GoogleAsyncClientThreadLocal is removed before ListenerManager is deleted. It will cause sds_api to send a google grpc request with a deleted GoogleAsyncClientThreadLocal. Hence the crash.

Fix:
When GoogleAsyncClientThreadLocal is removed, all its active streams will be marked as draining_cq_ = true. If GoogleAsyncClientImpl::writeQueued is called with that flag, bail out.

Risk Level: Low

Testing:
bazel test --runs_per_test=1000 //test/integration:sds_dynamic_integration_test

Docs Changes: None

Release Notes: None

Signed-off-by: Wayne Zhang <qiwzhang@google.com>
@qiwzhang
Copy link
Contributor Author

qiwzhang commented Sep 6, 2018

@htuch Please help to review this.

@htuch htuch self-assigned this Sep 6, 2018
@@ -205,7 +205,8 @@ void GoogleAsyncStreamImpl::resetStream() {
}

void GoogleAsyncStreamImpl::writeQueued() {
if (!call_initialized_ || finish_pending_ || write_pending_ || write_pending_queue_.empty()) {
if (!call_initialized_ || finish_pending_ || write_pending_ || write_pending_queue_.empty() ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a legitimate change; we shouldn't be queueing new writes if we're draining for sure. Does this solve your teardown race for the Envoy gRPC client as well?

Trying to recall our earlier conversation, it seems we should get a clear teardown ordering. We should ensure that all the gRPC consumers are properly removed before we begin the TLS teardown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it solves the test flakiness. tested with runs_per_test=1000.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that teardown ordering in a separate PR. We need to get this in to unblock CI tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will merge given that you folks have ownership on the underlying issue. Thanks.

@htuch htuch merged commit c4211b3 into envoyproxy:master Sep 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sporadic timeout in sds_dynamic_integration_test
2 participants