Skip to content

Conversation

markatou
Copy link
Contributor

Description:
Follow up on #4149.

I increased any unused timers to 100 seconds. Additionally, instead of waiting on a timer to go off, all but one tests wait on cluster stats to change.
I decreased the timer wait times, reducing the test runtime from 25 seconds to 5.

Exceptions:

  • While TCP health checking, if an endpoint responds with a wrong phrase, the cluster failure stats do not change. So, the one test that examines this case still depends on timers.
  • There are 2 tests that test hosts timing out. I set those timeouts to 0.1 seconds (vs 100). However, both tests still wait on the cluster stats to change.

Risk Level:
Low

Testing:
I run the test locally, and it passed 3000/3000 times.

This work is for #1310.

Signed-off-by: Lilika Markatou <lilika@google.com>
@lizan
Copy link
Member

lizan commented Aug 15, 2018

re: reducing timeouts, can you run test locally with TSAN and ASAN? These are much slower than regular debug build, so make sure your reduced timeout doesn't fail them.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is exactly what I was asking for. Will defer to @htuch and @lizan for further review.

@@ -118,7 +118,7 @@ class HdsIntegrationTest : public HttpIntegrationTest,
// one HTTP health_check
envoy::service::discovery::v2::HealthCheckSpecifier makeHttpHealthCheckSpecifier() {
envoy::service::discovery::v2::HealthCheckSpecifier server_health_check_specifier_;
server_health_check_specifier_.mutable_interval()->set_seconds(1);
server_health_check_specifier_.mutable_interval()->set_nanos(100000000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not still use seconds here for readability? Same elsewhere? Or perhaps a constant such as MaxTimeout to make it clear that we don't care about the timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interval specifies how often the HdsDelegate sends a message with health checking information back to the management server. Before these changes, I was using time to make sure that the cluster updated before a response was sent. Now, I check the stats to make sure that the cluster processed the change, before I look at what the HdsDelegate sends to the server.

In order to make the interval less than a second, I need to use nanos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok. Alright, some type of constant or comment might still be nice here since it's hard to read how much time this is.

Signed-off-by: Lilika Markatou <lilika@google.com>
@markatou
Copy link
Contributor Author

I run the test locally with TSAN and ASAN and it passed 1000/1000 times for each one.

Signed-off-by: Lilika Markatou <lilika@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to fix the HDS flakes here. I agree with generally shortening the timeouts for timeout tests to force immediate timeout behavior, and lengthening any timeout that doesn't slow down the test. Not sure about some of the wait for cluster stats to change steps though.

@@ -201,6 +201,7 @@ class HdsIntegrationTest : public HttpIntegrationTest,
FakeHttpConnectionPtr host2_fake_connection_;
FakeRawConnectionPtr host_fake_raw_connection_;

int MaxTimeout = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: static constexpr int

@@ -396,6 +412,9 @@ TEST_P(HdsIntegrationTest, SingleEndpointHealthyTcp) {
RELEASE_ASSERT(result, result.message());
host_upstream_->set_allow_unexpected_disconnects(true);

// Make sure the cluster processed the endpoint's response
test_server_->waitForCounterGe("cluster.anna.health_check.success", 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear to me what the win with some of the waits is. We will need to wait for the gRPC response from Envoy to the HDS server below, and those messages are sent independently of what we're doing here wait wise.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, looks robust now. Just two last nits if you have time tomorrow.

ASSERT_TRUE(hds_stream_->waitForGrpcMessage(*dispatcher_, response_));
while (!(checkEndpointHealthResponse(response_.endpoint_health_response().endpoints_health(0),
envoy::api::v2::core::HealthStatus::UNHEALTHY,
host_upstream_->localAddress()) &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer logical && over bitwise & when expressing a logical condition, i.e. more declarative; also safer when when one of the conjuncts might not be only 0 or 1.

while (!checkEndpointHealthResponse(response_.endpoint_health_response().endpoints_health(0),
envoy::api::v2::core::HealthStatus::HEALTHY,
host_upstream_->localAddress())) {
ASSERT_TRUE(hds_stream_->waitForGrpcMessage(*dispatcher_, response_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This while pattern appears a lot, can you factor it into a waitForEndpointHealthResponse method? You can skip doing this for the more complex case of waiting for two conditions further below, but I think most of the cases are simple.

Signed-off-by: markatou <lilika@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@htuch htuch merged commit 4acd243 into envoyproxy:master Aug 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants