-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Making //test/integration:hds_integration_test test faster and more robust #4164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Lilika Markatou <lilika@google.com>
re: reducing timeouts, can you run test locally with TSAN and ASAN? These are much slower than regular debug build, so make sure your reduced timeout doesn't fail them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -118,7 +118,7 @@ class HdsIntegrationTest : public HttpIntegrationTest, | |||
// one HTTP health_check | |||
envoy::service::discovery::v2::HealthCheckSpecifier makeHttpHealthCheckSpecifier() { | |||
envoy::service::discovery::v2::HealthCheckSpecifier server_health_check_specifier_; | |||
server_health_check_specifier_.mutable_interval()->set_seconds(1); | |||
server_health_check_specifier_.mutable_interval()->set_nanos(100000000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not still use seconds here for readability? Same elsewhere? Or perhaps a constant such as MaxTimeout to make it clear that we don't care about the timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interval specifies how often the HdsDelegate sends a message with health checking information back to the management server. Before these changes, I was using time to make sure that the cluster updated before a response was sent. Now, I check the stats to make sure that the cluster processed the change, before I look at what the HdsDelegate sends to the server.
In order to make the interval less than a second, I need to use nanos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok. Alright, some type of constant or comment might still be nice here since it's hard to read how much time this is.
Signed-off-by: Lilika Markatou <lilika@google.com>
I run the test locally with TSAN and ASAN and it passed 1000/1000 times for each one. |
Signed-off-by: Lilika Markatou <lilika@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to fix the HDS flakes here. I agree with generally shortening the timeouts for timeout tests to force immediate timeout behavior, and lengthening any timeout that doesn't slow down the test. Not sure about some of the wait for cluster stats to change steps though.
@@ -201,6 +201,7 @@ class HdsIntegrationTest : public HttpIntegrationTest, | |||
FakeHttpConnectionPtr host2_fake_connection_; | |||
FakeRawConnectionPtr host_fake_raw_connection_; | |||
|
|||
int MaxTimeout = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: static constexpr int
@@ -396,6 +412,9 @@ TEST_P(HdsIntegrationTest, SingleEndpointHealthyTcp) { | |||
RELEASE_ASSERT(result, result.message()); | |||
host_upstream_->set_allow_unexpected_disconnects(true); | |||
|
|||
// Make sure the cluster processed the endpoint's response | |||
test_server_->waitForCounterGe("cluster.anna.health_check.success", 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unclear to me what the win with some of the waits is. We will need to wait for the gRPC response from Envoy to the HDS server below, and those messages are sent independently of what we're doing here wait wise.
Signed-off-by: markatou <lilika@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, looks robust now. Just two last nits if you have time tomorrow.
ASSERT_TRUE(hds_stream_->waitForGrpcMessage(*dispatcher_, response_)); | ||
while (!(checkEndpointHealthResponse(response_.endpoint_health_response().endpoints_health(0), | ||
envoy::api::v2::core::HealthStatus::UNHEALTHY, | ||
host_upstream_->localAddress()) & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: prefer logical &&
over bitwise &
when expressing a logical condition, i.e. more declarative; also safer when when one of the conjuncts might not be only 0 or 1.
while (!checkEndpointHealthResponse(response_.endpoint_health_response().endpoints_health(0), | ||
envoy::api::v2::core::HealthStatus::HEALTHY, | ||
host_upstream_->localAddress())) { | ||
ASSERT_TRUE(hds_stream_->waitForGrpcMessage(*dispatcher_, response_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This while
pattern appears a lot, can you factor it into a waitForEndpointHealthResponse
method? You can skip doing this for the more complex case of waiting for two conditions further below, but I think most of the cases are simple.
Signed-off-by: markatou <lilika@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Description:
Follow up on #4149.
I increased any unused timers to 100 seconds. Additionally, instead of waiting on a timer to go off, all but one tests wait on cluster stats to change.
I decreased the timer wait times, reducing the test runtime from 25 seconds to 5.
Exceptions:
Risk Level:
Low
Testing:
I run the test locally, and it passed 3000/3000 times.
This work is for #1310.