-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Basic Implementation of HDS #3973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Lilika Markatou <lilika@google.com>
Flaky test Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
@htuch @alyssawilk who do you want to take a first pass on this? |
@mattklein123 will do; I've already done some review feedback in a private PR, so this is close to final shape, but will give another pass. |
@htuch OK let me know when you are done and I can take a final pass. |
Signed-off-by: Lilika Markatou <lilika@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Exciting to see this coming together. CC @biefy for HDS progress visibility.
health_check_request_.mutable_node()->MergeFrom(node); | ||
retry_timer_ = dispatcher.createTimer([this]() -> void { establishNewStream(); }); | ||
response_timer_ = dispatcher.createTimer([this]() -> void { sendHealthCheckRequest(); }); | ||
server_response_timer_ = dispatcher.createTimer([this]() -> void { sendResponse(); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename these two timers to hds_retry_timer_
and hds_stream_response_timer_
respectively to make it clearer what they do.
envoy::service::discovery::v2::HealthCheckRequestOrEndpointHealthResponse | ||
HdsDelegate::sendResponse() { | ||
envoy::service::discovery::v2::HealthCheckRequestOrEndpointHealthResponse response; | ||
for (auto& cluster : hds_clusters_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: const auto&
for (const auto& hosts : cluster->prioritySet().hostSetsPerPriority()) { | ||
for (const auto& host : hosts->hosts()) { | ||
auto* endpoint = response.mutable_endpoint_health_response()->add_endpoints_health(); | ||
endpoint->mutable_endpoint()->mutable_address()->mutable_socket_address()->set_address( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use https://github.com/envoyproxy/envoy/blob/master/source/common/network/utility.h#L208 here.
ENVOY_LOG(debug, "New health check response message {} ", message->DebugString()); | ||
ASSERT(message); | ||
|
||
for (auto& cluster_health_check : message->health_check()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: const auto&
|
||
for (auto& cluster_health_check : message->health_check()) { | ||
// Create HdsCluster config | ||
envoy::api::v2::core::BindConfig bind_config; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: static const
info_factory_)); | ||
|
||
for (auto& health_check : cluster_config.health_checks()) { | ||
health_checkers_.push_back( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably, HdsCluster
could be response for owning and maintaining health_checkers_
. Any thoughts here?
ENVOY_LOG(debug, "New health check response message {} ", message->DebugString()); | ||
|
||
// Process the HealthCheckSpecifier message | ||
processMessage(std::move(message)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely should at least reset hds_clusters_
between messages, otherwise it will just keep growing on each update forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume no updates in this PR.
ssl_context_manager_, secret_manager_, added_via_api_); | ||
|
||
for (const auto& host : cluster.hosts()) { | ||
initial_hosts_->emplace_back(HostSharedPtr{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this just be emplace_back(new HostImpl..
?
ClusterSharedPtr HdsCluster::create() { NOT_IMPLEMENTED_GCOVR_EXCL_LINE; } | ||
|
||
HostVectorConstSharedPtr HdsCluster::createHealthyHostList(const HostVector& hosts) { | ||
HostVectorSharedPtr healthy_list(new HostVector()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to just inherit from ClusterImplBase
to get some of these methods for free?
const uint32_t RETRY_DELAY_MS = 5000; | ||
const uint32_t RetryDelayMilliseconds = 5000; | ||
static constexpr uint32_t ClusterConnectionBufferLimitBytes = 12345; | ||
static constexpr uint32_t ClusterTimeoutSeconds = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments for all of these constants to explain what they do/represent?
Signed-off-by: Lilika Markatou <lilika@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I have some test comments on this pass.
} | ||
|
||
void HdsDelegate::setServerResponseTimer() { | ||
hds_stream_response_timer_->enableTimer(std::chrono::milliseconds(server_response_ms_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename the method to reflect the new field names?
* server with a set of hosts to healthcheck, healthchecking them, and reporting | ||
* back the results. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove blank line.
// TODO(htuch): Make this configurable or some static. | ||
const uint32_t RETRY_DELAY_MS = 5000; | ||
|
||
// How often we retry to establish a stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stream to where?
const uint32_t RetryDelayMilliseconds = 5000; | ||
|
||
// Soft limit on size of the cluster’s connections read and write buffers. | ||
static constexpr uint32_t ClusterConnectionBufferLimitBytes = 12345; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12345 is a strange buffer size :) Probably best to make it power-of-two.
static constexpr uint32_t ClusterTimeoutSeconds = 1; | ||
|
||
// How often envoy reports the healthcheck results to the server | ||
uint32_t server_response_ms_ = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this default initialize to 0?
test/common/upstream/hds_test.cc
Outdated
dispatcher_, runtime_, stats_store_, ssl_context_manager_, | ||
secret_manager_, random_, test_factory_, log_manager_)); | ||
} | ||
envoy::service::discovery::v2::HealthCheckSpecifier* simpleMessage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to name methods as verbs, since they do something, e.g. createSimpleMessage
.
test/common/upstream/hds_test.cc
Outdated
auto* health_check2 = message->add_health_check(); | ||
health_check2->set_cluster_name("voronoi"); | ||
|
||
auto* address4 = health_check2->add_endpoints()->add_endpoints()->mutable_address(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could factor out a lot of this endpoint addition boiler plate to a helper function.
test/common/upstream/hds_test.cc
Outdated
EXPECT_EQ(host6->address()->ip()->port(), 8765); | ||
} | ||
|
||
TEST_F(HdsTest, TestProcessMessageHealthChecks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a one line description above each TEST_F
explaining what the test does?
EXPECT_EQ(1, test_server_->counter("cluster.anna.health_check.failure")->value()); | ||
} | ||
|
||
TEST_P(HdsIntegrationTest, TwoEndpointsSameLocality) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add explanatory test summary above each TEST_P
here.
host2_stream->waitForEndStream(*dispatcher_); | ||
|
||
// Check that the healthcheck messages are correct | ||
EXPECT_STREQ(host_stream->headers().Path()->value().c_str(), "/healthcheck"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor these tests to reduce some of the boiler plate?
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I haven't fully gone over all tests, but I think this is now ready for another set of eyes. @mattklein123 want to take a pass? I think the main thing to discuss design wise is to what extent HdsCluster
should reuse ClusterImplBase
machinery, given it needs to only do a very limited subset of regular cluster behavior, vs. its own simplistic re-implementation of the cluster interface.
}; | ||
|
||
typedef std::unique_ptr<HdsDelegate> HdsDelegatePtr; | ||
|
||
// Friend class of HdsDelegate, making it easier to access private fields | ||
class HdsDelegateFriend { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just forward declare class HdsDelegateFriend
and only fill in its contents in the test module.
hds_stream_->sendGrpcMessage(server_health_check_specifier); | ||
// Wait until the request has been received by Envoy. | ||
test_server_->waitForCounterGe("hds_delegate.requests", ++hds_requests_); | ||
if (cluster2 != "None") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just use empty string instead of "None", that should also be an invalid cluster name and is more idiomatic C++ for an optional.
Signed-off-by: Lilika Markatou <lilika@google.com>
@mattklein123 friendly ping. |
Sorry I'm behind on reviews will do today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level looks great to me. Some comments for TODOs for future work if you have time before your internship ends. Nice work!
/** | ||
* Factory for creating ClusterInfo | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: kill newline between comment and think commented on. Same elsewhere.
// How often we retry to establish a stream to the management server | ||
const uint32_t RetryDelayMilliseconds = 5000; | ||
|
||
// Soft limit on size of the cluster’s connections read and write buffers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think it's odd that this setting and timeout are hard coded, and IMO this is a deficiency in the API that we should fix. Can you add a TODO around settings that are currently hard coded that we probably want to eventually add API knobs for?
* The HdsDelegate class is responsible for receiving requests from a management | ||
* server with a set of hosts to healthcheck, healthchecking them, and reporting | ||
* back the results. | ||
*/ | ||
class HdsDelegate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a TODO around adding /config_dump
support for this? I think it would be extremely useful to be able to see what hosts we are health checking like we do for the other xDS endpoints. Also, might consider adding a TODO around whether we want to add health check clusters to the /clusters
endpoint to get detailed stats about each HC host. Also I think would be extremely useful for debugging.
Signed-off-by: Lilika Markatou <lilika@google.com>
Thanks! I added the TODOs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo a few last nits, ready to ship when fixed.
test/common/upstream/hds_test.cc
Outdated
Stats::IsolatedStoreImpl stats_store_; | ||
MockClusterInfoFactory test_factory_; | ||
|
||
std::shared_ptr<Upstream::HdsDelegate> hds_delegate_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really need to be a shared_ptr? Could it just be unique? Shared pointers should only be used when non-trivial ownership semantics are required.
test/common/upstream/hds_test.cc
Outdated
// Creates a HealthCheckSpecifier message that contains one endpoint and one | ||
// healthcheck | ||
envoy::service::discovery::v2::HealthCheckSpecifier* createSimpleMessage() { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove empty line.
test/common/upstream/hds_test.cc
Outdated
} | ||
|
||
// Test if processMessage processes healthchecks from a HealthCheckSpecifier | ||
// message correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: double space
test/common/upstream/hds_test.cc
Outdated
} | ||
} | ||
|
||
// Test if processMessage processes healthchecks from a HealthCheckSpecifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/healthchecks/health checks/
Signed-off-by: Lilika Markatou <lilika@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome.
This reverts commit f3b0f85. Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>
…proxy#4063)" This reverts commit 4e52589. Signed-off-by: Lilika Markatou <lilika@google.com>
This pull request contains a basic implementation of HDS, where a management server can request an HTTP healthcheck for any number of endpoints, the HdsDelegate healthchecks them, and reports back. The code also includes TODOs, to help identify the work that needs to be done next, like supporting updates to the set of endpoints that require healthchecking.
Risk Level: Low
Testing:
There are integration tests in test/integration/hds_integration_test.cc
and unit tests in test/common/upstream/hds_test.cc.
This work is for #1310.