Skip to content

Conversation

markatou
Copy link
Contributor

This pull request contains a basic implementation of HDS, where a management server can request an HTTP healthcheck for any number of endpoints, the HdsDelegate healthchecks them, and reports back. The code also includes TODOs, to help identify the work that needs to be done next, like supporting updates to the set of endpoints that require healthchecking.

Risk Level: Low

Testing:
There are integration tests in test/integration/hds_integration_test.cc
and unit tests in test/common/upstream/hds_test.cc.

This work is for #1310.

markatou and others added 26 commits June 27, 2018 17:21
Signed-off-by: Lilika Markatou <lilika@google.com>
Flaky test

Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
@mattklein123
Copy link
Member

@htuch @alyssawilk who do you want to take a first pass on this?

@htuch
Copy link
Member

htuch commented Jul 30, 2018

@mattklein123 will do; I've already done some review feedback in a private PR, so this is close to final shape, but will give another pass.

@mattklein123
Copy link
Member

@htuch OK let me know when you are done and I can take a final pass.

Signed-off-by: Lilika Markatou <lilika@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Exciting to see this coming together. CC @biefy for HDS progress visibility.

health_check_request_.mutable_node()->MergeFrom(node);
retry_timer_ = dispatcher.createTimer([this]() -> void { establishNewStream(); });
response_timer_ = dispatcher.createTimer([this]() -> void { sendHealthCheckRequest(); });
server_response_timer_ = dispatcher.createTimer([this]() -> void { sendResponse(); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename these two timers to hds_retry_timer_ and hds_stream_response_timer_ respectively to make it clearer what they do.

envoy::service::discovery::v2::HealthCheckRequestOrEndpointHealthResponse
HdsDelegate::sendResponse() {
envoy::service::discovery::v2::HealthCheckRequestOrEndpointHealthResponse response;
for (auto& cluster : hds_clusters_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: const auto&

for (const auto& hosts : cluster->prioritySet().hostSetsPerPriority()) {
for (const auto& host : hosts->hosts()) {
auto* endpoint = response.mutable_endpoint_health_response()->add_endpoints_health();
endpoint->mutable_endpoint()->mutable_address()->mutable_socket_address()->set_address(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENVOY_LOG(debug, "New health check response message {} ", message->DebugString());
ASSERT(message);

for (auto& cluster_health_check : message->health_check()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: const auto&


for (auto& cluster_health_check : message->health_check()) {
// Create HdsCluster config
envoy::api::v2::core::BindConfig bind_config;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: static const

info_factory_));

for (auto& health_check : cluster_config.health_checks()) {
health_checkers_.push_back(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably, HdsCluster could be response for owning and maintaining health_checkers_. Any thoughts here?

ENVOY_LOG(debug, "New health check response message {} ", message->DebugString());

// Process the HealthCheckSpecifier message
processMessage(std::move(message));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should at least reset hds_clusters_ between messages, otherwise it will just keep growing on each update forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume no updates in this PR.

ssl_context_manager_, secret_manager_, added_via_api_);

for (const auto& host : cluster.hosts()) {
initial_hosts_->emplace_back(HostSharedPtr{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be emplace_back(new HostImpl..?

ClusterSharedPtr HdsCluster::create() { NOT_IMPLEMENTED_GCOVR_EXCL_LINE; }

HostVectorConstSharedPtr HdsCluster::createHealthyHostList(const HostVector& hosts) {
HostVectorSharedPtr healthy_list(new HostVector());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to just inherit from ClusterImplBase to get some of these methods for free?

const uint32_t RETRY_DELAY_MS = 5000;
const uint32_t RetryDelayMilliseconds = 5000;
static constexpr uint32_t ClusterConnectionBufferLimitBytes = 12345;
static constexpr uint32_t ClusterTimeoutSeconds = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments for all of these constants to explain what they do/represent?

Signed-off-by: Lilika Markatou <lilika@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I have some test comments on this pass.

}

void HdsDelegate::setServerResponseTimer() {
hds_stream_response_timer_->enableTimer(std::chrono::milliseconds(server_response_ms_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename the method to reflect the new field names?

* server with a set of hosts to healthcheck, healthchecking them, and reporting
* back the results.
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove blank line.

// TODO(htuch): Make this configurable or some static.
const uint32_t RETRY_DELAY_MS = 5000;

// How often we retry to establish a stream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream to where?

const uint32_t RetryDelayMilliseconds = 5000;

// Soft limit on size of the cluster’s connections read and write buffers.
static constexpr uint32_t ClusterConnectionBufferLimitBytes = 12345;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12345 is a strange buffer size :) Probably best to make it power-of-two.

static constexpr uint32_t ClusterTimeoutSeconds = 1;

// How often envoy reports the healthcheck results to the server
uint32_t server_response_ms_ = 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default initialize to 0?

dispatcher_, runtime_, stats_store_, ssl_context_manager_,
secret_manager_, random_, test_factory_, log_manager_));
}
envoy::service::discovery::v2::HealthCheckSpecifier* simpleMessage() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to name methods as verbs, since they do something, e.g. createSimpleMessage.

auto* health_check2 = message->add_health_check();
health_check2->set_cluster_name("voronoi");

auto* address4 = health_check2->add_endpoints()->add_endpoints()->mutable_address();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could factor out a lot of this endpoint addition boiler plate to a helper function.

EXPECT_EQ(host6->address()->ip()->port(), 8765);
}

TEST_F(HdsTest, TestProcessMessageHealthChecks) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a one line description above each TEST_F explaining what the test does?

EXPECT_EQ(1, test_server_->counter("cluster.anna.health_check.failure")->value());
}

TEST_P(HdsIntegrationTest, TwoEndpointsSameLocality) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add explanatory test summary above each TEST_P here.

host2_stream->waitForEndStream(*dispatcher_);

// Check that the healthcheck messages are correct
EXPECT_STREQ(host_stream->headers().Path()->value().c_str(), "/healthcheck");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor these tests to reduce some of the boiler plate?

Signed-off-by: Lilika Markatou <lilika@google.com>
Signed-off-by: Lilika Markatou <lilika@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I haven't fully gone over all tests, but I think this is now ready for another set of eyes. @mattklein123 want to take a pass? I think the main thing to discuss design wise is to what extent HdsCluster should reuse ClusterImplBase machinery, given it needs to only do a very limited subset of regular cluster behavior, vs. its own simplistic re-implementation of the cluster interface.

};

typedef std::unique_ptr<HdsDelegate> HdsDelegatePtr;

// Friend class of HdsDelegate, making it easier to access private fields
class HdsDelegateFriend {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just forward declare class HdsDelegateFriend and only fill in its contents in the test module.

hds_stream_->sendGrpcMessage(server_health_check_specifier);
// Wait until the request has been received by Envoy.
test_server_->waitForCounterGe("hds_delegate.requests", ++hds_requests_);
if (cluster2 != "None") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just use empty string instead of "None", that should also be an invalid cluster name and is more idiomatic C++ for an optional.

Signed-off-by: Lilika Markatou <lilika@google.com>
@htuch
Copy link
Member

htuch commented Aug 3, 2018

@mattklein123 friendly ping.

@mattklein123
Copy link
Member

Sorry I'm behind on reviews will do today.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level looks great to me. Some comments for TODOs for future work if you have time before your internship ends. Nice work!

/**
* Factory for creating ClusterInfo
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: kill newline between comment and think commented on. Same elsewhere.

// How often we retry to establish a stream to the management server
const uint32_t RetryDelayMilliseconds = 5000;

// Soft limit on size of the cluster’s connections read and write buffers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think it's odd that this setting and timeout are hard coded, and IMO this is a deficiency in the API that we should fix. Can you add a TODO around settings that are currently hard coded that we probably want to eventually add API knobs for?

* The HdsDelegate class is responsible for receiving requests from a management
* server with a set of hosts to healthcheck, healthchecking them, and reporting
* back the results.
*/
class HdsDelegate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO around adding /config_dump support for this? I think it would be extremely useful to be able to see what hosts we are health checking like we do for the other xDS endpoints. Also, might consider adding a TODO around whether we want to add health check clusters to the /clusters endpoint to get detailed stats about each HC host. Also I think would be extremely useful for debugging.

Signed-off-by: Lilika Markatou <lilika@google.com>
@markatou
Copy link
Contributor Author

markatou commented Aug 3, 2018

Thanks! I added the TODOs.

@htuch htuch changed the title [WIP] Basic Implementation of HDS Basic Implementation of HDS Aug 5, 2018
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo a few last nits, ready to ship when fixed.

Stats::IsolatedStoreImpl stats_store_;
MockClusterInfoFactory test_factory_;

std::shared_ptr<Upstream::HdsDelegate> hds_delegate_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be a shared_ptr? Could it just be unique? Shared pointers should only be used when non-trivial ownership semantics are required.

// Creates a HealthCheckSpecifier message that contains one endpoint and one
// healthcheck
envoy::service::discovery::v2::HealthCheckSpecifier* createSimpleMessage() {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove empty line.

}

// Test if processMessage processes healthchecks from a HealthCheckSpecifier
// message correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: double space

}
}

// Test if processMessage processes healthchecks from a HealthCheckSpecifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/healthchecks/health checks/

Signed-off-by: Lilika Markatou <lilika@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.

@htuch htuch merged commit f3b0f85 into envoyproxy:master Aug 6, 2018
zuercher added a commit to turbinelabs/envoy that referenced this pull request Aug 6, 2018
This reverts commit f3b0f85.

Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>
zuercher added a commit that referenced this pull request Aug 6, 2018
This reverts commit f3b0f85.

Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>
markatou added a commit to markatou/envoy that referenced this pull request Aug 6, 2018
…proxy#4063)"

This reverts commit 4e52589.

Signed-off-by: Lilika Markatou <lilika@google.com>
htuch pushed a commit that referenced this pull request Aug 7, 2018
Resolving the conflict and reverting the revert.

Signed-off-by: Lilika Markatou <lilika@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants