Basic Implementation of HDS #3973

markatou · 2018-07-27T20:24:08Z

This pull request contains a basic implementation of HDS, where a management server can request an HTTP healthcheck for any number of endpoints, the HdsDelegate healthchecks them, and reports back. The code also includes TODOs, to help identify the work that needs to be done next, like supporting updates to the set of endpoints that require healthchecking.

Risk Level: Low

Testing:
There are integration tests in test/integration/hds_integration_test.cc
and unit tests in test/common/upstream/hds_test.cc.

This work is for #1310.

Signed-off-by: Lilika Markatou <lilika@google.com>

Flaky test Signed-off-by: Lilika Markatou <lilika@google.com>

Signed-off-by: Lilika Markatou <lilika@google.com>

mattklein123 · 2018-07-30T05:18:31Z

@htuch @alyssawilk who do you want to take a first pass on this?

…s_cluster

htuch · 2018-07-30T15:17:55Z

@mattklein123 will do; I've already done some review feedback in a private PR, so this is close to final shape, but will give another pass.

mattklein123 · 2018-07-30T16:13:33Z

@htuch OK let me know when you are done and I can take a final pass.

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch

Looks good! Exciting to see this coming together. CC @biefy for HDS progress visibility.

htuch · 2018-07-31T11:38:24Z

source/common/upstream/health_discovery_service.cc

  health_check_request_.mutable_node()->MergeFrom(node);
  retry_timer_ = dispatcher.createTimer([this]() -> void { establishNewStream(); });
-  response_timer_ = dispatcher.createTimer([this]() -> void { sendHealthCheckRequest(); });
+  server_response_timer_ = dispatcher.createTimer([this]() -> void { sendResponse(); });


I would rename these two timers to hds_retry_timer_ and hds_stream_response_timer_ respectively to make it clearer what they do.

htuch · 2018-07-31T11:38:47Z

source/common/upstream/health_discovery_service.cc

+envoy::service::discovery::v2::HealthCheckRequestOrEndpointHealthResponse
+HdsDelegate::sendResponse() {
+  envoy::service::discovery::v2::HealthCheckRequestOrEndpointHealthResponse response;
+  for (auto& cluster : hds_clusters_) {


Nit: const auto&

htuch · 2018-07-31T11:40:28Z

source/common/upstream/health_discovery_service.cc

+    for (const auto& hosts : cluster->prioritySet().hostSetsPerPriority()) {
+      for (const auto& host : hosts->hosts()) {
+        auto* endpoint = response.mutable_endpoint_health_response()->add_endpoints_health();
+        endpoint->mutable_endpoint()->mutable_address()->mutable_socket_address()->set_address(


You might want to use https://github.com/envoyproxy/envoy/blob/master/source/common/network/utility.h#L208 here.

htuch · 2018-07-31T11:41:00Z

source/common/upstream/health_discovery_service.cc

+  ENVOY_LOG(debug, "New health check response message {} ", message->DebugString());
+  ASSERT(message);
+
+  for (auto& cluster_health_check : message->health_check()) {


Nit: const auto&

htuch · 2018-07-31T11:41:32Z

source/common/upstream/health_discovery_service.cc

+
+  for (auto& cluster_health_check : message->health_check()) {
+    // Create HdsCluster config
+    envoy::api::v2::core::BindConfig bind_config;


Nit: static const

htuch · 2018-07-31T11:46:05Z

source/common/upstream/health_discovery_service.cc

+                                              info_factory_));
+
+    for (auto& health_check : cluster_config.health_checks()) {
+      health_checkers_.push_back(


Arguably, HdsCluster could be response for owning and maintaining health_checkers_. Any thoughts here?

htuch · 2018-07-31T11:47:30Z

source/common/upstream/health_discovery_service.cc

+  ENVOY_LOG(debug, "New health check response message {} ", message->DebugString());
+
+  // Process the HealthCheckSpecifier message
+  processMessage(std::move(message));


We definitely should at least reset hds_clusters_ between messages, otherwise it will just keep growing on each update forever.

I assume no updates in this PR.

htuch · 2018-07-31T11:49:08Z

source/common/upstream/health_discovery_service.cc

+                                         ssl_context_manager_, secret_manager_, added_via_api_);
+
+  for (const auto& host : cluster.hosts()) {
+    initial_hosts_->emplace_back(HostSharedPtr{


Can this just be emplace_back(new HostImpl..?

htuch · 2018-07-31T11:51:16Z

source/common/upstream/health_discovery_service.cc

+ClusterSharedPtr HdsCluster::create() { NOT_IMPLEMENTED_GCOVR_EXCL_LINE; }
+
+HostVectorConstSharedPtr HdsCluster::createHealthyHostList(const HostVector& hosts) {
+  HostVectorSharedPtr healthy_list(new HostVector());


Would it make sense to just inherit from ClusterImplBase to get some of these methods for free?

htuch · 2018-07-31T11:51:49Z

source/common/upstream/health_discovery_service.h

-  const uint32_t RETRY_DELAY_MS = 5000;
+  const uint32_t RetryDelayMilliseconds = 5000;
+  static constexpr uint32_t ClusterConnectionBufferLimitBytes = 12345;
+  static constexpr uint32_t ClusterTimeoutSeconds = 1;


Can you add comments for all of these constants to explain what they do/represent?

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch

Great, I have some test comments on this pass.

htuch · 2018-08-01T11:52:48Z

source/common/upstream/health_discovery_service.cc

+}
+
+void HdsDelegate::setServerResponseTimer() {
+  hds_stream_response_timer_->enableTimer(std::chrono::milliseconds(server_response_ms_));


Can you rename the method to reflect the new field names?

htuch · 2018-08-01T11:54:28Z

source/common/upstream/health_discovery_service.h

+ * server with a set of hosts to healthcheck, healthchecking them, and reporting
+ * back the results.
+ */
+


Nit: remove blank line.

htuch · 2018-08-01T11:54:47Z

source/common/upstream/health_discovery_service.h

  // TODO(htuch): Make this configurable or some static.
-  const uint32_t RETRY_DELAY_MS = 5000;
+
+  // How often we retry to establish a stream


Stream to where?

htuch · 2018-08-01T11:55:13Z

source/common/upstream/health_discovery_service.h

+  const uint32_t RetryDelayMilliseconds = 5000;
+
+  // Soft limit on size of the cluster’s connections read and write buffers.
+  static constexpr uint32_t ClusterConnectionBufferLimitBytes = 12345;


12345 is a strange buffer size :) Probably best to make it power-of-two.

htuch · 2018-08-01T11:55:47Z

source/common/upstream/health_discovery_service.h

+  static constexpr uint32_t ClusterTimeoutSeconds = 1;
+
+  // How often envoy reports the healthcheck results to the server
+  uint32_t server_response_ms_ = 1000;


Should this default initialize to 0?

htuch · 2018-08-01T11:58:06Z

test/common/upstream/hds_test.cc

+                                        dispatcher_, runtime_, stats_store_, ssl_context_manager_,
+                                        secret_manager_, random_, test_factory_, log_manager_));
+  }
+  envoy::service::discovery::v2::HealthCheckSpecifier* simpleMessage() {


Prefer to name methods as verbs, since they do something, e.g. createSimpleMessage.

htuch · 2018-08-01T11:59:10Z

test/common/upstream/hds_test.cc

+  auto* health_check2 = message->add_health_check();
+  health_check2->set_cluster_name("voronoi");
+
+  auto* address4 = health_check2->add_endpoints()->add_endpoints()->mutable_address();


You could factor out a lot of this endpoint addition boiler plate to a helper function.

htuch · 2018-08-01T11:59:58Z

test/common/upstream/hds_test.cc

+  EXPECT_EQ(host6->address()->ip()->port(), 8765);
+}
+
+TEST_F(HdsTest, TestProcessMessageHealthChecks) {


Can you add a one line description above each TEST_F explaining what the test does?

htuch · 2018-08-01T12:01:42Z

test/integration/hds_integration_test.cc

+  EXPECT_EQ(1, test_server_->counter("cluster.anna.health_check.failure")->value());
+}
+
+TEST_P(HdsIntegrationTest, TwoEndpointsSameLocality) {


Also add explanatory test summary above each TEST_P here.

htuch · 2018-08-01T12:02:16Z

test/integration/hds_integration_test.cc

+  host2_stream->waitForEndStream(*dispatcher_);
+
+  // Check that the healthcheck messages are correct
+  EXPECT_STREQ(host_stream->headers().Path()->value().c_str(), "/healthcheck");


Can you refactor these tests to reduce some of the boiler plate?

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch

Looks good. I haven't fully gone over all tests, but I think this is now ready for another set of eyes. @mattklein123 want to take a pass? I think the main thing to discuss design wise is to what extent HdsCluster should reuse ClusterImplBase machinery, given it needs to only do a very limited subset of regular cluster behavior, vs. its own simplistic re-implementation of the cluster interface.

htuch · 2018-08-01T22:53:58Z

source/common/upstream/health_discovery_service.h

 };

 typedef std::unique_ptr<HdsDelegate> HdsDelegatePtr;

+// Friend class of HdsDelegate, making it easier to access private fields
+class HdsDelegateFriend {


I think you can just forward declare class HdsDelegateFriend and only fill in its contents in the test module.

htuch · 2018-08-01T22:56:12Z

test/integration/hds_integration_test.cc

-    hds_stream_->sendGrpcMessage(server_health_check_specifier);
-    // Wait until the request has been received by Envoy.
-    test_server_->waitForCounterGe("hds_delegate.requests", ++hds_requests_);
+    if (cluster2 != "None") {


Maybe just use empty string instead of "None", that should also be an invalid cluster name and is more idiomatic C++ for an optional.

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch · 2018-08-03T15:08:09Z

@mattklein123 friendly ping.

mattklein123 · 2018-08-03T15:39:10Z

Sorry I'm behind on reviews will do today.

mattklein123

At a high level looks great to me. Some comments for TODOs for future work if you have time before your internship ends. Nice work!

mattklein123 · 2018-08-03T16:30:16Z

include/envoy/upstream/cluster_manager.h

+/**
+ * Factory for creating ClusterInfo
+ */
+


nit: kill newline between comment and think commented on. Same elsewhere.

mattklein123 · 2018-08-03T16:43:29Z

source/common/upstream/health_discovery_service.h

+  // How often we retry to establish a stream to the management server
+  const uint32_t RetryDelayMilliseconds = 5000;
+
+  // Soft limit on size of the cluster’s connections read and write buffers.


FWIW I think it's odd that this setting and timeout are hard coded, and IMO this is a deficiency in the API that we should fix. Can you add a TODO around settings that are currently hard coded that we probably want to eventually add API knobs for?

mattklein123 · 2018-08-03T16:47:17Z

source/common/upstream/health_discovery_service.h

+ * The HdsDelegate class is responsible for receiving requests from a management
+ * server with a set of hosts to healthcheck, healthchecking them, and reporting
+ * back the results.
+ */
 class HdsDelegate


Can we add a TODO around adding /config_dump support for this? I think it would be extremely useful to be able to see what hosts we are health checking like we do for the other xDS endpoints. Also, might consider adding a TODO around whether we want to add health check clusters to the /clusters endpoint to get detailed stats about each HC host. Also I think would be extremely useful for debugging.

Signed-off-by: Lilika Markatou <lilika@google.com>

markatou · 2018-08-03T20:03:10Z

Thanks! I added the TODOs.

htuch

LGTM modulo a few last nits, ready to ship when fixed.

htuch · 2018-08-05T02:24:23Z

test/common/upstream/hds_test.cc

+  Stats::IsolatedStoreImpl stats_store_;
+  MockClusterInfoFactory test_factory_;
+
+  std::shared_ptr<Upstream::HdsDelegate> hds_delegate_;


Does this really need to be a shared_ptr? Could it just be unique? Shared pointers should only be used when non-trivial ownership semantics are required.

htuch · 2018-08-05T02:24:55Z

test/common/upstream/hds_test.cc

+  // Creates a HealthCheckSpecifier message that contains one endpoint and one
+  // healthcheck
+  envoy::service::discovery::v2::HealthCheckSpecifier* createSimpleMessage() {
+


Nit: remove empty line.

htuch · 2018-08-05T02:26:19Z

test/common/upstream/hds_test.cc

+}
+
+// Test if processMessage processes healthchecks from a HealthCheckSpecifier
+// message  correctly


Nit: double space

htuch · 2018-08-05T02:26:27Z

test/common/upstream/hds_test.cc

+  }
+}
+
+// Test if processMessage processes healthchecks from a HealthCheckSpecifier


Nit: s/healthchecks/health checks/

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch

Awesome.

This reverts commit f3b0f85. Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>

…proxy#4063)" This reverts commit 4e52589. Signed-off-by: Lilika Markatou <lilika@google.com>

Resolving the conflict and reverting the revert. Signed-off-by: Lilika Markatou <lilika@google.com>

markatou and others added 26 commits June 27, 2018 17:21

Skeleton for HdsCluster

4f344cf

Signed-off-by: Lilika Markatou <lilika@google.com>

Basic HdsCluster, and healthchecker initialization

4b4fd6a

Flaky test Signed-off-by: Lilika Markatou <lilika@google.com>

Endpoint reposnds to healthcheck request

1c39216

Signed-off-by: Lilika Markatou <lilika@google.com>

Server -> Envoy -> Single Endpoint healthcheck

c46a946

Signed-off-by: Lilika Markatou <lilika@google.com>

Naming changes

a49f8b7

Signed-off-by: Lilika Markatou <lilika@google.com>

Removing EDS dependency from hds test

e0977b1

Signed-off-by: Lilika Markatou <lilika@google.com>

Addressing comments

119c4c8

Signed-off-by: Lilika Markatou <lilika@google.com>

Addressing comments, adding a response message

14af459

Signed-off-by: Lilika Markatou <lilika@google.com>

Process more messages and reply to server every interval

606a22a

Signed-off-by: Lilika Markatou <lilika@google.com>

Added capability to the hdsdelegate's message and cleaned up code a bit

05647d6

Signed-off-by: Lilika Markatou <lilika@google.com>

Minor changes

b3c5fe6

Signed-off-by: Lilika Markatou <lilika@google.com>

Added a todo

d69dd92

Signed-off-by: Lilika Markatou <lilika@google.com>

Addressing some comments

4aad3f8

Signed-off-by: Lilika Markatou <lilika@google.com>

Fixing format

03d02e5

Signed-off-by: Lilika Markatou <lilika@google.com>

No magic numbers

c99fff2

Signed-off-by: Lilika Markatou <lilika@google.com>

Addressed comments

8e7b0b5

Signed-off-by: Lilika Markatou <lilika@google.com>

Addressing comments and an attempt at a unit test

f9ce8f7

Signed-off-by: Lilika Markatou <lilika@google.com>

Added a unit test, and the class HdsInfoFactory to help with testing

2d7b9df

Signed-off-by: Lilika Markatou <lilika@google.com>

Addressing comments

054513e

Signed-off-by: Lilika Markatou <lilika@google.com>

Moving ClusterInfoFactory to a more usual location

b9fc350

Signed-off-by: Lilika Markatou <lilika@google.com>

Addressing comments

c27ed1f

Signed-off-by: Lilika Markatou <lilika@google.com>

Polishing

bcad173

Signed-off-by: Lilika Markatou <lilika@google.com>

Added a unit test for HdsDelegate::sendResponse

8babc5e

Signed-off-by: Lilika Markatou <lilika@google.com>

Added a TODO

f4b0a52

Signed-off-by: Lilika Markatou <lilika@google.com>

Merge branch 'master' into hds_cluster

9d62fe8

Merge remote-tracking branch 'upstream/master' into hds_cluster

b6c737f

Signed-off-by: Lilika Markatou <lilika@google.com>

mattklein123 assigned mattklein123 and htuch Jul 30, 2018

Merge remote-tracking branch 'upstream/master' into hds_cluster

2563fe1

Merge branch 'hds_cluster' of ssh://github.com/markatou/envoy into hd…

6a820f7

…s_cluster

Addressing failing tests

46c63e6

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch suggested changes Jul 31, 2018

View reviewed changes

Addressing comments

b96da53

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch suggested changes Aug 1, 2018

View reviewed changes

markatou added 2 commits August 1, 2018 16:25

Merge remote-tracking branch 'upstream/master' into hds_cluster

315f561

Signed-off-by: Lilika Markatou <lilika@google.com>

Addressing comments

bb0a9c6

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch reviewed Aug 1, 2018

View reviewed changes

Addressing comments

85f0ba7

Signed-off-by: Lilika Markatou <lilika@google.com>

mattklein123 reviewed Aug 3, 2018

View reviewed changes

Adding TODOs and removing a couple empty lines

2a0ab6f

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch changed the title ~~[WIP] Basic Implementation of HDS~~ Basic Implementation of HDS Aug 5, 2018

htuch suggested changes Aug 5, 2018

View reviewed changes

Addressing comments

79f1dce

Signed-off-by: Lilika Markatou <lilika@google.com>

htuch approved these changes Aug 6, 2018

View reviewed changes

htuch merged commit f3b0f85 into envoyproxy:master Aug 6, 2018

zuercher added a commit to turbinelabs/envoy that referenced this pull request Aug 6, 2018

Revert "Basic Implementation of HDS (envoyproxy#3973)"

2abfb48

This reverts commit f3b0f85. Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>

zuercher added a commit that referenced this pull request Aug 6, 2018

Revert "Basic Implementation of HDS (#3973)" (#4063)

4e52589

This reverts commit f3b0f85. Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>

markatou added a commit to markatou/envoy that referenced this pull request Aug 6, 2018

Revert "Revert "Basic Implementation of HDS (envoyproxy#3973)" (envoy…

7dced38

…proxy#4063)" This reverts commit 4e52589. Signed-off-by: Lilika Markatou <lilika@google.com>

htuch pushed a commit that referenced this pull request Aug 7, 2018

Revert "Revert "Basic Implementation of HDS (#3973)" (#4063)" (#4068)

fccaead

Resolving the conflict and reverting the revert. Signed-off-by: Lilika Markatou <lilika@google.com>

Basic Implementation of HDS #3973

Basic Implementation of HDS #3973

Uh oh!

Conversation

markatou commented Jul 27, 2018

Uh oh!

mattklein123 commented Jul 30, 2018

Uh oh!

htuch commented Jul 30, 2018

Uh oh!

mattklein123 commented Jul 30, 2018

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch commented Aug 3, 2018

Uh oh!

mattklein123 commented Aug 3, 2018

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!