support for hystrix dashboard through an admin endpoint, using sink #3425

trabetti · 2018-05-17T19:27:13Z

admin filter: add support for hystrix dashboard output

Description:
This PR adds an admin endoint for hystrix dashboard output.
When there is a request to admin_port/hystrix_event_stream, envoy starts sending SSE stream, using the same socket of the initial request.
The admin URL can be accessed directly from hystrix dashboard.

Risk Level: Medium

Testing:
Added unit testing hystrix_test.cc - WIP
The admin endpoint was not tested, since it returns an infinite event stream, so current mechanisms for testing admin endpoint do not work. If there are any suggestions on how to test it, we will add.

Docs Changes:

Release Notes:
Support was added to hystrix event stream output. From hystrix dashboard, set the Eureka URL to envoy_ip:8001, and add stream from http://envoy_ip:8001/hystrix_event_stream
Note that not all hystrix dashboard statistics are supported. More detail can be found here: <link to the documentation generated from docs/root/operations/admin.rst >

Fixes #1244
split from PR #2736

…echanism Signed-off-by: trabetti <talis@il.ibm.com>

mattklein123 · 2018-05-17T20:41:07Z

@jmarantz and @mrice32 for first pass

mrice32

Great stuff! A few comments to get started.

mrice32 · 2018-05-17T21:42:17Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  EXPECT_EQ(getStreamField(data_message, "rollingCountTimeout"), "0");
+}
+
+// TEST_F(HystrixSinkTest, BasicFlow) {


Why is this all commented?

The tests were using the old sink interface. I need to update.

mrice32 · 2018-05-18T14:32:59Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+      it = callbacks_list_.erase(it);
+      break;
+    } else {
+      ++it;


nit: can't you just move this ++it up into the for loop statement and get rid of the else?

mrice32 · 2018-05-18T14:33:21Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+void HystrixSink::unregisterConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_remove) {
+  for (auto it = callbacks_list_.begin(); it != callbacks_list_.end();) {
+    if ((*it)->streamId() == callbacks_to_remove->streamId()) {
+      it = callbacks_list_.erase(it);


nit: since you're breaking, can't you just ignore the return value?

mrice32 · 2018-05-18T14:38:29Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  Upstream::ClusterStats& cluster_stats = cluster_info->stats();
+
+  if (cluster_stats_cache_map_.find(cluster_name) == cluster_stats_cache_map_.end()) {
+    // ClusterStatsCache cluster_stats_cache_inst(cluster_name);


Why are these commented?

mrice32 · 2018-05-18T15:26:44Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  if (cluster_stats_cache_map_.find(cluster_name) == cluster_stats_cache_map_.end()) {
+    // ClusterStatsCache cluster_stats_cache_inst(cluster_name);
+    // cluster_stats_cache_map_[cluster_name] = cluster_stats_cache_inst;
+    cluster_stats_cache_map_[cluster_name] = std::make_unique<ClusterStatsCache>(cluster_name);


I'm concerned that a per-cluster cache where elements are only added but never deleted is going to effectively be a memory leak. Also, can you describe what you want to happen if a cluster is deleted and then another cluster is, many flushes later, added under the same name? IIUC, with the current implementation, this will result in that cluster using the old, unrelated cluster's cache. I have a few ideas of how to flush the cache, but before proposing anything, I'd like to understand what the desired behavior is for naming collisions.

Good catch, I did not consider those cases. I think the best would be to detect when a cluster no longer exists and remove from the cache map. What are your ideas?

I haven't thought it through completely, but I think one solution would be to basically check the differences in sizes in the two maps (the cluster map in the server and the sink's cache map) after you finish updating the caches. Since you've already added any missing clusters to the sink's map, it's only possible that the sink map is larger. If it is larger, you can iterate through the sink's map and look up each name in the server's cluster map and delete any that aren't found in the server's map. It's not an incredibly elegant solution, but I'm not sure if there are any better options for trying to sync two unordered maps.

Almost done handling the clusters removal + updating the test. How can I simulate removing and adding back clusters during operation? (not in a test)

mrice32 · 2018-05-18T15:35:35Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+}
+
+void HystrixSink::updateRollingWindowMap(Upstream::ClusterInfoConstSharedPtr cluster_info,
+                                         Stats::Store& stats) {


We should be using each cluster's stats scope to get cluster-specific stats. Otherwise, they will not be deallocated when the cluster is removed.

Not sure I understand the comment about not being deallocated when the cluster is removed. Are you referring to using Upstream::ClusterStats? or the Shared pointer Upstream::ClusterInfoConstSharedPtr?

We should be using Upstream::ClusterInfo::statsScope() and calling counter(stat_name) on that (the name will be different since the scope already assumes the cluster prefix) - see here.

The stat scope system is a little confusing with respect to how it manages memory. When you request a stat from a scope using scope.counters(stat_name), the scope will then ensure the underlying stat stays alive as long as that scope is alive. If we pull the cluster stats from the server scope, that means the stats that we're requesting are required to live as long as the server's stats scope, which is the life of the Envoy instance. If we grab them from the cluster scope, they will be deallocated when the cluster is removed as one would expect for cluster-specific stats.

Is it new interface? It is very good for us, since using this, we won't need to save the concatenated string of the statistic's name per cluster. What about the stats we are currently getting from Upstream::ClusterStats::stats()? Take them also from the Upstream::ClusterInfo::statsScope() instead?

Yup, it should make things much easier. Technically, you could pull any stats in Upstream::ClusterStats::stats() from Upstream::ClusterInfo::statsScope(), but it's better the way you have it. Upstream::ClusterStats::stats() is essentially a struct holding stats from Upstream::ClusterInfo::statsScope() to access them without doing a map lookup.

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-05-30T09:18:40Z

Pushed updated code, with handling the clusters removal + tests. How can I simulate removing and adding back clusters during operation? (not in a test)

mrice32

To try this outside of a test, you would need to use CDS to send an update with a cluster and then a second update without the cluster - see here for an explanation of dynamic updates. You can do this with a management server over gRPC or you can use the filesystem as a source.

mrice32 · 2018-05-31T16:55:17Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  /**
+   * Generate the streams to be sent to hystrix dashboard.
+   */
+  void getClusterStats(absl::string_view cluster_name, uint64_t max_concurrent_requests,


nit: method name is misleading. Can we change to something like writeClusterStats(), appendClusterStats(), or addClusterStatsToStream() (as the methods are named below)?

mrice32 · 2018-05-31T20:46:58Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+}
+
+void HystrixSink::addHystrixThreadPool(absl::string_view cluster_name, uint64_t queue_size,
+                                       uint64_t reporting_hosts, uint64_t rolling_window,


nit: since rolling_window represents a time, but it's represented as an integer, can you change the name to indicate that it's a time - like rolling_window_ms or window_period_ms? As a reader, this was a bit confusing to me.

mrice32 · 2018-05-31T20:51:49Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  }
+
+  std::stringstream ss;
+  for (auto& cluster : clusters) {


I think this series of operations could be simplified a bit. Could we combine this loop with the updateRollingWindowMap loop above? We could also move the cache cleanup to the very end (since this is probably the least time sensitive operation). Then we can do a single lookup of the cache for a particular cluster and pass it through to updateRollingWindowMap and getClusterStats.

mrice32 · 2018-05-31T21:08:56Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  // the lookup.
+  ClusterStatsCache(const std::string& cluster_name);
+
+  std::string errors_name_;


Are these names just here for the purpose of testing and logging? I don' think the full stat names are still needed since we're looking the stats up within the cluster scope. It appears that they're only used in printRollingWindows(), which seems to only be used in tests and log messages? Not sure how others feel, but I think we should probably avoid storing this state if it's just for logging and testing. If we want to keep that method for logging and testing, we could just let it construct whatever preferred stat name is needed on the fly.

Pushed new version. Please see if it seems more reasonable now. Another option is to totally remove it. It is there just for logging. Not needed for testing.

mrice32 · 2018-05-31T21:22:22Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  response_headers.insertNoChunks().value().setReference("0");
+
+  Http::StreamDecoderFilterCallbacks& stream_decoder_filter_callbacks =
+      const_cast<Http::StreamDecoderFilterCallbacks&>(admin_stream.getDecoderFilterCallbacks());


Why do we need to const cast this? If this interface returns a const StreamDecoderFilterCallbacks&, I think there's something wrong with it. StreamDecoderFilterCallbacks has no const methods, which means a const one would essentially be useless, right? I think this should be updated to return a non-const StreamDecoderFilterCallbacks&.

We wanted to return only const refs when we worked on the AdminStream. It forced me to use the const_cast for the hystrix implementation. We overlooked the fact there are no const methods on it anyway.

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-06-07T11:26:37Z

Hi, any feedback on last commit? Thanks!

mrice32

This is looking great! I left a few more nits, but I think this is getting close. I haven't had a chance to go through the tests, yet. @jmarantz, would you mind taking a pass, especially through the tests?

mrice32 · 2018-06-07T13:53:42Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+
+void HystrixSink::flush(Stats::Source&) {
+  if (callbacks_list_.empty())
+    return;


nit: I think we generally prefer:

if (callbacks_list_.empty()) { return; }

mrice32 · 2018-06-07T14:09:08Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  Upstream::ClusterStats& cluster_stats = cluster_info->stats();
+  Stats::Scope& cluster_stats_scope = cluster_info->statsScope();
+
+  if (cluster_stats_cache_map_.find(cluster_name) == cluster_stats_cache_map_.end()) {


nit: you could avoid the redundant lookup by doing something like:

std::unique_ptr<ClusterStatsCache>& cluster_stats_cache_ptr = cluster_stats_cache_map_[cluster_name]; if (cluster_stats_cache_ptr == nullptr) { cluster_stats_cache_ptr = std::make_unique<ClusterStatsCache>(cluster_name); } // Use cluster_stats_cache_ptr here

mrice32 · 2018-06-07T14:35:48Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+    Upstream::ClusterInfoConstSharedPtr cluster_info = cluster.second.get().info();
+
+    // update rolling window with cluster stats
+    updateRollingWindowMap(cluster_info);


nit: why not just do the cache lookup based on cluster name out here and pass a reference to the cache into both methods?

IIUC - you mean to move the std::unique_ptr<ClusterStatsCache>& cluster_stats_cache_ptr = cluster_stats_cache_map_[cluster_name]; up to flush()? Please see the new version , is this what you had in mind?

mrice32 · 2018-06-07T14:41:03Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+
+void HystrixSink::addClusterStatsToStream(absl::string_view cluster_name,
+                                          uint64_t max_concurrent_requests,
+                                          uint64_t reporting_hosts, uint64_t rolling_window_ms,


nit: Sorry for being indecisive on this, but could we instead just pass the rolling_window_ms around as type std::chrono::milliseconds rather than a uint64_t? This means that if the flush interval that you're accessing starts being represented as std::chrono::microseconds or std::chrono::seconds, you'll see a compile-time error. I know this is unlikely, but we've had problems with outdated .count() interpretations in the past.

mrice32 · 2018-06-07T14:46:57Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  // check if any clusters were removed, and remove from cache
+  if (clusters.size() < cluster_stats_cache_map_.size()) {
+    std::vector<std::string> clusters_to_remove;
+    for (auto& itr : cluster_stats_cache_map_) {


nit: this could be made a bit simpler/faster by doing the following:

for (auto it = cluster_stats_cache_map_.begin(); it != cluster_stats_cache_map_.end();) { if (clusters.find(it.first) == clusters.end()) { it = cluster_stats_cache_map_.erase(it); } else { ++it; } }

Signed-off-by: trabetti <talis@il.ibm.com>

jmarantz

flushing comments, some of which were pending for the last month :) I'm going to look at the tests next.

jmarantz · 2018-05-18T13:12:46Z

source/common/http/headers.h

@@ -208,6 +211,15 @@ class HeaderValues {
    const std::string AcceptEncoding{"Accept-Encoding"};
    const std::string Wildcard{"*"};
  } VaryValues;
+
+  struct {
+    const std::string AccessControlAllowHeadersHystrix{


Doesn't seem like you need to prefix this variable-name with 'AccessControlAllow', since that's in the struct name.

jmarantz · 2018-05-18T13:14:51Z

source/common/http/http1/codec_impl.cc

+  // chunk transfer encoding but we don't have a content-length so we pass "envoy only"
+  // header to avoid adding chunks
+  if (headers.NoChunks()) {
+    no_chunks = true;


why not just put this in the initializer on line 47?

const bool no_chunks = headers.NoChunks();

actually there's just a single-use, maybe you can just write

if (saw_content_length || headers.NoChunks()) {

below, and skip the temp?

jmarantz · 2018-05-18T13:29:23Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  init();
+}
+
+HystrixSink::HystrixSink(Server::Instance& server)


looking at the call-sites you could kill this version, and have the 2-arg ctor init like this:

: server_(server), current_index_(num_of_buckets ? num_of_buckets : DEFAULT_NUM_OF_BUCKETS_), window_size_(current_index_ + 1) {

That would simplify call-sites.

jmarantz · 2018-06-08T12:46:13Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  void onHistogramComplete(const Stats::Histogram&, uint64_t) override{};
+
+  /**
+   * register a new connection


nit: s/register/Registers/ , and specify doxygen for args per convention, here & below.

and below
a/remove/Removes/, s/Add/Adds/, etc.

jmarantz · 2018-06-08T12:50:18Z

source/extensions/stat_sinks/hystrix/hystrix.h

+                            std::stringstream& ss);
+
+  // HystrixStatCachePtr stats_;
+  std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{};


not necessary to initialize an empty vector; just leave out the {}

jmarantz · 2018-06-08T12:50:33Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  static const uint64_t DEFAULT_NUM_OF_BUCKETS = 10;
+
+  // Map from cluster names to a struct of all of that cluster's stat windows.
+  std::unordered_map<std::string, ClusterStatsCachePtr> cluster_stats_cache_map_{};


jmarantz

OK left some test comments. The tests look really comprehensive because they are long and the names of the tests seem to cover the sort of functionality that I'd expect.

But it would be easier to follow if you factored out common blocks into shared helper methods.

jmarantz · 2018-06-08T13:12:39Z

source/extensions/stat_sinks/hystrix/hystrix.h

+
+  std::string cluster_name_;
+
+  // Rolling windows


move member vars below the methods.

jmarantz · 2018-06-08T13:14:33Z

source/extensions/stat_sinks/hystrix/hystrix.h

+typedef std::vector<uint64_t> RollingWindow;
+typedef std::map<const std::string, RollingWindow> RollingStatsMap;
+
+struct ClusterStatsCache {


I think this should be a class not a struct.

https://google.github.io/styleguide/cppguide.html#Structs_vs._Classes

But if this is a clear pattern in Envoy it's OK to leave it as a struct, in which case you don't need to suffix the member vars with _

Actually by that definition, it seems to me that it should be a struct and not a class. It passively holds the rolling windows, which are manipulated by the sink's methods. What do you think?

OK, that's fine; I guess the methods are just printers. And actually the _ suffix is called out by STYLE.md as a local rule. Leave as is then :)

jmarantz · 2018-06-08T13:16:56Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+    absl::string_view::size_type key_pos = dataMessage.find(key);
+    EXPECT_NE(absl::string_view::npos, key_pos);
+    absl::string_view trimDataBeforeKey = dataMessage.substr(key_pos);
+    key_pos = trimDataBeforeKey.find(" ");


trim_data_before_key, trim_data_after_value.

jmarantz · 2018-06-08T13:18:25Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+    // set callbacks to send data to buffer
+    Buffer::OwnedImpl buffer;
+    auto encode_callback = [&buffer](Buffer::Instance& data, bool) {
+      buffer.add(


I'd put the comment above the call to buffer.add(data); -- I think it'll format more nicely that way.

jmarantz · 2018-06-08T13:19:00Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+
+  Buffer::OwnedImpl createClusterAndCallbacks() {
+
+    // set cluster


Set cluster. (sentences start with a capital letter, end with a period). Here & below.

jmarantz · 2018-06-08T13:23:03Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  //  //std::string rolling_map = sink_->getStats().printRollingWindow();
+  std::string rolling_map = sink_->printRollingWindows();
+  std::size_t pos = rolling_map.find("test_cluster1.total");
+  EXPECT_NE(std::string::npos, pos);


nit: you could do this in one line:

EXPECT_NE(std::string::npos, rolling_map.find("test_cluster1.total"));

the advantage is that if it fails the test error message would be a little easier to debug because it would include the source of the find call.

jmarantz · 2018-06-08T13:24:45Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  std::size_t pos1 = data_message.find(cluster1_name_);
+  std::size_t pos2 = data_message.find(cluster2_name);
+  EXPECT_NE(std::string::npos, pos1);
+  EXPECT_NE(std::string::npos, pos2);


probably want to make these ASSERT_NE because you don't want to use relational operators below on pos1 or pos2 if they are npos.

jmarantz · 2018-06-08T13:28:16Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  EXPECT_EQ(getStreamField(data_message_2, "requestCount"), "0");
+  EXPECT_EQ(getStreamField(data_message_2, "rollingCountSemaphoreRejected"), "0");
+  EXPECT_EQ(getStreamField(data_message_2, "rollingCountSuccess"), "0");
+  EXPECT_EQ(getStreamField(data_message_2, "rollingCountTimeout"), "0");


My general comment here is that I can only peephole review the tests because they are so long, so it's hard for me to tell what's going on. Can you factor out some common blocks into helpers? E.g. this pattern of checking that a specific bunch of values in a message are 0 is repeated 3x in this file. Factor that out into a common method. This is also the case likely with a bunch of other stanzas, such as as setting up a bunch of mocks.

Thanks for pointing it out. I am factoring out a bunch of stuff. Will push once ready.
Problem with helper method for the checkers is, as you pointed out in the past, it would be hard to debug error messages from lines inside the function. But the test code would be more readable. I can group most of the checkers into readable methods. Any suggestion here?

jmarantz · 2018-06-08T13:28:21Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  pos1 = data_message.find(cluster1_name_);
+  pos2 = data_message.find(cluster2_name);
+  EXPECT_NE(std::string::npos, pos1);
+  EXPECT_NE(std::string::npos, pos2);


jmarantz · 2018-06-08T13:30:11Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  std::string data_message = TestUtility::bufferToString(buffer);
+
+  std::size_t pos1 = data_message.find(cluster1_name_);
+  std::size_t pos2 = data_message.find(cluster2_name);


I wonder if this pattern of using find on the cluster-names and this dance with the substr etc could be done easier with absl::StrSplit. That would be the case if there were a known delimiter between the clusters.

There are known delimiters, but the order of the clusters is not guaranteed, so we'll need to add more actions to find which is which. (or otherwise, add sorting in the code that produces the stream)

A helper method to populate & return a map would be be useful to remove this kind of complicated pattern of finding substring delimiters in this way. In addition to require a handful of lines that require some scrutiny, the current pattern could fail if either cluster-name turned out to be a substring of some other syntax in the data.

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-06-10T19:36:21Z

New version with most review comments taken care of. Factored out a lot in the test. I am in doubt about factoring checkers into functions, because it would be hard to debug failures.

jmarantz · 2018-06-10T23:10:07Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  ON_CALL(cluster_manager_, clusters()).WillByDefault(Return(cluster_map_));
+  sink_->flush(source_);
+
+  cluster_message_map.clear();


you don't need this, as on the next line you kill the previous value completely.

jmarantz · 2018-06-10T23:10:52Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+
+  // Generate data to both clusters.
+  sink_->flush(source_);
+  for (int i = 0; i < 12; i++) {


is 12 arbitrary? I feel like some of the golden values you are comparing to below are somehow related to the constants here, but it's not super-obvious what the relationship is. E.g. I see a few cases throughout this file where rollingCountSemaphoreRejected is tested against some expected constant, but I can't tell why that constant is expected. E.g. why is errorCount 160?

Can we at least have comments explaining the relationship?

Some of the values are arbitrary. We set some values to simulate the statistics and then we calculate the expected values in the stream. I was thinking we could do that in a function that will set the counter values and then calculate expected results and check them, this will make things more obvious. This function can be used in many places throughout the test. But again, I have the same concern that when the function fails, it will be hard to debug.

I'm not following you why failures in a function are hard to debug.

Do you mean that if there's an EXPECT failure in a helper function, you might not see the whole call-stack in a test log? I agree it's awesome when you can diagnose and fix failures purely based on a test log. However, I don't think it justifies flattening and inlining what logically should be functions.

You can also make test logs directly actionable by passing enough context to the function to print, but usually the clutter is not worth it.

And you can actually debug tests with gdb, in which case a function does not present any barrier to debuggability. Note that debugging envoy unit tests with gdb is not super-obvious even if you are otherwise familiar with gdb, but there is a helper script:

tools/bazel-test-gdb //test/common/common:utility_test -c dbg

And the high order bit here, is: this test is filled with magic constants and it's not clear why they are the right ones.

ok, I'll put them in functions and then we'll look and see if it can be improved for debugging.

jmarantz · 2018-06-10T23:46:39Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+
+  std::unordered_map<std::string, std::string> cluster_message_map =
+      buildClusterMap(TestUtility::bufferToString(buffer));
+  ASSERT_NE(cluster_message_map.find(cluster1_name_), cluster_message_map.end());


per current ASAN failure, recommend adding

<< "cluster1_name=" << cluster1_name_;

to this ASSERT_NE, and similar below.

this is still pending.

I don't understand you suggestion. ASAN is not failing anymore.

My suggestion was not just to fix your ASAN problem, but also generally to make EXPECT/ASSERT failures actionable, should they occur again in the future.

When checking map/set membership in an ASSERT/EXPECT the test infra will show the code and the actuals, but not automatically show the args, so it would be aid debugging to stream in the value as I showed. This is particularly important if the failures may be flaky as they might not show on repro, so you want to structure your EXPECT/ASSERT calls to get something actionable in the log.

mrice32

A few more minor comments.

mrice32 · 2018-06-11T19:21:19Z

api/envoy/config/metrics/v2/stats.proto

+// Stats configuration proto schema for built-in *envoy.hystrix* sink.
+// The sink emits stats in SSE format for use with Hystrix dashboard
+message HystrixSink {
+  int64 num_of_buckets = 1;


nit: can we just shorten to num_buckets here and throughout?

mrice32 · 2018-06-11T19:24:32Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+}
+
+HystrixSink::HystrixSink(Server::Instance& server, const uint64_t num_of_buckets)
+    : // stats_(new HystrixStatCache(num_of_buckets)),


nit: can this commented line be removed?

mrice32 · 2018-06-11T19:31:45Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  }
+}
+
+void HystrixSink::updateRollingWindowMap(Upstream::ClusterInfoConstSharedPtr cluster_info,


nit: can we just pass this as a const ClusterInfo& (the result of *cluster_info) to save a shared_ptr copy?

mrice32 · 2018-06-11T19:45:10Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+                                    std::chrono::milliseconds rolling_window_ms,
+                                    std::stringstream& ss) {
+
+  std::stringstream cluster_info;


Why create this variable? Why not just append everything to ss directly? (Same comment for the other places where this is done).

mrice32 · 2018-06-11T19:48:40Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+                                          absl::string_view cluster_name,
+                                          uint64_t max_concurrent_requests,
+                                          uint64_t reporting_hosts,
+                                          std::chrono::milliseconds rolling_window_ms,


nit: this one is my bad. I think we can remove the _ms from this variable now that the type tells us that it's milliseconds. Sorry about that!

I left it to distinguish from rolling_window of RollingWindow type. Here and in some other places: https://github.com/trabetti/envoy/blob/d464f8c1f9de8eb21775ee1768006008543a21c6/source/extensions/stat_sinks/hystrix/hystrix.h#L57

Okay, sounds good. Not a huge deal to me either way.

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-06-11T21:06:12Z

Pushed a new version. @jmarantz Does it look like the right direction in terms of factoring out the tests and making them more readable? It still looks quiet messy.. 😕

mrice32

LGTM - after jmarantz@'s test concerns are addressed.

jmarantz · 2018-06-13T16:42:27Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+    Buffer::OwnedImpl buffer;
+    auto encode_callback = [&buffer](Buffer::Instance& data, bool) {
+      // Set callbacks to send data to buffer.
+      buffer.add(


this still looks awkward. I was thinking of:

// Set callbacks to send data to buffer. This will append to the end of the // buffer, so multiple calls will all be dumped one after another into this buffer. buffer.add(data);

jmarantz · 2018-06-13T16:49:18Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+                   timeout_retry_step, rejected_step, window_size_);
+  validate_results(cluster_message_map[cluster2_name], success_step2,
+                   error_4xx_step2 + error_4xx_retry_step2 + error_5xx_step2 +
+                       error_5xx_retry_step2,


what's the reason not to just make a single constant for the sum?

Need each one to set values to counters, somehow forgot to use these after defining them...

jmarantz · 2018-06-13T16:50:43Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  InSequence s;
+
+  // Arbitrary numbers for testing. Make sure error > timeout.
+  uint64_t success_step = 7;


nit: make these const.

I'm not sure I believe these are completely arbitrary though; I think some must be dependent on others. Maybe comment those inline?

e.g.

uint64_t error_step = 17; // must be larger than xxxx + yyyyy + ....

it seems unlikely that error > timeout is sufficient, right? If it just had to be > timeout I'd have expected you to set it to 2.

Well, they are. Will set them to running values so it will be more obvious. And there aren't many restrictions on the actual values for the calculation purposes.

jmarantz · 2018-06-13T16:51:22Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  for (uint64_t i = 0; i < (window_size_ + 1); i++) {
+    buffer.drain(buffer.length());
+    // Cluster 1
+    ON_CALL(cluster1_.error_5xx_counter_, value()).WillByDefault(Return((i + 1) * 17));


is the 17 here the same thing as error_step? If so, use the constant?

Right, forgot to use all those values after I defined them...

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-06-13T19:51:52Z

Hope we're approaching the end... What should be done in terms of documentation?

jmarantz · 2018-06-13T23:39:23Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  uint64_t error_step = 4;
+  uint64_t timeout_step = 3;
+  uint64_t timeout_retry_step = 2;
+  uint64_t rejected_step = 1;


If these are important relative to one another, you could make that clearer in the code in a few ways: e.g. specify them in numeric order, or maybe even assign them from a post-increment accumulator, e.g.

uint64_t accum = 1; uint64_t rejected_step = accum++; uint64_t timeout_retry_step = accum++;

etc

But I'm not sure if that's better than explicit numbers.

Make them const though.

I'll leave them as explicit numbers. Changed some values to make it more random. These numbers should not be dependent on each other, they represent traffic. The only constraint is errors>timeouts, since Envoy counts timeouts together with errors, and Hystrix needs the values separated.

In fact, I shouldn't be using "step", but instead randomly increasing values. I have added randomly increasing values to basic test.

OK looking forward to the refactored version. Hopefully much smaller, and with numbers that are more transparent in their meaning.

jmarantz · 2018-06-13T23:40:03Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+    return buffer;
+  }
+
+  void validate_results(std::string data_message, uint64_t success_step, uint64_t error_step,


validateResults(const std::string& data_message, ...

That's style-guide for envoy functions, and you almost never want to pass std::string by value.

jmarantz · 2018-06-13T23:44:15Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+}
+
+TEST_F(HystrixSinkTest, AddAndRemoveClusters) {
+  InSequence s;


This test has improved a lot. Thank you! It is still a bit long. I'm not sure if it could be broken up into multiple tests. If so, that'd be my first suggestion. I think there's a notion here that the latter part of the test depends on the former, but there's really some distinct tests. here.

It's not obvious whether there are repeating patterns to factor out, but maybe it should be broken into single-call helpers anyway, just for legibility.

A search on this topic yielded https://refactoring.guru/extract-method

One thing that comes to mind is to break this into 2 tests with one shared helper.

TEST_F(HysterixSinkTest, AddClusters) { AddHelper(); EXPECT that add worked } TEST_F(HystrixSinkTest, RemoveClusters) { AddHelper(); RemoveHelper(); EXPECT that remove worked. }

I think it's OK to outline RemoveHelper even though it's single use, for brevity and parallelism.

WDYT?

Thanks for the link, I learned.
Separating into two tests - I can do it. But it means that some actions will be done twice in both tests - I want the cache to be dirty, so AddHelper is not just setting up stuff.
In the new version (pushing soon) I attempted to factor out some stuff in the spirit of the link you sent, and I was also able to move away some unnecessary actions from the clusters test. please see how it looks now, and if it still looks two much, I can split into to test as you suggested.

I think splitting them up would help readability.

I'm not sure why there's an issue with the cache being dirty though. Can you explain?

The idea is to see that if a cluster that was active is removed, and then a new cluster with same name is added, there are no leftover data from the previous cluster (it was an actual bug I fixed)

Sounds good. So can you split it up help readability? My suggestion does not change flow of the test you have here, just makes it easier to read (and additionally adds a simpler test).

Signed-off-by: trabetti <talis@il.ibm.com>

jmarantz

Nice! Thank you for the clear comments, and the tests are much more readable now.

Just a few nits about the comments themselves left.

jmarantz · 2018-06-25T12:30:02Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  // Values are unimportant - they represent traffic statistics, and for the purpose of the test any
+  // arbitrary number will do. Only restriction is that errors >= timeouts, since in Envoy timeouts
+  // are counted as errors and therefore the code that prepares the stream for the dashboard reduces
+  // number of timeouts from total number of errors.


nit: s/reduces/deducts the/

You could use either "reduce ... by" or "deducts ... from".

jmarantz · 2018-06-25T12:34:28Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+public:
+  HystrixSinkTest() { sink_.reset(new HystrixSink(server_, window_size_)); }
+
+  // Return the value corresponding to key in an event stream.


Could you just put into a comment a sample line that you are parsing here?

jmarantz · 2018-06-25T12:36:02Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+    EXPECT_NE(absl::string_view::npos, key_pos);
+    absl::string_view trim_data_before_Key = data_message.substr(key_pos);
+    key_pos = trim_data_before_Key.find(" ");
+    EXPECT_NE(absl::string_view::npos, key_pos);


Probably for all these EXPECTs it would be good to add

<< "data_message=" << data_message;

In case someone besides you runs into this, that would let them know why the data didn't look like what was expected.

jmarantz · 2018-06-25T12:37:29Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  // Return a string without quotes.
+  absl::string_view getStringStreamField(absl::string_view data_message, absl::string_view key) {
+    absl::string_view value = getStreamField(data_message, key);
+    return value.substr(1, value.length() - 2);


is it worth EXPECTing that the first & last chars are quotes?

Actually, with a minor fix, the stream is JSON format. is it worth to use JSON utils to get the value instead of this ugly function?

Yeah that sounds great. I'd also be OK with that as a follow-up.

jmarantz · 2018-06-25T12:44:26Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  }
+
+  TestRandomGenerator rand_;
+  uint64_t window_size_ = rand_.random() % 10 + 5; // Arbitrary reasonable number.


nice! I didn't realize this but, but I just poked around, and this both echoes the seed, and provides a way for you to re-run with the same seed.

mrice32 · 2018-06-25T15:28:06Z

@mattklein123, I think @jmarantz and I are about done with our comments, so I think this is ready for a pass by a maintainer.

mattklein123 · 2018-06-25T15:29:29Z

Thanks will take a look.

mattklein123

Cool stuff! I just skimmed through and everything looks sane to me. Just a few small comments. Thank you!

mattklein123 · 2018-06-25T17:40:35Z

source/common/http/headers.h

@@ -212,6 +215,14 @@ class HeaderValues {
    const std::string AcceptEncoding{"Accept-Encoding"};
    const std::string Wildcard{"*"};
  } VaryValues;
+
+  struct {
+    const std::string AllowHeadersHystrix{"Accept, Cache-Control, X-Requested-With, Last-Event-ID"};


nit: I'm not sure I would define this in this file as it is specific to an extension. Can you move this to somewhere in your extension code?

mattklein123 · 2018-06-25T17:40:49Z

docs/root/operations/admin.rst

+
+  .. _operations_admin_interface_hystrix_event_stream:
+
+.. http:get:: /hystrix_event_stream


Please add a release note also.

mattklein123 · 2018-06-25T17:42:14Z

source/extensions/stat_sinks/hystrix/hystrix.h

+   * Register a new connection.
+   */
+  void registerConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_register);
+  /**


nit: newline before this line

mattklein123 · 2018-06-25T17:43:11Z

source/extensions/stat_sinks/well_known_names.h

@@ -18,6 +18,8 @@ class StatsSinkNameValues {
  const std::string DOG_STATSD = "envoy.dog_statsd";
  // MetricsService sink
  const std::string METRICS_SERVICE = "envoy.metrics_service";
+  // Hystrix sink
+  const std::string HYSTRIX = "envoy.hystrix";


See Note above, please call this "envoy.stat_sinks.hystrix" (and update any necessary docs)

Signed-off-by: trabetti <talis@il.ibm.com>

…trix_dashboard_support_wip_sync Signed-off-by: trabetti <talis@il.ibm.com>

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-06-25T21:07:56Z

I had to merge to master, to get updated release notes file. Now it is not compiling for me (compiling with docker). I pushed it anyway to see if it passes compilation here.

mattklein123 · 2018-06-26T01:43:58Z

docs/root/intro/version_history.rst

@@ -11,6 +11,8 @@ Version history
  Support for the legacy proto :repo:`source/common/ratelimit/ratelimit.proto` is deprecated and will be removed at the start of the 1.9.0 release cycle.
 * tracing: added support for configuration of :ref:`tracing sampling
  <envoy_api_field_config.filter.network.http_connection_manager.v2.HttpConnectionManager.tracing>`.
+* admin: added :http:get:`/hystrix_event_stream` as an endpoint for monitoring envoy's statistics 


please alpha order

…trix_dashboard_support_wip_sync Signed-off-by: trabetti <talis@il.ibm.com>

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-06-26T08:06:52Z

@jmarantz I changed the test to use json utils, maybe you can take a look

mattklein123 · 2018-06-27T20:06:07Z

@trabetti sorry needs another master merge. Please ping me immediately after it passes tests and I will merge!

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-06-28T13:01:50Z

@mattklein123 all tests have passed

mattklein123

Awesome to get this in! Please do a blog post with a demo. ;)

moderation · 2018-06-28T14:39:14Z

source/extensions/extensions_build_config.bzl

@@ -77,6 +77,7 @@ EXTENSIONS = {
    "envoy.stat_sinks.dog_statsd":                      "//source/extensions/stat_sinks/dog_statsd:config",
    "envoy.stat_sinks.metrics_service":                 "//source/extensions/stat_sinks/metrics_service:config",
    "envoy.stat_sinks.statsd":                          "//source/extensions/stat_sinks/statsd:config",
+    "envoy.stat_sinks.hystrix":                         "//source/extensions/stat_sinks/hystrix:config",


@trabetti Small comment - the hystrix entry is the only one in the list that is not alphabetically sorted within the section. For consistency it should be below dog_statsd and above metrics_service

opened #3752

Adding the latency information from existing histogram to Hystrix event stream that was added in #3425 Risk Level: Low Testing: Docs Changes: admin.rst Release Notes: No addition to release note of original Hystrix stream PR, which is still pending for 1.8.0 Fixes #3753 Signed-off-by: trabetti <talis@il.ibm.com>

support for hystrix dashboard through an admin endpoint, using sink m…

16faae5

…echanism Signed-off-by: trabetti <talis@il.ibm.com>

trabetti mentioned this pull request May 17, 2018

support for hystrix event stream output #2736

Closed

mattklein123 assigned jmarantz and mrice32 May 17, 2018

mrice32 reviewed May 18, 2018

View reviewed changes

remove cluster from cache when it is removed. added testing.

5de68a5

Signed-off-by: trabetti <talis@il.ibm.com>

mrice32 reviewed May 31, 2018

View reviewed changes

corrections in response to review comments

37db887

Signed-off-by: trabetti <talis@il.ibm.com>

mrice32 reviewed Jun 7, 2018

View reviewed changes

imporvements around cluster_stats_cache_map_

2f80b0c

Signed-off-by: trabetti <talis@il.ibm.com>

jmarantz reviewed Jun 8, 2018

View reviewed changes

trabetti added 2 commits June 10, 2018 09:54

fixes according to review comments

32f5226

Signed-off-by: trabetti <talis@il.ibm.com>

factor out repeating behaviors to class/functions

d464f8c

Signed-off-by: trabetti <talis@il.ibm.com>

jmarantz reviewed Jun 10, 2018

View reviewed changes

mrice32 reviewed Jun 11, 2018

View reviewed changes

review comments and factoring out more in tests

d743fb1

Signed-off-by: trabetti <talis@il.ibm.com>

mrice32 previously approved these changes Jun 13, 2018

View reviewed changes

jmarantz reviewed Jun 13, 2018

View reviewed changes

bugfix in print. improved code on callbacks disconnect. improved tests.

1a842f0

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti dismissed mrice32’s stale review via 1a842f0 June 13, 2018 19:49

jmarantz reviewed Jun 13, 2018

View reviewed changes

split AddAndRemoveCluster test to help readability

43901b9

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti dismissed mrice32’s stale review via 43901b9 June 25, 2018 12:20

jmarantz previously approved these changes Jun 25, 2018

View reviewed changes

mrice32 previously approved these changes Jun 25, 2018

View reviewed changes

mattklein123 self-assigned this Jun 25, 2018

mattklein123 reviewed Jun 25, 2018

View reviewed changes

trabetti added 3 commits June 25, 2018 22:23

changed test to use json utils, fix some comments

a61bcc0

Signed-off-by: trabetti <talis@il.ibm.com>

Merge branch 'master' of https://github.com/envoyproxy/envoy into hys…

5652687

…trix_dashboard_support_wip_sync Signed-off-by: trabetti <talis@il.ibm.com>

added release notes

82adc91

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti dismissed stale reviews from mrice32 and jmarantz via 82adc91 June 25, 2018 21:06

mattklein123 reviewed Jun 26, 2018

View reviewed changes

trabetti added 2 commits June 26, 2018 09:36

Merge branch 'master' of https://github.com/envoyproxy/envoy into hys…

b6e7707

…trix_dashboard_support_wip_sync Signed-off-by: trabetti <talis@il.ibm.com>

alpha order version history

43ff61b

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti added 2 commits June 28, 2018 13:26

merge master

3999180

Signed-off-by: trabetti <talis@il.ibm.com>

use buffer.toString()

4b4248d

Signed-off-by: trabetti <talis@il.ibm.com>

mattklein123 approved these changes Jun 28, 2018

View reviewed changes

mattklein123 merged commit 9c27339 into envoyproxy:master Jun 28, 2018

moderation reviewed Jun 28, 2018

View reviewed changes

trabetti deleted the hystrix_dashboard_support branch June 28, 2018 20:52

This was referenced Jun 28, 2018

move hystrix entry to keep the section alphabetically sorted #3752

Merged

Add missing timing information to Hystrix event stream output #3753

Closed

trabetti mentioned this pull request Jul 30, 2018

added histogram latency information to Hystrix dashboard stream #3986

Merged


		Buffer::OwnedImpl createClusterAndCallbacks() {

		// set cluster


		.. _operations_admin_interface_hystrix_event_stream:

		.. http:get:: /hystrix_event_stream

support for hystrix dashboard through an admin endpoint, using sink #3425

support for hystrix dashboard through an admin endpoint, using sink #3425

Conversation

trabetti commented May 17, 2018

Uh oh!

mattklein123 commented May 17, 2018

Uh oh!

mrice32 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 May 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 May 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trabetti commented May 30, 2018

Uh oh!

mrice32 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trabetti commented Jun 7, 2018

Uh oh!

mrice32 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 Jun 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 May 18, 2018 •

edited

Loading

mrice32 May 22, 2018 •

edited

Loading

mrice32 Jun 7, 2018 •

edited

Loading

jmarantz Jun 8, 2018 •

edited

Loading