support for hystrix event stream output #2736

trabetti · 2018-03-06T12:32:04Z

admin filter: add support for hystrix dashboard output

Description:
This PR adds an admin endoint for hystrix dashboard output.
When there is a request to admin_port/hystrix_event_stream, envoy starts sending SSE stream, using the same socket of the initial request.
The admin URL can be accessed directly from hystrix dashboard.

Risk Level: Medium

Testing:
Added unit testing hystrix_test.cc
The admin endpoint was not tested, since it returns an infinite event stream, so current mechanisms for testing admin endpoint do not work. If there are any suggestions on how to test it, we will add.

Docs Changes:
changed docs/root/operations/admin.rst.
PR: envoyproxy/data-plane-api#523

Release Notes:
Support was added to hystrix event stream output. From hystrix dashboard, set the Eureka URL to envoy_ip:8001, and add stream from http://envoy_ip:8001/hystrix_event_stream
Note that not all hystrix dashboard statistics are supported. More detail can be found here: <link to the documentation generated from docs/root/operations/admin.rst >

Fixes #1244

Signed-off-by: talis <talis@il.ibm.com>

Signed-off-by: trabetti <talis@il.ibm.com>

…-support Signed-off-by: trabetti <talis@il.ibm.com>

Signed-off-by: trabetti <talis@il.ibm.com>

mattklein123

cool stuff. Few drive by comments from a quick skim (not a full review).

mattklein123 · 2018-03-06T23:34:37Z

source/server/http/admin.cc

@@ -575,13 +574,66 @@ std::string AdminImpl::runtimeAsJson(
  return strbuf.GetString();
 }

+Http::Code AdminImpl::handlerHystrixEventStream(const std::string&,


Quick drive by comment. This file is getting huge and will continue to get huger. Can you put all the new Hystrix admin stuff into a dedicated file, potentially in a new admin directory? We will need to start decomping the rest of this file also as we add more stuff to it.

moved hystrix stuff from admin.cc - commit a05aea1

mattklein123 · 2018-03-06T23:35:41Z

source/common/http/http1/codec_impl.cc

@@ -69,12 +70,16 @@ void StreamEncoderImpl::encodeHeaders(const HeaderMap& headers, bool end_stream)
    saw_content_length = true;
  }

+  if (headers.NoChunks()) {


Can you say a little more about why this is needed? This is a bit odd. I assume it's because you don't want chunk transfer encoding but you don't have a content-length? Is there any way to handle this without this kind of hack?

Your assumption is right. I saw that the ":" wes created exactly to allow such kind of hook and it is being used in some other places. Alternatives could be (1) changing the interface of encodeHeaders(), which is much more intrusive since it is all over, or (2) building the response without using encodeHeaders(). What do you recommend?

I'm not crazy about it, but it's probably the least bad option for doing this. If we go this route can you please add liberal comments on why this is here, that it is a non-standard header, etc.? @alyssawilk WDYT?

mrice32

This is really cool! Thanks! Did a quick first pass. Looks good for the most part.

mrice32 · 2018-03-06T22:23:53Z

include/envoy/server/admin.h

+class HystrixHandlerInfo : public HandlerInfo {
+public:
+  HystrixHandlerInfo(Http::StreamDecoderFilterCallbacks* callbacks)
+      : stats_(new Stats::Hystrix()), data_timer_(nullptr), ping_timer_(nullptr),


nit: prefer std::make_unique<Stats::Hystrix>() here and throughout (this may be common elsewhere in Envoy, but we recently moved to C++14).

mrice32 · 2018-03-06T22:31:06Z

include/envoy/server/admin.h

+ * This class contains data which will be sent from admin filter to a hystrix_event_stream handler
+ * and build a class which contains the relevant data.
+ */
+class HystrixHandlerInfo : public HandlerInfo {


Why is this class under include/? Generally, in Envoy, only pure interfaces with no implementation go under include/, and all implementations of those interfaces (even if there's just one) go under source/.

mrice32 · 2018-03-06T22:35:59Z

include/envoy/server/admin.h

+ * in admin impl. Each handler which needs to receive data from admin filter can inherit from
+ * HandlerInfo and build a class which contains the relevant data.
+ */
+class HandlerInfo {


Could this be separated into a pure interface here and a default implementation under source/? Generally within Envoy, classes declared under include/ are pure interfaces with all PURE (= 0) methods. Even the default implementations are generally separated from the interface and placed under source/

fixed by commit 61659a3

As indicated in another comment thread, we can eliminate HandlerInfo completely and instead pass the AdminFilter* into each handler, which will clean up the dynamic_cast and remove a special-case from where the HandlerInfo is constructed.

It looks like as a sink you still need to get the Http::StreamDecoderFilterCallbacks* into your system, but can we get this by plumbing through the AdminFilter& rather than having this empty-base-class/dynamic-cast strategy? Or @mattklein123 do you have a better idea?

I agree - Either the Http::StreamDecoderFilterCallbacks* should be passed through or you could add some new public methods to AdminFilter to do what you want and pass it through.

mrice32 · 2018-03-06T23:52:36Z

source/server/http/admin.cc

+    code = parent_.runCallback(path, *header_map, response, *handler_info_);
+  } else {
+    HandlerInfoPtr temp_handler(new HystrixHandlerInfo(callbacks_));
+    handler_info_ = std::move(temp_handler);


nit: Why not just handler_info_ = std::make_unique<HystrixHandlerInfo>(callbacks_);?

mrice32 · 2018-03-07T00:06:48Z

source/server/http/admin.cc

+
+  // using write() since we are sending network level
+  // TODO(trabetti): is there an alternative to the const_cast?
+  (const_cast<Network::Connection*>((hystrix_handler_info->callbacks_)->connection()))


Could you elaborate in your comment on why can't you use callbacks_->encodeData() here instead of getting at the raw connections? I'm having a little trouble understanding.

Right. Probably can use encodeData()..

mrice32 · 2018-03-07T00:08:38Z

test/server/http/admin_test.cc

-  auto callback = [&](const std::string&, Http::HeaderMap&, Buffer::Instance&) -> Http::Code {
-    return Http::Code::Accepted;
-  };
+  auto callback = [&](const std::string&, Http::HeaderMap&, Buffer::Instance&,


nit: remove & from the capture since you don't need to capture anything from the environment - ditto in other test cases.

This is not related to this PR. Want us to fix those here anyway?

Oh, I didn't notice that you were only modifying part of that line. It'd be awesome if you could clean that up, though!

mrice32 · 2018-03-07T00:59:15Z

source/common/stats/hystrix.cc

+const uint64_t Hystrix::PING_INTERVAL_IN_MS;
+
+// add new value to rolling window, in place of oldest one
+void Hystrix::pushNewValue(std::string key, int value) {


nit: generally prefer types with defined sizes - like int64_t. Also, it seems like you use int64_t as the return value for getRollingValue() which is a diff of these inputs. A similar inconsistency exists for num_of_buckets_/current_index_ and DEFAULT_NUM_OF_BUCKETS.

mrice32 · 2018-03-07T01:01:08Z

source/common/stats/hystrix.cc

+
+  // combining timeouts+retries - retries are counted  as separate requests
+  // (alternative: each request including the retries counted as 1)
+  double timeouts = getRollingValue(cluster_name, "upstream_rq_timeout") +


Why are all these being casted to doubles and then casted back to integers?

mrice32 · 2018-03-07T01:21:53Z

source/server/http/admin.cc

+    handler_info_ = std::move(temp_handler);
+    code = parent_.runCallback(path, *header_map, response, *handler_info_);
+  } else {
+    HandlerInfoPtr temp_handler(new HystrixHandlerInfo(callbacks_));


This is only half-baked, but maybe it makes sense to make all the handlers classes instead of just functions and pass them (or provide callbacks to get) certain state, like callbacks_, on each request. Then, they can manage the creation of any per-request state that may need to be created internally. That would eliminate the need to modify this function every time a handler wants to use HandlerInfo, and it would eliminate the dynamic_cast.
What do you think?

I agree, we also thought about something like that. How about opening a separate issue for that?

Yup, that seems reasonable to me. Could you also toss a TODO in there referencing the issue?

mrice32 · 2018-03-07T18:12:24Z

source/common/stats/hystrix.h

+typedef std::vector<uint64_t> RollingStats;
+typedef std::map<std::string, RollingStats> RollingStatsMap;
+
+// Consider implement the HystrixStats as a sink to have access to histograms data


+1 to the idea of moving the bulk of the logic to a stats sink.

We thought of doing this later as an enhancement..

The reason I mentioned this was that I was thinking more about using stats sinks for this. I'd be interested to hear @mattklein123's and @htuch's thoughts. IIUC, the general architecture is a subscription mechanism where a client reaches out to subscribe and then Envoy streams specific stats data indefinitely over that connection. Is this correct?

That would mean that this is a combination of the existing pull-based integrations, like prometheus, that just implement a simple request-query-response handler on the admin port and push-based integrations, like the GRPC stats streamer and statsd, that implement stats sinks to push stats to subscribers. A lot of the custom functionality that you've added, like querying stats and periodic flushes, are already handled by the existing sink interface.

I'm wondering if there's some way that we can keep an admin port subscription model, but pass off the stream to a stats sink where the periodic pushing can be done. This would likely allow the pushing to be more performant since it won't require a new periodic query to be set up for each subscriber (assuming the multi-subscriber case would ever happen in a hystrix deployment). It would also mean that you wouldn't have to reinvent the wheel when it comes to building and configuring a push system within Envoy or querying stats data. And, as you mentioned, it will have built-in support for histograms. I'm curious to hear what others think, though.

+1, agreed. We should do this using a stat sink. I haven't through it through fully, but I don't see any reason why we can't figure out a way in which the active admin stream registers with the sink, and then when the flush calls happen it would just work? @mrice32 can you help guide this PR in that direction?

Stats sink means the stream will be sent at same intervals as other stats which are active. Isn't it a limitation?

@mattklein123, sounds good.

@trabetti, yes, it is a bit of a limitation. However, I think most deployments have only been using one sink so far, so the aligned intervals have not been an issue. If this is something you foresee being an issue, we should open an issue and design a way for stats sinks to optionally specify a time interval that is different from the global one. It shouldn't be too hard to make this more flexible.

jmarantz

+1 on changing this to a stats sink. I felt like most of the superficial testing and string-handling comments below might apply to this as a sink as well.

jmarantz · 2018-03-08T02:05:58Z

source/common/stats/hystrix.cc

+  }
+}
+
+uint64_t Hystrix::getRollingValue(std::string cluster_name, std::string stats) {


const std::string& cluster_name, const std::string& stats.

or maybe:

absl::string_view cluster_name, absl::string_view stats

jmarantz · 2018-03-08T02:07:10Z

source/common/stats/hystrix.cc

+}
+
+uint64_t Hystrix::getRollingValue(std::string cluster_name, std::string stats) {
+  std::string key = "cluster." + cluster_name + "." + stats;


absl::StrCat (which knows how to reserve the right size for the output buffer for all inputs, before it starts appending bytes).

jmarantz · 2018-03-08T02:08:12Z

source/common/stats/hystrix.cc

+    // better return 0, will be back to normal once one rolling window passes
+    // better idea what to return? could change the algorithm to keep last valid delta,
+    // updated with pushNewValue and kept stable when the update is negative.
+    if (rolling_stats_map_[key][current_index_] <


auto& rolling_stats_map_entry = rolling_stats_map_[key] and CSE everywhere that appears.

jmarantz · 2018-03-08T02:08:33Z

source/common/stats/hystrix.cc

+void Hystrix::resetRollingWindow() { rolling_stats_map_.clear(); }
+
+void Hystrix::addStringToStream(std::string key, std::string value, std::stringstream& info) {
+  addInfoToStream(key, "\"" + value + "\"", info);


absl::StrCat.

jmarantz · 2018-03-08T02:08:44Z

source/common/stats/hystrix.cc

+
+void Hystrix::resetRollingWindow() { rolling_stats_map_.clear(); }
+
+void Hystrix::addStringToStream(std::string key, std::string value, std::stringstream& info) {


const refs or string_view

jmarantz · 2018-03-08T02:16:56Z

test/common/stats/hystrix_test.cc

+namespace Envoy {
+namespace Stats {
+
+void checkStreamField(std::string dataMessage, std::string key, uint64_t expected) {


Use string_view here. Also declare non-changing temps as const. Also all your temp strings below should be absl::string_view.

jmarantz · 2018-03-08T02:18:32Z

test/common/stats/hystrix_test.cc

+namespace Stats {
+
+void checkStreamField(std::string dataMessage, std::string key, uint64_t expected) {
+  std::string actual = dataMessage.substr(dataMessage.find(key));


capture dataMessage.find(key) in a temp key_pos and EXPECT_NE(absl::string_view::npos, key_pos);

jmarantz · 2018-03-08T02:20:01Z

test/common/stats/hystrix_test.cc

+  actual = actual.substr(actual.find(" ") + 1);
+  std::size_t length = actual.find(",");
+  actual = actual.substr(0, length);
+  EXPECT_EQ(actual, std::to_string(expected));


a slightly nicer pattern is to return the actual, and do the EXPECT_EQ at the call-site, which gives you useful line-numbers if it fails.

jmarantz · 2018-03-08T02:22:31Z

test/common/stats/hystrix_test.cc

+
+void checkStreamField(std::string dataMessage, std::string key, uint64_t expected) {
+  std::string actual = dataMessage.substr(dataMessage.find(key));
+  actual = actual.substr(actual.find(" ") + 1);


all your .find returns need to be checked. Also, please give a comment showing what the format is you are parsing.

jmarantz · 2018-03-08T02:27:19Z

test/common/stats/hystrix_test.cc

+  checkStreamField(dataMessage, "rollingCountSuccess", 63);
+  checkStreamField(dataMessage, "rollingCountTimeout", 27);
+  checkStreamField(dataMessage, "propertyValue_queueSizeRejectionThreshold", expectedQueueSize);
+  checkStreamField(dataMessage, "reportingHosts", expectedReportingHosts);


I wonder if you could have your helper do a more complete parsing of dataMessage, returning a map<string, uint64_t> which you could then just EXPECT_EQ against a map that's initialized to the above gold values.

The existing checkStreamField looks like it might ignore stray characters in the string, and would definitely ignore extra field/value pairs. It also won't work if one of the keys is a substring of another.

trabetti · 2018-03-08T14:30:38Z

Thanks for so many comments! Will probably take until middle of next week to start pushing the fixes.

jmarantz · 2018-03-08T14:34:28Z

@trabetti recommend focusing on big picture first (hystrix as stats sink). I have a growing concern with dependency-creep in Envoy as lots of deployments differ greatly and we'd like it to be easy to configure the parts we really need without generating a sea of ifdefs.

…nly the needed values Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-03-16T09:39:43Z

Just updating, we got distracted by some urgent task this week, will start pushing fixes next week..

mattklein123 · 2018-03-16T19:50:44Z

Per new release note policy please add a release note with any applicable feature docs to https://github.com/envoyproxy/data-plane-api/blob/master/docs/root/intro/version_history.rst. Thank you!

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

jmarantz · 2018-03-19T11:41:11Z

source/server/http/admin.cc

+  response_headers.insertNoChunks().value().setReference("0");
+
+  Stats::HystrixHandlerInfoImpl& hystrix_handler_info =
+      dynamic_cast<Stats::HystrixHandlerInfoImpl&>(handler_info);


IMO dynamic_cast should rarely be needed, and when used the result certainly need to be checked for nullptr (so it should be pointers not refs).

This all might be moot anyway if you move to using sinks maybe this can all be reverted.

But if we decide not to go with sinks in this PR, and you need to add this as a new stats callback, then my suggestion is to allocate Stats::HystrixHandlerInfoImpl as a temp inside a simple handler. It does look like you currently need to pass through callbacks_ to construct the HystrixHandlerInfoImpl, but since the handlers are non-static methods. you have that available to you in the callback. You don't need to change MAKE_ADMIN_HANDLER, the dispatching mechanism, or all the signatures of the existing handlers.

That would clean this up quite a bit from a code and PR-size perspective.

But really the indicates that this admin handler is really a lot different from all the other ones, and points strongly toward having an architectural discussion. In particular, this admin handler actually starts a timer and does stuff after the request terminates, which is very different semantics from all the other admin handlers, which either do a simple mutation or return some data. Probably this is another signpost pointing us toward sinks as a better solution. WDYT?

In fact, the request never terminates. We must use the same socket that was used in the original request for streaming. If the request was terminated, the socket would have been closed. I think this was the main issue with implementing as a sink. Will need some help to resolve this.

@jmarantz Totally agree with the sentiment. However, IIUC, the handlers don't have the private variables of the filter (or the filter, itself) available to them since they are methods on the AdminImpl object, but I may be missing something.

Above, we discussed above adding a handler class (rather than just a method) to combine the callback and data and formalize the idea of combining handlers with state. This would allow handlers to allocate and manage their own state, be tied to the filter object, and not require any dynamic_casting. We decided to add this as a TODO rather than do it in this PR, but maybe we should get that in as a part of this PR instead? What do you think?

Side note: as long as a bad_cast exception is the desired result of a failed dynamic_cast, I think dynamic_casting a ref should be fine since it automatically throws if the cast fails.

@trabetti I think this will still work exactly the same way as it does now in a sink. I think the request object would need to be registered with the sink, and then the sink could start encoding data through it without closing it, just as the timers do now.

I still think it should be a separate PR, but maybe it should go first, and have the current one on top of it.

@mrice32 Good point; how about supplying the AdminFilter* to the callbacks, rather than an empty HandlerInfo* base class that needs to be cast whenever it is used. That would require passing 'this' to parent_.runCallback() in AdminFilter::onComplete, and then passing that through to the handlers, so they'd have context of the filter instantiation that invoked them.

@trabetti IIUC, you're saying you want to break these into two PRs - first the sink that the handler would register with, and then add the handler, itself, in a separate PR? I think that's a great idea.

@jmarantz yup, seems reasonable to me.

@mrice32 I meant for first PR for adding a handler class instead of a method. Then, modify this PR to use the class.

Oh, I see. Yup, that would be fine too. I'm always in favor of dividing up PRs where possible :)

jmarantz · 2018-03-20T18:27:20Z

include/envoy/server/admin.h

+ * in admin impl. Each handler which needs to receive data from admin filter can inherit from
+ * HandlerInfo and build a class which contains the relevant data.
+ */
+class HandlerInfo {


As indicated in another comment thread, we can eliminate HandlerInfo completely and instead pass the AdminFilter* into each handler, which will clean up the dynamic_cast and remove a special-case from where the HandlerInfo is constructed.

jmarantz · 2018-03-20T19:21:20Z

source/common/stats/hystrix.cc

+void Hystrix::pushNewValue(std::string key, uint64_t value) {
+  // create vector if do not exist
+  if (rolling_stats_map_.find(key) == rolling_stats_map_.end()) {
+    rolling_stats_map_[key].resize(num_of_buckets_, value);


rolling_stats_map_[key].resize(num_o_buckets_, value) is all you need; no need for the if-statement or .find() check.

jmarantz · 2018-03-20T19:23:33Z

source/common/stats/hystrix.cc

+const uint64_t Hystrix::PING_INTERVAL_IN_MS;
+
+// add new value to rolling window, in place of oldest one
+void Hystrix::pushNewValue(std::string key, uint64_t value) {


const std::string& here and in all formals.

jmarantz · 2018-03-20T19:25:55Z

source/common/stats/hystrix.cc

+
+void Hystrix::updateRollingWindowMap(Stats::Store& stats, absl::string_view cluster_name) {
+  std::string prefix;
+  prefix = absl::StrCat("cluster.",cluster_name,".");


Generally the formatter requires that you add spaces between args. You need to run the formatter under docker with every commit.

jmarantz · 2018-03-20T19:47:29Z

source/common/stats/hystrix.cc

+  // (alternative: each request including the retries counted as 1)
+  std::string upstream_rq_timeout_key = absl::StrCat(prefix,"upstream_rq_timeout");
+  std::string upstream_rq_per_try_timeout_key = absl::StrCat(prefix,"upstream_rq_per_try_timeout");
+  uint64_t timeouts = stats.counter(upstream_rq_timeout_key).value() +


If this is going to happen a bunch, do you want to cache the stats.counter return references and store them in the class, rather than doing the name-lookup each time?

jmarantz · 2018-03-20T19:51:39Z

source/common/stats/hystrix.cc

+  uint64_t errors = stats.counter(upstream_rq_5xx_key).value() +
+                    stats.counter(retry_upstream_rq_5xx_key).value() +
+                    stats.counter(upstream_rq_4xx_key).value() +
+                    stats.counter(retry_upstream_rq_4xx_key).value() -


why '-' instead of '+' here? Just add comment if this was intentional.

jmarantz · 2018-03-20T19:54:25Z

source/common/stats/hystrix.cc

+  addIntToStream("currentConcurrentExecutionCount", 0, cluster_info);
+  addIntToStream("latencyExecute_mean", 0, cluster_info);
+
+  // latency information can be  taken rom hystogram, which is only available to sinks


typo "from histogram"? Start comments with a capital letter, end with a period.

jmarantz · 2018-03-20T19:55:59Z

source/common/stats/hystrix.cc

+  std::stringstream out_str;
+
+  for (std::map<std::string, RollingStats>::const_iterator it = rolling_stats_map_.begin();
+       it != rolling_stats_map_.end(); ++it) {


for (auto it : rolling_stats_map_) {

jmarantz · 2018-03-20T19:56:35Z

source/common/stats/hystrix.cc

+  addHystrixThreadPool(ss, cluster_name, max_concurrent_requests, reporting_hosts);
+}
+
+absl::string_view Hystrix::printRollingWindow() {


this method should be const. Probably others should be as well.

jmarantz · 2018-03-20T19:58:04Z

source/common/stats/hystrix.h

+   * Get value of the sampling buckets
+   */
+  static uint64_t GetRollingWindowIntervalInMs() {
+    return static_cast<const uint64_t>(ROLLING_WINDOW_IN_MS / DEFAULT_NUM_OF_BUCKETS);


jmarantz · 2018-03-20T20:00:25Z

source/server/http/admin.cc

+    handler_info_ = std::make_unique<Stats::HystrixHandlerInfoImpl>(callbacks_);
+    code = parent_.runCallback(path, *header_map, response, *handler_info_);
+    end_stream = false;
+  }


as discussed elsewhere I am hoping to replace this whole new block with:
Http::Code code = parent_.runCallback(path, *header_map, this, response);

@jmarantz Thanks for the comments. Will do another fix soon.

@jmarantz @mrice32 Question for the sink implementation.
As far as I understand, currently sinks are only added to the sinks vector (MainImpl::stats_sinks_) during init, according to what's specified in the configuration file "stats_sinks" parameter.
I thought of two ways to implement:
(1) Add a "hystrix sink" through the config file. Once a request is made from the hystrix dashboard, set the connection parameters on that sink and switch on an enable flag in it.
(2) Once a request is made, create a new sink with the connection parameters and add it to the vector of sinks. Remove it from the vector when the connection is destroyed.
Do you see any problem or disadvantage with any of these options?

@trabetti, 1 is almost definitely what we want. This allows you to do all of the formatting once rather than having to do it for each connection. Also, Envoy is not really built to scale in the number of stats sinks because of the way it passes stats to them. And lastly, dynamic registration of sinks without involving the config makes them more difficult to reason about from the user's perspective.

I'm not exactly sure what you mean by setting connection parameters or enabling the flag, but by my understanding, you should be able to just register each stream with the single hystrix sink that is created through the config. We'll also need to detect connection termination and make sure the sink is aware of this so it doesn't try to access a deleted stream. This registration mechanism will take some thought, by the way, because there is no existing communication mechanism between stats sinks and the admin interface.

As you're working through this design change, feel free to reach out with any questions here or on slack. We're happy to provide input wherever needed :). Take your time making incremental updates here, and whenever you're ready for us to take another pass, just let us know.

Hi @mrice32 , the first version of the hystrix support as a sink is almost ready, I hope I'll get to push it before Passover holiday..
Questions:

As we discussed, the hystrix sink will be set through the config file, which means I touched data_plane_api. How do I push it here so it compiles (in my development env I point to my fork of data_plane_api)

I need some help figuring out how to use the histogram data.
Thanks!

You should send out a PR for your data-plane-api changes, which will need to be merged before this PR is merged. Until that PR gets merged, you should do exactly what you're doing now - point the tests to your own fork to allow the CI to pass in the meantime.

What, specifically, about the histogram data is confusing? The model for histograms is different for the other stats. Each sample is sent to the sinks when published. This is done synchronously by calling this method from whatever thread recorded the value. So you need to be sure that that method is thread-safe. The statsd implementation uses TLS to ensure thread-safety (see here), but since you are probably interested in forming some sort of central cache from which to pull at flush time, your implementation will likely be more complicated than that. It's also worth noting that there's work underway on a design to create an Envoy representation of the distributions and central cache to store them as opposed to just handing the samples to the sinks as is done now.

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

Signed-off-by: trabetti <talis@il.ibm.com>

…trix-dashboard-support-handler-upstream Signed-off-by: trabetti <talis@il.ibm.com>

mrice32

This is looking awesome! I have a few design suggestions. I think we should aim for all hystrix-specific code to be under source/extensions. We can extend existing APIs to make the hystrix use case possible, but we should remove any special cases or methods in the core Envoy codebase. I've laid out a few suggestions (this review and the last one) to get us there.

I haven't gone through the sink, itself, yet, but I'll take a look at that next.

mrice32 · 2018-04-14T23:38:22Z

source/extensions/stat_sinks/hystrix/config.h

+namespace Envoy {
+namespace Extensions {
+namespace StatSinks {
+namespace HystrixNameSpace {


Prefer Hystrix over HystrixNameSpace.

mrice32 · 2018-04-16T20:28:36Z

include/envoy/server/admin.h

+ * in admin impl. Each handler which needs to receive data from admin filter can inherit from
+ * HandlerInfo and build a class which contains the relevant data.
+ */
+class HandlerInfo {


I agree - Either the Http::StreamDecoderFilterCallbacks* should be passed through or you could add some new public methods to AdminFilter to do what you want and pass it through.

mrice32 · 2018-04-16T21:37:29Z

source/server/http/admin.cc

+    handler_info_ =
+        std::make_unique<Extensions::StatSinks::Common::HystrixHandlerInfoImpl>(callbacks_);
+    code = parent_.runCallback(path, *request_headers_, *header_map, response, *handler_info_);
+    end_stream = false;


If we decide to remove handler_info_, could we remove the rest of this special case? Could we just allow the callback to set the end_stream (it could still be defaulted to true so we don't need to change the other handlers)?

How about, instead of involving the handlers and adding more arguments to their interface, add a bool streaming_ to struct UrlHandler and let AdminImpl::runCallback handle the end_stream value?

mrice32 · 2018-04-16T21:44:30Z

source/server/http/admin.cc

@@ -668,7 +714,7 @@ void AdminFilter::onComplete() {

  // Under no circumstance should browsers sniff content-type.
  header_map->addReference(headers.XContentTypeOptions, headers.XContentTypeOptionValues.Nosniff);
-  callbacks_->encodeHeaders(std::move(header_map), response.length() == 0);
+  callbacks_->encodeHeaders(std::move(header_map), end_stream && response.length() == 0);

  if (response.length() > 0) {
    callbacks_->encodeData(response, true);


If we want end_stream to be general purpose as proposed in the comment above, can we change this call to:

callbacks_->encodeData(response, end_stream);

Essentially this would mean that if hystrix or some other callback wanted to send data and not end the stream, that would be possible.

mrice32

Another round of comments - the sink looks good so far, thanks for addressing my comments from last time! I'll continue digging into the core sink implementation tomorrow, but this should be good enough to get started.

mrice32 · 2018-04-16T22:08:26Z

source/extensions/stat_sinks/common/hystrix/hystrix.h

@@ -0,0 +1,157 @@
+#include <map>


I think the stats_sinks/common directory is only used for components that are shared by multiple stats sinks. I think it would be fine to put everything under stats_sinks/hystrix (similar to stats_sinks/metrics_service).

mrice32 · 2018-04-16T22:10:21Z

source/extensions/stat_sinks/common/hystrix/hystrix.h

+typedef std::vector<uint64_t> RollingStats;
+typedef std::map<const std::string, RollingStats> RollingStatsMap;
+
+class Hystrix : public Logger::Loggable<Logger::Id::hystrix> {


nit: can we call this something a little more specific to what function it performs, like HystrixStatCache for instance?

mrice32 · 2018-04-16T22:15:01Z

source/extensions/stat_sinks/common/hystrix/hystrix.h

+
+  RollingStatsMap rolling_stats_map_;
+  uint64_t current_index_;
+  uint64_t num_of_buckets_;


nit: const?

mrice32 · 2018-04-16T22:18:53Z

source/extensions/stat_sinks/common/hystrix/hystrix.h

+  Hystrix() : current_index_(DEFAULT_NUM_OF_BUCKETS), num_of_buckets_(DEFAULT_NUM_OF_BUCKETS + 1){};
+
+  Hystrix(uint64_t num_of_buckets)
+      : current_index_(num_of_buckets), num_of_buckets_(num_of_buckets + 1){};


nit: the variable naming here is a little confusing since you're setting num_of_buckets_ equal to num_of_buckets + 1. Can we change the name of the instance variable (num_of_buckets_) to something that more clearly describes what it is - like bucket_mod_, for instance?

mrice32 · 2018-04-16T22:22:32Z

source/extensions/stat_sinks/common/hystrix/hystrix.h

+  void flushGauge(const Stats::Gauge&, uint64_t){};
+  void endFlush();
+  void onHistogramComplete(const Stats::Histogram& histogram, uint64_t value) {
+    std::cout << "histogram complete: " << histogram.name() << ", value: " << std::to_string(value)


Did you mean to leave this print statement here?

mrice32 · 2018-04-16T22:52:29Z

source/extensions/stat_sinks/common/hystrix/hystrix.cc

+  }
+}
+
+uint64_t Hystrix::getRollingValue(absl::string_view cluster_name, absl::string_view stats) {


nit: stats -> stat since you're looking up a single statistic.

mrice32 · 2018-04-16T22:59:12Z

source/extensions/stat_sinks/common/hystrix/hystrix.cc

+}
+
+// void HystrixSink::onHistogramComplete(const Histogram& histogram, uint64_t value);
+void HystrixSink::registerConnection(Http::StreamDecoderFilterCallbacks* callbacks) {


As mentioned in the previous review, I'm hoping we can redesign this so that the sink registers with the Admin as opposed to the other way around.

Overall, though, it seems as if this is currently not designed for multiple streams. I would imagine that multiple subscribers is a valid use case since there's no restriction on the number of subscribers that can register with the hystrix admin handler. Is the intent just to drop the last connection each time a new one comes in? If not, we should probably have a vector of connection/callbacks/streams (or whatever object we end up using to interact with the stream) and send the data to all of them.

mrice32 · 2018-04-16T23:04:31Z

source/extensions/stat_sinks/common/hystrix/hystrix.cc

+  callbacks_ = callbacks;
+}
+
+// TODO (@trabetti) is this correct way?


I think to close the stream, you will need to make an encodeData (or some other encode* method) call on the callbacks with the end_stream argument set to true.

we did not provide a way to terminate the stream from Envoy's side (should we?), this is called when the dashboard client was closed, so we are not sending any more data.

mrice32 · 2018-04-16T23:25:12Z

source/extensions/stat_sinks/common/hystrix/hystrix.cc

+    }
+    out_str << std::endl;
+  }
+  return out_str.str();


@jmarantz can check me on this, but this seems like a dangerous/incorrect use of absl::string_view. According to the source, the std::string must outlive the absl::string_view.

mrice32 · 2018-04-16T23:32:57Z

source/server/http/admin.h

@@ -233,7 +258,8 @@ class AdminFilter : public Http::StreamDecoderFilter, Logger::Loggable<Logger::I
  AdminFilter(AdminImpl& parent);

  // Http::StreamFilterBase
-  void onDestroy() override {}
+  // TODO (@trabetti) : make more generic (how?)


What about creating a new public method on AdminFilter called addOnDestroyCallback()? This method could take a callback and append to a list of callbacks to execute when the filter is destroyed. This will only work if we go with what @jmarantz suggested - an architecture where the AdminFilter is passed to the handlers directly.

Is this the portion that's confusing? Basically, you'd add a public method to AdminFilter that would have the following signature:

void addOnDestroyCallback(std::function<void()> cb);

That method would append to a list of callbacks that are stored inside the AdminFilter. When onDestroy() is called, the AdminFilter can call all of the callbacks in that list. Ultimately, once all this is done, I see your handler looking something like this:

Http::Code HystrixSink::handlerHystrixEventStream(... Server::AdminFilter& admin_filter, ....) { . . . Http::StreamDecoderFilterCallbacks *stream_decoder_filter_callbacks = admin_filter.getDecoderFilterCallbacks(); // Separated out just so it's easier to understand auto on_destroy_callback = [this, stream_decoder_filter_callbacks] () { // Unregister the callbacks from the sink so data is no longer encoded through them. unregisterConnection(stream_decoder_filter_callbacks); } // Add the callback to the admin_filter list of callbacks admin_filter. addOnDestroyCallback(std::move(on_destroy_callback)); . . . }

To do this, you'll need to have the AdminFilter be passed to the handlers, and you'll need to add and implement the addOnDestroyCallback method.

If you do the above, we can get rid of the AdminImpl::unregisterHystrixConnection() and Server::InstanceImpl:: unregisterHystrixSink. This is the last piece of the puzzle to allowing the sink registering itself and requiring no special methods to pass the registration through the server.

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-04-17T13:44:57Z

Pushed a new version. The handler is added in the sink constructor. (plus other fixes).
If this direction looks good, I will continue to passing the AdminFilter to handlers, and the related changes discussed here.
I am not so clear about the direction of registering the sink on the Admin.

With the implementation in this commit, only one sink is possible, since the sink constructor adds the handler. (not sure what happens if someone tries to add more than one sink). I think it makes sense, but it is different than other sinks.
I did add support for multiple dashboard clients on one sink.

jmarantz · 2018-04-17T19:05:06Z

include/envoy/server/admin.h

@@ -23,8 +24,8 @@ namespace Server {
 */
 #define MAKE_ADMIN_HANDLER(X)                                                                      \
  [this](absl::string_view path_and_query, Http::HeaderMap& response_headers,                      \
-         Buffer::Instance& data) -> Http::Code {                                                   \
-    return X(path_and_query, response_headers, data);                                              \
+         Buffer::Instance& data, Http::StreamDecoderFilterCallbacks* callbacks) -> Http::Code {    \


As mentioned earlier, IMO all these args could be saved in the AdminFilter object -- and the callbacks already are. That would simplify this interface once and for all, rather than adding a very specialized complicated object to all of them.

mrice32

Thanks for making all those changes! I added a longer explanation to my suggestion below about getting the unregister portion to work. Let us know if you have any questions!

mrice32 · 2018-04-17T20:27:20Z

source/server/http/admin.h

@@ -233,7 +258,8 @@ class AdminFilter : public Http::StreamDecoderFilter, Logger::Loggable<Logger::I
  AdminFilter(AdminImpl& parent);

  // Http::StreamFilterBase
-  void onDestroy() override {}
+  // TODO (@trabetti) : make more generic (how?)


Is this the portion that's confusing? Basically, you'd add a public method to AdminFilter that would have the following signature:

void addOnDestroyCallback(std::function<void()> cb);

That method would append to a list of callbacks that are stored inside the AdminFilter. When onDestroy() is called, the AdminFilter can call all of the callbacks in that list. Ultimately, once all this is done, I see your handler looking something like this:

Http::Code HystrixSink::handlerHystrixEventStream(... Server::AdminFilter& admin_filter, ....) { . . . Http::StreamDecoderFilterCallbacks *stream_decoder_filter_callbacks = admin_filter.getDecoderFilterCallbacks(); // Separated out just so it's easier to understand auto on_destroy_callback = [this, stream_decoder_filter_callbacks] () { // Unregister the callbacks from the sink so data is no longer encoded through them. unregisterConnection(stream_decoder_filter_callbacks); } // Add the callback to the admin_filter list of callbacks admin_filter. addOnDestroyCallback(std::move(on_destroy_callback)); . . . }

To do this, you'll need to have the AdminFilter be passed to the handlers, and you'll need to add and implement the addOnDestroyCallback method.

If you do the above, we can get rid of the AdminImpl::unregisterHystrixConnection() and Server::InstanceImpl:: unregisterHystrixSink. This is the last piece of the puzzle to allowing the sink registering itself and requiring no special methods to pass the registration through the server.

Signed-off-by: trabetti <talis@il.ibm.com>

trabetti · 2018-04-18T08:40:28Z

Pushed a commit with AdminFilter passed to handler instead of the callbacks, and followed @mrice32 explanation about the destroy callback.
As @jmarantz mentioned, now we can remove the request_headers parameter, and maybe even others. But at first step I did not change this, as this will result in many changes across various handlers, and also requires some work to do on testing (there is no mock for the AdminFilter).
Please give me feedback on this version, and if it looks as the right direction I will proceed.

mrice32

Looks good. This is definitely the right direction, thanks! I did a quick review and left a few nits and questions, but I think we're getting close. I'll do a full pass over the code tomorrow. Side note: I suspect that you'll need to add a few tests for the codec (no chunk) change, new server accessor method (should be super simple to get coverage there), and admin filter changes.

mrice32 · 2018-04-18T23:10:28Z

api/envoy/config/metrics/v2/stats.proto

@@ -184,6 +184,11 @@ message StatsdSink {
  string prefix = 3;
 }

+// can it be empty?


Technically, yes, but then it might make sense to not provide a config at in the API all by using this implementation in your sink:

ProtobufTypes::MessagePtr HystrixSinkFactory::createEmptyConfigProto() { return ProtobufTypes::MessagePtr{new Envoy::ProtobufWkt::Empty()}; }

However, the API portion of the repo is generally used for docs as well (the comments are scraped, formatted, and published to a docs site), so this config might be a good place to provide general information about the Hystrix stats sink.

I thought of having the number of buckets as a parameter to the constructor. And theoretically it can be changed even during action. But not sure if this level of control is really necessary.

I think adding it as a config option would be better than making it a constructor argument since the constructor is just called from the factory, and there is no factory configuration, so it would have to be hardcoded there. If this is truly something that a user might want to tweak, it would be good to have it as a config parameter :).

mrice32 · 2018-04-18T23:12:12Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  void endFlush() override;
+  void onHistogramComplete(const Stats::Histogram&, uint64_t) override{};
+
+  // TODO (@trabetti) : support multiple connections


nit: hasn't this been done? If so, can we remove the TODO comment?

mrice32 · 2018-04-18T23:13:23Z

source/server/http/admin.cc

@@ -150,6 +150,16 @@ Http::FilterTrailersStatus AdminFilter::decodeTrailers(Http::HeaderMap&) {
  return Http::FilterTrailersStatus::StopIteration;
 }

+void AdminFilter::onDestroy() {
+  for (auto callback : on_destroy_callbacks_) {


nit: const auto& callback

mrice32 · 2018-04-18T23:14:02Z

source/server/http/admin.cc

+}
+
+void AdminFilter::addOnDestroyCallback(std::function<void()> cb) {
+  on_destroy_callbacks_.push_back(cb);


nit: on_destroy_callbacks_.push_back(std::move(cb));

mrice32 · 2018-04-18T23:15:56Z

source/server/http/admin.cc

+  Http::Code code = parent_.runCallback(path, *request_headers_, *header_map, response, *this);
+  bool end_stream = true;
+
+  if (path.find("/hystrix_event_stream") != std::string::npos) {


You may already be planning to do this, but can we get rid of the special case by allowing the handler to be able to set the end_stream variable (we can still default it to true)?

How about, instead of involving the handlers and adding more arguments to their interface, add a bool streaming_ to struct UrlHandler and let AdminImpl::runCallback handle the end_stream value?

I don't have any particular issues with that. Although, if we end up putting most of the handler arguments into the AdminFilter itself, we could add end_stream there, too. I think that would be be the easiest way.

mrice32 · 2018-04-18T23:22:22Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+void HystrixSink::unregisterConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_remove) {
+  for (auto it = callbacks_list_.begin(); it != callbacks_list_.end();) {
+    if ((*it)->streamId() == callbacks_to_remove->streamId()) {
+      it = callbacks_list_.erase(it);


Why not break after erasing? Is it possible to have multiple callbacks with the same stream id here? If it is, are you sure you want to delete both? Maybe it makes sense to de-dup in the registerConnection() method so this can't happen? What do you think?

I think that the stream id should be unique, can break.

mrice32 · 2018-04-18T23:30:52Z

source/extensions/stat_sinks/hystrix/hystrix.h

+private:
+  HystrixPtr stats_;
+  std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{};
+  Server::Instance* server_;


nit: why not Server::Instance& server_; since it never changes what it's pointing to?

mrice32 · 2018-04-18T23:31:41Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  std::map<std::string, std::map<std::string, std::string>> counter_name_lookup;
+};
+
+typedef std::unique_ptr<HystrixStatCache> HystrixPtr;


nit: HystrixStatCachePtr

mrice32 · 2018-04-19T22:25:43Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+
+uint64_t HystrixStatCache::getRollingValue(absl::string_view cluster_name, absl::string_view stat) {
+  std::string key;
+  key = absl::StrCat("cluster.", cluster_name, ".", stat);


nit: why not single line std::string key = absl::StrCat("cluster.", cluster_name, ".", stat);?

mrice32 · 2018-04-19T22:26:27Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+// Add new value to rolling window, in place of oldest one.
+void HystrixStatCache::pushNewValue(const std::string& key, uint64_t value) {
+  // Create vector if do not exist.
+  // TODO trabetti: why resize + value param didn't work without the if?


What is this TODO about?

we wanted to do a changed suggested by @jmarantz , it did not work for some reason, did not get to check

Yeah the trigger for me here was that in both branches, you are doing two map lookups, and I think you should only need one. Maybe mention that in the TODO.

mrice32 · 2018-04-19T22:48:47Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  HystrixStatCache()
+      : current_index_(DEFAULT_NUM_OF_BUCKETS), window_size_(DEFAULT_NUM_OF_BUCKETS + 1){};
+
+  HystrixStatCache(uint64_t num_of_buckets)


Sorry, I don't think I gave the right feedback the first time around. I think window_size_ should be equal to num_of_buckets since you want to mod by the total number of buckets. I think current_index_ should be set to num_of_buckets - 1, since that's the highest index. Right now, there are technically num_of_buckets + 1 total buckets, which is a bit confusing. Also, there should be some input validation when you read in the config to ensure that num_of_buckets is at least 2 (I assume this wouldn't make much sense with only one bucket...?). You could also toss in an assert in the constructor to validate.

It is confusing.. The statsCache is a rolling window. If for example, we have flush interval of 1 ms, and the number_of_buckets is 5, it means that we send every 1 ms the statistics for the last 5 ms. The values stored in it are the counters values, which are constantly increasing. So the result we want to return every endFlush() is the value of current counter, minus the counter before the rolling window started. That's why we keep num_of_buckets+1 values in the rolling window. Does it make sense?

Yep, that makes sense. Thanks for the explanation. On another note, are 1 or 0 acceptable inputs to this function? If not, could we add some input validation (either where you read from the config or here)?

I assume '0' means unset value, in this case it will be set to default value (10) . '1' should be legitimate value, but it is good to make sure it works. I already did that but config.cc I see here is not the updated one (the github scalability issue?)

mrice32 · 2018-04-19T23:27:57Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  if (callbacks_list_.empty()) {
+    return;
+  }
+  if (counter.name().find("upstream_rq_") != std::string::npos) {


Rather than string compare each of these, I think you should just grab the stats directly from each cluster in the endFlush() method. See the stats() method on the cluster object:
https://github.com/envoyproxy/envoy/blob/master/include/envoy/upstream/upstream.h#L494

These are all the stats that are available through that struct:
https://github.com/envoyproxy/envoy/blob/master/include/envoy/upstream/upstream.h#L292

I know the macros are a bit confusing, but it's just creating a struct where you can access the Counter&, Gauge&, etc objects. I think this should cover most (or all?) of the stats you need to capture here and the .membership_total one you look up in endFlush(). IIUC this should also allow you to completely get rid of counter_name_lookup_.

The existing struct do not cover all the stats we need. missing - upstream_rq_2xx, upstream_rq_4xx, upstream_rq_5xx. To get rid of the string compare, we'll need to read those directly in end_flush(). Or is it possible to create struct with only the stats we need?
By the way, we use the stats() method here: https://github.com/trabetti/envoy/blob/bb876ee141069371ad0624bc3efc08074a0d7602/source/extensions/stat_sinks/hystrix/hystrix.cc#L338

That's a good point - my bad, those are actually created and set dynamically -

envoy/source/common/http/codes.cc

Line 23 in b8f2268

scope.counter(fmt::format("{}upstream_rq_{}", prefix, enumToInt(response_code))).inc();

. For those, I would recommend looking them up dynamically similar to how you currently look up the .membership_total stat.

The stats() method you are referring to is the method on the Server::Instance object, which returns a stats scope that allows looking up/creating stats by name. I'm referring to the stats() method on the cluster object, which returns a struct of specific stats that are relevant to the cluster that can be accessed directly without a lookup. The latter should be used for as many stats as possible as it is the cheapest lookup. The former should be used for specific stats that are not accessible directly via a struct, which IIUC, should just be those 3 stats you just mentioned. We can't really create our own struct for all of these stats because it would need to constantly be updated whenever Envoy adds or removes a cluster.

I would really only advise using the flushCounter, flushGauge, etc methods when you need to export all or most of Envoy's stats. It seems this sink is interested in a pretty limited subset of stats, which means individual lookups will likely be the cheapest way to do it.

I know the stats system can be pretty confusing because there are a lot of different ways to do the same thing :). Does all of that make sense?

Just for my own learnings....is there some reason an interested class can't do the name-lookup once on construction and then save the desired stat in their own member var?

Signed-off-by: trabetti <talis@il.ibm.com>

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com> Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

jmarantz · 2018-04-23T14:27:00Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+    // If the counter was reset, the result is negative
+    // better return 0, will be back to normal once one rolling window passes.
+    if (rolling_stats_map_[key][current_index_] <
+        rolling_stats_map_[key][(current_index_ + 1) % window_size_]) {


same here. You are doing lots of map lookups with the same key. This can be avoided.

jmarantz · 2018-04-23T14:28:20Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  const uint64_t window_size_;
+  // TODO(trabetti): do we want this to be configurable through the HystrixSink in config file?
+  static const uint64_t DEFAULT_NUM_OF_BUCKETS = 10;
+  std::map<std::string, std::map<std::string, std::string>> counter_name_lookup;


suffix member vars with and underscore.

jmarantz · 2018-04-23T14:29:36Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  // Building lookup name map for all specific cluster values.
+  // Every call to the updateRollingWindowMap function should get the appropriate name from the map.
+  std::string cluster_name_with_prefix = absl::StrCat("cluster.", cluster_name, ".");
+  counter_name_lookup[cluster_name]["upstream_rq_timeout"] =


Save counter_name_lookup[cluster_name] into a temp.

jmarantz · 2018-04-23T14:30:34Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+void HystrixStatCache::updateRollingWindowMap(std::map<std::string, uint64_t> current_stat_values,
+                                              std::string cluster_name) {
+
+  if (counter_name_lookup.find(cluster_name) == counter_name_lookup.end()) {


ditto -- lots of lookups to cluster_name all of which return the same exact string->string map.

jmarantz

This PR is running into github's current limits; may need to make a new one. When you do, I'd recommend.

Use a single commit in the new PR; don't just cut a new PR from all your existing commits, otherwise it will quickly blow up again.
Be careful about DCO which is failing now.

jmarantz · 2018-04-23T15:34:35Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  /**
+   * Generate the streams to be sent to hystrix dashboard.
+   */
+  void getClusterStats(std::stringstream& ss, absl::string_view cluster_name,


Move output parameter ss last. https://google.github.io/styleguide/cppguide.html#Output_Parameters

jmarantz · 2018-04-23T15:35:19Z

source/extensions/stat_sinks/hystrix/hystrix.h

+  /**
+   * Calculate values needed to create the stream and write into the map.
+   */
+  void updateRollingWindowMap(Upstream::ClusterInfoConstSharedPtr cluster_info,


const Upstream::ClusterInfoConstSharedPtr&

jmarantz · 2018-04-23T15:37:02Z

source/extensions/stat_sinks/hystrix/hystrix.h

+
+typedef std::unique_ptr<HystrixStatCache> HystrixStatCachePtr;
+
+namespace Hystrix {


Do you need this extra level of namespace? The only names in it are already prefixed with 'Hystrix'.

Good point. I think this namespace should surround everything inside the hystrix/ dir, not just this class. This is how it's done for other stats sinks.

jmarantz · 2018-04-23T15:37:24Z

source/extensions/stat_sinks/hystrix/hystrix.h

+
+class HystrixSink : public Stats::Sink, public Logger::Loggable<Logger::Id::hystrix> {
+public:
+  HystrixSink(Server::Instance& server, uint64_t num_of_buckets);


output params 'server' should be last: https://google.github.io/styleguide/cppguide.html#Output_Parameters

jmarantz · 2018-04-23T15:38:07Z

source/extensions/stat_sinks/hystrix/hystrix.h

+
+private:
+  HystrixStatCachePtr stats_;
+  std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{};


Is there any value to explicitly putting {} in this declaration?

jmarantz · 2018-04-23T15:44:49Z

test/extensions/stats_sinks/hystrix/hystrix_test.cc

+  std::string rolling_map = sink_->getStats().printRollingWindow();
+  std::size_t pos = rolling_map.find("cluster.test_cluster.total");
+  EXPECT_NE(std::string::npos, pos);
+  //  //EXPECT_NE(absl::string_view::npos, map.find("cluster.test_cluster.total"));


jmarantz · 2018-04-23T15:53:03Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  for (auto stats_map_itr = rolling_stats_map_.begin(); stats_map_itr != rolling_stats_map_.end();
+       ++stats_map_itr) {
+    out_str << stats_map_itr->first << " | ";
+    RollingWindow rolling_window = stats_map_itr->second;


const RollingWindow& rolling_window = name_value.second;

jmarantz · 2018-04-23T15:54:41Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+    out_str << stats_map_itr->first << " | ";
+    RollingWindow rolling_window = stats_map_itr->second;
+    for (auto specific_stat_vec_itr = rolling_window.begin();
+         specific_stat_vec_itr != rolling_window.end(); ++specific_stat_vec_itr) {


use a range-loop here too, with a specific type for the iterator variable (not auto). It's not obvious from looking at this code locally what the type is, which seems interesting.

jmarantz · 2018-04-23T15:55:24Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+    RollingWindow rolling_window = stats_map_itr->second;
+    for (auto specific_stat_vec_itr = rolling_window.begin();
+         specific_stat_vec_itr != rolling_window.end(); ++specific_stat_vec_itr) {
+      out_str << *specific_stat_vec_itr << " | ";


Won't it be weird to have a " | " at the end of the list? If you use absl::StrJoin you won't get a trailing one.

I wanted it to look like a table.. although the lines are not aligned so it doesn't really look like a table in reality..

jmarantz · 2018-04-23T16:00:12Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  // Building lookup name map for all specific cluster values.
+  // Every call to the updateRollingWindowMap function should get the appropriate name from the map.
+  std::string cluster_name_with_prefix = absl::StrCat("cluster.", cluster_name, ".");
+  counter_name_lookup[cluster_name]["upstream_rq_5xx"] =


this seems indirect. Can you capture the counter object pointer in your map, rather than capturing a long string that you'll then have to look up in updateRollingWindowMap, which seems like it might be called frequently enough to matter.

mrice32

One comment about the data storage in the Hystrix sink. Once we get that all sorted out, I think we'll be pretty close to getting this merged! Thanks for all the work on this!

mrice32 · 2018-04-23T15:14:42Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+void HystrixStatCache::CreateCounterNameLookupForCluster(const std::string& cluster_name) {
+  // Building lookup name map for all specific cluster values.
+  // Every call to the updateRollingWindowMap function should get the appropriate name from the map.
+  std::string cluster_name_with_prefix = absl::StrCat("cluster.", cluster_name, ".");


For context, I'm trying to get rid of as many map lookups as possible since static struct accesses are generally much more efficient.

Instead of using a nested map, why can't we combine counter_name_lookup and rolling_stats_map_ into a single map from cluster name to a struct containing all cluster specific information?

For example:

struct ClusterStats { // This constructor should initialize all the stat names by concatenating // the cluster name and the rest of the stat name that's required for // the lookup. ClusterStats(const std::string &cluster_name); const std::string retry_upstream_rq_5xx_name_; const std::string upstream_rq_5xx_name_; const std::string retry_upstream_rq_4xx__name_; . . . // Rolling windows RollingWindow errors_; RollingWindow success_; RollingWindow total_; . . . }; // Map from cluster names to a struct of all of that cluster's stat windows. std::unordered_map<std::string, ClusterStats> cluster_stats_map_;

This would allow us to remove all of this initialization code inside CreateCounterNameLookupForCluster and just allow you to do one map lookup for each cluster. Then your pushNewValue calls could look more like this (you would need to modify it to take in a RollingWindow& and the value):

ClusterStats& cluster_stats = cluster_stats_map_[cluster_name]; . . . uint64_t success = stats.counter(cluster_stats.upstream_rq_2xx_name_).value(); pushNewValue(cluster_stats.success_, success); . . .

Sorry for the wall of text. Let me know if the above suggestion doesn't make sense or you don't think it will work. I think this stat export is happening often enough that we should make sure it's not doing any unnecessary work.

mrice32 · 2018-04-23T16:49:06Z

source/extensions/stat_sinks/hystrix/hystrix.cc

+  uint64_t error_rate =
+      total == 0
+          ? 0
+          : (static_cast<double>(errors + timeouts + rejected) / static_cast<double>(total)) * 100;


nit: I think you can remove all the static casts by doing the multiply on the numerator before dividing:

(100 * (errors + timeouts + rejected)) / total;

mrice32

As @jmarantz mentioned, please squash all commits and open a new PR because of the github scalability issue. I can move my recent comments over to the new PR.

jmarantz · 2018-04-23T19:59:38Z

Actually, while we are talking about new PRs, I think it would be worthwhile to break out the admin.cc changes as a separate easy-to-review PR. It touches a bunch of lines but those issues are now pretty much settled, so might as well iterate on the map data structure issues @mrice32 and I have been discussing in a smaller PR.

trabetti · 2018-04-23T20:06:36Z

@jmarantz fine with me. will open a separate PR for the admin changes. There's nothing directly related to the hystrix support there.

jmarantz · 2018-04-23T20:18:00Z

RE hystrix.cc being skipped: I think this is really a github UI filter working as intended. You can get to hysterix.cc you just have a to click through a prompt in the 'files' diff that says "diff too large, click here to show". It's just a little counterintuitive because if you do a find in the page, you won't cover any of the contents of hystrix.cc.

…

On Mon, Apr 23, 2018 at 3:53 PM trabetti ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In source/extensions/stat_sinks/hystrix/hystrix.h <#2736 (comment)>: > + + /** + * register a new connection + */ + void registerConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_register); + /** + * remove registered connection + */ + void unregisterConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_remove); + HystrixStatCache& getStats() { return *stats_; } + +private: + HystrixStatCachePtr stats_; + std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{}; + Server::Instance& server_; + std::map<std::string, uint64_t> current_stat_values_; @jmarantz <https://github.com/jmarantz> why does hysrtix.cc being skipped? I also noticed that some files are not updated here. Is it because the github PR capacity issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2736 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB2kPQTu8UdrBlN5Khew_ZnbPTquPSs2ks5trjEwgaJpZM4SeoHY> .

trabetti · 2018-04-23T21:20:18Z

Opened PR #3172 for admin changes only. Will need to add coverage tests.

trabetti · 2018-05-17T19:28:55Z

Closing this PR. Opened a new PR #3425.

support for hystrix event stream output

d2d6d16

Signed-off-by: talis <talis@il.ibm.com>

trabetti mentioned this pull request Mar 6, 2018

Documentation for hystrix dashboard support feature (issue #1244) envoyproxy/data-plane-api#523

Closed

trabetti added 4 commits March 6, 2018 14:38

fix formatting

5c865c5

Signed-off-by: talis <talis@il.ibm.com>

fix clang compilation error

035b38e

Signed-off-by: trabetti <talis@il.ibm.com>

Merge remote-tracking branch 'upstream/master' into hystrix-dashboard…

8acb044

…-support Signed-off-by: trabetti <talis@il.ibm.com>

fix formatting

b9a4301

Signed-off-by: trabetti <talis@il.ibm.com>

htuch assigned mrice32 Mar 6, 2018

mattklein123 reviewed Mar 6, 2018

View reviewed changes

mrice32 reviewed Mar 7, 2018

View reviewed changes

jmarantz reviewed Mar 8, 2018

View reviewed changes

htuch assigned jmarantz Mar 8, 2018

taiki45 mentioned this pull request Mar 13, 2018

[envoy] new integration DataDog/integrations-core#1156

Merged

fix some PR comments, and change content of rolling window to match o…

cacf90e

…nly the needed values Signed-off-by: trabetti <talis@il.ibm.com>

eliranroffe added 2 commits March 19, 2018 09:40

Move HandlerInfoClasses from include folder to source folder

61659a3

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

Move Hystrix code from Admin to a decdicated file

a05aea1

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

jmarantz reviewed Mar 19, 2018

View reviewed changes

Change std::string to absl:string_view in hystrix files

9eac8cf

jmarantz reviewed Mar 20, 2018

View reviewed changes

eliranroffe and others added 5 commits March 25, 2018 10:04

Refactoring in Hystrix and HystrixTest classes and fixing comments style

5f6b6ee

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

Name lookup for counters in Hystrix class

0e92bd5

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

move hystrix support to base on a sink

6e9e807

Signed-off-by: trabetti <talis@il.ibm.com>

fixed typo in function name

847b241

Signed-off-by: trabetti <talis@il.ibm.com>

small fixes

f91126c

Signed-off-by: trabetti <talis@il.ibm.com>

mrice32 mentioned this pull request Mar 28, 2018

stats: add built-in log linear histogram support #1965

Closed

trabetti added 2 commits April 16, 2018 23:54

sync to upstream

4e68380

Signed-off-by: trabetti <talis@il.ibm.com>

Merge branch 'master' of https://github.com/envoyproxy/envoy into hys…

9b29b7c

…trix-dashboard-support-handler-upstream Signed-off-by: trabetti <talis@il.ibm.com>

mrice32 reviewed Apr 16, 2018

View reviewed changes

trabetti added 2 commits April 17, 2018 11:33

review comments

aac876d

Signed-off-by: trabetti <talis@il.ibm.com>

support multiple connections

9a3c0bf

Signed-off-by: trabetti <talis@il.ibm.com>

jmarantz reviewed Apr 17, 2018

View reviewed changes

mrice32 reviewed Apr 17, 2018

View reviewed changes

passing AdminFilter to callback, adding on_destroy_callbacks

bb876ee

Signed-off-by: trabetti <talis@il.ibm.com>

mrice32 reviewed Apr 18, 2018

View reviewed changes

mrice32 reviewed Apr 20, 2018

View reviewed changes

trabetti and others added 3 commits April 23, 2018 16:29

fix review comments

f5e21f2

Signed-off-by: trabetti <talis@il.ibm.com>

Set end_stream as an attribute of AdminFilter

98de40a

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com> Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

Move request_headers to admin_filter when calling runCallback function

09dd340

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

jmarantz reviewed Apr 23, 2018

View reviewed changes

mrice32 reviewed Apr 23, 2018

View reviewed changes

trabetti mentioned this pull request Apr 23, 2018

admin: improve infrastructure #3172

Merged

trabetti mentioned this pull request May 17, 2018

support for hystrix dashboard through an admin endpoint, using sink #3425

Merged

trabetti closed this May 17, 2018


		void Hystrix::resetRollingWindow() { rolling_stats_map_.clear(); }

		void Hystrix::addStringToStream(std::string key, std::string value, std::stringstream& info) {


		typedef std::unique_ptr<HystrixStatCache> HystrixStatCachePtr;

		namespace Hystrix {

support for hystrix event stream output #2736

support for hystrix event stream output #2736

Uh oh!

Conversation

trabetti commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 left a comment

Choose a reason for hiding this comment

Uh oh!

mrice32 Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz Apr 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 Apr 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

trabetti commented Mar 6, 2018 •

edited

Loading

mrice32 Mar 6, 2018 •

edited

Loading

jmarantz Apr 11, 2018 •

edited

Loading

mrice32 Apr 16, 2018 •

edited

Loading

mrice32 Mar 6, 2018 •

edited

Loading

mrice32 Mar 7, 2018 •

edited

Loading

mrice32 Mar 7, 2018 •

edited

Loading