Skip to content

support for hystrix event stream output #2736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from

Conversation

trabetti
Copy link
Contributor

@trabetti trabetti commented Mar 6, 2018

Signed-off-by: talis talis@il.ibm.com

admin filter: add support for hystrix dashboard output

Description:
This PR adds an admin endoint for hystrix dashboard output.
When there is a request to admin_port/hystrix_event_stream, envoy starts sending SSE stream, using the same socket of the initial request.
The admin URL can be accessed directly from hystrix dashboard.

Risk Level: Medium

Testing:
Added unit testing hystrix_test.cc
The admin endpoint was not tested, since it returns an infinite event stream, so current mechanisms for testing admin endpoint do not work. If there are any suggestions on how to test it, we will add.

Docs Changes:
changed docs/root/operations/admin.rst.
PR: envoyproxy/data-plane-api#523

Release Notes:
Support was added to hystrix event stream output. From hystrix dashboard, set the Eureka URL to envoy_ip:8001, and add stream from http://envoy_ip:8001/hystrix_event_stream
Note that not all hystrix dashboard statistics are supported. More detail can be found here: <link to the documentation generated from docs/root/operations/admin.rst >

Fixes #1244

Signed-off-by: talis <talis@il.ibm.com>
trabetti added 4 commits March 6, 2018 14:38
Signed-off-by: talis <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
…-support

Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool stuff. Few drive by comments from a quick skim (not a full review).

@@ -575,13 +574,66 @@ std::string AdminImpl::runtimeAsJson(
return strbuf.GetString();
}

Http::Code AdminImpl::handlerHystrixEventStream(const std::string&,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick drive by comment. This file is getting huge and will continue to get huger. Can you put all the new Hystrix admin stuff into a dedicated file, potentially in a new admin directory? We will need to start decomping the rest of this file also as we add more stuff to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved hystrix stuff from admin.cc - commit a05aea1

@@ -69,12 +70,16 @@ void StreamEncoderImpl::encodeHeaders(const HeaderMap& headers, bool end_stream)
saw_content_length = true;
}

if (headers.NoChunks()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say a little more about why this is needed? This is a bit odd. I assume it's because you don't want chunk transfer encoding but you don't have a content-length? Is there any way to handle this without this kind of hack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your assumption is right. I saw that the ":" wes created exactly to allow such kind of hook and it is being used in some other places. Alternatives could be (1) changing the interface of encodeHeaders(), which is much more intrusive since it is all over, or (2) building the response without using encodeHeaders(). What do you recommend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about it, but it's probably the least bad option for doing this. If we go this route can you please add liberal comments on why this is here, that it is a non-standard header, etc.? @alyssawilk WDYT?

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! Thanks! Did a quick first pass. Looks good for the most part.

class HystrixHandlerInfo : public HandlerInfo {
public:
HystrixHandlerInfo(Http::StreamDecoderFilterCallbacks* callbacks)
: stats_(new Stats::Hystrix()), data_timer_(nullptr), ping_timer_(nullptr),
Copy link
Member

@mrice32 mrice32 Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer std::make_unique<Stats::Hystrix>() here and throughout (this may be common elsewhere in Envoy, but we recently moved to C++14).

* This class contains data which will be sent from admin filter to a hystrix_event_stream handler
* and build a class which contains the relevant data.
*/
class HystrixHandlerInfo : public HandlerInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this class under include/? Generally, in Envoy, only pure interfaces with no implementation go under include/, and all implementations of those interfaces (even if there's just one) go under source/.

* in admin impl. Each handler which needs to receive data from admin filter can inherit from
* HandlerInfo and build a class which contains the relevant data.
*/
class HandlerInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be separated into a pure interface here and a default implementation under source/? Generally within Envoy, classes declared under include/ are pure interfaces with all PURE (= 0) methods. Even the default implementations are generally separated from the interface and placed under source/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed by commit 61659a3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated in another comment thread, we can eliminate HandlerInfo completely and instead pass the AdminFilter* into each handler, which will clean up the dynamic_cast and remove a special-case from where the HandlerInfo is constructed.

Copy link
Contributor

@jmarantz jmarantz Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like as a sink you still need to get the Http::StreamDecoderFilterCallbacks* into your system, but can we get this by plumbing through the AdminFilter& rather than having this empty-base-class/dynamic-cast strategy? Or @mattklein123 do you have a better idea?

Copy link
Member

@mrice32 mrice32 Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - Either the Http::StreamDecoderFilterCallbacks* should be passed through or you could add some new public methods to AdminFilter to do what you want and pass it through.

code = parent_.runCallback(path, *header_map, response, *handler_info_);
} else {
HandlerInfoPtr temp_handler(new HystrixHandlerInfo(callbacks_));
handler_info_ = std::move(temp_handler);
Copy link
Member

@mrice32 mrice32 Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why not just handler_info_ = std::make_unique<HystrixHandlerInfo>(callbacks_);?


// using write() since we are sending network level
// TODO(trabetti): is there an alternative to the const_cast?
(const_cast<Network::Connection*>((hystrix_handler_info->callbacks_)->connection()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate in your comment on why can't you use callbacks_->encodeData() here instead of getting at the raw connections? I'm having a little trouble understanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Probably can use encodeData()..

auto callback = [&](const std::string&, Http::HeaderMap&, Buffer::Instance&) -> Http::Code {
return Http::Code::Accepted;
};
auto callback = [&](const std::string&, Http::HeaderMap&, Buffer::Instance&,
Copy link
Member

@mrice32 mrice32 Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove & from the capture since you don't need to capture anything from the environment - ditto in other test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to this PR. Want us to fix those here anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't notice that you were only modifying part of that line. It'd be awesome if you could clean that up, though!

const uint64_t Hystrix::PING_INTERVAL_IN_MS;

// add new value to rolling window, in place of oldest one
void Hystrix::pushNewValue(std::string key, int value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: generally prefer types with defined sizes - like int64_t. Also, it seems like you use int64_t as the return value for getRollingValue() which is a diff of these inputs. A similar inconsistency exists for num_of_buckets_/current_index_ and DEFAULT_NUM_OF_BUCKETS.


// combining timeouts+retries - retries are counted as separate requests
// (alternative: each request including the retries counted as 1)
double timeouts = getRollingValue(cluster_name, "upstream_rq_timeout") +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all these being casted to doubles and then casted back to integers?

handler_info_ = std::move(temp_handler);
code = parent_.runCallback(path, *header_map, response, *handler_info_);
} else {
HandlerInfoPtr temp_handler(new HystrixHandlerInfo(callbacks_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only half-baked, but maybe it makes sense to make all the handlers classes instead of just functions and pass them (or provide callbacks to get) certain state, like callbacks_, on each request. Then, they can manage the creation of any per-request state that may need to be created internally. That would eliminate the need to modify this function every time a handler wants to use HandlerInfo, and it would eliminate the dynamic_cast.
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we also thought about something like that. How about opening a separate issue for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that seems reasonable to me. Could you also toss a TODO in there referencing the issue?

typedef std::vector<uint64_t> RollingStats;
typedef std::map<std::string, RollingStats> RollingStatsMap;

// Consider implement the HystrixStats as a sink to have access to histograms data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the idea of moving the bulk of the logic to a stats sink.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We thought of doing this later as an enhancement..

Copy link
Member

@mrice32 mrice32 Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I mentioned this was that I was thinking more about using stats sinks for this. I'd be interested to hear @mattklein123's and @htuch's thoughts. IIUC, the general architecture is a subscription mechanism where a client reaches out to subscribe and then Envoy streams specific stats data indefinitely over that connection. Is this correct?

That would mean that this is a combination of the existing pull-based integrations, like prometheus, that just implement a simple request-query-response handler on the admin port and push-based integrations, like the GRPC stats streamer and statsd, that implement stats sinks to push stats to subscribers. A lot of the custom functionality that you've added, like querying stats and periodic flushes, are already handled by the existing sink interface.

I'm wondering if there's some way that we can keep an admin port subscription model, but pass off the stream to a stats sink where the periodic pushing can be done. This would likely allow the pushing to be more performant since it won't require a new periodic query to be set up for each subscriber (assuming the multi-subscriber case would ever happen in a hystrix deployment). It would also mean that you wouldn't have to reinvent the wheel when it comes to building and configuring a push system within Envoy or querying stats data. And, as you mentioned, it will have built-in support for histograms. I'm curious to hear what others think, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, agreed. We should do this using a stat sink. I haven't through it through fully, but I don't see any reason why we can't figure out a way in which the active admin stream registers with the sink, and then when the flush calls happen it would just work? @mrice32 can you help guide this PR in that direction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats sink means the stream will be sent at same intervals as other stats which are active. Isn't it a limitation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123, sounds good.

@trabetti, yes, it is a bit of a limitation. However, I think most deployments have only been using one sink so far, so the aligned intervals have not been an issue. If this is something you foresee being an issue, we should open an issue and design a way for stats sinks to optionally specify a time interval that is different from the global one. It shouldn't be too hard to make this more flexible.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing this to a stats sink. I felt like most of the superficial testing and string-handling comments below might apply to this as a sink as well.

}
}

uint64_t Hystrix::getRollingValue(std::string cluster_name, std::string stats) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::string& cluster_name, const std::string& stats.

or maybe:

absl::string_view cluster_name, absl::string_view stats

}

uint64_t Hystrix::getRollingValue(std::string cluster_name, std::string stats) {
std::string key = "cluster." + cluster_name + "." + stats;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absl::StrCat (which knows how to reserve the right size for the output buffer for all inputs, before it starts appending bytes).

// better return 0, will be back to normal once one rolling window passes
// better idea what to return? could change the algorithm to keep last valid delta,
// updated with pushNewValue and kept stable when the update is negative.
if (rolling_stats_map_[key][current_index_] <
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto& rolling_stats_map_entry = rolling_stats_map_[key] and CSE everywhere that appears.

void Hystrix::resetRollingWindow() { rolling_stats_map_.clear(); }

void Hystrix::addStringToStream(std::string key, std::string value, std::stringstream& info) {
addInfoToStream(key, "\"" + value + "\"", info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absl::StrCat.


void Hystrix::resetRollingWindow() { rolling_stats_map_.clear(); }

void Hystrix::addStringToStream(std::string key, std::string value, std::stringstream& info) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const refs or string_view

namespace Envoy {
namespace Stats {

void checkStreamField(std::string dataMessage, std::string key, uint64_t expected) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use string_view here. Also declare non-changing temps as const. Also all your temp strings below should be absl::string_view.

namespace Stats {

void checkStreamField(std::string dataMessage, std::string key, uint64_t expected) {
std::string actual = dataMessage.substr(dataMessage.find(key));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capture dataMessage.find(key) in a temp key_pos and EXPECT_NE(absl::string_view::npos, key_pos);

actual = actual.substr(actual.find(" ") + 1);
std::size_t length = actual.find(",");
actual = actual.substr(0, length);
EXPECT_EQ(actual, std::to_string(expected));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a slightly nicer pattern is to return the actual, and do the EXPECT_EQ at the call-site, which gives you useful line-numbers if it fails.


void checkStreamField(std::string dataMessage, std::string key, uint64_t expected) {
std::string actual = dataMessage.substr(dataMessage.find(key));
actual = actual.substr(actual.find(" ") + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all your .find returns need to be checked. Also, please give a comment showing what the format is you are parsing.

checkStreamField(dataMessage, "rollingCountSuccess", 63);
checkStreamField(dataMessage, "rollingCountTimeout", 27);
checkStreamField(dataMessage, "propertyValue_queueSizeRejectionThreshold", expectedQueueSize);
checkStreamField(dataMessage, "reportingHosts", expectedReportingHosts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you could have your helper do a more complete parsing of dataMessage, returning a map<string, uint64_t> which you could then just EXPECT_EQ against a map that's initialized to the above gold values.

The existing checkStreamField looks like it might ignore stray characters in the string, and would definitely ignore extra field/value pairs. It also won't work if one of the keys is a substring of another.

@trabetti
Copy link
Contributor Author

trabetti commented Mar 8, 2018

Thanks for so many comments! Will probably take until middle of next week to start pushing the fixes.

@jmarantz
Copy link
Contributor

jmarantz commented Mar 8, 2018

@trabetti recommend focusing on big picture first (hystrix as stats sink). I have a growing concern with dependency-creep in Envoy as lots of deployments differ greatly and we'd like it to be easy to configure the parts we really need without generating a sea of ifdefs.

…nly the needed values

Signed-off-by: trabetti <talis@il.ibm.com>
@trabetti
Copy link
Contributor Author

Just updating, we got distracted by some urgent task this week, will start pushing fixes next week..

@mattklein123
Copy link
Member

Per new release note policy please add a release note with any applicable feature docs to https://github.com/envoyproxy/data-plane-api/blob/master/docs/root/intro/version_history.rst. Thank you!

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
response_headers.insertNoChunks().value().setReference("0");

Stats::HystrixHandlerInfoImpl& hystrix_handler_info =
dynamic_cast<Stats::HystrixHandlerInfoImpl&>(handler_info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO dynamic_cast should rarely be needed, and when used the result certainly need to be checked for nullptr (so it should be pointers not refs).

This all might be moot anyway if you move to using sinks maybe this can all be reverted.

But if we decide not to go with sinks in this PR, and you need to add this as a new stats callback, then my suggestion is to allocate Stats::HystrixHandlerInfoImpl as a temp inside a simple handler. It does look like you currently need to pass through callbacks_ to construct the HystrixHandlerInfoImpl, but since the handlers are non-static methods. you have that available to you in the callback. You don't need to change MAKE_ADMIN_HANDLER, the dispatching mechanism, or all the signatures of the existing handlers.

That would clean this up quite a bit from a code and PR-size perspective.

But really the indicates that this admin handler is really a lot different from all the other ones, and points strongly toward having an architectural discussion. In particular, this admin handler actually starts a timer and does stuff after the request terminates, which is very different semantics from all the other admin handlers, which either do a simple mutation or return some data. Probably this is another signpost pointing us toward sinks as a better solution. WDYT?

Copy link
Contributor Author

@trabetti trabetti Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the request never terminates. We must use the same socket that was used in the original request for streaming. If the request was terminated, the socket would have been closed. I think this was the main issue with implementing as a sink. Will need some help to resolve this.

Copy link
Member

@mrice32 mrice32 Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmarantz Totally agree with the sentiment. However, IIUC, the handlers don't have the private variables of the filter (or the filter, itself) available to them since they are methods on the AdminImpl object, but I may be missing something.

Above, we discussed above adding a handler class (rather than just a method) to combine the callback and data and formalize the idea of combining handlers with state. This would allow handlers to allocate and manage their own state, be tied to the filter object, and not require any dynamic_casting. We decided to add this as a TODO rather than do it in this PR, but maybe we should get that in as a part of this PR instead? What do you think?

Side note: as long as a bad_cast exception is the desired result of a failed dynamic_cast, I think dynamic_casting a ref should be fine since it automatically throws if the cast fails.

Copy link
Member

@mrice32 mrice32 Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trabetti I think this will still work exactly the same way as it does now in a sink. I think the request object would need to be registered with the sink, and then the sink could start encoding data through it without closing it, just as the timers do now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it should be a separate PR, but maybe it should go first, and have the current one on top of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrice32 Good point; how about supplying the AdminFilter* to the callbacks, rather than an empty HandlerInfo* base class that needs to be cast whenever it is used. That would require passing 'this' to parent_.runCallback() in AdminFilter::onComplete, and then passing that through to the handlers, so they'd have context of the filter instantiation that invoked them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trabetti IIUC, you're saying you want to break these into two PRs - first the sink that the handler would register with, and then add the handler, itself, in a separate PR? I think that's a great idea.

@jmarantz yup, seems reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrice32 I meant for first PR for adding a handler class instead of a method. Then, modify this PR to use the class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Yup, that would be fine too. I'm always in favor of dividing up PRs where possible :)

* in admin impl. Each handler which needs to receive data from admin filter can inherit from
* HandlerInfo and build a class which contains the relevant data.
*/
class HandlerInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated in another comment thread, we can eliminate HandlerInfo completely and instead pass the AdminFilter* into each handler, which will clean up the dynamic_cast and remove a special-case from where the HandlerInfo is constructed.

void Hystrix::pushNewValue(std::string key, uint64_t value) {
// create vector if do not exist
if (rolling_stats_map_.find(key) == rolling_stats_map_.end()) {
rolling_stats_map_[key].resize(num_of_buckets_, value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolling_stats_map_[key].resize(num_o_buckets_, value) is all you need; no need for the if-statement or .find() check.

const uint64_t Hystrix::PING_INTERVAL_IN_MS;

// add new value to rolling window, in place of oldest one
void Hystrix::pushNewValue(std::string key, uint64_t value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::string& here and in all formals.


void Hystrix::updateRollingWindowMap(Stats::Store& stats, absl::string_view cluster_name) {
std::string prefix;
prefix = absl::StrCat("cluster.",cluster_name,".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the formatter requires that you add spaces between args. You need to run the formatter under docker with every commit.

// (alternative: each request including the retries counted as 1)
std::string upstream_rq_timeout_key = absl::StrCat(prefix,"upstream_rq_timeout");
std::string upstream_rq_per_try_timeout_key = absl::StrCat(prefix,"upstream_rq_per_try_timeout");
uint64_t timeouts = stats.counter(upstream_rq_timeout_key).value() +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to happen a bunch, do you want to cache the stats.counter return references and store them in the class, rather than doing the name-lookup each time?

uint64_t errors = stats.counter(upstream_rq_5xx_key).value() +
stats.counter(retry_upstream_rq_5xx_key).value() +
stats.counter(upstream_rq_4xx_key).value() +
stats.counter(retry_upstream_rq_4xx_key).value() -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why '-' instead of '+' here? Just add comment if this was intentional.

addIntToStream("currentConcurrentExecutionCount", 0, cluster_info);
addIntToStream("latencyExecute_mean", 0, cluster_info);

// latency information can be taken rom hystogram, which is only available to sinks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "from histogram"? Start comments with a capital letter, end with a period.

std::stringstream out_str;

for (std::map<std::string, RollingStats>::const_iterator it = rolling_stats_map_.begin();
it != rolling_stats_map_.end(); ++it) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (auto it : rolling_stats_map_) {

addHystrixThreadPool(ss, cluster_name, max_concurrent_requests, reporting_hosts);
}

absl::string_view Hystrix::printRollingWindow() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method should be const. Probably others should be as well.

* Get value of the sampling buckets
*/
static uint64_t GetRollingWindowIntervalInMs() {
return static_cast<const uint64_t>(ROLLING_WINDOW_IN_MS / DEFAULT_NUM_OF_BUCKETS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/const//

handler_info_ = std::make_unique<Stats::HystrixHandlerInfoImpl>(callbacks_);
code = parent_.runCallback(path, *header_map, response, *handler_info_);
end_stream = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed elsewhere I am hoping to replace this whole new block with:
Http::Code code = parent_.runCallback(path, *header_map, this, response);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmarantz Thanks for the comments. Will do another fix soon.

@jmarantz @mrice32 Question for the sink implementation.
As far as I understand, currently sinks are only added to the sinks vector (MainImpl::stats_sinks_) during init, according to what's specified in the configuration file "stats_sinks" parameter.
I thought of two ways to implement:
(1) Add a "hystrix sink" through the config file. Once a request is made from the hystrix dashboard, set the connection parameters on that sink and switch on an enable flag in it.
(2) Once a request is made, create a new sink with the connection parameters and add it to the vector of sinks. Remove it from the vector when the connection is destroyed.
Do you see any problem or disadvantage with any of these options?

Copy link
Member

@mrice32 mrice32 Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trabetti, 1 is almost definitely what we want. This allows you to do all of the formatting once rather than having to do it for each connection. Also, Envoy is not really built to scale in the number of stats sinks because of the way it passes stats to them. And lastly, dynamic registration of sinks without involving the config makes them more difficult to reason about from the user's perspective.

I'm not exactly sure what you mean by setting connection parameters or enabling the flag, but by my understanding, you should be able to just register each stream with the single hystrix sink that is created through the config. We'll also need to detect connection termination and make sure the sink is aware of this so it doesn't try to access a deleted stream. This registration mechanism will take some thought, by the way, because there is no existing communication mechanism between stats sinks and the admin interface.

As you're working through this design change, feel free to reach out with any questions here or on slack. We're happy to provide input wherever needed :). Take your time making incremental updates here, and whenever you're ready for us to take another pass, just let us know.

Copy link
Contributor Author

@trabetti trabetti Mar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mrice32 , the first version of the hystrix support as a sink is almost ready, I hope I'll get to push it before Passover holiday..
Questions:

  1. As we discussed, the hystrix sink will be set through the config file, which means I touched data_plane_api. How do I push it here so it compiles (in my development env I point to my fork of data_plane_api)
  2. I need some help figuring out how to use the histogram data.
    Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You should send out a PR for your data-plane-api changes, which will need to be merged before this PR is merged. Until that PR gets merged, you should do exactly what you're doing now - point the tests to your own fork to allow the CI to pass in the meantime.

  2. What, specifically, about the histogram data is confusing? The model for histograms is different for the other stats. Each sample is sent to the sinks when published. This is done synchronously by calling this method from whatever thread recorded the value. So you need to be sure that that method is thread-safe. The statsd implementation uses TLS to ensure thread-safety (see here), but since you are probably interested in forming some sort of central cache from which to pull at flush time, your implementation will likely be more complicated than that. It's also worth noting that there's work underway on a design to create an Envoy representation of the distributions and central cache to store them as opposed to just handing the samples to the sinks as is done now.

eliranroffe and others added 5 commits March 25, 2018 10:04
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
…trix-dashboard-support-handler-upstream

Signed-off-by: trabetti <talis@il.ibm.com>
Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking awesome! I have a few design suggestions. I think we should aim for all hystrix-specific code to be under source/extensions. We can extend existing APIs to make the hystrix use case possible, but we should remove any special cases or methods in the core Envoy codebase. I've laid out a few suggestions (this review and the last one) to get us there.

I haven't gone through the sink, itself, yet, but I'll take a look at that next.

namespace Envoy {
namespace Extensions {
namespace StatSinks {
namespace HystrixNameSpace {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer Hystrix over HystrixNameSpace.

* in admin impl. Each handler which needs to receive data from admin filter can inherit from
* HandlerInfo and build a class which contains the relevant data.
*/
class HandlerInfo {
Copy link
Member

@mrice32 mrice32 Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - Either the Http::StreamDecoderFilterCallbacks* should be passed through or you could add some new public methods to AdminFilter to do what you want and pass it through.

handler_info_ =
std::make_unique<Extensions::StatSinks::Common::HystrixHandlerInfoImpl>(callbacks_);
code = parent_.runCallback(path, *request_headers_, *header_map, response, *handler_info_);
end_stream = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to remove handler_info_, could we remove the rest of this special case? Could we just allow the callback to set the end_stream (it could still be defaulted to true so we don't need to change the other handlers)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about, instead of involving the handlers and adding more arguments to their interface, add a bool streaming_ to struct UrlHandler and let AdminImpl::runCallback handle the end_stream value?

@@ -668,7 +714,7 @@ void AdminFilter::onComplete() {

// Under no circumstance should browsers sniff content-type.
header_map->addReference(headers.XContentTypeOptions, headers.XContentTypeOptionValues.Nosniff);
callbacks_->encodeHeaders(std::move(header_map), response.length() == 0);
callbacks_->encodeHeaders(std::move(header_map), end_stream && response.length() == 0);

if (response.length() > 0) {
callbacks_->encodeData(response, true);
Copy link
Member

@mrice32 mrice32 Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want end_stream to be general purpose as proposed in the comment above, can we change this call to:

callbacks_->encodeData(response, end_stream);

Essentially this would mean that if hystrix or some other callback wanted to send data and not end the stream, that would be possible.

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of comments - the sink looks good so far, thanks for addressing my comments from last time! I'll continue digging into the core sink implementation tomorrow, but this should be good enough to get started.

@@ -0,0 +1,157 @@
#include <map>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the stats_sinks/common directory is only used for components that are shared by multiple stats sinks. I think it would be fine to put everything under stats_sinks/hystrix (similar to stats_sinks/metrics_service).

typedef std::vector<uint64_t> RollingStats;
typedef std::map<const std::string, RollingStats> RollingStatsMap;

class Hystrix : public Logger::Loggable<Logger::Id::hystrix> {
Copy link
Member

@mrice32 mrice32 Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we call this something a little more specific to what function it performs, like HystrixStatCache for instance?


RollingStatsMap rolling_stats_map_;
uint64_t current_index_;
uint64_t num_of_buckets_;
Copy link
Member

@mrice32 mrice32 Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const?

Hystrix() : current_index_(DEFAULT_NUM_OF_BUCKETS), num_of_buckets_(DEFAULT_NUM_OF_BUCKETS + 1){};

Hystrix(uint64_t num_of_buckets)
: current_index_(num_of_buckets), num_of_buckets_(num_of_buckets + 1){};
Copy link
Member

@mrice32 mrice32 Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the variable naming here is a little confusing since you're setting num_of_buckets_ equal to num_of_buckets + 1. Can we change the name of the instance variable (num_of_buckets_) to something that more clearly describes what it is - like bucket_mod_, for instance?

void flushGauge(const Stats::Gauge&, uint64_t){};
void endFlush();
void onHistogramComplete(const Stats::Histogram& histogram, uint64_t value) {
std::cout << "histogram complete: " << histogram.name() << ", value: " << std::to_string(value)
Copy link
Member

@mrice32 mrice32 Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to leave this print statement here?

}
}

uint64_t Hystrix::getRollingValue(absl::string_view cluster_name, absl::string_view stats) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: stats -> stat since you're looking up a single statistic.

}

// void HystrixSink::onHistogramComplete(const Histogram& histogram, uint64_t value);
void HystrixSink::registerConnection(Http::StreamDecoderFilterCallbacks* callbacks) {
Copy link
Member

@mrice32 mrice32 Apr 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the previous review, I'm hoping we can redesign this so that the sink registers with the Admin as opposed to the other way around.

Overall, though, it seems as if this is currently not designed for multiple streams. I would imagine that multiple subscribers is a valid use case since there's no restriction on the number of subscribers that can register with the hystrix admin handler. Is the intent just to drop the last connection each time a new one comes in? If not, we should probably have a vector of connection/callbacks/streams (or whatever object we end up using to interact with the stream) and send the data to all of them.

callbacks_ = callbacks;
}

// TODO (@trabetti) is this correct way?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to close the stream, you will need to make an encodeData (or some other encode* method) call on the callbacks with the end_stream argument set to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we did not provide a way to terminate the stream from Envoy's side (should we?), this is called when the dashboard client was closed, so we are not sending any more data.

}
out_str << std::endl;
}
return out_str.str();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmarantz can check me on this, but this seems like a dangerous/incorrect use of absl::string_view. According to the source, the std::string must outlive the absl::string_view.

@@ -233,7 +258,8 @@ class AdminFilter : public Http::StreamDecoderFilter, Logger::Loggable<Logger::I
AdminFilter(AdminImpl& parent);

// Http::StreamFilterBase
void onDestroy() override {}
// TODO (@trabetti) : make more generic (how?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about creating a new public method on AdminFilter called addOnDestroyCallback()? This method could take a callback and append to a list of callbacks to execute when the filter is destroyed. This will only work if we go with what @jmarantz suggested - an architecture where the AdminFilter is passed to the handlers directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the portion that's confusing? Basically, you'd add a public method to AdminFilter that would have the following signature:

void addOnDestroyCallback(std::function<void()> cb);

That method would append to a list of callbacks that are stored inside the AdminFilter. When onDestroy() is called, the AdminFilter can call all of the callbacks in that list. Ultimately, once all this is done, I see your handler looking something like this:

Http::Code HystrixSink::handlerHystrixEventStream(... Server::AdminFilter& admin_filter, ....) {
  .
  .
  .
  Http::StreamDecoderFilterCallbacks *stream_decoder_filter_callbacks = admin_filter.getDecoderFilterCallbacks();

  // Separated out just so it's easier to understand
  auto on_destroy_callback = [this, stream_decoder_filter_callbacks] () {
    // Unregister the callbacks from the sink so data is no longer encoded through them.
    unregisterConnection(stream_decoder_filter_callbacks);
  }
  
  // Add the callback to the admin_filter list of callbacks
  admin_filter. addOnDestroyCallback(std::move(on_destroy_callback));
  .
  .
  .
}

To do this, you'll need to have the AdminFilter be passed to the handlers, and you'll need to add and implement the addOnDestroyCallback method.

If you do the above, we can get rid of the AdminImpl::unregisterHystrixConnection() and Server::InstanceImpl:: unregisterHystrixSink. This is the last piece of the puzzle to allowing the sink registering itself and requiring no special methods to pass the registration through the server.

Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
@trabetti
Copy link
Contributor Author

trabetti commented Apr 17, 2018

Pushed a new version. The handler is added in the sink constructor. (plus other fixes).
If this direction looks good, I will continue to passing the AdminFilter to handlers, and the related changes discussed here.
I am not so clear about the direction of registering the sink on the Admin.

With the implementation in this commit, only one sink is possible, since the sink constructor adds the handler. (not sure what happens if someone tries to add more than one sink). I think it makes sense, but it is different than other sinks.
I did add support for multiple dashboard clients on one sink.

@@ -23,8 +24,8 @@ namespace Server {
*/
#define MAKE_ADMIN_HANDLER(X) \
[this](absl::string_view path_and_query, Http::HeaderMap& response_headers, \
Buffer::Instance& data) -> Http::Code { \
return X(path_and_query, response_headers, data); \
Buffer::Instance& data, Http::StreamDecoderFilterCallbacks* callbacks) -> Http::Code { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, IMO all these args could be saved in the AdminFilter object -- and the callbacks already are. That would simplify this interface once and for all, rather than adding a very specialized complicated object to all of them.

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making all those changes! I added a longer explanation to my suggestion below about getting the unregister portion to work. Let us know if you have any questions!

@@ -233,7 +258,8 @@ class AdminFilter : public Http::StreamDecoderFilter, Logger::Loggable<Logger::I
AdminFilter(AdminImpl& parent);

// Http::StreamFilterBase
void onDestroy() override {}
// TODO (@trabetti) : make more generic (how?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the portion that's confusing? Basically, you'd add a public method to AdminFilter that would have the following signature:

void addOnDestroyCallback(std::function<void()> cb);

That method would append to a list of callbacks that are stored inside the AdminFilter. When onDestroy() is called, the AdminFilter can call all of the callbacks in that list. Ultimately, once all this is done, I see your handler looking something like this:

Http::Code HystrixSink::handlerHystrixEventStream(... Server::AdminFilter& admin_filter, ....) {
  .
  .
  .
  Http::StreamDecoderFilterCallbacks *stream_decoder_filter_callbacks = admin_filter.getDecoderFilterCallbacks();

  // Separated out just so it's easier to understand
  auto on_destroy_callback = [this, stream_decoder_filter_callbacks] () {
    // Unregister the callbacks from the sink so data is no longer encoded through them.
    unregisterConnection(stream_decoder_filter_callbacks);
  }
  
  // Add the callback to the admin_filter list of callbacks
  admin_filter. addOnDestroyCallback(std::move(on_destroy_callback));
  .
  .
  .
}

To do this, you'll need to have the AdminFilter be passed to the handlers, and you'll need to add and implement the addOnDestroyCallback method.

If you do the above, we can get rid of the AdminImpl::unregisterHystrixConnection() and Server::InstanceImpl:: unregisterHystrixSink. This is the last piece of the puzzle to allowing the sink registering itself and requiring no special methods to pass the registration through the server.

Signed-off-by: trabetti <talis@il.ibm.com>
@trabetti
Copy link
Contributor Author

trabetti commented Apr 18, 2018

Pushed a commit with AdminFilter passed to handler instead of the callbacks, and followed @mrice32 explanation about the destroy callback.
As @jmarantz mentioned, now we can remove the request_headers parameter, and maybe even others. But at first step I did not change this, as this will result in many changes across various handlers, and also requires some work to do on testing (there is no mock for the AdminFilter).
Please give me feedback on this version, and if it looks as the right direction I will proceed.

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. This is definitely the right direction, thanks! I did a quick review and left a few nits and questions, but I think we're getting close. I'll do a full pass over the code tomorrow. Side note: I suspect that you'll need to add a few tests for the codec (no chunk) change, new server accessor method (should be super simple to get coverage there), and admin filter changes.

@@ -184,6 +184,11 @@ message StatsdSink {
string prefix = 3;
}

// can it be empty?
Copy link
Member

@mrice32 mrice32 Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, yes, but then it might make sense to not provide a config at in the API all by using this implementation in your sink:

ProtobufTypes::MessagePtr HystrixSinkFactory::createEmptyConfigProto() {
    return ProtobufTypes::MessagePtr{new Envoy::ProtobufWkt::Empty()};
  }

However, the API portion of the repo is generally used for docs as well (the comments are scraped, formatted, and published to a docs site), so this config might be a good place to provide general information about the Hystrix stats sink.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of having the number of buckets as a parameter to the constructor. And theoretically it can be changed even during action. But not sure if this level of control is really necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding it as a config option would be better than making it a constructor argument since the constructor is just called from the factory, and there is no factory configuration, so it would have to be hardcoded there. If this is truly something that a user might want to tweak, it would be good to have it as a config parameter :).

void endFlush() override;
void onHistogramComplete(const Stats::Histogram&, uint64_t) override{};

// TODO (@trabetti) : support multiple connections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hasn't this been done? If so, can we remove the TODO comment?

@@ -150,6 +150,16 @@ Http::FilterTrailersStatus AdminFilter::decodeTrailers(Http::HeaderMap&) {
return Http::FilterTrailersStatus::StopIteration;
}

void AdminFilter::onDestroy() {
for (auto callback : on_destroy_callbacks_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const auto& callback

}

void AdminFilter::addOnDestroyCallback(std::function<void()> cb) {
on_destroy_callbacks_.push_back(cb);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: on_destroy_callbacks_.push_back(std::move(cb));

Http::Code code = parent_.runCallback(path, *request_headers_, *header_map, response, *this);
bool end_stream = true;

if (path.find("/hystrix_event_stream") != std::string::npos) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may already be planning to do this, but can we get rid of the special case by allowing the handler to be able to set the end_stream variable (we can still default it to true)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about, instead of involving the handlers and adding more arguments to their interface, add a bool streaming_ to struct UrlHandler and let AdminImpl::runCallback handle the end_stream value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any particular issues with that. Although, if we end up putting most of the handler arguments into the AdminFilter itself, we could add end_stream there, too. I think that would be be the easiest way.

void HystrixSink::unregisterConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_remove) {
for (auto it = callbacks_list_.begin(); it != callbacks_list_.end();) {
if ((*it)->streamId() == callbacks_to_remove->streamId()) {
it = callbacks_list_.erase(it);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not break after erasing? Is it possible to have multiple callbacks with the same stream id here? If it is, are you sure you want to delete both? Maybe it makes sense to de-dup in the registerConnection() method so this can't happen? What do you think?

Copy link
Contributor Author

@trabetti trabetti Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the stream id should be unique, can break.

private:
HystrixPtr stats_;
std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{};
Server::Instance* server_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not Server::Instance& server_; since it never changes what it's pointing to?

std::map<std::string, std::map<std::string, std::string>> counter_name_lookup;
};

typedef std::unique_ptr<HystrixStatCache> HystrixPtr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: HystrixStatCachePtr


uint64_t HystrixStatCache::getRollingValue(absl::string_view cluster_name, absl::string_view stat) {
std::string key;
key = absl::StrCat("cluster.", cluster_name, ".", stat);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not single line std::string key = absl::StrCat("cluster.", cluster_name, ".", stat);?

// Add new value to rolling window, in place of oldest one.
void HystrixStatCache::pushNewValue(const std::string& key, uint64_t value) {
// Create vector if do not exist.
// TODO trabetti: why resize + value param didn't work without the if?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this TODO about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we wanted to do a changed suggested by @jmarantz , it did not work for some reason, did not get to check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the trigger for me here was that in both branches, you are doing two map lookups, and I think you should only need one. Maybe mention that in the TODO.

HystrixStatCache()
: current_index_(DEFAULT_NUM_OF_BUCKETS), window_size_(DEFAULT_NUM_OF_BUCKETS + 1){};

HystrixStatCache(uint64_t num_of_buckets)
Copy link
Member

@mrice32 mrice32 Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't think I gave the right feedback the first time around. I think window_size_ should be equal to num_of_buckets since you want to mod by the total number of buckets. I think current_index_ should be set to num_of_buckets - 1, since that's the highest index. Right now, there are technically num_of_buckets + 1 total buckets, which is a bit confusing. Also, there should be some input validation when you read in the config to ensure that num_of_buckets is at least 2 (I assume this wouldn't make much sense with only one bucket...?). You could also toss in an assert in the constructor to validate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing.. The statsCache is a rolling window. If for example, we have flush interval of 1 ms, and the number_of_buckets is 5, it means that we send every 1 ms the statistics for the last 5 ms. The values stored in it are the counters values, which are constantly increasing. So the result we want to return every endFlush() is the value of current counter, minus the counter before the rolling window started. That's why we keep num_of_buckets+1 values in the rolling window. Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense. Thanks for the explanation. On another note, are 1 or 0 acceptable inputs to this function? If not, could we add some input validation (either where you read from the config or here)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume '0' means unset value, in this case it will be set to default value (10) . '1' should be legitimate value, but it is good to make sure it works. I already did that but config.cc I see here is not the updated one (the github scalability issue?)

if (callbacks_list_.empty()) {
return;
}
if (counter.name().find("upstream_rq_") != std::string::npos) {
Copy link
Member

@mrice32 mrice32 Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than string compare each of these, I think you should just grab the stats directly from each cluster in the endFlush() method. See the stats() method on the cluster object:
https://github.com/envoyproxy/envoy/blob/master/include/envoy/upstream/upstream.h#L494

These are all the stats that are available through that struct:
https://github.com/envoyproxy/envoy/blob/master/include/envoy/upstream/upstream.h#L292

I know the macros are a bit confusing, but it's just creating a struct where you can access the Counter&, Gauge&, etc objects. I think this should cover most (or all?) of the stats you need to capture here and the .membership_total one you look up in endFlush(). IIUC this should also allow you to completely get rid of counter_name_lookup_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing struct do not cover all the stats we need. missing - upstream_rq_2xx, upstream_rq_4xx, upstream_rq_5xx. To get rid of the string compare, we'll need to read those directly in end_flush(). Or is it possible to create struct with only the stats we need?
By the way, we use the stats() method here: https://github.com/trabetti/envoy/blob/bb876ee141069371ad0624bc3efc08074a0d7602/source/extensions/stat_sinks/hystrix/hystrix.cc#L338

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - my bad, those are actually created and set dynamically -

scope.counter(fmt::format("{}upstream_rq_{}", prefix, enumToInt(response_code))).inc();
. For those, I would recommend looking them up dynamically similar to how you currently look up the .membership_total stat.

The stats() method you are referring to is the method on the Server::Instance object, which returns a stats scope that allows looking up/creating stats by name. I'm referring to the stats() method on the cluster object, which returns a struct of specific stats that are relevant to the cluster that can be accessed directly without a lookup. The latter should be used for as many stats as possible as it is the cheapest lookup. The former should be used for specific stats that are not accessible directly via a struct, which IIUC, should just be those 3 stats you just mentioned. We can't really create our own struct for all of these stats because it would need to constantly be updated whenever Envoy adds or removes a cluster.

I would really only advise using the flushCounter, flushGauge, etc methods when you need to export all or most of Envoy's stats. It seems this sink is interested in a pretty limited subset of stats, which means individual lookups will likely be the cheapest way to do it.

I know the stats system can be pretty confusing because there are a lot of different ways to do the same thing :). Does all of that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own learnings....is there some reason an interested class can't do the name-lookup once on construction and then save the desired stat in their own member var?

trabetti and others added 3 commits April 23, 2018 16:29
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>

Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
// If the counter was reset, the result is negative
// better return 0, will be back to normal once one rolling window passes.
if (rolling_stats_map_[key][current_index_] <
rolling_stats_map_[key][(current_index_ + 1) % window_size_]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. You are doing lots of map lookups with the same key. This can be avoided.

const uint64_t window_size_;
// TODO(trabetti): do we want this to be configurable through the HystrixSink in config file?
static const uint64_t DEFAULT_NUM_OF_BUCKETS = 10;
std::map<std::string, std::map<std::string, std::string>> counter_name_lookup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suffix member vars with and underscore.

// Building lookup name map for all specific cluster values.
// Every call to the updateRollingWindowMap function should get the appropriate name from the map.
std::string cluster_name_with_prefix = absl::StrCat("cluster.", cluster_name, ".");
counter_name_lookup[cluster_name]["upstream_rq_timeout"] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save counter_name_lookup[cluster_name] into a temp.

void HystrixStatCache::updateRollingWindowMap(std::map<std::string, uint64_t> current_stat_values,
std::string cluster_name) {

if (counter_name_lookup.find(cluster_name) == counter_name_lookup.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto -- lots of lookups to cluster_name all of which return the same exact string->string map.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is running into github's current limits; may need to make a new one. When you do, I'd recommend.

  1. Use a single commit in the new PR; don't just cut a new PR from all your existing commits, otherwise it will quickly blow up again.
  2. Be careful about DCO which is failing now.

/**
* Generate the streams to be sent to hystrix dashboard.
*/
void getClusterStats(std::stringstream& ss, absl::string_view cluster_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Calculate values needed to create the stream and write into the map.
*/
void updateRollingWindowMap(Upstream::ClusterInfoConstSharedPtr cluster_info,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const Upstream::ClusterInfoConstSharedPtr&


typedef std::unique_ptr<HystrixStatCache> HystrixStatCachePtr;

namespace Hystrix {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this extra level of namespace? The only names in it are already prefixed with 'Hystrix'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think this namespace should surround everything inside the hystrix/ dir, not just this class. This is how it's done for other stats sinks.


class HystrixSink : public Stats::Sink, public Logger::Loggable<Logger::Id::hystrix> {
public:
HystrixSink(Server::Instance& server, uint64_t num_of_buckets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


private:
HystrixStatCachePtr stats_;
std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value to explicitly putting {} in this declaration?

std::string rolling_map = sink_->getStats().printRollingWindow();
std::size_t pos = rolling_map.find("cluster.test_cluster.total");
EXPECT_NE(std::string::npos, pos);
// //EXPECT_NE(absl::string_view::npos, map.find("cluster.test_cluster.total"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm line

for (auto stats_map_itr = rolling_stats_map_.begin(); stats_map_itr != rolling_stats_map_.end();
++stats_map_itr) {
out_str << stats_map_itr->first << " | ";
RollingWindow rolling_window = stats_map_itr->second;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const RollingWindow& rolling_window = name_value.second;

out_str << stats_map_itr->first << " | ";
RollingWindow rolling_window = stats_map_itr->second;
for (auto specific_stat_vec_itr = rolling_window.begin();
specific_stat_vec_itr != rolling_window.end(); ++specific_stat_vec_itr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a range-loop here too, with a specific type for the iterator variable (not auto). It's not obvious from looking at this code locally what the type is, which seems interesting.

RollingWindow rolling_window = stats_map_itr->second;
for (auto specific_stat_vec_itr = rolling_window.begin();
specific_stat_vec_itr != rolling_window.end(); ++specific_stat_vec_itr) {
out_str << *specific_stat_vec_itr << " | ";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be weird to have a " | " at the end of the list? If you use absl::StrJoin you won't get a trailing one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted it to look like a table.. although the lines are not aligned so it doesn't really look like a table in reality..

// Building lookup name map for all specific cluster values.
// Every call to the updateRollingWindowMap function should get the appropriate name from the map.
std::string cluster_name_with_prefix = absl::StrCat("cluster.", cluster_name, ".");
counter_name_lookup[cluster_name]["upstream_rq_5xx"] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems indirect. Can you capture the counter object pointer in your map, rather than capturing a long string that you'll then have to look up in updateRollingWindowMap, which seems like it might be called frequently enough to matter.

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about the data storage in the Hystrix sink. Once we get that all sorted out, I think we'll be pretty close to getting this merged! Thanks for all the work on this!

void HystrixStatCache::CreateCounterNameLookupForCluster(const std::string& cluster_name) {
// Building lookup name map for all specific cluster values.
// Every call to the updateRollingWindowMap function should get the appropriate name from the map.
std::string cluster_name_with_prefix = absl::StrCat("cluster.", cluster_name, ".");
Copy link
Member

@mrice32 mrice32 Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, I'm trying to get rid of as many map lookups as possible since static struct accesses are generally much more efficient.

Instead of using a nested map, why can't we combine counter_name_lookup and rolling_stats_map_ into a single map from cluster name to a struct containing all cluster specific information?

For example:

struct ClusterStats {
  // This constructor should initialize all the stat names by concatenating 
  // the cluster name and the rest of the stat name that's required for
  // the lookup.
  ClusterStats(const std::string &cluster_name);
  const std::string retry_upstream_rq_5xx_name_;
  const std::string upstream_rq_5xx_name_;
  const std::string retry_upstream_rq_4xx__name_;
  .
  .
  .
  // Rolling windows
  RollingWindow errors_;
  RollingWindow success_;
  RollingWindow total_;
  .
  .
  .
};

// Map from cluster names to a struct of all of that cluster's stat windows.
std::unordered_map<std::string, ClusterStats> cluster_stats_map_;

This would allow us to remove all of this initialization code inside CreateCounterNameLookupForCluster and just allow you to do one map lookup for each cluster. Then your pushNewValue calls could look more like this (you would need to modify it to take in a RollingWindow& and the value):

ClusterStats& cluster_stats = cluster_stats_map_[cluster_name];
.
.
.
uint64_t success = stats.counter(cluster_stats.upstream_rq_2xx_name_).value();
pushNewValue(cluster_stats.success_, success);
.
.
.

Sorry for the wall of text. Let me know if the above suggestion doesn't make sense or you don't think it will work. I think this stat export is happening often enough that we should make sure it's not doing any unnecessary work.

uint64_t error_rate =
total == 0
? 0
: (static_cast<double>(errors + timeouts + rejected) / static_cast<double>(total)) * 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you can remove all the static casts by doing the multiply on the numerator before dividing:

(100 * (errors + timeouts + rejected)) / total;

Copy link
Member

@mrice32 mrice32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @jmarantz mentioned, please squash all commits and open a new PR because of the github scalability issue. I can move my recent comments over to the new PR.

@jmarantz
Copy link
Contributor

Actually, while we are talking about new PRs, I think it would be worthwhile to break out the admin.cc changes as a separate easy-to-review PR. It touches a bunch of lines but those issues are now pretty much settled, so might as well iterate on the map data structure issues @mrice32 and I have been discussing in a smaller PR.

@trabetti
Copy link
Contributor Author

@jmarantz fine with me. will open a separate PR for the admin changes. There's nothing directly related to the hystrix support there.

@jmarantz
Copy link
Contributor

jmarantz commented Apr 23, 2018 via email

@trabetti
Copy link
Contributor Author

Opened PR #3172 for admin changes only. Will need to add coverage tests.

@trabetti
Copy link
Contributor Author

Closing this PR. Opened a new PR #3425.

@trabetti trabetti closed this May 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants