-
Notifications
You must be signed in to change notification settings - Fork 5.1k
support for hystrix event stream output #2736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: talis <talis@il.ibm.com>
Signed-off-by: talis <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
…-support Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool stuff. Few drive by comments from a quick skim (not a full review).
source/server/http/admin.cc
Outdated
@@ -575,13 +574,66 @@ std::string AdminImpl::runtimeAsJson( | |||
return strbuf.GetString(); | |||
} | |||
|
|||
Http::Code AdminImpl::handlerHystrixEventStream(const std::string&, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick drive by comment. This file is getting huge and will continue to get huger. Can you put all the new Hystrix admin stuff into a dedicated file, potentially in a new admin directory? We will need to start decomping the rest of this file also as we add more stuff to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved hystrix stuff from admin.cc - commit a05aea1
@@ -69,12 +70,16 @@ void StreamEncoderImpl::encodeHeaders(const HeaderMap& headers, bool end_stream) | |||
saw_content_length = true; | |||
} | |||
|
|||
if (headers.NoChunks()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say a little more about why this is needed? This is a bit odd. I assume it's because you don't want chunk transfer encoding but you don't have a content-length? Is there any way to handle this without this kind of hack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your assumption is right. I saw that the ":" wes created exactly to allow such kind of hook and it is being used in some other places. Alternatives could be (1) changing the interface of encodeHeaders(), which is much more intrusive since it is all over, or (2) building the response without using encodeHeaders(). What do you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not crazy about it, but it's probably the least bad option for doing this. If we go this route can you please add liberal comments on why this is here, that it is a non-standard header, etc.? @alyssawilk WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool! Thanks! Did a quick first pass. Looks good for the most part.
include/envoy/server/admin.h
Outdated
class HystrixHandlerInfo : public HandlerInfo { | ||
public: | ||
HystrixHandlerInfo(Http::StreamDecoderFilterCallbacks* callbacks) | ||
: stats_(new Stats::Hystrix()), data_timer_(nullptr), ping_timer_(nullptr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer std::make_unique<Stats::Hystrix>()
here and throughout (this may be common elsewhere in Envoy, but we recently moved to C++14).
include/envoy/server/admin.h
Outdated
* This class contains data which will be sent from admin filter to a hystrix_event_stream handler | ||
* and build a class which contains the relevant data. | ||
*/ | ||
class HystrixHandlerInfo : public HandlerInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this class under include/
? Generally, in Envoy, only pure interfaces with no implementation go under include/
, and all implementations of those interfaces (even if there's just one) go under source/
.
include/envoy/server/admin.h
Outdated
* in admin impl. Each handler which needs to receive data from admin filter can inherit from | ||
* HandlerInfo and build a class which contains the relevant data. | ||
*/ | ||
class HandlerInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be separated into a pure interface here and a default implementation under source/
? Generally within Envoy, classes declared under include/
are pure interfaces with all PURE
(= 0
) methods. Even the default implementations are generally separated from the interface and placed under source/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by commit 61659a3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As indicated in another comment thread, we can eliminate HandlerInfo completely and instead pass the AdminFilter* into each handler, which will clean up the dynamic_cast and remove a special-case from where the HandlerInfo is constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like as a sink you still need to get the Http::StreamDecoderFilterCallbacks* into your system, but can we get this by plumbing through the AdminFilter& rather than having this empty-base-class/dynamic-cast strategy? Or @mattklein123 do you have a better idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - Either the Http::StreamDecoderFilterCallbacks*
should be passed through or you could add some new public methods to AdminFilter
to do what you want and pass it through.
source/server/http/admin.cc
Outdated
code = parent_.runCallback(path, *header_map, response, *handler_info_); | ||
} else { | ||
HandlerInfoPtr temp_handler(new HystrixHandlerInfo(callbacks_)); | ||
handler_info_ = std::move(temp_handler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why not just handler_info_ = std::make_unique<HystrixHandlerInfo>(callbacks_);
?
source/server/http/admin.cc
Outdated
|
||
// using write() since we are sending network level | ||
// TODO(trabetti): is there an alternative to the const_cast? | ||
(const_cast<Network::Connection*>((hystrix_handler_info->callbacks_)->connection())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate in your comment on why can't you use callbacks_->encodeData()
here instead of getting at the raw connections? I'm having a little trouble understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Probably can use encodeData()..
test/server/http/admin_test.cc
Outdated
auto callback = [&](const std::string&, Http::HeaderMap&, Buffer::Instance&) -> Http::Code { | ||
return Http::Code::Accepted; | ||
}; | ||
auto callback = [&](const std::string&, Http::HeaderMap&, Buffer::Instance&, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove &
from the capture since you don't need to capture anything from the environment - ditto in other test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to this PR. Want us to fix those here anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't notice that you were only modifying part of that line. It'd be awesome if you could clean that up, though!
source/common/stats/hystrix.cc
Outdated
const uint64_t Hystrix::PING_INTERVAL_IN_MS; | ||
|
||
// add new value to rolling window, in place of oldest one | ||
void Hystrix::pushNewValue(std::string key, int value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: generally prefer types with defined sizes - like int64_t
. Also, it seems like you use int64_t
as the return value for getRollingValue()
which is a diff of these inputs. A similar inconsistency exists for num_of_buckets_
/current_index_
and DEFAULT_NUM_OF_BUCKETS
.
source/common/stats/hystrix.cc
Outdated
|
||
// combining timeouts+retries - retries are counted as separate requests | ||
// (alternative: each request including the retries counted as 1) | ||
double timeouts = getRollingValue(cluster_name, "upstream_rq_timeout") + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are all these being casted to double
s and then casted back to integers?
source/server/http/admin.cc
Outdated
handler_info_ = std::move(temp_handler); | ||
code = parent_.runCallback(path, *header_map, response, *handler_info_); | ||
} else { | ||
HandlerInfoPtr temp_handler(new HystrixHandlerInfo(callbacks_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only half-baked, but maybe it makes sense to make all the handlers classes instead of just functions and pass them (or provide callbacks to get) certain state, like callbacks_
, on each request. Then, they can manage the creation of any per-request state that may need to be created internally. That would eliminate the need to modify this function every time a handler wants to use HandlerInfo, and it would eliminate the dynamic_cast
.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we also thought about something like that. How about opening a separate issue for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that seems reasonable to me. Could you also toss a TODO in there referencing the issue?
source/common/stats/hystrix.h
Outdated
typedef std::vector<uint64_t> RollingStats; | ||
typedef std::map<std::string, RollingStats> RollingStatsMap; | ||
|
||
// Consider implement the HystrixStats as a sink to have access to histograms data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the idea of moving the bulk of the logic to a stats sink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We thought of doing this later as an enhancement..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I mentioned this was that I was thinking more about using stats sinks for this. I'd be interested to hear @mattklein123's and @htuch's thoughts. IIUC, the general architecture is a subscription mechanism where a client reaches out to subscribe and then Envoy streams specific stats data indefinitely over that connection. Is this correct?
That would mean that this is a combination of the existing pull-based integrations, like prometheus, that just implement a simple request-query-response handler on the admin port and push-based integrations, like the GRPC stats streamer and statsd, that implement stats sinks to push stats to subscribers. A lot of the custom functionality that you've added, like querying stats and periodic flushes, are already handled by the existing sink interface.
I'm wondering if there's some way that we can keep an admin port subscription model, but pass off the stream to a stats sink where the periodic pushing can be done. This would likely allow the pushing to be more performant since it won't require a new periodic query to be set up for each subscriber (assuming the multi-subscriber case would ever happen in a hystrix deployment). It would also mean that you wouldn't have to reinvent the wheel when it comes to building and configuring a push system within Envoy or querying stats data. And, as you mentioned, it will have built-in support for histograms. I'm curious to hear what others think, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, agreed. We should do this using a stat sink. I haven't through it through fully, but I don't see any reason why we can't figure out a way in which the active admin stream registers with the sink, and then when the flush calls happen it would just work? @mrice32 can you help guide this PR in that direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stats sink means the stream will be sent at same intervals as other stats which are active. Isn't it a limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123, sounds good.
@trabetti, yes, it is a bit of a limitation. However, I think most deployments have only been using one sink so far, so the aligned intervals have not been an issue. If this is something you foresee being an issue, we should open an issue and design a way for stats sinks to optionally specify a time interval that is different from the global one. It shouldn't be too hard to make this more flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on changing this to a stats sink. I felt like most of the superficial testing and string-handling comments below might apply to this as a sink as well.
source/common/stats/hystrix.cc
Outdated
} | ||
} | ||
|
||
uint64_t Hystrix::getRollingValue(std::string cluster_name, std::string stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::string& cluster_name, const std::string& stats.
or maybe:
absl::string_view cluster_name, absl::string_view stats
source/common/stats/hystrix.cc
Outdated
} | ||
|
||
uint64_t Hystrix::getRollingValue(std::string cluster_name, std::string stats) { | ||
std::string key = "cluster." + cluster_name + "." + stats; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absl::StrCat (which knows how to reserve the right size for the output buffer for all inputs, before it starts appending bytes).
source/common/stats/hystrix.cc
Outdated
// better return 0, will be back to normal once one rolling window passes | ||
// better idea what to return? could change the algorithm to keep last valid delta, | ||
// updated with pushNewValue and kept stable when the update is negative. | ||
if (rolling_stats_map_[key][current_index_] < |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto& rolling_stats_map_entry = rolling_stats_map_[key] and CSE everywhere that appears.
source/common/stats/hystrix.cc
Outdated
void Hystrix::resetRollingWindow() { rolling_stats_map_.clear(); } | ||
|
||
void Hystrix::addStringToStream(std::string key, std::string value, std::stringstream& info) { | ||
addInfoToStream(key, "\"" + value + "\"", info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absl::StrCat.
source/common/stats/hystrix.cc
Outdated
|
||
void Hystrix::resetRollingWindow() { rolling_stats_map_.clear(); } | ||
|
||
void Hystrix::addStringToStream(std::string key, std::string value, std::stringstream& info) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const refs or string_view
test/common/stats/hystrix_test.cc
Outdated
namespace Envoy { | ||
namespace Stats { | ||
|
||
void checkStreamField(std::string dataMessage, std::string key, uint64_t expected) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use string_view here. Also declare non-changing temps as const. Also all your temp strings below should be absl::string_view.
test/common/stats/hystrix_test.cc
Outdated
namespace Stats { | ||
|
||
void checkStreamField(std::string dataMessage, std::string key, uint64_t expected) { | ||
std::string actual = dataMessage.substr(dataMessage.find(key)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capture dataMessage.find(key) in a temp key_pos and EXPECT_NE(absl::string_view::npos, key_pos);
test/common/stats/hystrix_test.cc
Outdated
actual = actual.substr(actual.find(" ") + 1); | ||
std::size_t length = actual.find(","); | ||
actual = actual.substr(0, length); | ||
EXPECT_EQ(actual, std::to_string(expected)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a slightly nicer pattern is to return the actual, and do the EXPECT_EQ at the call-site, which gives you useful line-numbers if it fails.
test/common/stats/hystrix_test.cc
Outdated
|
||
void checkStreamField(std::string dataMessage, std::string key, uint64_t expected) { | ||
std::string actual = dataMessage.substr(dataMessage.find(key)); | ||
actual = actual.substr(actual.find(" ") + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all your .find returns need to be checked. Also, please give a comment showing what the format is you are parsing.
test/common/stats/hystrix_test.cc
Outdated
checkStreamField(dataMessage, "rollingCountSuccess", 63); | ||
checkStreamField(dataMessage, "rollingCountTimeout", 27); | ||
checkStreamField(dataMessage, "propertyValue_queueSizeRejectionThreshold", expectedQueueSize); | ||
checkStreamField(dataMessage, "reportingHosts", expectedReportingHosts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you could have your helper do a more complete parsing of dataMessage, returning a map<string, uint64_t> which you could then just EXPECT_EQ against a map that's initialized to the above gold values.
The existing checkStreamField looks like it might ignore stray characters in the string, and would definitely ignore extra field/value pairs. It also won't work if one of the keys is a substring of another.
Thanks for so many comments! Will probably take until middle of next week to start pushing the fixes. |
@trabetti recommend focusing on big picture first (hystrix as stats sink). I have a growing concern with dependency-creep in Envoy as lots of deployments differ greatly and we'd like it to be easy to configure the parts we really need without generating a sea of ifdefs. |
…nly the needed values Signed-off-by: trabetti <talis@il.ibm.com>
Just updating, we got distracted by some urgent task this week, will start pushing fixes next week.. |
Per new release note policy please add a release note with any applicable feature docs to https://github.com/envoyproxy/data-plane-api/blob/master/docs/root/intro/version_history.rst. Thank you! |
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
source/server/http/admin.cc
Outdated
response_headers.insertNoChunks().value().setReference("0"); | ||
|
||
Stats::HystrixHandlerInfoImpl& hystrix_handler_info = | ||
dynamic_cast<Stats::HystrixHandlerInfoImpl&>(handler_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO dynamic_cast should rarely be needed, and when used the result certainly need to be checked for nullptr (so it should be pointers not refs).
This all might be moot anyway if you move to using sinks maybe this can all be reverted.
But if we decide not to go with sinks in this PR, and you need to add this as a new stats callback, then my suggestion is to allocate Stats::HystrixHandlerInfoImpl as a temp inside a simple handler. It does look like you currently need to pass through callbacks_ to construct the HystrixHandlerInfoImpl, but since the handlers are non-static methods. you have that available to you in the callback. You don't need to change MAKE_ADMIN_HANDLER, the dispatching mechanism, or all the signatures of the existing handlers.
That would clean this up quite a bit from a code and PR-size perspective.
But really the indicates that this admin handler is really a lot different from all the other ones, and points strongly toward having an architectural discussion. In particular, this admin handler actually starts a timer and does stuff after the request terminates, which is very different semantics from all the other admin handlers, which either do a simple mutation or return some data. Probably this is another signpost pointing us toward sinks as a better solution. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, the request never terminates. We must use the same socket that was used in the original request for streaming. If the request was terminated, the socket would have been closed. I think this was the main issue with implementing as a sink. Will need some help to resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmarantz Totally agree with the sentiment. However, IIUC, the handlers don't have the private variables of the filter (or the filter, itself) available to them since they are methods on the AdminImpl
object, but I may be missing something.
Above, we discussed above adding a handler class (rather than just a method) to combine the callback and data and formalize the idea of combining handlers with state. This would allow handlers to allocate and manage their own state, be tied to the filter object, and not require any dynamic_cast
ing. We decided to add this as a TODO rather than do it in this PR, but maybe we should get that in as a part of this PR instead? What do you think?
Side note: as long as a bad_cast
exception is the desired result of a failed dynamic_cast
, I think dynamic_cast
ing a ref should be fine since it automatically throws if the cast fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trabetti I think this will still work exactly the same way as it does now in a sink. I think the request object would need to be registered with the sink, and then the sink could start encoding data through it without closing it, just as the timers do now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it should be a separate PR, but maybe it should go first, and have the current one on top of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrice32 Good point; how about supplying the AdminFilter* to the callbacks, rather than an empty HandlerInfo* base class that needs to be cast whenever it is used. That would require passing 'this' to parent_.runCallback() in AdminFilter::onComplete, and then passing that through to the handlers, so they'd have context of the filter instantiation that invoked them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrice32 I meant for first PR for adding a handler class instead of a method. Then, modify this PR to use the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Yup, that would be fine too. I'm always in favor of dividing up PRs where possible :)
include/envoy/server/admin.h
Outdated
* in admin impl. Each handler which needs to receive data from admin filter can inherit from | ||
* HandlerInfo and build a class which contains the relevant data. | ||
*/ | ||
class HandlerInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As indicated in another comment thread, we can eliminate HandlerInfo completely and instead pass the AdminFilter* into each handler, which will clean up the dynamic_cast and remove a special-case from where the HandlerInfo is constructed.
source/common/stats/hystrix.cc
Outdated
void Hystrix::pushNewValue(std::string key, uint64_t value) { | ||
// create vector if do not exist | ||
if (rolling_stats_map_.find(key) == rolling_stats_map_.end()) { | ||
rolling_stats_map_[key].resize(num_of_buckets_, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rolling_stats_map_[key].resize(num_o_buckets_, value) is all you need; no need for the if-statement or .find() check.
source/common/stats/hystrix.cc
Outdated
const uint64_t Hystrix::PING_INTERVAL_IN_MS; | ||
|
||
// add new value to rolling window, in place of oldest one | ||
void Hystrix::pushNewValue(std::string key, uint64_t value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::string& here and in all formals.
source/common/stats/hystrix.cc
Outdated
|
||
void Hystrix::updateRollingWindowMap(Stats::Store& stats, absl::string_view cluster_name) { | ||
std::string prefix; | ||
prefix = absl::StrCat("cluster.",cluster_name,"."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally the formatter requires that you add spaces between args. You need to run the formatter under docker with every commit.
source/common/stats/hystrix.cc
Outdated
// (alternative: each request including the retries counted as 1) | ||
std::string upstream_rq_timeout_key = absl::StrCat(prefix,"upstream_rq_timeout"); | ||
std::string upstream_rq_per_try_timeout_key = absl::StrCat(prefix,"upstream_rq_per_try_timeout"); | ||
uint64_t timeouts = stats.counter(upstream_rq_timeout_key).value() + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is going to happen a bunch, do you want to cache the stats.counter return references and store them in the class, rather than doing the name-lookup each time?
source/common/stats/hystrix.cc
Outdated
uint64_t errors = stats.counter(upstream_rq_5xx_key).value() + | ||
stats.counter(retry_upstream_rq_5xx_key).value() + | ||
stats.counter(upstream_rq_4xx_key).value() + | ||
stats.counter(retry_upstream_rq_4xx_key).value() - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why '-' instead of '+' here? Just add comment if this was intentional.
source/common/stats/hystrix.cc
Outdated
addIntToStream("currentConcurrentExecutionCount", 0, cluster_info); | ||
addIntToStream("latencyExecute_mean", 0, cluster_info); | ||
|
||
// latency information can be taken rom hystogram, which is only available to sinks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "from histogram"? Start comments with a capital letter, end with a period.
source/common/stats/hystrix.cc
Outdated
std::stringstream out_str; | ||
|
||
for (std::map<std::string, RollingStats>::const_iterator it = rolling_stats_map_.begin(); | ||
it != rolling_stats_map_.end(); ++it) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (auto it : rolling_stats_map_) {
source/common/stats/hystrix.cc
Outdated
addHystrixThreadPool(ss, cluster_name, max_concurrent_requests, reporting_hosts); | ||
} | ||
|
||
absl::string_view Hystrix::printRollingWindow() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method should be const. Probably others should be as well.
source/common/stats/hystrix.h
Outdated
* Get value of the sampling buckets | ||
*/ | ||
static uint64_t GetRollingWindowIntervalInMs() { | ||
return static_cast<const uint64_t>(ROLLING_WINDOW_IN_MS / DEFAULT_NUM_OF_BUCKETS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/const//
source/server/http/admin.cc
Outdated
handler_info_ = std::make_unique<Stats::HystrixHandlerInfoImpl>(callbacks_); | ||
code = parent_.runCallback(path, *header_map, response, *handler_info_); | ||
end_stream = false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed elsewhere I am hoping to replace this whole new block with:
Http::Code code = parent_.runCallback(path, *header_map, this, response);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmarantz Thanks for the comments. Will do another fix soon.
@jmarantz @mrice32 Question for the sink implementation.
As far as I understand, currently sinks are only added to the sinks vector (MainImpl::stats_sinks_) during init, according to what's specified in the configuration file "stats_sinks" parameter.
I thought of two ways to implement:
(1) Add a "hystrix sink" through the config file. Once a request is made from the hystrix dashboard, set the connection parameters on that sink and switch on an enable flag in it.
(2) Once a request is made, create a new sink with the connection parameters and add it to the vector of sinks. Remove it from the vector when the connection is destroyed.
Do you see any problem or disadvantage with any of these options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trabetti, 1 is almost definitely what we want. This allows you to do all of the formatting once rather than having to do it for each connection. Also, Envoy is not really built to scale in the number of stats sinks because of the way it passes stats to them. And lastly, dynamic registration of sinks without involving the config makes them more difficult to reason about from the user's perspective.
I'm not exactly sure what you mean by setting connection parameters or enabling the flag, but by my understanding, you should be able to just register each stream with the single hystrix sink that is created through the config. We'll also need to detect connection termination and make sure the sink is aware of this so it doesn't try to access a deleted stream. This registration mechanism will take some thought, by the way, because there is no existing communication mechanism between stats sinks and the admin interface.
As you're working through this design change, feel free to reach out with any questions here or on slack. We're happy to provide input wherever needed :). Take your time making incremental updates here, and whenever you're ready for us to take another pass, just let us know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mrice32 , the first version of the hystrix support as a sink is almost ready, I hope I'll get to push it before Passover holiday..
Questions:
- As we discussed, the hystrix sink will be set through the config file, which means I touched data_plane_api. How do I push it here so it compiles (in my development env I point to my fork of data_plane_api)
- I need some help figuring out how to use the histogram data.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
You should send out a PR for your data-plane-api changes, which will need to be merged before this PR is merged. Until that PR gets merged, you should do exactly what you're doing now - point the tests to your own fork to allow the CI to pass in the meantime.
-
What, specifically, about the histogram data is confusing? The model for histograms is different for the other stats. Each sample is sent to the sinks when published. This is done synchronously by calling this method from whatever thread recorded the value. So you need to be sure that that method is thread-safe. The statsd implementation uses TLS to ensure thread-safety (see here), but since you are probably interested in forming some sort of central cache from which to pull at flush time, your implementation will likely be more complicated than that. It's also worth noting that there's work underway on a design to create an Envoy representation of the distributions and central cache to store them as opposed to just handing the samples to the sinks as is done now.
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
…trix-dashboard-support-handler-upstream Signed-off-by: trabetti <talis@il.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking awesome! I have a few design suggestions. I think we should aim for all hystrix-specific code to be under source/extensions. We can extend existing APIs to make the hystrix use case possible, but we should remove any special cases or methods in the core Envoy codebase. I've laid out a few suggestions (this review and the last one) to get us there.
I haven't gone through the sink, itself, yet, but I'll take a look at that next.
namespace Envoy { | ||
namespace Extensions { | ||
namespace StatSinks { | ||
namespace HystrixNameSpace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer Hystrix
over HystrixNameSpace
.
include/envoy/server/admin.h
Outdated
* in admin impl. Each handler which needs to receive data from admin filter can inherit from | ||
* HandlerInfo and build a class which contains the relevant data. | ||
*/ | ||
class HandlerInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - Either the Http::StreamDecoderFilterCallbacks*
should be passed through or you could add some new public methods to AdminFilter
to do what you want and pass it through.
source/server/http/admin.cc
Outdated
handler_info_ = | ||
std::make_unique<Extensions::StatSinks::Common::HystrixHandlerInfoImpl>(callbacks_); | ||
code = parent_.runCallback(path, *request_headers_, *header_map, response, *handler_info_); | ||
end_stream = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to remove handler_info_
, could we remove the rest of this special case? Could we just allow the callback to set the end_stream (it could still be defaulted to true so we don't need to change the other handlers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about, instead of involving the handlers and adding more arguments to their interface, add a bool streaming_
to struct UrlHandler
and let AdminImpl::runCallback
handle the end_stream
value?
source/server/http/admin.cc
Outdated
@@ -668,7 +714,7 @@ void AdminFilter::onComplete() { | |||
|
|||
// Under no circumstance should browsers sniff content-type. | |||
header_map->addReference(headers.XContentTypeOptions, headers.XContentTypeOptionValues.Nosniff); | |||
callbacks_->encodeHeaders(std::move(header_map), response.length() == 0); | |||
callbacks_->encodeHeaders(std::move(header_map), end_stream && response.length() == 0); | |||
|
|||
if (response.length() > 0) { | |||
callbacks_->encodeData(response, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want end_stream
to be general purpose as proposed in the comment above, can we change this call to:
callbacks_->encodeData(response, end_stream);
Essentially this would mean that if hystrix or some other callback wanted to send data and not end the stream, that would be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of comments - the sink looks good so far, thanks for addressing my comments from last time! I'll continue digging into the core sink implementation tomorrow, but this should be good enough to get started.
@@ -0,0 +1,157 @@ | |||
#include <map> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the stats_sinks/common
directory is only used for components that are shared by multiple stats sinks. I think it would be fine to put everything under stats_sinks/hystrix
(similar to stats_sinks/metrics_service
).
typedef std::vector<uint64_t> RollingStats; | ||
typedef std::map<const std::string, RollingStats> RollingStatsMap; | ||
|
||
class Hystrix : public Logger::Loggable<Logger::Id::hystrix> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we call this something a little more specific to what function it performs, like HystrixStatCache
for instance?
|
||
RollingStatsMap rolling_stats_map_; | ||
uint64_t current_index_; | ||
uint64_t num_of_buckets_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const
?
Hystrix() : current_index_(DEFAULT_NUM_OF_BUCKETS), num_of_buckets_(DEFAULT_NUM_OF_BUCKETS + 1){}; | ||
|
||
Hystrix(uint64_t num_of_buckets) | ||
: current_index_(num_of_buckets), num_of_buckets_(num_of_buckets + 1){}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the variable naming here is a little confusing since you're setting num_of_buckets_
equal to num_of_buckets + 1
. Can we change the name of the instance variable (num_of_buckets_
) to something that more clearly describes what it is - like bucket_mod_
, for instance?
void flushGauge(const Stats::Gauge&, uint64_t){}; | ||
void endFlush(); | ||
void onHistogramComplete(const Stats::Histogram& histogram, uint64_t value) { | ||
std::cout << "histogram complete: " << histogram.name() << ", value: " << std::to_string(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to leave this print statement here?
} | ||
} | ||
|
||
uint64_t Hystrix::getRollingValue(absl::string_view cluster_name, absl::string_view stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: stats
-> stat
since you're looking up a single statistic.
} | ||
|
||
// void HystrixSink::onHistogramComplete(const Histogram& histogram, uint64_t value); | ||
void HystrixSink::registerConnection(Http::StreamDecoderFilterCallbacks* callbacks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the previous review, I'm hoping we can redesign this so that the sink registers with the Admin
as opposed to the other way around.
Overall, though, it seems as if this is currently not designed for multiple streams. I would imagine that multiple subscribers is a valid use case since there's no restriction on the number of subscribers that can register with the hystrix admin handler. Is the intent just to drop the last connection each time a new one comes in? If not, we should probably have a vector of connection/callbacks/streams (or whatever object we end up using to interact with the stream) and send the data to all of them.
callbacks_ = callbacks; | ||
} | ||
|
||
// TODO (@trabetti) is this correct way? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to close the stream, you will need to make an encodeData
(or some other encode* method) call on the callbacks with the end_stream
argument set to true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we did not provide a way to terminate the stream from Envoy's side (should we?), this is called when the dashboard client was closed, so we are not sending any more data.
} | ||
out_str << std::endl; | ||
} | ||
return out_str.str(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmarantz can check me on this, but this seems like a dangerous/incorrect use of absl::string_view
. According to the source, the std::string
must outlive the absl::string_view
.
source/server/http/admin.h
Outdated
@@ -233,7 +258,8 @@ class AdminFilter : public Http::StreamDecoderFilter, Logger::Loggable<Logger::I | |||
AdminFilter(AdminImpl& parent); | |||
|
|||
// Http::StreamFilterBase | |||
void onDestroy() override {} | |||
// TODO (@trabetti) : make more generic (how?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about creating a new public method on AdminFilter
called addOnDestroyCallback()
? This method could take a callback and append to a list of callbacks to execute when the filter is destroyed. This will only work if we go with what @jmarantz suggested - an architecture where the AdminFilter
is passed to the handlers directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the portion that's confusing? Basically, you'd add a public method to AdminFilter
that would have the following signature:
void addOnDestroyCallback(std::function<void()> cb);
That method would append to a list of callbacks that are stored inside the AdminFilter
. When onDestroy()
is called, the AdminFilter
can call all of the callbacks in that list. Ultimately, once all this is done, I see your handler looking something like this:
Http::Code HystrixSink::handlerHystrixEventStream(... Server::AdminFilter& admin_filter, ....) {
.
.
.
Http::StreamDecoderFilterCallbacks *stream_decoder_filter_callbacks = admin_filter.getDecoderFilterCallbacks();
// Separated out just so it's easier to understand
auto on_destroy_callback = [this, stream_decoder_filter_callbacks] () {
// Unregister the callbacks from the sink so data is no longer encoded through them.
unregisterConnection(stream_decoder_filter_callbacks);
}
// Add the callback to the admin_filter list of callbacks
admin_filter. addOnDestroyCallback(std::move(on_destroy_callback));
.
.
.
}
To do this, you'll need to have the AdminFilter
be passed to the handlers, and you'll need to add and implement the addOnDestroyCallback
method.
If you do the above, we can get rid of the AdminImpl::unregisterHystrixConnection()
and Server::InstanceImpl:: unregisterHystrixSink
. This is the last piece of the puzzle to allowing the sink registering itself and requiring no special methods to pass the registration through the server.
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
Pushed a new version. The handler is added in the sink constructor. (plus other fixes). With the implementation in this commit, only one sink is possible, since the sink constructor adds the handler. (not sure what happens if someone tries to add more than one sink). I think it makes sense, but it is different than other sinks. |
include/envoy/server/admin.h
Outdated
@@ -23,8 +24,8 @@ namespace Server { | |||
*/ | |||
#define MAKE_ADMIN_HANDLER(X) \ | |||
[this](absl::string_view path_and_query, Http::HeaderMap& response_headers, \ | |||
Buffer::Instance& data) -> Http::Code { \ | |||
return X(path_and_query, response_headers, data); \ | |||
Buffer::Instance& data, Http::StreamDecoderFilterCallbacks* callbacks) -> Http::Code { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, IMO all these args could be saved in the AdminFilter object -- and the callbacks already are. That would simplify this interface once and for all, rather than adding a very specialized complicated object to all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making all those changes! I added a longer explanation to my suggestion below about getting the unregister portion to work. Let us know if you have any questions!
source/server/http/admin.h
Outdated
@@ -233,7 +258,8 @@ class AdminFilter : public Http::StreamDecoderFilter, Logger::Loggable<Logger::I | |||
AdminFilter(AdminImpl& parent); | |||
|
|||
// Http::StreamFilterBase | |||
void onDestroy() override {} | |||
// TODO (@trabetti) : make more generic (how?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the portion that's confusing? Basically, you'd add a public method to AdminFilter
that would have the following signature:
void addOnDestroyCallback(std::function<void()> cb);
That method would append to a list of callbacks that are stored inside the AdminFilter
. When onDestroy()
is called, the AdminFilter
can call all of the callbacks in that list. Ultimately, once all this is done, I see your handler looking something like this:
Http::Code HystrixSink::handlerHystrixEventStream(... Server::AdminFilter& admin_filter, ....) {
.
.
.
Http::StreamDecoderFilterCallbacks *stream_decoder_filter_callbacks = admin_filter.getDecoderFilterCallbacks();
// Separated out just so it's easier to understand
auto on_destroy_callback = [this, stream_decoder_filter_callbacks] () {
// Unregister the callbacks from the sink so data is no longer encoded through them.
unregisterConnection(stream_decoder_filter_callbacks);
}
// Add the callback to the admin_filter list of callbacks
admin_filter. addOnDestroyCallback(std::move(on_destroy_callback));
.
.
.
}
To do this, you'll need to have the AdminFilter
be passed to the handlers, and you'll need to add and implement the addOnDestroyCallback
method.
If you do the above, we can get rid of the AdminImpl::unregisterHystrixConnection()
and Server::InstanceImpl:: unregisterHystrixSink
. This is the last piece of the puzzle to allowing the sink registering itself and requiring no special methods to pass the registration through the server.
Signed-off-by: trabetti <talis@il.ibm.com>
Pushed a commit with AdminFilter passed to handler instead of the callbacks, and followed @mrice32 explanation about the destroy callback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. This is definitely the right direction, thanks! I did a quick review and left a few nits and questions, but I think we're getting close. I'll do a full pass over the code tomorrow. Side note: I suspect that you'll need to add a few tests for the codec (no chunk) change, new server accessor method (should be super simple to get coverage there), and admin filter changes.
@@ -184,6 +184,11 @@ message StatsdSink { | |||
string prefix = 3; | |||
} | |||
|
|||
// can it be empty? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, yes, but then it might make sense to not provide a config at in the API all by using this implementation in your sink:
ProtobufTypes::MessagePtr HystrixSinkFactory::createEmptyConfigProto() {
return ProtobufTypes::MessagePtr{new Envoy::ProtobufWkt::Empty()};
}
However, the API portion of the repo is generally used for docs as well (the comments are scraped, formatted, and published to a docs site), so this config might be a good place to provide general information about the Hystrix stats sink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of having the number of buckets as a parameter to the constructor. And theoretically it can be changed even during action. But not sure if this level of control is really necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding it as a config option would be better than making it a constructor argument since the constructor is just called from the factory, and there is no factory configuration, so it would have to be hardcoded there. If this is truly something that a user might want to tweak, it would be good to have it as a config parameter :).
void endFlush() override; | ||
void onHistogramComplete(const Stats::Histogram&, uint64_t) override{}; | ||
|
||
// TODO (@trabetti) : support multiple connections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: hasn't this been done? If so, can we remove the TODO comment?
source/server/http/admin.cc
Outdated
@@ -150,6 +150,16 @@ Http::FilterTrailersStatus AdminFilter::decodeTrailers(Http::HeaderMap&) { | |||
return Http::FilterTrailersStatus::StopIteration; | |||
} | |||
|
|||
void AdminFilter::onDestroy() { | |||
for (auto callback : on_destroy_callbacks_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const auto& callback
source/server/http/admin.cc
Outdated
} | ||
|
||
void AdminFilter::addOnDestroyCallback(std::function<void()> cb) { | ||
on_destroy_callbacks_.push_back(cb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: on_destroy_callbacks_.push_back(std::move(cb));
source/server/http/admin.cc
Outdated
Http::Code code = parent_.runCallback(path, *request_headers_, *header_map, response, *this); | ||
bool end_stream = true; | ||
|
||
if (path.find("/hystrix_event_stream") != std::string::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may already be planning to do this, but can we get rid of the special case by allowing the handler to be able to set the end_stream
variable (we can still default it to true
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about, instead of involving the handlers and adding more arguments to their interface, add a bool streaming_ to struct UrlHandler and let AdminImpl::runCallback handle the end_stream value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any particular issues with that. Although, if we end up putting most of the handler arguments into the AdminFilter
itself, we could add end_stream
there, too. I think that would be be the easiest way.
void HystrixSink::unregisterConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_remove) { | ||
for (auto it = callbacks_list_.begin(); it != callbacks_list_.end();) { | ||
if ((*it)->streamId() == callbacks_to_remove->streamId()) { | ||
it = callbacks_list_.erase(it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not break
after erasing? Is it possible to have multiple callbacks with the same stream id here? If it is, are you sure you want to delete both? Maybe it makes sense to de-dup in the registerConnection()
method so this can't happen? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the stream id should be unique, can break.
private: | ||
HystrixPtr stats_; | ||
std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{}; | ||
Server::Instance* server_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not Server::Instance& server_;
since it never changes what it's pointing to?
std::map<std::string, std::map<std::string, std::string>> counter_name_lookup; | ||
}; | ||
|
||
typedef std::unique_ptr<HystrixStatCache> HystrixPtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: HystrixStatCachePtr
|
||
uint64_t HystrixStatCache::getRollingValue(absl::string_view cluster_name, absl::string_view stat) { | ||
std::string key; | ||
key = absl::StrCat("cluster.", cluster_name, ".", stat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not single line std::string key = absl::StrCat("cluster.", cluster_name, ".", stat);
?
// Add new value to rolling window, in place of oldest one. | ||
void HystrixStatCache::pushNewValue(const std::string& key, uint64_t value) { | ||
// Create vector if do not exist. | ||
// TODO trabetti: why resize + value param didn't work without the if? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this TODO about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we wanted to do a changed suggested by @jmarantz , it did not work for some reason, did not get to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the trigger for me here was that in both branches, you are doing two map lookups, and I think you should only need one. Maybe mention that in the TODO.
HystrixStatCache() | ||
: current_index_(DEFAULT_NUM_OF_BUCKETS), window_size_(DEFAULT_NUM_OF_BUCKETS + 1){}; | ||
|
||
HystrixStatCache(uint64_t num_of_buckets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't think I gave the right feedback the first time around. I think window_size_
should be equal to num_of_buckets
since you want to mod by the total number of buckets. I think current_index_
should be set to num_of_buckets - 1
, since that's the highest index. Right now, there are technically num_of_buckets + 1
total buckets, which is a bit confusing. Also, there should be some input validation when you read in the config to ensure that num_of_buckets is at least 2
(I assume this wouldn't make much sense with only one bucket...?). You could also toss in an assert in the constructor to validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is confusing.. The statsCache is a rolling window. If for example, we have flush interval of 1 ms, and the number_of_buckets
is 5, it means that we send every 1 ms the statistics for the last 5 ms. The values stored in it are the counters values, which are constantly increasing. So the result we want to return every endFlush()
is the value of current counter, minus the counter before the rolling window started. That's why we keep num_of_buckets+1 values in the rolling window. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense. Thanks for the explanation. On another note, are 1 or 0 acceptable inputs to this function? If not, could we add some input validation (either where you read from the config or here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume '0' means unset value, in this case it will be set to default value (10) . '1' should be legitimate value, but it is good to make sure it works. I already did that but config.cc I see here is not the updated one (the github scalability issue?)
if (callbacks_list_.empty()) { | ||
return; | ||
} | ||
if (counter.name().find("upstream_rq_") != std::string::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than string compare each of these, I think you should just grab the stats directly from each cluster in the endFlush()
method. See the stats()
method on the cluster object:
https://github.com/envoyproxy/envoy/blob/master/include/envoy/upstream/upstream.h#L494
These are all the stats that are available through that struct:
https://github.com/envoyproxy/envoy/blob/master/include/envoy/upstream/upstream.h#L292
I know the macros are a bit confusing, but it's just creating a struct where you can access the Counter&
, Gauge&
, etc objects. I think this should cover most (or all?) of the stats you need to capture here and the .membership_total
one you look up in endFlush()
. IIUC this should also allow you to completely get rid of counter_name_lookup_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing struct do not cover all the stats we need. missing - upstream_rq_2xx
, upstream_rq_4xx
, upstream_rq_5xx
. To get rid of the string compare, we'll need to read those directly in end_flush()
. Or is it possible to create struct with only the stats we need?
By the way, we use the stats() method here: https://github.com/trabetti/envoy/blob/bb876ee141069371ad0624bc3efc08074a0d7602/source/extensions/stat_sinks/hystrix/hystrix.cc#L338
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - my bad, those are actually created and set dynamically -
envoy/source/common/http/codes.cc
Line 23 in b8f2268
scope.counter(fmt::format("{}upstream_rq_{}", prefix, enumToInt(response_code))).inc(); |
.membership_total
stat.
The stats()
method you are referring to is the method on the Server::Instance object, which returns a stats scope that allows looking up/creating stats by name. I'm referring to the stats()
method on the cluster object, which returns a struct of specific stats that are relevant to the cluster that can be accessed directly without a lookup. The latter should be used for as many stats as possible as it is the cheapest lookup. The former should be used for specific stats that are not accessible directly via a struct, which IIUC, should just be those 3 stats you just mentioned. We can't really create our own struct for all of these stats because it would need to constantly be updated whenever Envoy adds or removes a cluster.
I would really only advise using the flushCounter
, flushGauge
, etc methods when you need to export all or most of Envoy's stats. It seems this sink is interested in a pretty limited subset of stats, which means individual lookups will likely be the cheapest way to do it.
I know the stats system can be pretty confusing because there are a lot of different ways to do the same thing :). Does all of that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own learnings....is there some reason an interested class can't do the name-lookup once on construction and then save the desired stat in their own member var?
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com> Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
Signed-off-by: Eliran Roffe <eliranr@il.ibm.com>
// If the counter was reset, the result is negative | ||
// better return 0, will be back to normal once one rolling window passes. | ||
if (rolling_stats_map_[key][current_index_] < | ||
rolling_stats_map_[key][(current_index_ + 1) % window_size_]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. You are doing lots of map lookups with the same key. This can be avoided.
const uint64_t window_size_; | ||
// TODO(trabetti): do we want this to be configurable through the HystrixSink in config file? | ||
static const uint64_t DEFAULT_NUM_OF_BUCKETS = 10; | ||
std::map<std::string, std::map<std::string, std::string>> counter_name_lookup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suffix member vars with and underscore.
// Building lookup name map for all specific cluster values. | ||
// Every call to the updateRollingWindowMap function should get the appropriate name from the map. | ||
std::string cluster_name_with_prefix = absl::StrCat("cluster.", cluster_name, "."); | ||
counter_name_lookup[cluster_name]["upstream_rq_timeout"] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save counter_name_lookup[cluster_name] into a temp.
void HystrixStatCache::updateRollingWindowMap(std::map<std::string, uint64_t> current_stat_values, | ||
std::string cluster_name) { | ||
|
||
if (counter_name_lookup.find(cluster_name) == counter_name_lookup.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto -- lots of lookups to cluster_name all of which return the same exact string->string map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is running into github's current limits; may need to make a new one. When you do, I'd recommend.
- Use a single commit in the new PR; don't just cut a new PR from all your existing commits, otherwise it will quickly blow up again.
- Be careful about DCO which is failing now.
/** | ||
* Generate the streams to be sent to hystrix dashboard. | ||
*/ | ||
void getClusterStats(std::stringstream& ss, absl::string_view cluster_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move output parameter ss last. https://google.github.io/styleguide/cppguide.html#Output_Parameters
/** | ||
* Calculate values needed to create the stream and write into the map. | ||
*/ | ||
void updateRollingWindowMap(Upstream::ClusterInfoConstSharedPtr cluster_info, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const Upstream::ClusterInfoConstSharedPtr&
|
||
typedef std::unique_ptr<HystrixStatCache> HystrixStatCachePtr; | ||
|
||
namespace Hystrix { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this extra level of namespace? The only names in it are already prefixed with 'Hystrix'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think this namespace should surround everything inside the hystrix/
dir, not just this class. This is how it's done for other stats sinks.
|
||
class HystrixSink : public Stats::Sink, public Logger::Loggable<Logger::Id::hystrix> { | ||
public: | ||
HystrixSink(Server::Instance& server, uint64_t num_of_buckets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output params 'server' should be last: https://google.github.io/styleguide/cppguide.html#Output_Parameters
|
||
private: | ||
HystrixStatCachePtr stats_; | ||
std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any value to explicitly putting {} in this declaration?
std::string rolling_map = sink_->getStats().printRollingWindow(); | ||
std::size_t pos = rolling_map.find("cluster.test_cluster.total"); | ||
EXPECT_NE(std::string::npos, pos); | ||
// //EXPECT_NE(absl::string_view::npos, map.find("cluster.test_cluster.total")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm line
for (auto stats_map_itr = rolling_stats_map_.begin(); stats_map_itr != rolling_stats_map_.end(); | ||
++stats_map_itr) { | ||
out_str << stats_map_itr->first << " | "; | ||
RollingWindow rolling_window = stats_map_itr->second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const RollingWindow& rolling_window = name_value.second;
out_str << stats_map_itr->first << " | "; | ||
RollingWindow rolling_window = stats_map_itr->second; | ||
for (auto specific_stat_vec_itr = rolling_window.begin(); | ||
specific_stat_vec_itr != rolling_window.end(); ++specific_stat_vec_itr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a range-loop here too, with a specific type for the iterator variable (not auto). It's not obvious from looking at this code locally what the type is, which seems interesting.
RollingWindow rolling_window = stats_map_itr->second; | ||
for (auto specific_stat_vec_itr = rolling_window.begin(); | ||
specific_stat_vec_itr != rolling_window.end(); ++specific_stat_vec_itr) { | ||
out_str << *specific_stat_vec_itr << " | "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't it be weird to have a " | " at the end of the list? If you use absl::StrJoin you won't get a trailing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted it to look like a table.. although the lines are not aligned so it doesn't really look like a table in reality..
// Building lookup name map for all specific cluster values. | ||
// Every call to the updateRollingWindowMap function should get the appropriate name from the map. | ||
std::string cluster_name_with_prefix = absl::StrCat("cluster.", cluster_name, "."); | ||
counter_name_lookup[cluster_name]["upstream_rq_5xx"] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems indirect. Can you capture the counter object pointer in your map, rather than capturing a long string that you'll then have to look up in updateRollingWindowMap, which seems like it might be called frequently enough to matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment about the data storage in the Hystrix sink. Once we get that all sorted out, I think we'll be pretty close to getting this merged! Thanks for all the work on this!
void HystrixStatCache::CreateCounterNameLookupForCluster(const std::string& cluster_name) { | ||
// Building lookup name map for all specific cluster values. | ||
// Every call to the updateRollingWindowMap function should get the appropriate name from the map. | ||
std::string cluster_name_with_prefix = absl::StrCat("cluster.", cluster_name, "."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, I'm trying to get rid of as many map lookups as possible since static struct accesses are generally much more efficient.
Instead of using a nested map, why can't we combine counter_name_lookup
and rolling_stats_map_
into a single map from cluster name to a struct containing all cluster specific information?
For example:
struct ClusterStats {
// This constructor should initialize all the stat names by concatenating
// the cluster name and the rest of the stat name that's required for
// the lookup.
ClusterStats(const std::string &cluster_name);
const std::string retry_upstream_rq_5xx_name_;
const std::string upstream_rq_5xx_name_;
const std::string retry_upstream_rq_4xx__name_;
.
.
.
// Rolling windows
RollingWindow errors_;
RollingWindow success_;
RollingWindow total_;
.
.
.
};
// Map from cluster names to a struct of all of that cluster's stat windows.
std::unordered_map<std::string, ClusterStats> cluster_stats_map_;
This would allow us to remove all of this initialization code inside CreateCounterNameLookupForCluster
and just allow you to do one map lookup for each cluster. Then your pushNewValue
calls could look more like this (you would need to modify it to take in a RollingWindow&
and the value):
ClusterStats& cluster_stats = cluster_stats_map_[cluster_name];
.
.
.
uint64_t success = stats.counter(cluster_stats.upstream_rq_2xx_name_).value();
pushNewValue(cluster_stats.success_, success);
.
.
.
Sorry for the wall of text. Let me know if the above suggestion doesn't make sense or you don't think it will work. I think this stat export is happening often enough that we should make sure it's not doing any unnecessary work.
uint64_t error_rate = | ||
total == 0 | ||
? 0 | ||
: (static_cast<double>(errors + timeouts + rejected) / static_cast<double>(total)) * 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think you can remove all the static casts by doing the multiply on the numerator before dividing:
(100 * (errors + timeouts + rejected)) / total;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @jmarantz mentioned, please squash all commits and open a new PR because of the github scalability issue. I can move my recent comments over to the new PR.
Actually, while we are talking about new PRs, I think it would be worthwhile to break out the admin.cc changes as a separate easy-to-review PR. It touches a bunch of lines but those issues are now pretty much settled, so might as well iterate on the map data structure issues @mrice32 and I have been discussing in a smaller PR. |
@jmarantz fine with me. will open a separate PR for the admin changes. There's nothing directly related to the hystrix support there. |
RE hystrix.cc being skipped: I think this is really a github UI filter
working as intended. You can get to hysterix.cc you just have a to click
through a prompt in the 'files' diff that says "diff too large, click here
to show". It's just a little counterintuitive because if you do a find in
the page, you won't cover any of the contents of hystrix.cc.
…On Mon, Apr 23, 2018 at 3:53 PM trabetti ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In source/extensions/stat_sinks/hystrix/hystrix.h
<#2736 (comment)>:
> +
+ /**
+ * register a new connection
+ */
+ void registerConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_register);
+ /**
+ * remove registered connection
+ */
+ void unregisterConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_remove);
+ HystrixStatCache& getStats() { return *stats_; }
+
+private:
+ HystrixStatCachePtr stats_;
+ std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{};
+ Server::Instance& server_;
+ std::map<std::string, uint64_t> current_stat_values_;
@jmarantz <https://github.com/jmarantz> why does hysrtix.cc being
skipped? I also noticed that some files are not updated here. Is it because
the github PR capacity issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2736 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB2kPQTu8UdrBlN5Khew_ZnbPTquPSs2ks5trjEwgaJpZM4SeoHY>
.
|
Opened PR #3172 for admin changes only. Will need to add coverage tests. |
Closing this PR. Opened a new PR #3425. |
Signed-off-by: talis talis@il.ibm.com
admin filter: add support for hystrix dashboard output
Description:
This PR adds an admin endoint for hystrix dashboard output.
When there is a request to admin_port/hystrix_event_stream, envoy starts sending SSE stream, using the same socket of the initial request.
The admin URL can be accessed directly from hystrix dashboard.
Risk Level: Medium
Testing:
Added unit testing hystrix_test.cc
The admin endpoint was not tested, since it returns an infinite event stream, so current mechanisms for testing admin endpoint do not work. If there are any suggestions on how to test it, we will add.
Docs Changes:
changed docs/root/operations/admin.rst.
PR: envoyproxy/data-plane-api#523
Release Notes:
Support was added to hystrix event stream output. From hystrix dashboard, set the Eureka URL to envoy_ip:8001, and add stream from http://envoy_ip:8001/hystrix_event_stream
Note that not all hystrix dashboard statistics are supported. More detail can be found here: <link to the documentation generated from docs/root/operations/admin.rst >
Fixes #1244