-
Notifications
You must be signed in to change notification settings - Fork 5.1k
support for hystrix dashboard through an admin endpoint, using sink #3425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for hystrix dashboard through an admin endpoint, using sink #3425
Conversation
…echanism Signed-off-by: trabetti <talis@il.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! A few comments to get started.
EXPECT_EQ(getStreamField(data_message, "rollingCountTimeout"), "0"); | ||
} | ||
|
||
// TEST_F(HystrixSinkTest, BasicFlow) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this all commented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests were using the old sink interface. I need to update.
it = callbacks_list_.erase(it); | ||
break; | ||
} else { | ||
++it; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can't you just move this ++it
up into the for loop statement and get rid of the else?
void HystrixSink::unregisterConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_remove) { | ||
for (auto it = callbacks_list_.begin(); it != callbacks_list_.end();) { | ||
if ((*it)->streamId() == callbacks_to_remove->streamId()) { | ||
it = callbacks_list_.erase(it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since you're breaking, can't you just ignore the return value?
Upstream::ClusterStats& cluster_stats = cluster_info->stats(); | ||
|
||
if (cluster_stats_cache_map_.find(cluster_name) == cluster_stats_cache_map_.end()) { | ||
// ClusterStatsCache cluster_stats_cache_inst(cluster_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these commented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old code
if (cluster_stats_cache_map_.find(cluster_name) == cluster_stats_cache_map_.end()) { | ||
// ClusterStatsCache cluster_stats_cache_inst(cluster_name); | ||
// cluster_stats_cache_map_[cluster_name] = cluster_stats_cache_inst; | ||
cluster_stats_cache_map_[cluster_name] = std::make_unique<ClusterStatsCache>(cluster_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that a per-cluster cache where elements are only added but never deleted is going to effectively be a memory leak. Also, can you describe what you want to happen if a cluster is deleted and then another cluster is, many flushes later, added under the same name? IIUC, with the current implementation, this will result in that cluster using the old, unrelated cluster's cache. I have a few ideas of how to flush the cache, but before proposing anything, I'd like to understand what the desired behavior is for naming collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I did not consider those cases. I think the best would be to detect when a cluster no longer exists and remove from the cache map. What are your ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't thought it through completely, but I think one solution would be to basically check the differences in sizes in the two maps (the cluster map in the server and the sink's cache map) after you finish updating the caches. Since you've already added any missing clusters to the sink's map, it's only possible that the sink map is larger. If it is larger, you can iterate through the sink's map and look up each name in the server's cluster map and delete any that aren't found in the server's map. It's not an incredibly elegant solution, but I'm not sure if there are any better options for trying to sync two unordered maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done handling the clusters removal + updating the test. How can I simulate removing and adding back clusters during operation? (not in a test)
} | ||
|
||
void HystrixSink::updateRollingWindowMap(Upstream::ClusterInfoConstSharedPtr cluster_info, | ||
Stats::Store& stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be using each cluster's stats scope to get cluster-specific stats. Otherwise, they will not be deallocated when the cluster is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the comment about not being deallocated when the cluster is removed. Are you referring to using Upstream::ClusterStats
? or the Shared pointer Upstream::ClusterInfoConstSharedPtr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be using Upstream::ClusterInfo::statsScope()
and calling counter(stat_name)
on that (the name will be different since the scope already assumes the cluster prefix) - see here.
The stat scope system is a little confusing with respect to how it manages memory. When you request a stat from a scope using scope.counters(stat_name)
, the scope will then ensure the underlying stat stays alive as long as that scope is alive. If we pull the cluster stats from the server scope, that means the stats that we're requesting are required to live as long as the server's stats scope, which is the life of the Envoy instance. If we grab them from the cluster scope, they will be deallocated when the cluster is removed as one would expect for cluster-specific stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it new interface? It is very good for us, since using this, we won't need to save the concatenated string of the statistic's name per cluster. What about the stats we are currently getting from Upstream::ClusterStats::stats()
? Take them also from the Upstream::ClusterInfo::statsScope()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it should make things much easier. Technically, you could pull any stats in Upstream::ClusterStats::stats()
from Upstream::ClusterInfo::statsScope()
, but it's better the way you have it. Upstream::ClusterStats::stats()
is essentially a struct holding stats from Upstream::ClusterInfo::statsScope()
to access them without doing a map lookup.
Signed-off-by: trabetti <talis@il.ibm.com>
Pushed updated code, with handling the clusters removal + tests. How can I simulate removing and adding back clusters during operation? (not in a test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To try this outside of a test, you would need to use CDS to send an update with a cluster and then a second update without the cluster - see here for an explanation of dynamic updates. You can do this with a management server over gRPC or you can use the filesystem as a source.
/** | ||
* Generate the streams to be sent to hystrix dashboard. | ||
*/ | ||
void getClusterStats(absl::string_view cluster_name, uint64_t max_concurrent_requests, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: method name is misleading. Can we change to something like writeClusterStats()
, appendClusterStats()
, or addClusterStatsToStream()
(as the methods are named below)?
} | ||
|
||
void HystrixSink::addHystrixThreadPool(absl::string_view cluster_name, uint64_t queue_size, | ||
uint64_t reporting_hosts, uint64_t rolling_window, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since rolling_window
represents a time, but it's represented as an integer, can you change the name to indicate that it's a time - like rolling_window_ms
or window_period_ms
? As a reader, this was a bit confusing to me.
} | ||
|
||
std::stringstream ss; | ||
for (auto& cluster : clusters) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this series of operations could be simplified a bit. Could we combine this loop with the updateRollingWindowMap
loop above? We could also move the cache cleanup to the very end (since this is probably the least time sensitive operation). Then we can do a single lookup of the cache for a particular cluster and pass it through to updateRollingWindowMap
and getClusterStats
.
// the lookup. | ||
ClusterStatsCache(const std::string& cluster_name); | ||
|
||
std::string errors_name_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these names just here for the purpose of testing and logging? I don' think the full stat names are still needed since we're looking the stats up within the cluster scope. It appears that they're only used in printRollingWindows()
, which seems to only be used in tests and log messages? Not sure how others feel, but I think we should probably avoid storing this state if it's just for logging and testing. If we want to keep that method for logging and testing, we could just let it construct whatever preferred stat name is needed on the fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed new version. Please see if it seems more reasonable now. Another option is to totally remove it. It is there just for logging. Not needed for testing.
response_headers.insertNoChunks().value().setReference("0"); | ||
|
||
Http::StreamDecoderFilterCallbacks& stream_decoder_filter_callbacks = | ||
const_cast<Http::StreamDecoderFilterCallbacks&>(admin_stream.getDecoderFilterCallbacks()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to const cast this? If this interface returns a const StreamDecoderFilterCallbacks&
, I think there's something wrong with it. StreamDecoderFilterCallbacks
has no const methods, which means a const one would essentially be useless, right? I think this should be updated to return a non-const StreamDecoderFilterCallbacks&
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wanted to return only const refs when we worked on the AdminStream. It forced me to use the const_cast for the hystrix implementation. We overlooked the fact there are no const methods on it anyway.
Signed-off-by: trabetti <talis@il.ibm.com>
Hi, any feedback on last commit? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! I left a few more nits, but I think this is getting close. I haven't had a chance to go through the tests, yet. @jmarantz, would you mind taking a pass, especially through the tests?
|
||
void HystrixSink::flush(Stats::Source&) { | ||
if (callbacks_list_.empty()) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we generally prefer:
if (callbacks_list_.empty()) {
return;
}
Upstream::ClusterStats& cluster_stats = cluster_info->stats(); | ||
Stats::Scope& cluster_stats_scope = cluster_info->statsScope(); | ||
|
||
if (cluster_stats_cache_map_.find(cluster_name) == cluster_stats_cache_map_.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you could avoid the redundant lookup by doing something like:
std::unique_ptr<ClusterStatsCache>& cluster_stats_cache_ptr = cluster_stats_cache_map_[cluster_name];
if (cluster_stats_cache_ptr == nullptr) {
cluster_stats_cache_ptr = std::make_unique<ClusterStatsCache>(cluster_name);
}
// Use cluster_stats_cache_ptr here
Upstream::ClusterInfoConstSharedPtr cluster_info = cluster.second.get().info(); | ||
|
||
// update rolling window with cluster stats | ||
updateRollingWindowMap(cluster_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not just do the cache lookup based on cluster name out here and pass a reference to the cache into both methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC - you mean to move the std::unique_ptr<ClusterStatsCache>& cluster_stats_cache_ptr = cluster_stats_cache_map_[cluster_name];
up to flush()? Please see the new version , is this what you had in mind?
|
||
void HystrixSink::addClusterStatsToStream(absl::string_view cluster_name, | ||
uint64_t max_concurrent_requests, | ||
uint64_t reporting_hosts, uint64_t rolling_window_ms, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Sorry for being indecisive on this, but could we instead just pass the rolling_window_ms around as type std::chrono::milliseconds
rather than a uint64_t
? This means that if the flush interval that you're accessing starts being represented as std::chrono::microseconds
or std::chrono::seconds
, you'll see a compile-time error. I know this is unlikely, but we've had problems with outdated .count()
interpretations in the past.
// check if any clusters were removed, and remove from cache | ||
if (clusters.size() < cluster_stats_cache_map_.size()) { | ||
std::vector<std::string> clusters_to_remove; | ||
for (auto& itr : cluster_stats_cache_map_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this could be made a bit simpler/faster by doing the following:
for (auto it = cluster_stats_cache_map_.begin(); it != cluster_stats_cache_map_.end();) {
if (clusters.find(it.first) == clusters.end()) {
it = cluster_stats_cache_map_.erase(it);
} else {
++it;
}
}
Signed-off-by: trabetti <talis@il.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing comments, some of which were pending for the last month :) I'm going to look at the tests next.
source/common/http/headers.h
Outdated
@@ -208,6 +211,15 @@ class HeaderValues { | |||
const std::string AcceptEncoding{"Accept-Encoding"}; | |||
const std::string Wildcard{"*"}; | |||
} VaryValues; | |||
|
|||
struct { | |||
const std::string AccessControlAllowHeadersHystrix{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem like you need to prefix this variable-name with 'AccessControlAllow', since that's in the struct name.
// chunk transfer encoding but we don't have a content-length so we pass "envoy only" | ||
// header to avoid adding chunks | ||
if (headers.NoChunks()) { | ||
no_chunks = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just put this in the initializer on line 47?
const bool no_chunks = headers.NoChunks();
actually there's just a single-use, maybe you can just write
if (saw_content_length || headers.NoChunks()) {
below, and skip the temp?
init(); | ||
} | ||
|
||
HystrixSink::HystrixSink(Server::Instance& server) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the call-sites you could kill this version, and have the 2-arg ctor init like this:
: server_(server),
current_index_(num_of_buckets ? num_of_buckets : DEFAULT_NUM_OF_BUCKETS_),
window_size_(current_index_ + 1) {
That would simplify call-sites.
void onHistogramComplete(const Stats::Histogram&, uint64_t) override{}; | ||
|
||
/** | ||
* register a new connection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/register/Registers/ , and specify doxygen for args per convention, here & below.
and below
a/remove/Removes/, s/Add/Adds/, etc.
std::stringstream& ss); | ||
|
||
// HystrixStatCachePtr stats_; | ||
std::vector<Http::StreamDecoderFilterCallbacks*> callbacks_list_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary to initialize an empty vector; just leave out the {}
static const uint64_t DEFAULT_NUM_OF_BUCKETS = 10; | ||
|
||
// Map from cluster names to a struct of all of that cluster's stat windows. | ||
std::unordered_map<std::string, ClusterStatsCachePtr> cluster_stats_cache_map_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/{}//
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK left some test comments. The tests look really comprehensive because they are long and the names of the tests seem to cover the sort of functionality that I'd expect.
But it would be easier to follow if you factored out common blocks into shared helper methods.
|
||
std::string cluster_name_; | ||
|
||
// Rolling windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move member vars below the methods.
typedef std::vector<uint64_t> RollingWindow; | ||
typedef std::map<const std::string, RollingWindow> RollingStatsMap; | ||
|
||
struct ClusterStatsCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a class not a struct.
https://google.github.io/styleguide/cppguide.html#Structs_vs._Classes
But if this is a clear pattern in Envoy it's OK to leave it as a struct, in which case you don't need to suffix the member vars with _
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually by that definition, it seems to me that it should be a struct and not a class. It passively holds the rolling windows, which are manipulated by the sink's methods. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's fine; I guess the methods are just printers. And actually the _ suffix is called out by STYLE.md as a local rule. Leave as is then :)
absl::string_view::size_type key_pos = dataMessage.find(key); | ||
EXPECT_NE(absl::string_view::npos, key_pos); | ||
absl::string_view trimDataBeforeKey = dataMessage.substr(key_pos); | ||
key_pos = trimDataBeforeKey.find(" "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trim_data_before_key, trim_data_after_value.
// set callbacks to send data to buffer | ||
Buffer::OwnedImpl buffer; | ||
auto encode_callback = [&buffer](Buffer::Instance& data, bool) { | ||
buffer.add( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put the comment above the call to buffer.add(data);
-- I think it'll format more nicely that way.
|
||
Buffer::OwnedImpl createClusterAndCallbacks() { | ||
|
||
// set cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set cluster. (sentences start with a capital letter, end with a period). Here & below.
// //std::string rolling_map = sink_->getStats().printRollingWindow(); | ||
std::string rolling_map = sink_->printRollingWindows(); | ||
std::size_t pos = rolling_map.find("test_cluster1.total"); | ||
EXPECT_NE(std::string::npos, pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you could do this in one line:
EXPECT_NE(std::string::npos, rolling_map.find("test_cluster1.total"));
the advantage is that if it fails the test error message would be a little easier to debug because it would include the source of the find call.
std::size_t pos1 = data_message.find(cluster1_name_); | ||
std::size_t pos2 = data_message.find(cluster2_name); | ||
EXPECT_NE(std::string::npos, pos1); | ||
EXPECT_NE(std::string::npos, pos2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably want to make these ASSERT_NE because you don't want to use relational operators below on pos1 or pos2 if they are npos.
EXPECT_EQ(getStreamField(data_message_2, "requestCount"), "0"); | ||
EXPECT_EQ(getStreamField(data_message_2, "rollingCountSemaphoreRejected"), "0"); | ||
EXPECT_EQ(getStreamField(data_message_2, "rollingCountSuccess"), "0"); | ||
EXPECT_EQ(getStreamField(data_message_2, "rollingCountTimeout"), "0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general comment here is that I can only peephole review the tests because they are so long, so it's hard for me to tell what's going on. Can you factor out some common blocks into helpers? E.g. this pattern of checking that a specific bunch of values in a message are 0 is repeated 3x in this file. Factor that out into a common method. This is also the case likely with a bunch of other stanzas, such as as setting up a bunch of mocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. I am factoring out a bunch of stuff. Will push once ready.
Problem with helper method for the checkers is, as you pointed out in the past, it would be hard to debug error messages from lines inside the function. But the test code would be more readable. I can group most of the checkers into readable methods. Any suggestion here?
pos1 = data_message.find(cluster1_name_); | ||
pos2 = data_message.find(cluster2_name); | ||
EXPECT_NE(std::string::npos, pos1); | ||
EXPECT_NE(std::string::npos, pos2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASSERT_EQ
std::string data_message = TestUtility::bufferToString(buffer); | ||
|
||
std::size_t pos1 = data_message.find(cluster1_name_); | ||
std::size_t pos2 = data_message.find(cluster2_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this pattern of using find on the cluster-names and this dance with the substr etc could be done easier with absl::StrSplit. That would be the case if there were a known delimiter between the clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are known delimiters, but the order of the clusters is not guaranteed, so we'll need to add more actions to find which is which. (or otherwise, add sorting in the code that produces the stream)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A helper method to populate & return a map would be be useful to remove this kind of complicated pattern of finding substring delimiters in this way. In addition to require a handful of lines that require some scrutiny, the current pattern could fail if either cluster-name turned out to be a substring of some other syntax in the data.
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
New version with most review comments taken care of. Factored out a lot in the test. I am in doubt about factoring checkers into functions, because it would be hard to debug failures. |
ON_CALL(cluster_manager_, clusters()).WillByDefault(Return(cluster_map_)); | ||
sink_->flush(source_); | ||
|
||
cluster_message_map.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this, as on the next line you kill the previous value completely.
|
||
// Generate data to both clusters. | ||
sink_->flush(source_); | ||
for (int i = 0; i < 12; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 12 arbitrary? I feel like some of the golden values you are comparing to below are somehow related to the constants here, but it's not super-obvious what the relationship is. E.g. I see a few cases throughout this file where rollingCountSemaphoreRejected is tested against some expected constant, but I can't tell why that constant is expected. E.g. why is errorCount 160?
Can we at least have comments explaining the relationship?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the values are arbitrary. We set some values to simulate the statistics and then we calculate the expected values in the stream. I was thinking we could do that in a function that will set the counter values and then calculate expected results and check them, this will make things more obvious. This function can be used in many places throughout the test. But again, I have the same concern that when the function fails, it will be hard to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following you why failures in a function are hard to debug.
Do you mean that if there's an EXPECT failure in a helper function, you might not see the whole call-stack in a test log? I agree it's awesome when you can diagnose and fix failures purely based on a test log. However, I don't think it justifies flattening and inlining what logically should be functions.
You can also make test logs directly actionable by passing enough context to the function to print, but usually the clutter is not worth it.
And you can actually debug tests with gdb, in which case a function does not present any barrier to debuggability. Note that debugging envoy unit tests with gdb is not super-obvious even if you are otherwise familiar with gdb, but there is a helper script:
tools/bazel-test-gdb //test/common/common:utility_test -c dbg
And the high order bit here, is: this test is filled with magic constants and it's not clear why they are the right ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll put them in functions and then we'll look and see if it can be improved for debugging.
|
||
std::unordered_map<std::string, std::string> cluster_message_map = | ||
buildClusterMap(TestUtility::bufferToString(buffer)); | ||
ASSERT_NE(cluster_message_map.find(cluster1_name_), cluster_message_map.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per current ASAN failure, recommend adding
<< "cluster1_name=" << cluster1_name_;
to this ASSERT_NE, and similar below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand you suggestion. ASAN is not failing anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion was not just to fix your ASAN problem, but also generally to make EXPECT/ASSERT failures actionable, should they occur again in the future.
When checking map/set membership in an ASSERT/EXPECT the test infra will show the code and the actuals, but not automatically show the args, so it would be aid debugging to stream in the value as I showed. This is particularly important if the failures may be flaky as they might not show on repro, so you want to structure your EXPECT/ASSERT calls to get something actionable in the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor comments.
// Stats configuration proto schema for built-in *envoy.hystrix* sink. | ||
// The sink emits stats in SSE format for use with Hystrix dashboard | ||
message HystrixSink { | ||
int64 num_of_buckets = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just shorten to num_buckets
here and throughout?
} | ||
|
||
HystrixSink::HystrixSink(Server::Instance& server, const uint64_t num_of_buckets) | ||
: // stats_(new HystrixStatCache(num_of_buckets)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can this commented line be removed?
} | ||
} | ||
|
||
void HystrixSink::updateRollingWindowMap(Upstream::ClusterInfoConstSharedPtr cluster_info, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just pass this as a const ClusterInfo&
(the result of *cluster_info
) to save a shared_ptr
copy?
std::chrono::milliseconds rolling_window_ms, | ||
std::stringstream& ss) { | ||
|
||
std::stringstream cluster_info; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why create this variable? Why not just append everything to ss
directly? (Same comment for the other places where this is done).
absl::string_view cluster_name, | ||
uint64_t max_concurrent_requests, | ||
uint64_t reporting_hosts, | ||
std::chrono::milliseconds rolling_window_ms, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this one is my bad. I think we can remove the _ms
from this variable now that the type tells us that it's milliseconds. Sorry about that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it to distinguish from rolling_window of RollingWindow type. Here and in some other places: https://github.com/trabetti/envoy/blob/d464f8c1f9de8eb21775ee1768006008543a21c6/source/extensions/stat_sinks/hystrix/hystrix.h#L57
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sounds good. Not a huge deal to me either way.
Signed-off-by: trabetti <talis@il.ibm.com>
Pushed a new version. @jmarantz Does it look like the right direction in terms of factoring out the tests and making them more readable? It still looks quiet messy.. 😕 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - after jmarantz@'s test concerns are addressed.
Buffer::OwnedImpl buffer; | ||
auto encode_callback = [&buffer](Buffer::Instance& data, bool) { | ||
// Set callbacks to send data to buffer. | ||
buffer.add( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still looks awkward. I was thinking of:
// Set callbacks to send data to buffer. This will append to the end of the
// buffer, so multiple calls will all be dumped one after another into this buffer.
buffer.add(data);
timeout_retry_step, rejected_step, window_size_); | ||
validate_results(cluster_message_map[cluster2_name], success_step2, | ||
error_4xx_step2 + error_4xx_retry_step2 + error_5xx_step2 + | ||
error_5xx_retry_step2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason not to just make a single constant for the sum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need each one to set values to counters, somehow forgot to use these after defining them...
InSequence s; | ||
|
||
// Arbitrary numbers for testing. Make sure error > timeout. | ||
uint64_t success_step = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make these const.
I'm not sure I believe these are completely arbitrary though; I think some must be dependent on others. Maybe comment those inline?
e.g.
uint64_t error_step = 17; // must be larger than xxxx + yyyyy + ....
it seems unlikely that error > timeout is sufficient, right? If it just had to be > timeout I'd have expected you to set it to 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, they are. Will set them to running values so it will be more obvious. And there aren't many restrictions on the actual values for the calculation purposes.
for (uint64_t i = 0; i < (window_size_ + 1); i++) { | ||
buffer.drain(buffer.length()); | ||
// Cluster 1 | ||
ON_CALL(cluster1_.error_5xx_counter_, value()).WillByDefault(Return((i + 1) * 17)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the 17 here the same thing as error_step? If so, use the constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, forgot to use all those values after I defined them...
Signed-off-by: trabetti <talis@il.ibm.com>
Hope we're approaching the end... What should be done in terms of documentation? |
uint64_t error_step = 4; | ||
uint64_t timeout_step = 3; | ||
uint64_t timeout_retry_step = 2; | ||
uint64_t rejected_step = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are important relative to one another, you could make that clearer in the code in a few ways: e.g. specify them in numeric order, or maybe even assign them from a post-increment accumulator, e.g.
uint64_t accum = 1;
uint64_t rejected_step = accum++;
uint64_t timeout_retry_step = accum++;
etc
But I'm not sure if that's better than explicit numbers.
Make them const though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave them as explicit numbers. Changed some values to make it more random. These numbers should not be dependent on each other, they represent traffic. The only constraint is errors>timeouts, since Envoy counts timeouts together with errors, and Hystrix needs the values separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I shouldn't be using "step", but instead randomly increasing values. I have added randomly increasing values to basic test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK looking forward to the refactored version. Hopefully much smaller, and with numbers that are more transparent in their meaning.
return buffer; | ||
} | ||
|
||
void validate_results(std::string data_message, uint64_t success_step, uint64_t error_step, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validateResults(const std::string& data_message, ...
That's style-guide for envoy functions, and you almost never want to pass std::string by value.
} | ||
|
||
TEST_F(HystrixSinkTest, AddAndRemoveClusters) { | ||
InSequence s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has improved a lot. Thank you! It is still a bit long. I'm not sure if it could be broken up into multiple tests. If so, that'd be my first suggestion. I think there's a notion here that the latter part of the test depends on the former, but there's really some distinct tests. here.
It's not obvious whether there are repeating patterns to factor out, but maybe it should be broken into single-call helpers anyway, just for legibility.
A search on this topic yielded https://refactoring.guru/extract-method
One thing that comes to mind is to break this into 2 tests with one shared helper.
TEST_F(HysterixSinkTest, AddClusters) {
AddHelper();
EXPECT that add worked
}
TEST_F(HystrixSinkTest, RemoveClusters) {
AddHelper();
RemoveHelper();
EXPECT that remove worked.
}
I think it's OK to outline RemoveHelper even though it's single use, for brevity and parallelism.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the link, I learned.
Separating into two tests - I can do it. But it means that some actions will be done twice in both tests - I want the cache to be dirty, so AddHelper is not just setting up stuff.
In the new version (pushing soon) I attempted to factor out some stuff in the spirit of the link you sent, and I was also able to move away some unnecessary actions from the clusters test. please see how it looks now, and if it still looks two much, I can split into to test as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think splitting them up would help readability.
I'm not sure why there's an issue with the cache being dirty though. Can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to see that if a cluster that was active is removed, and then a new cluster with same name is added, there are no leftover data from the previous cluster (it was an actual bug I fixed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. So can you split it up help readability? My suggestion does not change flow of the test you have here, just makes it easier to read (and additionally adds a simpler test).
Signed-off-by: trabetti <talis@il.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thank you for the clear comments, and the tests are much more readable now.
Just a few nits about the comments themselves left.
// Values are unimportant - they represent traffic statistics, and for the purpose of the test any | ||
// arbitrary number will do. Only restriction is that errors >= timeouts, since in Envoy timeouts | ||
// are counted as errors and therefore the code that prepares the stream for the dashboard reduces | ||
// number of timeouts from total number of errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/reduces/deducts the/
You could use either "reduce ... by" or "deducts ... from".
public: | ||
HystrixSinkTest() { sink_.reset(new HystrixSink(server_, window_size_)); } | ||
|
||
// Return the value corresponding to key in an event stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just put into a comment a sample line that you are parsing here?
EXPECT_NE(absl::string_view::npos, key_pos); | ||
absl::string_view trim_data_before_Key = data_message.substr(key_pos); | ||
key_pos = trim_data_before_Key.find(" "); | ||
EXPECT_NE(absl::string_view::npos, key_pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably for all these EXPECTs it would be good to add
<< "data_message=" << data_message;
In case someone besides you runs into this, that would let them know why the data didn't look like what was expected.
// Return a string without quotes. | ||
absl::string_view getStringStreamField(absl::string_view data_message, absl::string_view key) { | ||
absl::string_view value = getStreamField(data_message, key); | ||
return value.substr(1, value.length() - 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth EXPECTing that the first & last chars are quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, with a minor fix, the stream is JSON format. is it worth to use JSON utils to get the value instead of this ugly function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds great. I'd also be OK with that as a follow-up.
} | ||
|
||
TestRandomGenerator rand_; | ||
uint64_t window_size_ = rand_.random() % 10 + 5; // Arbitrary reasonable number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I didn't realize this but, but I just poked around, and this both echoes the seed, and provides a way for you to re-run with the same seed.
@mattklein123, I think @jmarantz and I are about done with our comments, so I think this is ready for a pass by a maintainer. |
Thanks will take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff! I just skimmed through and everything looks sane to me. Just a few small comments. Thank you!
source/common/http/headers.h
Outdated
@@ -212,6 +215,14 @@ class HeaderValues { | |||
const std::string AcceptEncoding{"Accept-Encoding"}; | |||
const std::string Wildcard{"*"}; | |||
} VaryValues; | |||
|
|||
struct { | |||
const std::string AllowHeadersHystrix{"Accept, Cache-Control, X-Requested-With, Last-Event-ID"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm not sure I would define this in this file as it is specific to an extension. Can you move this to somewhere in your extension code?
|
||
.. _operations_admin_interface_hystrix_event_stream: | ||
|
||
.. http:get:: /hystrix_event_stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a release note also.
* Register a new connection. | ||
*/ | ||
void registerConnection(Http::StreamDecoderFilterCallbacks* callbacks_to_register); | ||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline before this line
@@ -18,6 +18,8 @@ class StatsSinkNameValues { | |||
const std::string DOG_STATSD = "envoy.dog_statsd"; | |||
// MetricsService sink | |||
const std::string METRICS_SERVICE = "envoy.metrics_service"; | |||
// Hystrix sink | |||
const std::string HYSTRIX = "envoy.hystrix"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See Note above, please call this "envoy.stat_sinks.hystrix" (and update any necessary docs)
Signed-off-by: trabetti <talis@il.ibm.com>
…trix_dashboard_support_wip_sync Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
I had to merge to master, to get updated release notes file. Now it is not compiling for me (compiling with docker). I pushed it anyway to see if it passes compilation here. |
docs/root/intro/version_history.rst
Outdated
@@ -11,6 +11,8 @@ Version history | |||
Support for the legacy proto :repo:`source/common/ratelimit/ratelimit.proto` is deprecated and will be removed at the start of the 1.9.0 release cycle. | |||
* tracing: added support for configuration of :ref:`tracing sampling | |||
<envoy_api_field_config.filter.network.http_connection_manager.v2.HttpConnectionManager.tracing>`. | |||
* admin: added :http:get:`/hystrix_event_stream` as an endpoint for monitoring envoy's statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please alpha order
…trix_dashboard_support_wip_sync Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
@jmarantz I changed the test to use json utils, maybe you can take a look |
@trabetti sorry needs another master merge. Please ping me immediately after it passes tests and I will merge! |
Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: trabetti <talis@il.ibm.com>
@mattklein123 all tests have passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome to get this in! Please do a blog post with a demo. ;)
@@ -77,6 +77,7 @@ EXTENSIONS = { | |||
"envoy.stat_sinks.dog_statsd": "//source/extensions/stat_sinks/dog_statsd:config", | |||
"envoy.stat_sinks.metrics_service": "//source/extensions/stat_sinks/metrics_service:config", | |||
"envoy.stat_sinks.statsd": "//source/extensions/stat_sinks/statsd:config", | |||
"envoy.stat_sinks.hystrix": "//source/extensions/stat_sinks/hystrix:config", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trabetti Small comment - the hystrix entry is the only one in the list that is not alphabetically sorted within the section. For consistency it should be below dog_statsd
and above metrics_service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #3752
Adding the latency information from existing histogram to Hystrix event stream that was added in #3425 Risk Level: Low Testing: Docs Changes: admin.rst Release Notes: No addition to release note of original Hystrix stream PR, which is still pending for 1.8.0 Fixes #3753 Signed-off-by: trabetti <talis@il.ibm.com>
Signed-off-by: talis talis@il.ibm.com
admin filter: add support for hystrix dashboard output
Description:
This PR adds an admin endoint for hystrix dashboard output.
When there is a request to admin_port/hystrix_event_stream, envoy starts sending SSE stream, using the same socket of the initial request.
The admin URL can be accessed directly from hystrix dashboard.
Risk Level: Medium
Testing:
Added unit testing hystrix_test.cc - WIP
The admin endpoint was not tested, since it returns an infinite event stream, so current mechanisms for testing admin endpoint do not work. If there are any suggestions on how to test it, we will add.
Docs Changes:
Release Notes:
Support was added to hystrix event stream output. From hystrix dashboard, set the Eureka URL to envoy_ip:8001, and add stream from http://envoy_ip:8001/hystrix_event_stream
Note that not all hystrix dashboard statistics are supported. More detail can be found here: <link to the documentation generated from docs/root/operations/admin.rst >
Fixes #1244
split from PR #2736