Skip to content

Conversation

mergeconflict
Copy link
Member

Description: Add per-thread dispatcher statistics for loop duration and poll delay, based on new "prepare" and "check" watchers added in libevent (libevent/libevent#793). See discussion in #4952.
Risk Level: medium
Testing: Added unit test, all existing tests pass, and running locally yields sane results.
Docs Changes: Added a new page on "performance" discussing event loop.
Release Notes: Added an entry noting new stats.

Signed-off-by: Dan Rosen mergeconflict@google.com

Signed-off-by: Dan Rosen <mergeconflict@google.com>
Copy link
Member Author

@mergeconflict mergeconflict left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and comments for reviewers below.
/review @htuch @mattklein123

@@ -16,6 +16,7 @@ namespace Envoy {
namespace Event {

MockDispatcher::MockDispatcher() {
ON_CALL(*this, initializeStats(_)).WillByDefault(Return());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for a handful of existing tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to write some new tests? I don't think we need to do anything super fancy with mock time etc., but if we could just run a dispatcher and then check some invariants, e.g. that the histograms are non-empty, the delays are >= expected, I think that would be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've written one in DispatcherImplTest to verify that we create the expected stats. I haven't written anything to assert that the stats are populated, since I assumed that would be too flaky, but lemme think about that a bit more...

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very concise and clear, I like it. Excited to see this appearing in Envoy!
/wait

@@ -41,6 +41,14 @@ DispatcherImpl::DispatcherImpl(Buffer::WatermarkFactoryPtr&& factory, Api::Api&

DispatcherImpl::~DispatcherImpl() {}

void DispatcherImpl::initializeStats(Stats::Scope& parent_scope) {
std::string thread_id = api_.threadFactory().currentThreadId()->debugString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer not to use a debug string interface, because this is subject to change, i.e. there are no long term guarantees on debug representations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ThreadId class doesn't really offer anything better currently. I could do a bit of additional spelunking and add a int64_t id() const method there, but I also wonder if the underlying platform-specific thread IDs are hidden from the interface on purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hidden on purpose. How do you think we can provide a stable ID given that debug is not guaranteed to be that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just assign workers an ordinal ID, e.g. [0, 1, 2, ..N-1] for N workers and use that? The advantage is this will be consistent across platforms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A trivial way would be to just assign sequential IDs ("worker 1", "worker 2"...), which seems fine to me. But then you can't correlate anything going awry in these stats with anything you might see in the logs, which are tagged with actual thread IDs...

Copy link
Member

@htuch htuch Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have threads log their known name when they startup? That way, you can do the join, albeit with some additional work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #6548 after chatting with @htuch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and closed #6548 again after chatting with @jmarantz. I'm just doing to identify worker dispatchers with a sequential ID as discussed above, and we can worry about how to map those to thread IDs in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, per @htuch, I now log the thread ID for debugging (in run, below). Note that I can't do it here because initializeStats isn't necessarily called from the same thread as the dispatcher's event loop.

@@ -16,6 +16,7 @@ namespace Envoy {
namespace Event {

MockDispatcher::MockDispatcher() {
ON_CALL(*this, initializeStats(_)).WillByDefault(Return());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to write some new tests? I don't think we need to do anything super fancy with mock time etc., but if we could just run a dispatcher and then check some invariants, e.g. that the histograms are non-empty, the delays are >= expected, I think that would be sufficient.

Signed-off-by: Dan Rosen <mergeconflict@google.com>
@mergeconflict
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: build_image (failed build)
🔨 rebuilding ci/circleci: ipv6_tests (failed build)

🐱

Caused by: a #6517 (comment) was created by @mergeconflict.

see: more, trace.

Signed-off-by: Dan Rosen <mergeconflict@google.com>
Signed-off-by: Dan Rosen <mergeconflict@google.com>
* Start writing stats once thread-local storage is ready to receive them (see
* ThreadLocalStoreImpl::initializeThreading).
* @param parent_scope the parent scope for the new per-dispatcher stats scope created here.
* @param name the name for the new stats scope, identifying this dispatcher.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for stats for dispatchers that are allocated and run without calling this?

I think this might be OK but just specify it here. Auto-name them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've put in the docs, at least for now, is that we only record event loop stats for the main thread and workers, not any other auxiliary threads. I assume other threads might be transient, and it would just be noisy to include stats for them automatically. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you have is fine for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's OK, but can we (a) document that here and (b) have a stat for the number of anonymous untracked threads with dispatchers?

You can also survey source/.../*.cc for references to allocateDispatcher(). There's at least one you should probably add, in server/guarddog_impl.cc

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. Flushing out some misc comments.

/wait-any

* Start writing stats once thread-local storage is ready to receive them (see
* ThreadLocalStoreImpl::initializeThreading).
* @param parent_scope the parent scope for the new per-dispatcher stats scope created here.
* @param name the name for the new stats scope, identifying this dispatcher.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you have is fine for now.

// feeling saucy. Disregard negative delays in stats, since they don't indicate anything
// particularly useful.
if (delay.tv_sec >= 0) {
self->stats_->poll_delay_us_.recordValue(delay.tv_sec * 1000000 + delay.tv_usec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not just use `std::chrono`` stuff in this file to avoid this math?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The honest answer is, I didn't know it existed :) But it looks like converting from timeval to the std::chrono stuff involves doing that math anyway. I'll factor it out into a helper method, though, just to keep it isolated and eliminate the duplication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think what I'm proposing is to not use any C style time stuff at all here as and just use a TimeSource interface which can give you monotonic time, and then you should be able to use std::chrono::microseconds etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could get rid of the evutil_gettimeofday calls, but I can't replace the evwatch_prepare_get_timeout call -- that gives me the timeout that it's internally decided to pass into epoll/select/whatever.

@@ -137,6 +138,18 @@ void ThreadLocalStoreImpl::initializeThreading(Event::Dispatcher& main_thread_di
tls_->set([](Event::Dispatcher&) -> ThreadLocal::ThreadLocalObjectSharedPtr {
return std::make_shared<TlsCache>();
});

// Initialize stats for each thread's dispatcher. This can't be done in the InitializeCb passed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on whether this is the right place to do this? Should we instead be initializing worker stats in the server or listener manager when we create the workers? It feels that might be the right place vs. backing this into the store itself? WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The way my initial draft implementation worked was, the DispatcherStats object for each DispatcherImpl was initialized in its constructor, rather than allocated on the heap in initializeStats(). I still needed to have a separate initializeStats() method, though, to indicate when it's safe to start actually writing to them, because the workers are allocated before TLS::initializeThreading() is called (meaning there's no slot to write to yet).

The main thing I didn't like about that approach was just that it required plumbing the parent Stats::Scope into DispatcherImpl's ctor, which ended up touching a lot of stuff (e.g. ValidationDispatcher, Api::allocateDispatcher(), etc.).

Having said that, it still might be the right way to go, if we think it's useful for all dispatchers (not just the main and worker threads) to have these stats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think what I'm proposing is what you are doing, but not initializing dispatcher stats inside the thread local store which seems not quite right to me. I would probably actually do this at the server level after the workers are created and initializeThreading is called. Then you could iterate through all workers and initialize dispatcher stats on the worker. I think this would make it a lot more clear that we are only doing this on workers, and then in the future we could do it on other dispatchers if we want to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I understand what you're suggesting. The workers are all managed by the listener manager, so I'll have to do a little plumbing there, but that shouldn't be bad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, plumbing is added in this revision. Stats are now initialized for each worker in ListenerManagerImpl::startWorkers, which seems more appropriate.

@jmarantz
Copy link
Contributor

jmarantz commented Apr 12, 2019 via email

@mattklein123
Copy link
Member

I thought the challenge here was we got the C time struct from libevent now
so it has to be converted to c++ time somewhere. We could centralize that
though

If that's the case than nevermind, I thought we were just getting event callbacks and calling gettimeofday ourselves in which case we could do whatever we want.

Signed-off-by: Dan Rosen <mergeconflict@google.com>
Signed-off-by: Dan Rosen <mergeconflict@google.com>
Signed-off-by: Dan Rosen <mergeconflict@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few random comments. Super cool. Will defer to @htuch and @jmarantz for further reviews.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good modulo Matt's comments and a couple I have. Very clean.
/wait


// Delay can be negative, meaning polling completed early. This happens in normal operation,
// either because I/O was ready before we hit the timeout, or just because the kernel was
// feeling saucy. Disregard negative delays in stats, since they don't indicate anything
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we discard or record a 0 delay when negative? I'm thinking the latter could be good to ensure we don't get anomalies like the following. Imagine there is a slight negative delay normally, and you see this on a million iterations. But, scattered amongst these, we see 10 iterations where there is an elevated delay. These are really only a tiny fraction and don't indicate a huge problem, but they will appear as 100% with elevated delay in the histogram if we discard here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recording the early wakeups as 0 delay would give you a number that means something different from what we're looking for. It might be worth recording as well (although I'm not sure what name I'd give it), but the thing we care about most here is specifically the distribution of positive delays.

One thing I tried to be very clear about in the docs is that delays are normal; they don't indicate a problem if they're within some expected range. For example, if you see the poll_delay_us number floating around a millisecond (or whatever your time slice is), with no long tail, then you're happy. It doesn't mean that every iteration of the event loop has a millisecond delay, it just means that when there is a delay, it's reasonably bounded.

If we included early delays as zeroes, it would skew the histogram such that 0 appears to be the correct ideal delay, and that the 1ms delay appears to be problematic tail latency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm struggling with here is the difference between -1ms, 0ms and 1ms delays. To me, they seem basically the same thing, the kernel woke you up roughly when you expected. Or, is there something special about -1ms vs 0ms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah. The main difference is that libevent treats a just-slightly-too-early wakeup the same as a way-too-early wakeup: it doesn't process the timeout event and just goes back to sleep. It sometimes does this a few iterations in a row, sleeping for 0us on each, before finally sleeping for a bit longer.

One way to avoid the difference is to have a configurable minimum threshold. For example, don't record any delays less than 100us, because they're not interesting. But I don't know if I'd actually recommend implementing that, at least not until we've observed how this behaves in production.

@@ -8,12 +8,17 @@
#include "common/stats/isolated_store_impl.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we end up with the "should new tests be written?" thread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ended up with me dragging my heels. I have the one new unit test here, but I guess it could be worthwhile to add an integration test somewhere, asserting that the stats we expect to populate are actually there...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this could easily be added to something like server_test or one of the integration test. Verify that we have the server and worker histogram stats available. You could also consider posting an event to one of the dispatchers that sleeps for 100ms (real time, not simulated, since that's what libevent uses) and verifying that the epoll duration is at least 100ms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my first attempt at adding something to server_test was a non-starter: see #6582 for details. I'll try adding to an integration test instead, I may have better luck with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... No luck there either, same issue: integration tests use IsolatedStoreImpl which doesn't really understand histograms. @htuch, would you mind if I punt on this until we have a fix for #6582?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is fine, but please do track as I think we have some functional validation missing.

Signed-off-by: Dan Rosen <mergeconflict@google.com>
Signed-off-by: Dan Rosen <mergeconflict@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet!

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is going to be very useful to operators.

@htuch htuch merged commit 77748b2 into envoyproxy:master Apr 15, 2019
@mergeconflict mergeconflict deleted the event_dispatcher_stats branch April 16, 2019 14:05
mattklein123 added a commit that referenced this pull request Apr 18, 2019
This reverts commit 77748b2.

Signed-off-by: Matt Klein <mklein@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants