Skip to content

Conversation

mattklein123
Copy link
Member

This change reworks how stats are flushed to sinks by removing the
Source class and replacing it with a MetricSnapshot which provides
constant stats for all sinks to flush with counters pre-latched.
In addition to cleaning up the flushing logic to unify and constify
it, this change has the side effect of performing counter latching
every flush interval (even if there are no sinks), this fixing
the below reported issue.

Fixes #6924

Risk Level: Low
Testing: Modified existing tests and new unit tests
Docs Changes: N/A
Release Notes: N/A

This change reworks how stats are flushed to sinks by removing the
Source class and replacing it with a MetricSnapshot which provides
constant stats for all sinks to flush with counters pre-latched.
In addition to cleaning up the flushing logic to unify and constify
it, this change has the side effect of performing counter latching
every flush interval (even if there are no sinks), this fixing
the below reported issue.

Fixes #6924

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@rgs1 this should fix #6924. Can you give it a spin if possible in your env? cc @fredlas @jmarantz.

I need to add more comments and more tests but wanted to get an early assessment if this fixes the reported problem.

Signed-off-by: Matt Klein <mklein@lyft.com>
@fishcakez
Copy link
Contributor

Testing the issue reported in #6924, the rate of increments no longer has the dip and spike during a hot restart. The counter values aren't passed between hot restarts so if we look at absolute values in /stats they reset.

Screen Shot 2019-05-20 at 11 14 50 AM

@mattklein123
Copy link
Member Author

@fishcakez the reset to 0 is expected. Presumably your metrics system can deal with a counter reset to 0 and we don't need to persist the current value across hot restarts (since this is what happens during a full restart)?

@fishcakez
Copy link
Contributor

Presumably your metrics system can deal with a counter reset to 0 and we don't need to persist the current value across hot restarts (since this is what happens during a full restart)?

Yes, it can handle that just fine 👌

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123 mattklein123 changed the title [WIP] stats: rework stat sink flushing to centralize counter latching stats: rework stat sink flushing to centralize counter latching May 21, 2019
@mattklein123 mattklein123 marked this pull request as ready for review May 21, 2019 04:09
@mattklein123
Copy link
Member Author

@fredlas @jmarantz PTAL this is ready for final review.

fredlas
fredlas previously approved these changes May 21, 2019
// fixfix comment about latching.
// Create a snapshot and flush to all sinks.
// NOTE: Even if there are no sinks, creating the snapshot has the important property that it
// latches all counters on a periodic basis. The hot restart code assumes this is being
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment!

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
writer.write(fmt::format("{}.{}:{}|c{}", prefix_, getName(*counter), delta,
buildTagStr(counter->tags())));
for (const auto& counter : snapshot.counters()) {
if (counter.counter_.get().used()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counter.counter_->used() does not work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counter_ is a reference_wrapper

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@jmarantz updated

@mattklein123
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: compile_time_options (failed build)

🐱

Caused by: a #6996 (comment) was created by @mattklein123.

see: more, trace.

@mattklein123 mattklein123 merged commit f78295b into master May 22, 2019
@mattklein123 mattklein123 deleted the fix_hr_stat_latch branch May 22, 2019 04:12
mpuncel added a commit to mpuncel/envoy that referenced this pull request May 22, 2019
* master: (65 commits)
  proto: Add PATCH method to RequestMethod enum (envoyproxy#6737)
  exe: drop unused deps on zlib compressor code (envoyproxy#7022)
  coverage: fix some misc coverage (envoyproxy#7033)
  Enable proto schema for router_check_tool (envoyproxy#6992)
  stats: rework stat sink flushing to centralize counter latching (envoyproxy#6996)
  [test] convert lds api test config stubs to v2 (envoyproxy#7021)
  router: scoped rds (2c): implement scoped rds API (envoyproxy#6932)
  build: Add option for size-optimized binary (envoyproxy#6960)
  test: adding an integration test framework for file-based LDS (envoyproxy#6933)
  doc: update obsolete ref to api/XDS_PROTOCOL.md (envoyproxy#7002)
  dispatcher: faster runOnAllThreads (envoyproxy#7011)
  example: add csrf sandbox (envoyproxy#6805)
  fix syntax of gcov exclusion zone. (envoyproxy#7023)
  /runtime_modify: add support for query params in body (envoyproxy#6977)
  stats: Create stats for http codes with the symbol table. (envoyproxy#6733)
  health check: fix more fallout from inline deletion change (envoyproxy#6988)
  Max heap fix (envoyproxy#7016)
  Add support to unregister from lifecycle notifications (envoyproxy#6984)
  build spdy_core_alt_svc_wire_format (envoyproxy#7010)
  ext_authz: Make sure initiateCall only called once (envoyproxy#6949)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stats drop after hot restart
4 participants