[observability] Disable histogram idle timeout by default #3609

AhmedSoliman · 2025-07-31T11:53:17Z

In recent restate release we changed the default idle timeout of histograms to 3 minutes, this caused some of the histograms to drop after being idle and never appearing again due to this issue in metrics-rs metrics-rs/metrics#372.

The issue happens for histograms that we keep a stable handle for, we keep those handles to avoid the repetitive heap allocation in our hot paths. Although we are disabling the default idle timeout, users will still see the issue if they are setting this value in configuration files.

This should be considered as a short-term mitigation, we'll need a longer-term solution still.

// intentionally empty

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2025-07-31T12:06:05Z

Test Results

7 files ±0 7 suites ±0 4m 36s ⏱️ -27s
54 tests ±0 53 ✅ ±0 1 💤 ±0 0 ❌ ±0
223 runs ±0 220 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 1ef58b3. ± Comparison against base commit a96055d.

♻️ This comment has been updated with latest results.

Cleans up worker tasks and adds a new feature for task center to get a TaskGuard for unmanaged tasks that should be cancelled on drop. This also waits for worker shutdown explicitly before shutting down the rest of the system to allow partitions to drain cleanly with potential improvement in a follow-up PR that would let rocksdb flush to happen once PartitionStoreManager is dropped. Bonus: This includes a minor fix to let PPM cancel processors while they are still starting up, if they are still opening partition store or mid-initialization delay.

Durability tracker shouldn't hold a strong reference for PartitionStoreManager. It's a minor step towards the worker taking full control over PSM.

Should fix the flaky three-node trim test

- _Actually_ cancel in-flight snapshots on worker shutdown. - Fix error logging that didn't show the underlying error in log messages and make it stylistically consistent with the other errors.

This PR removes the dedicated ingress runtime and allows ingress to share the runtime with `default`. This is a step towards reducing the number of runtimes, threads, and metric dimensions. In my testing, the impact of this change is negligible and in all cases we'd be improving poll latencies in default runtime by avoiding sync IO operations from log-server and metadata server soon.

In recent restate release we changed the default idle timeout of histograms to 3 minutes, this caused some of the histograms to drop after being idle and never appearing again due to this issue in metrics-rs metrics-rs/metrics#372. The issue happens for histograms that we keep a stable handle for, we keep those handles to avoid the repetitive heap allocation in our hot paths. Although we are disabling the default idle timeout, users will still see the issue if they are setting this value in configuration files. This should be considered as a short-term mitigation, we'll need a longer-term solution still. ``` // intentionally empty ```

tillrohrmann

LGTM. +1 for merging :-)

peterbourgon · 2025-08-03T19:12:57Z

Just for the record -- the issue here isn't really metrics-rs/metrics#372, handle retention, or performance considerations. The issue is that .idle_timeout(...) in metrics-exporter-prometheus evicts metrics by deleting their internal state, so any later updates recreate histograms from zero. But Prometheus expects histograms to be cumulative and monotonic for the full lifetime of each process. So AFAICT any eviction and subsequent re-population of any histogram metric in a single process lifetime makes that metric semantically invalid. Same is true for counters, FWIW -- the only (Prometheus) metric type you can evict and later re-add in this way is a gauge.

This was referenced Jul 31, 2025

[minor] Worker task management cleanup #3601

Merged

[minor] Minor improvements to partition snapshotting task #3605

Merged

[minor] Durability tracker shouldn't pin partition store manager #3604

Merged

Remove ingress dedicated runtime #3607

Merged

AhmedSoliman requested review from tillrohrmann and jackkleeman July 31, 2025 12:06

AhmedSoliman added 6 commits July 31, 2025 13:09

[minor] Durability tracker shouldn't pin partition store manager

bc4f5de

Durability tracker shouldn't hold a strong reference for PartitionStoreManager. It's a minor step towards the worker taking full control over PSM.

Disable auto-improvement in replicated-loglet integration tests

b99ef7a

Should fix the flaky three-node trim test

[minor] Minor improvements to partition snapshotting task

c597277

- _Actually_ cancel in-flight snapshots on worker shutdown. - Fix error logging that didn't show the underlying error in log messages and make it stylistically consistent with the other errors.

AhmedSoliman force-pushed the pr3609 branch from 0b371f6 to 1ef58b3 Compare July 31, 2025 12:45

AhmedSoliman mentioned this pull request Jul 31, 2025

Disable auto-improvement in replicated-loglet integration tests #3610

Merged

jackkleeman approved these changes Jul 31, 2025

View reviewed changes

tillrohrmann approved these changes Jul 31, 2025

View reviewed changes

AhmedSoliman merged commit 1ef58b3 into main Jul 31, 2025
53 checks passed

AhmedSoliman deleted the pr3609 branch July 31, 2025 21:43

github-actions bot locked and limited conversation to collaborators Jul 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[observability] Disable histogram idle timeout by default #3609

[observability] Disable histogram idle timeout by default #3609

Uh oh!

AhmedSoliman commented Jul 31, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 31, 2025 •

edited

Loading

Uh oh!

tillrohrmann left a comment

Uh oh!

Uh oh!

peterbourgon commented Aug 3, 2025

Uh oh!

Uh oh!

[observability] Disable histogram idle timeout by default #3609

[observability] Disable histogram idle timeout by default #3609

Uh oh!

Conversation

AhmedSoliman commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

peterbourgon commented Aug 3, 2025

Uh oh!

Uh oh!

AhmedSoliman commented Jul 31, 2025 •

edited

Loading

github-actions bot commented Jul 31, 2025 •

edited

Loading