Skip to content

Conversation

AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Jul 31, 2025

In recent restate release we changed the default idle timeout of histograms to 3 minutes, this caused some of the histograms to drop after being idle and never appearing again due to this issue in metrics-rs metrics-rs/metrics#372.

The issue happens for histograms that we keep a stable handle for, we keep those handles to avoid the repetitive heap allocation in our hot paths. Although we are disabling the default idle timeout, users will still see the issue if they are setting this value in configuration files.

This should be considered as a short-term mitigation, we'll need a longer-term solution still.

// intentionally empty

Stack created with Sapling. Best reviewed with ReviewStack.

Copy link

github-actions bot commented Jul 31, 2025

Test Results

  7 files  ±0    7 suites  ±0   4m 36s ⏱️ -27s
 54 tests ±0   53 ✅ ±0  1 💤 ±0  0 ❌ ±0 
223 runs  ±0  220 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 1ef58b3. ± Comparison against base commit a96055d.

♻️ This comment has been updated with latest results.

Cleans up worker tasks and adds a new feature for task center to get a TaskGuard for unmanaged tasks that should be cancelled on drop.
This also waits for worker shutdown explicitly before shutting down the rest of the system to allow partitions to drain cleanly with potential improvement in a follow-up PR that would let rocksdb flush to happen once PartitionStoreManager is dropped.

Bonus: This includes a minor fix to let PPM cancel processors while they are still starting up, if they are still opening partition store or mid-initialization delay.
Durability tracker shouldn't hold a strong reference for PartitionStoreManager. It's a minor step towards the worker taking full control over PSM.
Should fix the flaky three-node trim test
- _Actually_ cancel in-flight snapshots on worker shutdown.
- Fix error logging that didn't show the underlying error in log messages and make it stylistically consistent with the other errors.
This PR removes the dedicated ingress runtime and allows ingress to share the runtime with `default`. This is a step towards reducing the number of runtimes, threads, and metric dimensions. In my testing, the impact of this change is negligible and in all cases we'd be improving poll latencies in default runtime by avoiding sync IO operations from log-server and metadata server soon.
In recent restate release we changed the default idle timeout of histograms to 3 minutes, this caused some of the histograms to drop after being idle and never appearing again due to this issue in metrics-rs metrics-rs/metrics#372.

The issue happens for histograms that we keep a stable handle for, we keep those handles to avoid the repetitive heap allocation in our hot paths. Although we are disabling the default idle timeout, users will still see the issue if they are setting this value in configuration files.

This should be considered as a short-term mitigation, we'll need a longer-term solution still.

```
// intentionally empty
```
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 for merging :-)

@AhmedSoliman AhmedSoliman merged commit 1ef58b3 into main Jul 31, 2025
53 checks passed
@AhmedSoliman AhmedSoliman deleted the pr3609 branch July 31, 2025 21:43
@github-actions github-actions bot locked and limited conversation to collaborators Jul 31, 2025
@peterbourgon
Copy link
Contributor

Just for the record -- the issue here isn't really metrics-rs/metrics#372, handle retention, or performance considerations. The issue is that .idle_timeout(...) in metrics-exporter-prometheus evicts metrics by deleting their internal state, so any later updates recreate histograms from zero. But Prometheus expects histograms to be cumulative and monotonic for the full lifetime of each process. So AFAICT any eviction and subsequent re-population of any histogram metric in a single process lifetime makes that metric semantically invalid. Same is true for counters, FWIW -- the only (Prometheus) metric type you can evict and later re-add in this way is a gauge.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants