-
Notifications
You must be signed in to change notification settings - Fork 94
[observability] Disable histogram idle timeout by default #3609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Cleans up worker tasks and adds a new feature for task center to get a TaskGuard for unmanaged tasks that should be cancelled on drop. This also waits for worker shutdown explicitly before shutting down the rest of the system to allow partitions to drain cleanly with potential improvement in a follow-up PR that would let rocksdb flush to happen once PartitionStoreManager is dropped. Bonus: This includes a minor fix to let PPM cancel processors while they are still starting up, if they are still opening partition store or mid-initialization delay.
Durability tracker shouldn't hold a strong reference for PartitionStoreManager. It's a minor step towards the worker taking full control over PSM.
Should fix the flaky three-node trim test
- _Actually_ cancel in-flight snapshots on worker shutdown. - Fix error logging that didn't show the underlying error in log messages and make it stylistically consistent with the other errors.
This PR removes the dedicated ingress runtime and allows ingress to share the runtime with `default`. This is a step towards reducing the number of runtimes, threads, and metric dimensions. In my testing, the impact of this change is negligible and in all cases we'd be improving poll latencies in default runtime by avoiding sync IO operations from log-server and metadata server soon.
In recent restate release we changed the default idle timeout of histograms to 3 minutes, this caused some of the histograms to drop after being idle and never appearing again due to this issue in metrics-rs metrics-rs/metrics#372. The issue happens for histograms that we keep a stable handle for, we keep those handles to avoid the repetitive heap allocation in our hot paths. Although we are disabling the default idle timeout, users will still see the issue if they are setting this value in configuration files. This should be considered as a short-term mitigation, we'll need a longer-term solution still. ``` // intentionally empty ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1 for merging :-)
Just for the record -- the issue here isn't really metrics-rs/metrics#372, handle retention, or performance considerations. The issue is that .idle_timeout(...) in metrics-exporter-prometheus evicts metrics by deleting their internal state, so any later updates recreate histograms from zero. But Prometheus expects histograms to be cumulative and monotonic for the full lifetime of each process. So AFAICT any eviction and subsequent re-population of any histogram metric in a single process lifetime makes that metric semantically invalid. Same is true for counters, FWIW -- the only (Prometheus) metric type you can evict and later re-add in this way is a gauge. |
In recent restate release we changed the default idle timeout of histograms to 3 minutes, this caused some of the histograms to drop after being idle and never appearing again due to this issue in metrics-rs metrics-rs/metrics#372.
The issue happens for histograms that we keep a stable handle for, we keep those handles to avoid the repetitive heap allocation in our hot paths. Although we are disabling the default idle timeout, users will still see the issue if they are setting this value in configuration files.
This should be considered as a short-term mitigation, we'll need a longer-term solution still.
Stack created with Sapling. Best reviewed with ReviewStack.