-
Notifications
You must be signed in to change notification settings - Fork 94
Partition Processor uses explicit restart delay policy #3616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements explicit back-off policies to delay partition processor restarts, replacing the previous ad-hoc sleep mechanism with a structured retry system.
- Removes hardcoded sleep delays from partition store opening error handling
- Introduces a structured
RestartDelay
enum with multiple backoff strategies (immediate, fixed, exponential, max backoff) - Tracks processor start times and previous delays to implement exponential backoff with reset capabilities
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
spawn_processor_task.rs | Removes hardcoded 10-20 second sleep for snapshot-related errors |
processor_state.rs | Adds timing and delay tracking fields to processor states, updates constructors |
partition_processor_manager.rs | Implements structured restart delay logic with exponential backoff and jitter |
partition_store_manager.rs | Updates parameter naming for clarity (min_applied_lsn → target_lsn) |
Comments suppressed due to low confidence (1)
crates/worker/src/partition_processor_manager.rs:566
- The field name 'started_time' should be 'start_time' to match the field name used in ProcessorState for consistency.
RestartDelay::Exponential {
Err( | ||
e @ restate_partition_store::OpenError::SnapshotRepositoryRequired | ||
| e @ restate_partition_store::OpenError::SnapshotRequired | ||
| e @ restate_partition_store::OpenError::SnapshotUnsuitable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Special case no longer required as PPM now handles this directly.
@@ -155,6 +156,16 @@ pub enum Error { | |||
InvokerBuild(#[from] BuildError), | |||
} | |||
|
|||
enum RestartDelay { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A reasonable critique would be that these add a lot of verbosity when we could just be passing delays around. I personally find the code much easier to read with explicit names.
Since we don't have a single retry policy that can be used, and we need to track some per-partition state, it's not possible to reuse the existing RetryPolicy
implementation here. This enables us to shift the delay calculation out of the already large async event handler.
Test Results 7 files ± 0 7 suites ±0 2m 52s ⏱️ - 1m 51s Results for commit 3c3385a. ± Comparison against base commit 1ef58b3. This pull request removes 2 tests.
♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Just a couple of small nits.
@@ -48,19 +49,27 @@ pub enum LeaderState { | |||
pub enum ProcessorState { | |||
Starting { | |||
target_run_mode: RunMode, | |||
start_time: MillisSinceEpoch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need the wall time for anything or can an Instant
work instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out, it can! I lifted this out of StartedProcessor
and was wondering why it was using this type.
use std::collections::hash_map::Entry; | ||
use std::collections::{BTreeMap, HashMap}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for reverting the ahash HashMap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just me being clumsy with the editor, I suspect! I'd still like to replace the BTreeMap
s in here with ahash::HashMap
and only sort on output but that's for another day.
This change explicitly models partition processor restarts' startup delay in response to various conditions.
Closes: #3512