Skip to content

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Aug 1, 2025

This change explicitly models partition processor restarts' startup delay in response to various conditions.

Closes: #3512

@pcholakov pcholakov requested a review from Copilot August 1, 2025 10:04
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements explicit back-off policies to delay partition processor restarts, replacing the previous ad-hoc sleep mechanism with a structured retry system.

  • Removes hardcoded sleep delays from partition store opening error handling
  • Introduces a structured RestartDelay enum with multiple backoff strategies (immediate, fixed, exponential, max backoff)
  • Tracks processor start times and previous delays to implement exponential backoff with reset capabilities

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
spawn_processor_task.rs Removes hardcoded 10-20 second sleep for snapshot-related errors
processor_state.rs Adds timing and delay tracking fields to processor states, updates constructors
partition_processor_manager.rs Implements structured restart delay logic with exponential backoff and jitter
partition_store_manager.rs Updates parameter naming for clarity (min_applied_lsn → target_lsn)
Comments suppressed due to low confidence (1)

crates/worker/src/partition_processor_manager.rs:566

  • The field name 'started_time' should be 'start_time' to match the field name used in ProcessorState for consistency.
                                    RestartDelay::Exponential {

Comment on lines -141 to -144
Err(
e @ restate_partition_store::OpenError::SnapshotRepositoryRequired
| e @ restate_partition_store::OpenError::SnapshotRequired
| e @ restate_partition_store::OpenError::SnapshotUnsuitable,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special case no longer required as PPM now handles this directly.

@@ -155,6 +156,16 @@ pub enum Error {
InvokerBuild(#[from] BuildError),
}

enum RestartDelay {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reasonable critique would be that these add a lot of verbosity when we could just be passing delays around. I personally find the code much easier to read with explicit names.

Since we don't have a single retry policy that can be used, and we need to track some per-partition state, it's not possible to reuse the existing RetryPolicy implementation here. This enables us to shift the delay calculation out of the already large async event handler.

@pcholakov pcholakov requested a review from tillrohrmann August 1, 2025 10:14
Copy link

github-actions bot commented Aug 1, 2025

Test Results

  7 files  ±0    7 suites  ±0   3m 9s ⏱️ - 1m 34s
 54 tests ±0   53 ✅ ±0  1 💤 ±0  0 ❌ ±0 
223 runs  ±0  220 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 3a9ca53. ± Comparison against base commit 1ef58b3.

Copy link

github-actions bot commented Aug 1, 2025

Test Results

  7 files  ± 0    7 suites  ±0   2m 52s ⏱️ - 1m 51s
 52 tests  -  2   51 ✅  -  2  1 💤 ±0  0 ❌ ±0 
213 runs   - 10  210 ✅  - 10  3 💤 ±0  0 ❌ ±0 

Results for commit 3c3385a. ± Comparison against base commit 1ef58b3.

This pull request removes 2 tests.
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwaitAny(Client)
dev.restate.sdktesting.tests.Combinators ‑ firstSuccessfulCompletedAwakeable(Client)

♻️ This comment has been updated with latest results.

@pcholakov pcholakov requested review from AhmedSoliman and removed request for tillrohrmann August 4, 2025 13:47
Copy link
Contributor

@AhmedSoliman AhmedSoliman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Just a couple of small nits.

@@ -48,19 +49,27 @@ pub enum LeaderState {
pub enum ProcessorState {
Starting {
target_run_mode: RunMode,
start_time: MillisSinceEpoch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need the wall time for anything or can an Instant work instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out, it can! I lifted this out of StartedProcessor and was wondering why it was using this type.

use std::collections::hash_map::Entry;
use std::collections::{BTreeMap, HashMap};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for reverting the ahash HashMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just me being clumsy with the editor, I suspect! I'd still like to replace the BTreeMaps in here with ahash::HashMap and only sort on output but that's for another day.

@pcholakov pcholakov changed the title Partition Processor uses explicit back-off policies to delay restarts Partition Processor uses explicit restart delay policy Aug 5, 2025
@pcholakov pcholakov merged commit 49f3cfb into main Aug 5, 2025
27 checks passed
@pcholakov pcholakov deleted the pp-restart-retry-policy branch August 5, 2025 12:55
@github-actions github-actions bot locked and limited conversation to collaborators Aug 5, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Excessive S3 retries when latest.json cannot be fetched
2 participants