[Core] Use scoped cancellations and work towards clean graceful shutdown #3613

AhmedSoliman · 2025-07-31T22:16:43Z

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2025-07-31T22:38:51Z

Test Results

7 files ±0 7 suites ±0 3m 48s ⏱️ - 1m 0s
54 tests ±0 53 ✅ ±0 1 💤 ±0 0 ❌ ±0
223 runs ±0 220 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 83e6966. ± Comparison against base commit 742b77e.

♻️ This comment has been updated with latest results.

This was not correct. is_alive() should strictly return true if the node is alive and not failing-over. This fix improves cluster controller's failover response time, and allows it to fail-over leader partitions as soon as it observes that a node shutdown has _started_. ``` // intentionally empty ```

This introduces a few crucial changes to how we handle panics in restate. Prior to this change, we would abort the process at panic time without considering a clean shutdown nor rocksdb wal fsync. The summary of changes is as follows: - We now always unwind the stack on panics. TaskCenter is designed to catch panics of important tasks and trigger a clean shutdown and reports a non-zero exit code. - Ensure that on graceful shutdown timeout that we attempt to cleanly flush/shutdown rocksdb manager. This is important to avoid massive backfills of lost memtables on unclean shutdown. - Catch panics at top-level task-center runtime control loop and trigger an emergency rocksdb WAL fsync to ensure that we flush the WAL to avoid loss of in-memory WAL buffer if/when we add support to manual wal flushing in the future. - Makes sure that panics from network connection tasks do not trigger a system shutdown, instead, they are caught and properly logged. This avoids a situation where a network bad request/handler can cause the entire node to panic. - In situations where tracing might have been lost/dropped, ensure that we also log critical information on stderr. - Allow the user to send a second signal (SIGTERM or SIGINT) to terminate restate to force the shutdown. - Shutdown timeout is controlled by TaskCenter itself, this means that self-triggered shutdown will also respect the timeout. This hardens restate against unclean crashes and ensures we perform a clean handoff to other cluster members in case of an unrecoverable crash. Other PRs in this stack focus on more granular control over the shutdown process to unlock better hand-off and cleaner shutdowns.

This was referenced Jul 31, 2025

[WIP] PartitionStoreManager exclusive ownership by Worker role #3606

Draft

[TaskCenter] spawn_unmanaged_child and scoped cancellations #3612

Merged

[Core] Always unwind and ensure rocksdb's WAL is flushed on panics #3611

Merged

AhmedSoliman force-pushed the pr3613 branch from 93ca8f8 to c2de7c2 Compare August 1, 2025 15:47

AhmedSoliman changed the title ~~WIP Use scoped cancellations and spawn_unmanaged_child for better shutdown control~~ [Core] Use scoped cancellations and work towards clean graceful shutdown Aug 1, 2025

AhmedSoliman mentioned this pull request Aug 1, 2025

[ClusterState] Fixing is_alive() reporting true for failing-over nodes #3618

Merged

AhmedSoliman force-pushed the pr3613 branch from c2de7c2 to 98de12d Compare August 1, 2025 17:53

AhmedSoliman added 2 commits August 1, 2025 18:58

[TaskCenter] spawn_unmanaged_child and scoped cancellations

6731ab5

AhmedSoliman marked this pull request as ready for review August 1, 2025 18:00

AhmedSoliman force-pushed the pr3613 branch from 98de12d to 4c004c8 Compare August 1, 2025 18:08

[Core] Use scoped cancellations and work towards clean graceful shutdown

83e6966

AhmedSoliman force-pushed the pr3613 branch from 4c004c8 to 83e6966 Compare August 1, 2025 18:25

This was referenced Aug 2, 2025

[Fabric] Adds ability to disable network compression via config #3619

Merged

[LogServer] Do not share record cache with log-server #3620

Merged

AhmedSoliman merged commit 83e6966 into main Aug 4, 2025
57 checks passed

AhmedSoliman deleted the pr3613 branch August 4, 2025 09:47

github-actions bot locked and limited conversation to collaborators Aug 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Use scoped cancellations and work towards clean graceful shutdown #3613

[Core] Use scoped cancellations and work towards clean graceful shutdown #3613

Uh oh!

AhmedSoliman commented Jul 31, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Core] Use scoped cancellations and work towards clean graceful shutdown #3613

[Core] Use scoped cancellations and work towards clean graceful shutdown #3613

Uh oh!

Conversation

AhmedSoliman commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

Uh oh!

AhmedSoliman commented Jul 31, 2025 •

edited

Loading

github-actions bot commented Jul 31, 2025 •

edited

Loading