Skip to content

Conversation

Copy link

github-actions bot commented Jul 31, 2025

Test Results

  7 files  ±0    7 suites  ±0   3m 48s ⏱️ - 1m 0s
 54 tests ±0   53 ✅ ±0  1 💤 ±0  0 ❌ ±0 
223 runs  ±0  220 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 83e6966. ± Comparison against base commit 742b77e.

♻️ This comment has been updated with latest results.

This was not correct. is_alive() should strictly return true if the node is alive and not failing-over. This fix improves cluster controller's failover response time, and allows it to fail-over leader partitions as soon as it observes that a node shutdown has _started_.

```
// intentionally empty
```
@AhmedSoliman AhmedSoliman changed the title WIP Use scoped cancellations and spawn_unmanaged_child for better shutdown control [Core] Use scoped cancellations and work towards clean graceful shutdown Aug 1, 2025
This introduces a few crucial changes to how we handle panics in restate. Prior to this change, we would abort the process at panic time without considering a clean shutdown nor rocksdb wal fsync.

The summary of changes is as follows:
- We now always unwind the stack on panics. TaskCenter is designed to catch panics of important tasks and trigger a clean shutdown and reports a non-zero exit code.
- Ensure that on graceful shutdown timeout that we attempt to cleanly flush/shutdown rocksdb manager. This is important to avoid massive backfills of lost memtables on unclean shutdown.
- Catch panics at top-level task-center runtime control loop and trigger an emergency rocksdb WAL fsync to ensure that we flush the WAL to avoid loss of in-memory WAL buffer if/when we add support to manual wal flushing in the future.
- Makes sure that panics from network connection tasks do not trigger a system shutdown, instead, they are caught and properly logged. This avoids a situation where a network bad request/handler can cause the entire node to panic.
- In situations where tracing might have been lost/dropped, ensure that we also log critical information on stderr.
- Allow the user to send a second signal (SIGTERM or SIGINT) to terminate restate to force the shutdown.
- Shutdown timeout is controlled by TaskCenter itself, this means that self-triggered shutdown will also respect the timeout.

This hardens restate against unclean crashes and ensures we perform a clean handoff to other cluster members in case of an unrecoverable crash.

Other PRs in this stack focus on more granular control over the shutdown process to unlock better hand-off and cleaner shutdowns.
@AhmedSoliman AhmedSoliman marked this pull request as ready for review August 1, 2025 18:00
@AhmedSoliman AhmedSoliman merged commit 83e6966 into main Aug 4, 2025
57 checks passed
@AhmedSoliman AhmedSoliman deleted the pr3613 branch August 4, 2025 09:47
@github-actions github-actions bot locked and limited conversation to collaborators Aug 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant