[Core] Always unwind and ensure rocksdb's WAL is flushed on panics #3611

AhmedSoliman · 2025-07-31T21:43:14Z

This introduces a few crucial changes to how we handle panics in restate. Prior to this change, we would abort the process at panic time without considering a clean shutdown nor rocksdb wal fsync.

The summary of changes is as follows:

We now always unwind the stack on panics. TaskCenter is designed to catch panics of important tasks and trigger a clean shutdown and reports a non-zero exit code.
Ensure that on graceful shutdown timeout that we attempt to cleanly flush/shutdown rocksdb manager. This is important to avoid massive backfills of lost memtables on unclean shutdown.
Catch panics at top-level task-center runtime control loop and trigger an emergency rocksdb WAL fsync to ensure that we flush the WAL to avoid loss of in-memory WAL buffer if/when we add support to manual wal flushing in the future.
Makes sure that panics from network connection tasks do not trigger a system shutdown, instead, they are caught and properly logged. This avoids a situation where a network bad request/handler can cause the entire node to panic.
In situations where tracing might have been lost/dropped, ensure that we also log critical information on stderr.
Allow the user to send a second signal (SIGTERM or SIGINT) to terminate restate to force the shutdown.
Shutdown timeout is controlled by TaskCenter itself, this means that self-triggered shutdown will also respect the timeout.

This hardens restate against unclean crashes and ensures we perform a clean handoff to other cluster members in case of an unrecoverable crash.

Other PRs in this stack focus on more granular control over the shutdown process to unlock better hand-off and cleaner shutdowns.

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2025-07-31T22:18:04Z

Test Results

7 files ±0 7 suites ±0 4m 45s ⏱️ -3s
54 tests ±0 53 ✅ ±0 1 💤 ±0 0 ❌ ±0
223 runs ±0 220 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 3566aed. ± Comparison against base commit 742b77e.

♻️ This comment has been updated with latest results.

AhmedSoliman · 2025-08-01T10:05:13Z

This has been tested in debug and release profiles with random panic/fault injection while the db lock is held in default runtime and with panics from partition processor managed runtimes.

jackkleeman · 2025-08-01T11:02:33Z

Makes sure that panics from network connection tasks do not trigger a system shutdown, instead, they are caught and properly logged. This avoids a situation where a network bad request/handler can cause the entire node to panic.

Is there any risk then that critical network connections can get silently 'stuck'? In some scenarios we might want a system shutdown vs a node operating in half-closed mode or something like that

tillrohrmann

Thanks for improving our emergency shutdown behavior @AhmedSoliman. The changes make sense to me. The one thing that I am a little bit nervous about is whether all our unmanaged tasks, Tokio tasks and TaskCenter task that log on error handle panics correctly and aren't simply swallowed by the system and causing it to get stuck. I think this is something we'll hopefully see in our tests. Unfortunately, I didn't manage to go through all our tasks during the review but you've probably already taken care of this. +1 for merging if this is the case.

tillrohrmann · 2025-08-01T15:38:01Z

server/src/main.rs

-                        shutdown = true;
+                    _ = &mut tc_cancelled => {
                        // Shutdown was requested by task center and it has completed.
+                        break;


If a TaskKind with OnError = "shutdown" fails, then the main thread will break here, right? If yes, would we then miss running on_ungraceful_shutdown because we complete the future with ()?

Addressed in the latest changes.

This was not correct. is_alive() should strictly return true if the node is alive and not failing-over. This fix improves cluster controller's failover response time, and allows it to fail-over leader partitions as soon as it observes that a node shutdown has _started_. ``` // intentionally empty ```

This introduces a few crucial changes to how we handle panics in restate. Prior to this change, we would abort the process at panic time without considering a clean shutdown nor rocksdb wal fsync. The summary of changes is as follows: - We now always unwind the stack on panics. TaskCenter is designed to catch panics of important tasks and trigger a clean shutdown and reports a non-zero exit code. - Ensure that on graceful shutdown timeout that we attempt to cleanly flush/shutdown rocksdb manager. This is important to avoid massive backfills of lost memtables on unclean shutdown. - Catch panics at top-level task-center runtime control loop and trigger an emergency rocksdb WAL fsync to ensure that we flush the WAL to avoid loss of in-memory WAL buffer if/when we add support to manual wal flushing in the future. - Makes sure that panics from network connection tasks do not trigger a system shutdown, instead, they are caught and properly logged. This avoids a situation where a network bad request/handler can cause the entire node to panic. - In situations where tracing might have been lost/dropped, ensure that we also log critical information on stderr. - Allow the user to send a second signal (SIGTERM or SIGINT) to terminate restate to force the shutdown. - Shutdown timeout is controlled by TaskCenter itself, this means that self-triggered shutdown will also respect the timeout. This hardens restate against unclean crashes and ensures we perform a clean handoff to other cluster members in case of an unrecoverable crash. Other PRs in this stack focus on more granular control over the shutdown process to unlock better hand-off and cleaner shutdowns.

AhmedSoliman · 2025-08-01T18:00:18Z

Makes sure that panics from network connection tasks do not trigger a system shutdown, instead, they are caught and properly logged. This avoids a situation where a network bad request/handler can cause the entire node to panic.

Is there any risk then that critical network connections can get silently 'stuck'? In some scenarios we might want a system shutdown vs a node operating in half-closed mode or something like that

We should keep an eye on that. The change here primarily focus on panics, and if connections were stuck because of panics, we'll still see the panic in the log as errors.

AhmedSoliman force-pushed the pr3611 branch from 711c7d8 to 8513d6a Compare July 31, 2025 21:46

AhmedSoliman changed the title ~~[Core] Always unwind and ensure rocksdb's wal is fsync on panics~~ [Core] Always unwind and ensure rocksdb's WAL is flushed on panics Jul 31, 2025

This was referenced Jul 31, 2025

[WIP] PartitionStoreManager exclusive ownership by Worker role #3606

Draft

[TaskCenter] spawn_unmanaged_child and scoped cancellations #3612

Merged

AhmedSoliman force-pushed the pr3611 branch from 8513d6a to 3aea9a1 Compare July 31, 2025 21:57

AhmedSoliman requested review from tillrohrmann and jackkleeman July 31, 2025 22:14

AhmedSoliman mentioned this pull request Jul 31, 2025

[Core] Use scoped cancellations and work towards clean graceful shutdown #3613

Merged

tillrohrmann reviewed Aug 1, 2025

View reviewed changes

AhmedSoliman force-pushed the pr3611 branch from 3aea9a1 to 7c484ba Compare August 1, 2025 15:47

AhmedSoliman mentioned this pull request Aug 1, 2025

[ClusterState] Fixing is_alive() reporting true for failing-over nodes #3618

Merged

AhmedSoliman force-pushed the pr3611 branch from 7c484ba to 40ad589 Compare August 1, 2025 17:53

AhmedSoliman force-pushed the pr3611 branch from 40ad589 to 3566aed Compare August 1, 2025 18:08

This was referenced Aug 2, 2025

[Fabric] Adds ability to disable network compression via config #3619

Merged

[LogServer] Do not share record cache with log-server #3620

Merged

AhmedSoliman merged commit 3566aed into main Aug 4, 2025
57 checks passed

AhmedSoliman deleted the pr3611 branch August 4, 2025 09:47

github-actions bot locked and limited conversation to collaborators Aug 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Always unwind and ensure rocksdb's WAL is flushed on panics #3611

[Core] Always unwind and ensure rocksdb's WAL is flushed on panics #3611

Uh oh!

AhmedSoliman commented Jul 31, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 31, 2025 •

edited

Loading

Uh oh!

AhmedSoliman commented Aug 1, 2025

Uh oh!

jackkleeman commented Aug 1, 2025

Uh oh!

tillrohrmann left a comment

Uh oh!

tillrohrmann Aug 1, 2025

Uh oh!

AhmedSoliman Aug 1, 2025

Uh oh!

AhmedSoliman commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

[Core] Always unwind and ensure rocksdb's WAL is flushed on panics #3611

[Core] Always unwind and ensure rocksdb's WAL is flushed on panics #3611

Uh oh!

Conversation

AhmedSoliman commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

AhmedSoliman commented Aug 1, 2025

Uh oh!

jackkleeman commented Aug 1, 2025

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

AhmedSoliman commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

AhmedSoliman commented Jul 31, 2025 •

edited

Loading

github-actions bot commented Jul 31, 2025 •

edited

Loading