-
Notifications
You must be signed in to change notification settings - Fork 98
[Core] Always unwind and ensure rocksdb's WAL is flushed on panics #3611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This has been tested in debug and release profiles with random panic/fault injection while the db lock is held in default runtime and with panics from partition processor managed runtimes. |
Is there any risk then that critical network connections can get silently 'stuck'? In some scenarios we might want a system shutdown vs a node operating in half-closed mode or something like that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving our emergency shutdown behavior @AhmedSoliman. The changes make sense to me. The one thing that I am a little bit nervous about is whether all our unmanaged tasks, Tokio tasks and TaskCenter task that log on error handle panics correctly and aren't simply swallowed by the system and causing it to get stuck. I think this is something we'll hopefully see in our tests. Unfortunately, I didn't manage to go through all our tasks during the review but you've probably already taken care of this. +1 for merging if this is the case.
server/src/main.rs
Outdated
shutdown = true; | ||
_ = &mut tc_cancelled => { | ||
// Shutdown was requested by task center and it has completed. | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a TaskKind
with OnError = "shutdown"
fails, then the main thread will break here, right? If yes, would we then miss running on_ungraceful_shutdown
because we complete the future with ()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest changes.
This was not correct. is_alive() should strictly return true if the node is alive and not failing-over. This fix improves cluster controller's failover response time, and allows it to fail-over leader partitions as soon as it observes that a node shutdown has _started_. ``` // intentionally empty ```
This introduces a few crucial changes to how we handle panics in restate. Prior to this change, we would abort the process at panic time without considering a clean shutdown nor rocksdb wal fsync. The summary of changes is as follows: - We now always unwind the stack on panics. TaskCenter is designed to catch panics of important tasks and trigger a clean shutdown and reports a non-zero exit code. - Ensure that on graceful shutdown timeout that we attempt to cleanly flush/shutdown rocksdb manager. This is important to avoid massive backfills of lost memtables on unclean shutdown. - Catch panics at top-level task-center runtime control loop and trigger an emergency rocksdb WAL fsync to ensure that we flush the WAL to avoid loss of in-memory WAL buffer if/when we add support to manual wal flushing in the future. - Makes sure that panics from network connection tasks do not trigger a system shutdown, instead, they are caught and properly logged. This avoids a situation where a network bad request/handler can cause the entire node to panic. - In situations where tracing might have been lost/dropped, ensure that we also log critical information on stderr. - Allow the user to send a second signal (SIGTERM or SIGINT) to terminate restate to force the shutdown. - Shutdown timeout is controlled by TaskCenter itself, this means that self-triggered shutdown will also respect the timeout. This hardens restate against unclean crashes and ensures we perform a clean handoff to other cluster members in case of an unrecoverable crash. Other PRs in this stack focus on more granular control over the shutdown process to unlock better hand-off and cleaner shutdowns.
We should keep an eye on that. The change here primarily focus on panics, and if connections were stuck because of panics, we'll still see the panic in the log as errors. |
This introduces a few crucial changes to how we handle panics in restate. Prior to this change, we would abort the process at panic time without considering a clean shutdown nor rocksdb wal fsync.
The summary of changes is as follows:
This hardens restate against unclean crashes and ensures we perform a clean handoff to other cluster members in case of an unrecoverable crash.
Other PRs in this stack focus on more granular control over the shutdown process to unlock better hand-off and cleaner shutdowns.
Stack created with Sapling. Best reviewed with ReviewStack.