[core] prevent sending SIGTERM after calling Worker::MarkDead #54377

rueian · 2025-07-07T19:02:50Z

Why are these changes needed?

Following the previous discussion, this PR merges the flags used by Worker::MarkDead and Worker::KillAsync.

Currently, Worker::MarkDead will only be called after we ask the worker to shutdown via the Exit RPC. After the Exit RPC, the worker should shut down by itself.

This change essentially prevents sending SIGTERM from the node manager to a worker once its Worker::MarkDead method has been called, ensuring that we don't interrupt its shutdown process.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rueian · 2025-07-08T22:55:40Z

Hi @codope, could you help review this? Thanks!

israbbani

A few comments about atomics. I'll leave the shutdown behavior review to @codope.

src/ray/raylet/worker.cc

codope

Is change in behavior strictly necessary? Previously you could distinguish between "marked for death" and "actually killing". Now we send SIGTERM to workers after failed Exit RPCs, which could be more disruptive if workers are legitimately unable to exit immediately (e.g., due to ongoing object references). I think this might be an issue for borrowers. If workers are killed forcefully, they cannot send WaitForRefRemoved messages to object owners. It could cause other workers to suddenly lose access to objects they were borrowing. In case of KillAsync, we would still want the idle workers to allow to drain their references properly. I am wondering if we should retain MarkDead() for graceful shutdown and KillAsync() only for truly unresponsive workers.

codope · 2025-07-09T04:17:55Z

src/ray/raylet/worker.cc

-void Worker::MarkDead() { dead_ = true; }
-
-bool Worker::IsDead() const { return dead_; }
+bool Worker::IsDead() const { return killing_.load(); }


Let's add a doc to clarify the new semantics of IsDead()

codope · 2025-07-09T04:25:54Z

src/ray/raylet/worker_pool.cc

+          // If the Exit RPC request failed, we still do a graceful kill.
+          // If the worker is dead already, KillAsync will be a no-op.
+          idle_worker->KillAsync(*io_service_, false);


Do we have tests to validate the two scenarios mentioned in the comment?

codope · 2025-07-09T04:26:03Z

src/ray/raylet/worker_pool_test.cc

@@ -989,7 +990,7 @@ TEST_F(WorkerPoolDriverRegisteredTest, PopWorkerForRequestWithRootDetachedActor)
  ASSERT_NE(worker_pool_->PopWorkerSync(task_spec_job_1_detached_actor_1),
            worker_job_2_no_detached_actor);
  ASSERT_EQ(worker_pool_->GetIdleWorkerSize(), 1);
-  worker_job_2_no_detached_actor->MarkDead();
+  worker_job_2_no_detached_actor->KillAsync(io_service, true);


All these tests are doing force kill. Let's also cover when force is false.

codope · 2025-07-09T04:26:23Z

src/ray/raylet/worker_pool_test.cc

@@ -989,7 +990,7 @@ TEST_F(WorkerPoolDriverRegisteredTest, PopWorkerForRequestWithRootDetachedActor)
  ASSERT_NE(worker_pool_->PopWorkerSync(task_spec_job_1_detached_actor_1),
            worker_job_2_no_detached_actor);
  ASSERT_EQ(worker_pool_->GetIdleWorkerSize(), 1);
-  worker_job_2_no_detached_actor->MarkDead();
+  worker_job_2_no_detached_actor->KillAsync(io_service, true);
  worker_pool_->TryKillingIdleWorkers();


Do we need to call TryKillingIdleWorkers if force kill is true?

src/ray/raylet/worker.cc

rueian · 2025-07-09T05:33:36Z

Thanks for the input @codope. It is super clear now that we should not send SIGTERM to workers after the Exit calls to them succeed.

Now we send SIGTERM to workers after failed Exit RPCs, which could be more disruptive if workers are legitimately unable to exit immediately (e.g., due to ongoing object references).

Do you mean that we should also not send SIGTERM in the case of !status.ok()?

codope · 2025-07-09T06:46:34Z

Do you mean that we should also not send SIGTERM in the case of !status.ok()?

Yes, I think we should not send SIGTERM in the !status.ok() case, at least not immediately. The RPC failure is ambiguous - it could be a network issue, and the worker might still be able to drain references properly. Consider implementing retry logic first (if there's already a retry then we're good), then process liveness checks, and only use SIGTERM as a last resort with longer timeouts.

Signed-off-by: Rueian <rueian@anyscale.com>

rueian · 2025-07-10T17:54:39Z

Do you mean that we should also not send SIGTERM in the case of !status.ok()?

Yes, I think we should not send SIGTERM in the !status.ok() case, at least not immediately. The RPC failure is ambiguous - it could be a network issue, and the worker might still be able to drain references properly. Consider implementing retry logic first (if there's already a retry then we're good), then process liveness checks, and only use SIGTERM as a last resort with longer timeouts.

Ok! Now, I only merge the flags but keep the MarkDead method. This PR now essentially prevents sending SIGTERM after the Exit RPC is called, ensuring that the node manager doesn't interrupt worker shutdown.

Just curious, shouldn't the SIGTERM handling in the core worker do nothing after the Exit RPC is called?

codope

@rueian Thanks for addressing the comments. Looks good to me.
To answer your question, SIGTERM doesn't run on the c++ io service event loop --

ray/src/ray/core_worker/core_worker.cc

Line 1291 in 51235c4

sigaddset(&mask, SIGTERM);

It's handled in the main thread.

…oject#54377) ## Why are these changes needed? Following [the previous discussion](ray-project#54068 (comment)), this PR merges the flags used by `Worker::MarkDead` and `Worker::KillAsync`. Currently, `Worker::MarkDead` will only be called after we ask the worker to shutdown via the `Exit` RPC. After the `Exit` RPC, the worker should shut down by itself. This change essentially prevents sending `SIGTERM` from the node manager to a worker once its `Worker::MarkDead` method has been called, ensuring that we don't interrupt its shutdown process. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Rueian <rueian@anyscale.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>

…oject#54377) ## Why are these changes needed? Following [the previous discussion](ray-project#54068 (comment)), this PR merges the flags used by `Worker::MarkDead` and `Worker::KillAsync`. Currently, `Worker::MarkDead` will only be called after we ask the worker to shutdown via the `Exit` RPC. After the `Exit` RPC, the worker should shut down by itself. This change essentially prevents sending `SIGTERM` from the node manager to a worker once its `Worker::MarkDead` method has been called, ensuring that we don't interrupt its shutdown process. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Rueian <rueian@anyscale.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>

rueian force-pushed the replace-worker-markdead branch 2 times, most recently from f9e3499 to 1eec2b8 Compare July 8, 2025 19:46

rueian marked this pull request as ready for review July 8, 2025 22:40

Copilot AI review requested due to automatic review settings July 8, 2025 22:40

rueian added the go add ONLY when ready to merge, run all tests label Jul 8, 2025

This comment was marked as outdated.

Sign in to view

israbbani self-assigned this Jul 9, 2025

israbbani reviewed Jul 9, 2025

View reviewed changes

src/ray/raylet/worker.cc Outdated Show resolved Hide resolved

src/ray/raylet/worker.cc Outdated Show resolved Hide resolved

codope reviewed Jul 9, 2025

View reviewed changes

codope self-assigned this Jul 9, 2025

cszhu added community-contribution Contributed by the community core Issues that should be addressed in Ray Core labels Jul 10, 2025

rueian force-pushed the replace-worker-markdead branch from 1eec2b8 to d75bfb6 Compare July 10, 2025 17:35

rueian changed the title ~~[core] Replace Worker::MarkDead with Worker::KillAsync~~ [core] prevent sending SIGTERM after calling Worker::MarkDead Jul 10, 2025

[core] prevent sending SIGTERM after calling Worker::MarkDead

5da4d13

Signed-off-by: Rueian <rueian@anyscale.com>

rueian force-pushed the replace-worker-markdead branch from d75bfb6 to 5da4d13 Compare July 10, 2025 17:45

rueian requested review from israbbani and codope July 14, 2025 15:06

codope approved these changes Jul 21, 2025

View reviewed changes

Merge branch 'master' into replace-worker-markdead

a5df7eb

codope requested a review from a team as a code owner July 21, 2025 10:55

edoakes merged commit 1de822f into ray-project:master Jul 21, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] prevent sending SIGTERM after calling Worker::MarkDead #54377

[core] prevent sending SIGTERM after calling Worker::MarkDead #54377

Uh oh!

rueian commented Jul 7, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

rueian commented Jul 8, 2025

Uh oh!

israbbani left a comment

Uh oh!

Uh oh!

Uh oh!

codope left a comment

Uh oh!

codope Jul 9, 2025

Uh oh!

codope Jul 9, 2025

Uh oh!

codope Jul 9, 2025

Uh oh!

codope Jul 9, 2025

Uh oh!

Uh oh!

rueian commented Jul 9, 2025 •

edited

Loading

Uh oh!

codope commented Jul 9, 2025

Uh oh!

rueian commented Jul 10, 2025 •

edited

Loading

Uh oh!

codope left a comment

Uh oh!

Uh oh!

Uh oh!

[core] prevent sending SIGTERM after calling Worker::MarkDead #54377

[core] prevent sending SIGTERM after calling Worker::MarkDead #54377

Uh oh!

Conversation

rueian commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

This comment was marked as outdated.

Uh oh!

rueian commented Jul 8, 2025

Uh oh!

israbbani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

codope Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

codope Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

codope Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

codope Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rueian commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codope commented Jul 9, 2025

Uh oh!

rueian commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rueian commented Jul 7, 2025 •

edited

Loading

rueian commented Jul 9, 2025 •

edited

Loading

rueian commented Jul 10, 2025 •

edited

Loading