Skip to content

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jul 7, 2025

Why are these changes needed?

Following the previous discussion, this PR merges the flags used by Worker::MarkDead and Worker::KillAsync.

Currently, Worker::MarkDead will only be called after we ask the worker to shutdown via the Exit RPC. After the Exit RPC, the worker should shut down by itself.

This change essentially prevents sending SIGTERM from the node manager to a worker once its Worker::MarkDead method has been called, ensuring that we don't interrupt its shutdown process.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rueian rueian force-pushed the replace-worker-markdead branch 2 times, most recently from f9e3499 to 1eec2b8 Compare July 8, 2025 19:46
@rueian rueian marked this pull request as ready for review July 8, 2025 22:40
@Copilot Copilot AI review requested due to automatic review settings July 8, 2025 22:40
@rueian rueian added the go add ONLY when ready to merge, run all tests label Jul 8, 2025
Copilot

This comment was marked as outdated.

@rueian
Copy link
Contributor Author

rueian commented Jul 8, 2025

Hi @codope, could you help review this? Thanks!

@israbbani israbbani self-assigned this Jul 9, 2025
Copy link
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments about atomics. I'll leave the shutdown behavior review to @codope.

Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is change in behavior strictly necessary? Previously you could distinguish between "marked for death" and "actually killing". Now we send SIGTERM to workers after failed Exit RPCs, which could be more disruptive if workers are legitimately unable to exit immediately (e.g., due to ongoing object references). I think this might be an issue for borrowers. If workers are killed forcefully, they cannot send WaitForRefRemoved messages to object owners. It could cause other workers to suddenly lose access to objects they were borrowing. In case of KillAsync, we would still want the idle workers to allow to drain their references properly. I am wondering if we should retain MarkDead() for graceful shutdown and KillAsync() only for truly unresponsive workers.

void Worker::MarkDead() { dead_ = true; }

bool Worker::IsDead() const { return dead_; }
bool Worker::IsDead() const { return killing_.load(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a doc to clarify the new semantics of IsDead()

Comment on lines 1250 to 1252
// If the Exit RPC request failed, we still do a graceful kill.
// If the worker is dead already, KillAsync will be a no-op.
idle_worker->KillAsync(*io_service_, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests to validate the two scenarios mentioned in the comment?

@@ -989,7 +990,7 @@ TEST_F(WorkerPoolDriverRegisteredTest, PopWorkerForRequestWithRootDetachedActor)
ASSERT_NE(worker_pool_->PopWorkerSync(task_spec_job_1_detached_actor_1),
worker_job_2_no_detached_actor);
ASSERT_EQ(worker_pool_->GetIdleWorkerSize(), 1);
worker_job_2_no_detached_actor->MarkDead();
worker_job_2_no_detached_actor->KillAsync(io_service, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these tests are doing force kill. Let's also cover when force is false.

@@ -989,7 +990,7 @@ TEST_F(WorkerPoolDriverRegisteredTest, PopWorkerForRequestWithRootDetachedActor)
ASSERT_NE(worker_pool_->PopWorkerSync(task_spec_job_1_detached_actor_1),
worker_job_2_no_detached_actor);
ASSERT_EQ(worker_pool_->GetIdleWorkerSize(), 1);
worker_job_2_no_detached_actor->MarkDead();
worker_job_2_no_detached_actor->KillAsync(io_service, true);
worker_pool_->TryKillingIdleWorkers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call TryKillingIdleWorkers if force kill is true?

@rueian
Copy link
Contributor Author

rueian commented Jul 9, 2025

Thanks for the input @codope. It is super clear now that we should not send SIGTERM to workers after the Exit calls to them succeed.

Now we send SIGTERM to workers after failed Exit RPCs, which could be more disruptive if workers are legitimately unable to exit immediately (e.g., due to ongoing object references).

Do you mean that we should also not send SIGTERM in the case of !status.ok()?

@codope
Copy link
Contributor

codope commented Jul 9, 2025

Do you mean that we should also not send SIGTERM in the case of !status.ok()?

Yes, I think we should not send SIGTERM in the !status.ok() case, at least not immediately. The RPC failure is ambiguous - it could be a network issue, and the worker might still be able to drain references properly. Consider implementing retry logic first (if there's already a retry then we're good), then process liveness checks, and only use SIGTERM as a last resort with longer timeouts.

@codope codope self-assigned this Jul 9, 2025
@cszhu cszhu added community-contribution Contributed by the community core Issues that should be addressed in Ray Core labels Jul 10, 2025
@rueian rueian force-pushed the replace-worker-markdead branch from 1eec2b8 to d75bfb6 Compare July 10, 2025 17:35
@rueian rueian changed the title [core] Replace Worker::MarkDead with Worker::KillAsync [core] prevent sending SIGTERM after calling Worker::MarkDead Jul 10, 2025
Signed-off-by: Rueian <rueian@anyscale.com>
@rueian rueian force-pushed the replace-worker-markdead branch from d75bfb6 to 5da4d13 Compare July 10, 2025 17:45
@rueian
Copy link
Contributor Author

rueian commented Jul 10, 2025

Do you mean that we should also not send SIGTERM in the case of !status.ok()?

Yes, I think we should not send SIGTERM in the !status.ok() case, at least not immediately. The RPC failure is ambiguous - it could be a network issue, and the worker might still be able to drain references properly. Consider implementing retry logic first (if there's already a retry then we're good), then process liveness checks, and only use SIGTERM as a last resort with longer timeouts.

Ok! Now, I only merge the flags but keep the MarkDead method. This PR now essentially prevents sending SIGTERM after the Exit RPC is called, ensuring that the node manager doesn't interrupt worker shutdown.

Just curious, shouldn't the SIGTERM handling in the core worker do nothing after the Exit RPC is called?

@rueian rueian requested review from israbbani and codope July 14, 2025 15:06
Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rueian Thanks for addressing the comments. Looks good to me.
To answer your question, SIGTERM doesn't run on the c++ io service event loop --

sigaddset(&mask, SIGTERM);

It's handled in the main thread.

@codope codope requested a review from a team as a code owner July 21, 2025 10:55
@edoakes edoakes merged commit 1de822f into ray-project:master Jul 21, 2025
5 checks passed
alimaazamat pushed a commit to alimaazamat/ray that referenced this pull request Jul 24, 2025
…oject#54377)

## Why are these changes needed?

Following [the previous
discussion](ray-project#54068 (comment)),
this PR merges the flags used by `Worker::MarkDead` and
`Worker::KillAsync`.

Currently, `Worker::MarkDead` will only be called after we ask the
worker to shutdown via the `Exit` RPC. After the `Exit` RPC, the worker
should shut down by itself.

This change essentially prevents sending `SIGTERM` from the node manager
to a worker once its `Worker::MarkDead` method has been called, ensuring
that we don't interrupt its shutdown process.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Rueian <rueian@anyscale.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
krishnakalyan3 pushed a commit to krishnakalyan3/ray that referenced this pull request Jul 30, 2025
…oject#54377)

## Why are these changes needed?

Following [the previous
discussion](ray-project#54068 (comment)),
this PR merges the flags used by `Worker::MarkDead` and
`Worker::KillAsync`.

Currently, `Worker::MarkDead` will only be called after we ask the
worker to shutdown via the `Exit` RPC. After the `Exit` RPC, the worker
should shut down by itself.

This change essentially prevents sending `SIGTERM` from the node manager
to a worker once its `Worker::MarkDead` method has been called, ensuring
that we don't interrupt its shutdown process.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Rueian <rueian@anyscale.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants