Skip to content

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jun 12, 2025

Why are these changes needed?

While doing the #53562, we decided to refactor the NodeManager first to allow us to inject a WorkerPoolInterface implementation to it from the main.cc. This PR does the refactoring. That is:

  1. Updated the WorkerPoolInterface to cover all methods of WorkerPool. Previously the interface was only a subset.
  2. Updated all the existing mock implementations of WorkerPoolInterface to cover new missing methods.
  3. Replaced WorkerPool worker_pool_ to WorkerPoolInterface &worker_pool_ in the NodeManger so that we can swap it out for testing, which is required by [core] fix detached actor being unexpectedly killed #53562.
  4. Modified the NodeManager constructor to accept a WorkerPoolInterface &worker_pool_.
  5. In addition to the new WorkerPoolInterface &worker_pool_ injection, we also need to inject all its dependencies. So we ended up with all the following are constructed and owned in the main.cc:
  std::shared_ptr<plasma::PlasmaClient> plasma_client;
  std::shared_ptr<ray::raylet::NodeManager> node_manager;
  std::shared_ptr<ray::rpc::ClientCallManager> client_call_manager;
  std::shared_ptr<ray::rpc::CoreWorkerClientPool> worker_rpc_pool;
  std::shared_ptr<ray::raylet::WorkerPoolInterface> worker_pool;
  std::shared_ptr<ray::raylet::LocalObjectManager> local_object_manager;
  std::shared_ptr<ray::ClusterResourceScheduler> cluster_resource_scheduler;
  std::shared_ptr<ray::raylet::LocalTaskManager> local_task_manager;
  std::shared_ptr<ray::raylet::ClusterTaskManagerInterface> cluster_task_manager;
  std::shared_ptr<ray::pubsub::SubscriberInterface> core_worker_subscriber;
  std::shared_ptr<ray::IObjectDirectory> object_directory;
  std::shared_ptr<ray::ObjectManagerInterface> object_manager;
  std::shared_ptr<ray::raylet::DependencyManager> dependency_manager;
  absl::flat_hash_map<WorkerID, std::shared_ptr<ray::raylet::WorkerInterface>> leased_workers;

This PR does not introduce any behavioral changes.

Related issue number

Related to
#53562 #40864
ray-project/kuberay#3701 and ray-project/kuberay#3700

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rueian rueian force-pushed the refactor-nodemanager-deps branch 3 times, most recently from b0765c8 to 174f86a Compare June 12, 2025 22:21
@rueian rueian changed the title [core] Move dependencies of NodeManger to main.cc [core] Move dependencies of NodeManger to main.cc for better testability Jun 12, 2025
@rueian rueian force-pushed the refactor-nodemanager-deps branch from 174f86a to cdd99e4 Compare June 12, 2025 22:26
@@ -2958,31 +2847,6 @@ std::optional<syncer::RaySyncMessage> NodeManager::CreateSyncMessage(
return std::make_optional(std::move(msg));
}

void NodeManager::PublishInfeasibleTaskError(const RayTask &task) const {
Copy link
Contributor Author

@rueian rueian Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is moved to the announce_infeasible_task in main.cc.

@@ -228,7 +280,7 @@ inline std::ostream &operator<<(std::ostream &os,
///
/// The WorkerPool is responsible for managing a pool of Workers. Each Worker
/// is a container for a unit of work.
class WorkerPool : public WorkerPoolInterface, public IOWorkerPoolInterface {
class WorkerPool : public WorkerPoolInterface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkerPoolInterface already covers IOWorkerPoolInterface now.

@@ -466,10 +500,322 @@ int main(int argc, char *argv[]) {
{ray::stats::SessionNameKey, session_name}};
ray::stats::Init(global_tags, metrics_agent_port, WorkerID::Nil());

ray::NodeID raylet_node_id = ray::NodeID::FromHex(node_id);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving out raylet_node_id to fix its lifetime.

@rueian rueian marked this pull request as ready for review June 12, 2025 23:40
@edoakes edoakes requested a review from a team June 13, 2025 01:11
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jun 13, 2025
@edoakes edoakes enabled auto-merge (squash) June 13, 2025 18:39
@edoakes edoakes merged commit 844a0f5 into ray-project:master Jun 13, 2025
7 checks passed
@dayshah
Copy link
Contributor

dayshah commented Jun 13, 2025

late to this review and ok with this for now

But in the future we should avoid just making everything a shared ptr just to avoid thinking about lifetimes. It isn't necessary for most of these things to be shared ptr's and by doing this, we're making the lifetimes of everything more complicated. The lifetime of everything should be obvious from looking at one spot in the code. This also makes it easier to reason about shutdown. I also don't see why we need to create everything in main instead of raylet.cc if the tests are just trying to inject into the node manager.

cc @edoakes

elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
…ity (#53782)

## Why are these changes needed?

While doing the #53562, we
[decided](#53562 (comment))
to refactor the `NodeManager` first to allow us to inject a
`WorkerPoolInterface` implementation to it from the `main.cc`. This PR
does the refactoring. That is:

1. Updated the `WorkerPoolInterface` to cover all methods of
`WorkerPool`. Previously the interface was only a subset.
2. Updated all the existing mock implementations of
`WorkerPoolInterface` to cover new missing methods.
3. Replaced `WorkerPool worker_pool_` to `WorkerPoolInterface
&worker_pool_` in the `NodeManger` so that we can swap it out for
testing, which is required by
#53562.
4. Modified the `NodeManager` constructor to accept a
`WorkerPoolInterface &worker_pool_`.
5. In addition to the new `WorkerPoolInterface &worker_pool_` injection,
we also need to inject all its dependencies. So we ended up with all the
following are constructed and owned in the `main.cc`:

```c
  std::shared_ptr<plasma::PlasmaClient> plasma_client;
  std::shared_ptr<ray::raylet::NodeManager> node_manager;
  std::shared_ptr<ray::rpc::ClientCallManager> client_call_manager;
  std::shared_ptr<ray::rpc::CoreWorkerClientPool> worker_rpc_pool;
  std::shared_ptr<ray::raylet::WorkerPoolInterface> worker_pool;
  std::shared_ptr<ray::raylet::LocalObjectManager> local_object_manager;
  std::shared_ptr<ray::ClusterResourceScheduler> cluster_resource_scheduler;
  std::shared_ptr<ray::raylet::LocalTaskManager> local_task_manager;
  std::shared_ptr<ray::raylet::ClusterTaskManagerInterface> cluster_task_manager;
  std::shared_ptr<ray::pubsub::SubscriberInterface> core_worker_subscriber;
  std::shared_ptr<ray::IObjectDirectory> object_directory;
  std::shared_ptr<ray::ObjectManagerInterface> object_manager;
  std::shared_ptr<ray::raylet::DependencyManager> dependency_manager;
  absl::flat_hash_map<WorkerID, std::shared_ptr<ray::raylet::WorkerInterface>> leased_workers;
```

This PR does not introduce any behavioral changes.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Related to
#53562
#40864
ray-project/kuberay#3701 and
ray-project/kuberay#3700

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@israbbani
Copy link
Contributor

But in the future we should avoid just making everything a shared ptr just to avoid thinking about lifetimes. It isn't necessary for most of these things to be shared ptr's and by doing this, we're making the lifetimes of everything more complicated. The lifetime of everything should be obvious from looking at one spot in the code. This also makes it easier to reason about shutdown. I also don't see why we need to create everything in main instead of raylet.cc if the tests are just trying to inject into the node manager.

I don't disagree. Can you provide a concrete example or a link that explains this problem and the pitfalls of using shared_ptrs everywhere?

@dayshah
Copy link
Contributor

dayshah commented Jun 25, 2025

But in the future we should avoid just making everything a shared ptr just to avoid thinking about lifetimes. It isn't necessary for most of these things to be shared ptr's and by doing this, we're making the lifetimes of everything more complicated. The lifetime of everything should be obvious from looking at one spot in the code. This also makes it easier to reason about shutdown. I also don't see why we need to create everything in main instead of raylet.cc if the tests are just trying to inject into the node manager.

I don't disagree. Can you provide a concrete example or a link that explains this problem and the pitfalls of using shared_ptrs everywhere?

https://ddanilov.me/shared-ptr-is-evil/

edoakes pushed a commit that referenced this pull request Jun 25, 2025
… references in raylet (#54062)

Following the suggestion in
#53782 (comment),
this PR replace all unnecessary shared_ptrs with unique_ptrs and
references in Raylet.

Signed-off-by: Rueian <rueian@anyscale.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
… references in raylet (ray-project#54062)

Following the suggestion in
ray-project#53782 (comment),
this PR replace all unnecessary shared_ptrs with unique_ptrs and
references in Raylet.

Signed-off-by: Rueian <rueian@anyscale.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
…ity (#53782)

## Why are these changes needed?

While doing the #53562, we
[decided](#53562 (comment))
to refactor the `NodeManager` first to allow us to inject a
`WorkerPoolInterface` implementation to it from the `main.cc`. This PR
does the refactoring. That is:

1. Updated the `WorkerPoolInterface` to cover all methods of
`WorkerPool`. Previously the interface was only a subset.
2. Updated all the existing mock implementations of
`WorkerPoolInterface` to cover new missing methods.
3. Replaced `WorkerPool worker_pool_` to `WorkerPoolInterface
&worker_pool_` in the `NodeManger` so that we can swap it out for
testing, which is required by
#53562.
4. Modified the `NodeManager` constructor to accept a
`WorkerPoolInterface &worker_pool_`.
5. In addition to the new `WorkerPoolInterface &worker_pool_` injection,
we also need to inject all its dependencies. So we ended up with all the
following are constructed and owned in the `main.cc`:

```c
  std::shared_ptr<plasma::PlasmaClient> plasma_client;
  std::shared_ptr<ray::raylet::NodeManager> node_manager;
  std::shared_ptr<ray::rpc::ClientCallManager> client_call_manager;
  std::shared_ptr<ray::rpc::CoreWorkerClientPool> worker_rpc_pool;
  std::shared_ptr<ray::raylet::WorkerPoolInterface> worker_pool;
  std::shared_ptr<ray::raylet::LocalObjectManager> local_object_manager;
  std::shared_ptr<ray::ClusterResourceScheduler> cluster_resource_scheduler;
  std::shared_ptr<ray::raylet::LocalTaskManager> local_task_manager;
  std::shared_ptr<ray::raylet::ClusterTaskManagerInterface> cluster_task_manager;
  std::shared_ptr<ray::pubsub::SubscriberInterface> core_worker_subscriber;
  std::shared_ptr<ray::IObjectDirectory> object_directory;
  std::shared_ptr<ray::ObjectManagerInterface> object_manager;
  std::shared_ptr<ray::raylet::DependencyManager> dependency_manager;
  absl::flat_hash_map<WorkerID, std::shared_ptr<ray::raylet::WorkerInterface>> leased_workers;
```

This PR does not introduce any behavioral changes.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Related to
#53562
#40864
ray-project/kuberay#3701 and
ray-project/kuberay#3700

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
… references in raylet (#54062)

Following the suggestion in
#53782 (comment),
this PR replace all unnecessary shared_ptrs with unique_ptrs and
references in Raylet.

Signed-off-by: Rueian <rueian@anyscale.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants