-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[core] Move dependencies of NodeManger to main.cc for better testability #53782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b0765c8
to
174f86a
Compare
Signed-off-by: Rueian <rueiancsie@gmail.com>
174f86a
to
cdd99e4
Compare
@@ -2958,31 +2847,6 @@ std::optional<syncer::RaySyncMessage> NodeManager::CreateSyncMessage( | |||
return std::make_optional(std::move(msg)); | |||
} | |||
|
|||
void NodeManager::PublishInfeasibleTaskError(const RayTask &task) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is moved to the announce_infeasible_task
in main.cc
.
@@ -228,7 +280,7 @@ inline std::ostream &operator<<(std::ostream &os, | |||
/// | |||
/// The WorkerPool is responsible for managing a pool of Workers. Each Worker | |||
/// is a container for a unit of work. | |||
class WorkerPool : public WorkerPoolInterface, public IOWorkerPoolInterface { | |||
class WorkerPool : public WorkerPoolInterface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WorkerPoolInterface
already covers IOWorkerPoolInterface
now.
@@ -466,10 +500,322 @@ int main(int argc, char *argv[]) { | |||
{ray::stats::SessionNameKey, session_name}}; | |||
ray::stats::Init(global_tags, metrics_agent_port, WorkerID::Nil()); | |||
|
|||
ray::NodeID raylet_node_id = ray::NodeID::FromHex(node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving out raylet_node_id
to fix its lifetime.
late to this review and ok with this for now But in the future we should avoid just making everything a shared ptr just to avoid thinking about lifetimes. It isn't necessary for most of these things to be shared ptr's and by doing this, we're making the lifetimes of everything more complicated. The lifetime of everything should be obvious from looking at one spot in the code. This also makes it easier to reason about shutdown. I also don't see why we need to create everything in main instead of raylet.cc if the tests are just trying to inject into the node manager. cc @edoakes |
…ity (#53782) ## Why are these changes needed? While doing the #53562, we [decided](#53562 (comment)) to refactor the `NodeManager` first to allow us to inject a `WorkerPoolInterface` implementation to it from the `main.cc`. This PR does the refactoring. That is: 1. Updated the `WorkerPoolInterface` to cover all methods of `WorkerPool`. Previously the interface was only a subset. 2. Updated all the existing mock implementations of `WorkerPoolInterface` to cover new missing methods. 3. Replaced `WorkerPool worker_pool_` to `WorkerPoolInterface &worker_pool_` in the `NodeManger` so that we can swap it out for testing, which is required by #53562. 4. Modified the `NodeManager` constructor to accept a `WorkerPoolInterface &worker_pool_`. 5. In addition to the new `WorkerPoolInterface &worker_pool_` injection, we also need to inject all its dependencies. So we ended up with all the following are constructed and owned in the `main.cc`: ```c std::shared_ptr<plasma::PlasmaClient> plasma_client; std::shared_ptr<ray::raylet::NodeManager> node_manager; std::shared_ptr<ray::rpc::ClientCallManager> client_call_manager; std::shared_ptr<ray::rpc::CoreWorkerClientPool> worker_rpc_pool; std::shared_ptr<ray::raylet::WorkerPoolInterface> worker_pool; std::shared_ptr<ray::raylet::LocalObjectManager> local_object_manager; std::shared_ptr<ray::ClusterResourceScheduler> cluster_resource_scheduler; std::shared_ptr<ray::raylet::LocalTaskManager> local_task_manager; std::shared_ptr<ray::raylet::ClusterTaskManagerInterface> cluster_task_manager; std::shared_ptr<ray::pubsub::SubscriberInterface> core_worker_subscriber; std::shared_ptr<ray::IObjectDirectory> object_directory; std::shared_ptr<ray::ObjectManagerInterface> object_manager; std::shared_ptr<ray::raylet::DependencyManager> dependency_manager; absl::flat_hash_map<WorkerID, std::shared_ptr<ray::raylet::WorkerInterface>> leased_workers; ``` This PR does not introduce any behavioral changes. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Related to #53562 #40864 ray-project/kuberay#3701 and ray-project/kuberay#3700 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Rueian <rueiancsie@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
I don't disagree. Can you provide a concrete example or a link that explains this problem and the pitfalls of using shared_ptrs everywhere? |
|
… references in raylet (#54062) Following the suggestion in #53782 (comment), this PR replace all unnecessary shared_ptrs with unique_ptrs and references in Raylet. Signed-off-by: Rueian <rueian@anyscale.com> Signed-off-by: Rueian <rueiancsie@gmail.com>
… references in raylet (ray-project#54062) Following the suggestion in ray-project#53782 (comment), this PR replace all unnecessary shared_ptrs with unique_ptrs and references in Raylet. Signed-off-by: Rueian <rueian@anyscale.com> Signed-off-by: Rueian <rueiancsie@gmail.com>
…ity (#53782) ## Why are these changes needed? While doing the #53562, we [decided](#53562 (comment)) to refactor the `NodeManager` first to allow us to inject a `WorkerPoolInterface` implementation to it from the `main.cc`. This PR does the refactoring. That is: 1. Updated the `WorkerPoolInterface` to cover all methods of `WorkerPool`. Previously the interface was only a subset. 2. Updated all the existing mock implementations of `WorkerPoolInterface` to cover new missing methods. 3. Replaced `WorkerPool worker_pool_` to `WorkerPoolInterface &worker_pool_` in the `NodeManger` so that we can swap it out for testing, which is required by #53562. 4. Modified the `NodeManager` constructor to accept a `WorkerPoolInterface &worker_pool_`. 5. In addition to the new `WorkerPoolInterface &worker_pool_` injection, we also need to inject all its dependencies. So we ended up with all the following are constructed and owned in the `main.cc`: ```c std::shared_ptr<plasma::PlasmaClient> plasma_client; std::shared_ptr<ray::raylet::NodeManager> node_manager; std::shared_ptr<ray::rpc::ClientCallManager> client_call_manager; std::shared_ptr<ray::rpc::CoreWorkerClientPool> worker_rpc_pool; std::shared_ptr<ray::raylet::WorkerPoolInterface> worker_pool; std::shared_ptr<ray::raylet::LocalObjectManager> local_object_manager; std::shared_ptr<ray::ClusterResourceScheduler> cluster_resource_scheduler; std::shared_ptr<ray::raylet::LocalTaskManager> local_task_manager; std::shared_ptr<ray::raylet::ClusterTaskManagerInterface> cluster_task_manager; std::shared_ptr<ray::pubsub::SubscriberInterface> core_worker_subscriber; std::shared_ptr<ray::IObjectDirectory> object_directory; std::shared_ptr<ray::ObjectManagerInterface> object_manager; std::shared_ptr<ray::raylet::DependencyManager> dependency_manager; absl::flat_hash_map<WorkerID, std::shared_ptr<ray::raylet::WorkerInterface>> leased_workers; ``` This PR does not introduce any behavioral changes. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Related to #53562 #40864 ray-project/kuberay#3701 and ray-project/kuberay#3700 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Rueian <rueiancsie@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
… references in raylet (#54062) Following the suggestion in #53782 (comment), this PR replace all unnecessary shared_ptrs with unique_ptrs and references in Raylet. Signed-off-by: Rueian <rueian@anyscale.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Why are these changes needed?
While doing the #53562, we decided to refactor the
NodeManager
first to allow us to inject aWorkerPoolInterface
implementation to it from themain.cc
. This PR does the refactoring. That is:WorkerPoolInterface
to cover all methods ofWorkerPool
. Previously the interface was only a subset.WorkerPoolInterface
to cover new missing methods.WorkerPool worker_pool_
toWorkerPoolInterface &worker_pool_
in theNodeManger
so that we can swap it out for testing, which is required by [core] fix detached actor being unexpectedly killed #53562.NodeManager
constructor to accept aWorkerPoolInterface &worker_pool_
.WorkerPoolInterface &worker_pool_
injection, we also need to inject all its dependencies. So we ended up with all the following are constructed and owned in themain.cc
:This PR does not introduce any behavioral changes.
Related issue number
Related to
#53562 #40864
ray-project/kuberay#3701 and ray-project/kuberay#3700
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.