-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Open
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Coregpu-objectsstability
Description
What happened + What you expected to happen
I'm testing verl with GPU tensor RDMA:
- 1 group with 4 GPUs for actor model: FSDP training
- 1 group with 4 GPUS for rollout model: vLLM inference
I want to sync weight between FSDP and vLLM groups, but encounter some error.
Versions / Dependencies
I use the master branch with commit_id 8528231e886dfcf926f24e362dac12ef198a0cff
git clone https://github.com/ray-project/ray.git
cd ray/python
pip install -e . --verbose
Reproduction script
git clone -b wuxibin/ray_gpu_object_store https://github.com/wuxibin89/verl.git
python3 tests/workers/rollout/test_ray_gpu_object_weight_sync.py
The test script output the following error:
(WorkerDict pid=56099) WARNING 07-26 23:59:38 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fa9f9aa74d0> [repeated 3x across cluster]
Traceback (most recent call last):
File "/opt/tiger/open_verl/tests/workers/rollout/test_ray_gpu_object_weight_sync.py", line 88, in <module>
ray.get(rollout_group.load_state_dict(actor_state_dicts))
File "/home/tiger/.local/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 2847, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 948, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RaySystemError): ray::WorkerDict.rollout_load_state_dict() (pid=55429, ip=127.0.0.1, actor_id=d1c1baf3d1077be1ca546db201000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f13330d0650>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/open_verl/verl/single_controller/ray/base.py", line 708, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/open_verl/verl/single_controller/base/decorator.py", line 549, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/open_verl/verl/workers/fsdp_workers.py", line 692, in load_state_dict
state_dicts = ray.get(state_dicts)
^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RaySystemError: System error:
traceback: Traceback (most recent call last):
^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/ray/experimental/channel/torch_tensor_type.py", line 128, in deserialize
return ctx.serialization_context.deserialize_tensor(b, self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/ray/experimental/channel/serialization_context.py", line 152, in deserialize_tensor
assert placeholder < len(self._out_of_band_tensors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
Issue Severity
None
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Coregpu-objectsstability