Skip to content

[Core] verl gpu tensor rdma integration #54943

@wuxibin89

Description

@wuxibin89

What happened + What you expected to happen

I'm testing verl with GPU tensor RDMA:

  • 1 group with 4 GPUs for actor model: FSDP training
  • 1 group with 4 GPUS for rollout model: vLLM inference

I want to sync weight between FSDP and vLLM groups, but encounter some error.

@kevin85421

Versions / Dependencies

I use the master branch with commit_id 8528231e886dfcf926f24e362dac12ef198a0cff

git clone https://github.com/ray-project/ray.git
cd ray/python
pip install -e . --verbose

Reproduction script

git clone -b wuxibin/ray_gpu_object_store https://github.com/wuxibin89/verl.git
python3 tests/workers/rollout/test_ray_gpu_object_weight_sync.py

The test script output the following error:

(WorkerDict pid=56099) WARNING 07-26 23:59:38 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fa9f9aa74d0> [repeated 3x across cluster]
Traceback (most recent call last):
  File "/opt/tiger/open_verl/tests/workers/rollout/test_ray_gpu_object_weight_sync.py", line 88, in <module>
    ray.get(rollout_group.load_state_dict(actor_state_dicts))
  File "/home/tiger/.local/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 2847, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 948, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RaySystemError): ray::WorkerDict.rollout_load_state_dict() (pid=55429, ip=127.0.0.1, actor_id=d1c1baf3d1077be1ca546db201000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f13330d0650>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/open_verl/verl/single_controller/ray/base.py", line 708, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/open_verl/verl/single_controller/base/decorator.py", line 549, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/open_verl/verl/workers/fsdp_workers.py", line 692, in load_state_dict
    state_dicts = ray.get(state_dicts)
                  ^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RaySystemError: System error: 
traceback: Traceback (most recent call last):
          ^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/ray/experimental/channel/torch_tensor_type.py", line 128, in deserialize
    return ctx.serialization_context.deserialize_tensor(b, self.device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/ray/experimental/channel/serialization_context.py", line 152, in deserialize_tensor
    assert placeholder < len(self._out_of_band_tensors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Issue Severity

None

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Coregpu-objectsstability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions