[RFC] Improving Ray for Post-Training / RL for LLM Projects

Over the past year, many projects have been launched leveraging Ray to scale out post-training and RL for LLMs. From our perspective, we’d like to ensure that Ray continues to be a great fit for these use cases and address any bugs or usability gaps in Ray.

We've spoken to a variety of project creators over the last couple of weeks and have gotten great feedback.

Below is our currently identified list of issues and features that we plan to address, but we’d also be eager to hear if there is any other feedback as well. 


## List of issues to address:

- [ ] ActorGroup / ActorMesh abstraction https://github.com/ray-project/ray/pull/54760 (cc @pcmoritz)
- [ ] Fast GPU object transfer (GPU Objects): [RFC](https://github.com/ray-project/ray/issues/51173) [Issues](https://github.com/ray-project/ray/issues?q=is%3Aissue%20state%3Aopen%20%20label%3Agpu-objects)

Core specific issues: 
- [ ] Add docs to debug `SYSTEM_ERROR` (cc @jjyao)
  - Related issues: [https://github.com/volcengine/verl/issues/1595](https://github.com/volcengine/verl/issues/1595) and [https://github.com/volcengine/verl/pull/1331\#issuecomment-2852980345](https://github.com/volcengine/verl/pull/1331#issuecomment-2852980345) and [https://github.com/volcengine/verl/issues/1299](https://github.com/volcengine/verl/issues/1299) and  
  - [https://github.com/volcengine/verl/issues/472](https://github.com/volcengine/verl/issues/472)  
- [x] Too many threads / Add documentation on thread control  
  - [https://github.com/volcengine/verl/issues/719](https://github.com/volcengine/verl/issues/719)  
  - [https://github.com/volcengine/verl/issues/1335](https://github.com/volcengine/verl/issues/1335) 
  - Fixed by https://github.com/ray-project/ray/pull/54988
- [ ] More observability for hanging workloads  
  - [https://github.com/volcengine/verl/issues/242](https://github.com/volcengine/verl/issues/242)  
  - https://github.com/volcengine/verl/issues/1126  
- [ ] Improving Ray Typing annotation: https://github.com/ray-project/ray/issues/54149
- [ ] https://github.com/ray-project/ray/issues/55212 
- [ ] https://github.com/ray-project/ray/issues/55171
- [ ] Improve CUDA_VISIBLE_DEVICES handling
- [ ] Make uv faster
  - [https://github.com/NVIDIA/NeMo-RL/issues/46](https://github.com/NVIDIA/NeMo-RL/issues/46)
- [ ] https://github.com/ray-project/ray/issues/54989

## Open Questions
* Anything we should do to better improve SLURM support?

## Key Projects

* VeRL (cc @eric-haibin-lin)
* NemoRL (cc @terrykong)
* OpenRLHF (cc @xiaoxigua999 @hijkzzz)
* ROLL (cc @PanAndy @StephenRi @wwxFromTju)
* AReaL (cc @garrett4wade)
* SkyRL (@tyler-griggs @caoshiyi @lynnliu030 @DachengLi1)


We welcome folks to participate, and please feel free to let us know if there are other items to address.

cc @robertnishihara  @SumanthRH @erictang000 @kouroshHakha @kevin85421 @stephanie-wang 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Improving Ray for Post-Training / RL for LLM Projects #54021

List of issues to address:

Open Questions

Key Projects

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Improving Ray for Post-Training / RL for LLM Projects #54021

Description

List of issues to address:

Open Questions

Key Projects

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions