-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Open
Labels
RFCRFC issuesRFC issuescoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CorellmobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profilingperformanceusability
Description
Over the past year, many projects have been launched leveraging Ray to scale out post-training and RL for LLMs. From our perspective, we’d like to ensure that Ray continues to be a great fit for these use cases and address any bugs or usability gaps in Ray.
We've spoken to a variety of project creators over the last couple of weeks and have gotten great feedback.
Below is our currently identified list of issues and features that we plan to address, but we’d also be eager to hear if there is any other feedback as well.
List of issues to address:
- ActorGroup / ActorMesh abstraction [WIP] [Prototype] ActorMesh API #54760 (cc @pcmoritz)
- Fast GPU object transfer (GPU Objects): RFC Issues
Core specific issues:
- Add docs to debug
SYSTEM_ERROR
(cc @jjyao)- Related issues: The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR volcengine/verl#1595 and fix: ray worker exit with SYSTEM_ERROR caused by SIGALRM from math re… volcengine/verl#1331 (comment) and "A worker died or was killed by an unexpected system error" when the training completes volcengine/verl#1299 and
- A worker died or was killed while executing a task by an unexpected system error volcengine/verl#472
- Too many threads / Add documentation on thread control
- More observability for hanging workloads
- Improving Ray Typing annotation: [core] Improving Ray Typing annotation #54149
- [core/dashboard] Separate actor scheduling vs slow initialization ('pending_creation is not clear') #55212
- [core] Unserializable Exceptions should fallback gracefully #55171
- Improve CUDA_VISIBLE_DEVICES handling
- Make uv faster
- [core] Functionality to kill a specific Ray cluster with
ray stop
#54989
Open Questions
- Anything we should do to better improve SLURM support?
Key Projects
- VeRL (cc @eric-haibin-lin)
- NemoRL (cc @terrykong)
- OpenRLHF (cc @xiaoxigua999 @hijkzzz)
- ROLL (cc @PanAndy @StephenRi @wwxFromTju)
- AReaL (cc @garrett4wade)
- SkyRL (@tyler-griggs @caoshiyi @lynnliu030 @DachengLi1)
We welcome folks to participate, and please feel free to let us know if there are other items to address.
cc @robertnishihara @SumanthRH @erictang000 @kouroshHakha @kevin85421 @stephanie-wang
eric-haibin-lin, kouroshHakha, SumanthRH, akshay-anyscale, pcmoritz and 9 morepcmoritz, zhaohaidao, SolitaryThinker and hijkzzzakshay-anyscale, pcmoritz, SumanthRH, zhaohaidao, SolitaryThinker and 2 more
Metadata
Metadata
Assignees
Labels
RFCRFC issuesRFC issuescoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray CorellmobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profilingperformanceusability