Skip to content

[Roadmap] Ray Q3 2025 #54923

@cszhu

Description

@cszhu

Hello everyone! 👋 I'm excited to share what we have in plan for Q3 2025 for Ray. I will try to keep this updated as features get merged in, and rolled out.

Goal: Deliver foundational reliability, performance, and DX improvements across Ray Core, Data, Train, LLM, Serve, RL, Observability, Technical Content, and KubeRay.

Ray Core

Reliability & Fault Tolerance

  • Improve system stability under node and network failures,
    • Including making RPCs tolerant to transient errors
  • Add robust support for preemptible instances

Scheduling & Performance

Developer Experience

  • Introduce ActorMesh for simplified interaction with groups of actors ([WIP] [Prototype] ActorMesh API #54760)
  • Improve static typing across the codebase to enhance developer productivity
  • Address outstanding technical debt in core worker components

Ecosystem Integrations

Ray Data

Reliability

  • Ensure workloads complete successfully despite cluster failures

Performance

  • Enhance training ingest pipelines with advanced sampling and caching support

Connectors

Usability

  • Schema UDFs
  • Enhanced internal query planning

Ray Train

API

  • Finalize Train v2 API

Performance

  • Implement asynchronous checkpointing

LLM

Goal: Run large models (e.g. DeepSeek) at scale via vLLM on Ray Serve:

  • Prefill diaggregation
  • Large scale DP
  • Custom request routing
  • Elastic expert parallelism

Performance & Efficiency

  • Implement prefill disaggregation to optimize performance for large context models.
  • Develop an intelligent, KV cache-aware router with a pluggable architecture
  • Implement Distributed Parallel (DP) Attention within Ray Serve

Operations

  • Publish updated performance benchmarks

Ecosystem

  • Support SkyRL for reinforcement learning for human feedback (RLHF) workloads

Ray Serve

Serving Flexibility

  • Custom auto‑scaling and routing patterns
  • Async inference support
  • MCP server patterns
  • Integrate label based scheduling

Observability

  • Enhanced tracing support

RLlib

  • Ray RL V2 stack GA
  • Algorithm composability enhancements

Observability

API Release

  • Public launch of unified event export API

Optimization

  • Refactor internals to leverage new export API

Technical Content

  • New technical templates
  • More examples & deep‑dives

KubeRay

Upgrades

  • Productionize the incremental upgrade feature for seamless cluster updates

Hardware Support

  • Streamline support for diverse accelerators, including multiple GPU types, Dynamic Resource Allocation (DRA), and MIG

Autoscaling

  • Continue to improve the functionality and reliability of Autoscaler V2

We love hearing from the community! If there is a feature you'd like to see in Ray in the future, let us know by filing a feature request or comment here. Thank you for supporting Ray!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions