Skip to content

[Roadmap] llm-d 0.3 Release Plan (Target 8/31/25) #146

@robertgshaw2-redhat

Description

@robertgshaw2-redhat

Following our 0.2 release, we are excited to continue progress against our well-lit paths


Themes - Areas of Focus

1. Commit to the mission

  • Expand hardware platform support
    • Accelerators - AMD and TPU
    • Networking - TCP and RDMA over RoCE
  • Respect our upstreams
    • Remove llm-d image (upstream all changes to vLLM)
    • Continue to upstream key generally useful features to upstream scheduler, like precise prefix cache aware routing

2. "Brighten" the "well-lit" paths

  • Finalize "DeepSeek Inference System on K8s" story
    • Wide EP path to beta
    • Stabilized KVTransferParams
  • Bring "intelligent scheduling" to GA along with IGW
  • Intelligent scheduler reconciles need to capacity/perf
    • Adaptive SLO targeting preview + alpha APIs in IGW

3. Build new "well-lit" paths

  • Prefix cache bigger than memeory
    • GPU -> CPU offload
    • Integrate LMCache for local offloading

Well Lit Paths

Intelligent Inference Scheduling

  • Objectives-based Scheduling

    • Flow-control and fairness (PRs)
    • API realignment: InferenceModel -> InferenceObjectives migration (#1199)
    • An initial SLO-based scheduling algorithm (proposal, #1161)
  • Pluggability enhancements

    • Data layer pluggability (proposal)
    • Another iteration on the config api, the goal is to improve the UX
  • Production readiness

    • Evaluating and recommending performant canned configurations for all will-lit paths.
    • Scale testing
    • A recommended HA best practice for EPP deployment (#692)
    • Graduate InferencePool API to GA
    • Deprecate the upstream filter-based algorithm and migrate to a scoring-based approach

pd-disaggregation

  • KV Transfer Protocol

    • Switch to HTTP handshake
    • Delayed decode?
  • Monitoring

    • NIXL telemetry
    • Metrics exposed by vLLM to prometheus
  • Hardware support

    • Exploration of TCP based transport
    • Exploration of NVL-72 (Resolve cuda_ipc issues with NIXL/ucx)
    • MI300X support and validation
    • TPU support and validation

wide-ep

  • Achieve DeepSeek inference system performance for Kimi / R1
    • Finalize Dual batch overlap implementation
    • NVIDIA B200 Performance validation
    • Finalize load balancing (either one-pod-per-rank or one-pod-per-node)

kv-cache management (new)

  • Working implementation of KV cache offloading
    • CPU KV Cache Offloading
    • Integration with approximate KV cache awareness
    • Integration with precise KV cache events
    • LMCache integration for <local SSD|something>

autoscaler (new)

  • Initial prototype of SLO-based autoscaling

Operations

infrastructure

  • Automated CI/CD
    • Well lit path - intelligent scheduling
    • Well lit path - p/d disagg
    • Well lit path - wide ep

Benchmarking

  • Automate creation of pareto frontier for wide-ep / disagg cases
  • Blogs highlighting impact of 3 well lit paths

Upstreams

  • Push changes upstream
    • Move llm-d image to upstream vllm
    • Settle on a process for llm-d leveraging GIE upstream image while still having “in development” prototypes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions