[Roadmap] `llm-d` 0.3 Release Plan (Target 8/31/25)

Following our 0.2 release, we are excited to continue progress against our well-lit paths

--- 
### Themes - Areas of Focus
#### 1. Commit to the mission
- Expand hardware platform support
    - Accelerators - AMD and TPU
    - Networking - TCP and RDMA over RoCE
- Respect our upstreams
    - Remove llm-d image (upstream all changes to vLLM)
    - Continue to upstream key generally useful features to upstream scheduler, like precise prefix cache aware routing

#### 2. "Brighten" the "well-lit" paths
- Finalize "DeepSeek Inference System on K8s" story
    - Wide EP path to beta
    - Stabilized KVTransferParams
- Bring "intelligent scheduling" to GA along with IGW
- Intelligent scheduler reconciles need to capacity/perf
    - Adaptive SLO targeting preview + alpha APIs in IGW
 
#### 3. Build new "well-lit" paths
- Prefix cache bigger than memeory
    - GPU -> CPU offload
    - Integrate LMCache for local offloading

--- 
### Well Lit Paths

#### Intelligent Inference Scheduling

- Objectives-based Scheduling
  - [ ] Flow-control and fairness ([PRs](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pulls?q=is%3Apr+%22feat%28flowcontrol%29%22))
   - [ ] API realignment: InferenceModel -> InferenceObjectives migration ([#1199](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1199))
  - [ ] An initial SLO-based scheduling algorithm ([proposal](https://docs.google.com/document/d/1q56wr3N5XGx0B21MzHu5oBsCiGi9VrbZAvyhP2VFG_c/edit?tab=t.0#heading=h.ob7j9esmcyd3), [#1161](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1161))

- Pluggability enhancements
   - [ ] Data layer pluggability ([proposal](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/1023-data-layer-architecture))
   - [ ] Another iteration on the config api, the goal is to improve the UX

- Production readiness
   - [ ] Evaluating and recommending performant canned configurations for all will-lit paths.
   - [ ] Scale testing
   - [ ] A recommended HA best practice for EPP deployment ([#692](https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/692))
   - [ ] Graduate InferencePool API to GA 
   - [ ] Deprecate the upstream filter-based algorithm and migrate to a scoring-based approach

#### pd-disaggregation

- KV Transfer Protocol
  - [ ] Switch to HTTP handshake
  - [ ] Delayed decode?
 
- Monitoring
  - [ ] NIXL telemetry
  - [ ] Metrics exposed by vLLM to prometheus

- Hardware support
  - [ ] Exploration of TCP based transport
  - [ ] Exploration of NVL-72 (Resolve cuda_ipc issues with NIXL/ucx)
  - [ ] MI300X support and validation
  - [ ] TPU support and validation

#### wide-ep
- Achieve DeepSeek inference system performance for Kimi / R1
   - [ ] Finalize Dual batch overlap implementation
   - [ ] NVIDIA B200 Performance validation
   - [ ] Finalize load balancing (either one-pod-per-rank or one-pod-per-node)

#### kv-cache management (new)
- Working implementation of KV cache offloading 
   - [ ] CPU KV Cache Offloading
   - [ ] Integration with approximate KV cache awareness
   - [ ] Integration with precise KV cache events
   - [ ] LMCache integration for <local SSD|something>


#### autoscaler (new)
- Initial prototype of SLO-based autoscaling

---
### Operations

#### infrastructure
 - Automated CI/CD
     - [x] Well lit path - intelligent scheduling
     - [ ] Well lit path - p/d disagg
     - [ ] Well lit path - wide ep  

#### Benchmarking
- Automate creation of pareto frontier for wide-ep / disagg cases
- Blogs highlighting impact of 3 well lit paths

### Upstreams
- Push changes upstream
   - [ ] Move llm-d image to upstream vllm
   - [ ] Settle on a process for llm-d leveraging GIE upstream image while still having “in development” prototypes



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Roadmap] `llm-d` 0.3 Release Plan (Target 8/31/25) #146

Themes - Areas of Focus

1. Commit to the mission

2. "Brighten" the "well-lit" paths

3. Build new "well-lit" paths

Well Lit Paths

Intelligent Inference Scheduling

pd-disaggregation

wide-ep

kv-cache management (new)

autoscaler (new)

Operations

infrastructure

Benchmarking

Upstreams

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] llm-d 0.3 Release Plan (Target 8/31/25) #146

Description

Themes - Areas of Focus

1. Commit to the mission

2. "Brighten" the "well-lit" paths

3. Build new "well-lit" paths

Well Lit Paths

Intelligent Inference Scheduling

pd-disaggregation

wide-ep

kv-cache management (new)

autoscaler (new)

Operations

infrastructure

Benchmarking

Upstreams

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Roadmap] `llm-d` 0.3 Release Plan (Target 8/31/25) #146