-
Notifications
You must be signed in to change notification settings - Fork 145
Open
Description
Following our 0.2 release, we are excited to continue progress against our well-lit paths
Themes - Areas of Focus
1. Commit to the mission
- Expand hardware platform support
- Accelerators - AMD and TPU
- Networking - TCP and RDMA over RoCE
- Respect our upstreams
- Remove llm-d image (upstream all changes to vLLM)
- Continue to upstream key generally useful features to upstream scheduler, like precise prefix cache aware routing
2. "Brighten" the "well-lit" paths
- Finalize "DeepSeek Inference System on K8s" story
- Wide EP path to beta
- Stabilized KVTransferParams
- Bring "intelligent scheduling" to GA along with IGW
- Intelligent scheduler reconciles need to capacity/perf
- Adaptive SLO targeting preview + alpha APIs in IGW
3. Build new "well-lit" paths
- Prefix cache bigger than memeory
- GPU -> CPU offload
- Integrate LMCache for local offloading
Well Lit Paths
Intelligent Inference Scheduling
-
Objectives-based Scheduling
-
Pluggability enhancements
- Data layer pluggability (proposal)
- Another iteration on the config api, the goal is to improve the UX
-
Production readiness
- Evaluating and recommending performant canned configurations for all will-lit paths.
- Scale testing
- A recommended HA best practice for EPP deployment (#692)
- Graduate InferencePool API to GA
- Deprecate the upstream filter-based algorithm and migrate to a scoring-based approach
pd-disaggregation
-
KV Transfer Protocol
- Switch to HTTP handshake
- Delayed decode?
-
Monitoring
- NIXL telemetry
- Metrics exposed by vLLM to prometheus
-
Hardware support
- Exploration of TCP based transport
- Exploration of NVL-72 (Resolve cuda_ipc issues with NIXL/ucx)
- MI300X support and validation
- TPU support and validation
wide-ep
- Achieve DeepSeek inference system performance for Kimi / R1
- Finalize Dual batch overlap implementation
- NVIDIA B200 Performance validation
- Finalize load balancing (either one-pod-per-rank or one-pod-per-node)
kv-cache management (new)
- Working implementation of KV cache offloading
- CPU KV Cache Offloading
- Integration with approximate KV cache awareness
- Integration with precise KV cache events
- LMCache integration for <local SSD|something>
autoscaler (new)
- Initial prototype of SLO-based autoscaling
Operations
infrastructure
- Automated CI/CD
- Well lit path - intelligent scheduling
- Well lit path - p/d disagg
- Well lit path - wide ep
Benchmarking
- Automate creation of pareto frontier for wide-ep / disagg cases
- Blogs highlighting impact of 3 well lit paths
Upstreams
- Push changes upstream
- Move llm-d image to upstream vllm
- Settle on a process for llm-d leveraging GIE upstream image while still having “in development” prototypes
yafengio, yankay and hhk7734nerdalert, subnet-dev and yankay
Metadata
Metadata
Assignees
Labels
No labels