Development  Roadmap (2024 Q3)

Here is the development roadmap for 2024 Q3. Contributions and feedback are welcome.

## Server API
 - [ ] Add APIs for using the inference engine in a single script without launching a separate server. See also [examples](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html).
   - #1127 
 - [x] Support most OpenAI APIs: Batch, completion, chat, embedding
   - #699  
   - #640 
   - #852 
   - #916 
   - #997
- [ ] Support directly taking embedding as inputs. #745
- [x] Support updating the model weights without relaunching the server. @shanyu-sys 
   - #1157 
- [ ] Support Mistral endpoint in the language frontend
## Performance
- [x] Improve time-to-first-token in streaming mode with better scheduling.
  - #1339
  - #1345
- [x] Implement chunked prefill. @hnyls2002 @vikranth22446 
   - #800
   - #811 
   - #1040 
   - #1013 
- [ ] Implement speculative decoding. See also a [prototype](https://github.com/sgl-project/sglang/pull/270).
    - #859
## Parallelism
- [ ] Support sequence parallelism for long context models.
    - #1041
## Quantization
- [x] Support W8A16, W4A16 weight-only integer quantization. @zhyncs 
  - #1341 
- [ ] Support W4A8 quantization with fp8 activation and int4 weight.
- [x] Support fp8/fp4 KV cache quantization. int4/int8 is low priority. Currently we've supported fp8 e5m2, and we should also support fp8 e4m3. @ispobock 
  - #1204
## Observability
- [ ] Integrate Grafana / Prometheus
  - #1461 
## Model Coverage
- [x] Support interleaved window attention (gemma). @Ying1123 
  - #1056 
  - #1090 
  - #1112 
  - #1151 (delayed to Q4, which is dependent on new memory manager)
- Language Models
  - [ ] Mamba models
  - [x] Deepseek models
     - #689 
     - #693 
     - #905 
     - #1138
     - #1285
- Vision Language Models
  - [x] [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
    - #1123  
  - [ ] [CogVLM2](https://github.com/THUDM/CogVLM2)
  - [ ] [Cambrian-1](https://cambrian-mllm.github.io/)
  - [ ] Phi-vision
- [x] Embedding models
  - #983 
  - #987 
  - #988 
  - #997 
  - #1014 
  - #1186 
## Hardware Coverage
- [x] AMD support
  - #1420
## Language API
- [ ] Function calling. Add `tools` argument in `sgl.gen`. See also [guidance](https://github.com/guidance-ai/guidance/blob/d1bbe1c698cbb201f89556d71193993e78c0686b/README.md?plain=1#L102) and [Claudette](https://www.answer.ai/posts/2024-06-21-claudette.html). For OpenAI models, we can translate to their function calling API (https://platform.openai.com/docs/guides/function-calling). For local models, we can use SGLang primitives (regex, select) and constrained decoding to implement a similar workflow. Or we can interrupt the decoding process to replace it with function callings. @Yiyun-Liang 
  - #573 
- [ ] Support sending a full serialized SGL program to the server.
- [x] Constraint decoding
  - #1125 
## LoRA Support
- [x] Port multi-LoRA serving from [S-LoRA](https://github.com/S-LoRA/S-LoRA) (Full optimizations will be in Q4 planning). @Ying1123 
  - #1307 
  - #1433 
## Usage examples
- [ ] Add more usage examples (e.g., [parallel JSON decoding](https://github.com/varunshenoy/super-json-mode/issues/8), [auto parallel decoding](https://arxiv.org/abs/2401.06761), [Self-Discover: Large Language Models Self-Compose Reasoning Structures](https://arxiv.org/abs/2402.03620)).
## Others
- [x] Setup CI. @zhyncs @hnyls2002 @merrymercy @Ying1123 
- [x] Documentation website.
- Compiler mode optimizations for the language. (Delayed to Q4)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Development Roadmap (2024 Q3) #634

Server API

Performance

Parallelism

Quantization

Observability

Model Coverage

Hardware Coverage

Language API

LoRA Support

Usage examples

Others

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Development Roadmap (2024 Q3) #634

Description

Server API

Performance

Parallelism

Quantization

Observability

Model Coverage

Hardware Coverage

Language API

LoRA Support

Usage examples

Others

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions