-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed
Description
Here is the development roadmap for 2024 Q3. Contributions and feedback are welcome.
Server API
- Add APIs for using the inference engine in a single script without launching a separate server. See also examples.
- Support most OpenAI APIs: Batch, completion, chat, embedding
- Support directly taking embedding as inputs. [Feature] Generation Inputs: input_embeds #745
- Support updating the model weights without relaunching the server. @shanyu-sys
- Support Mistral endpoint in the language frontend
Performance
- Improve time-to-first-token in streaming mode with better scheduling.
- Implement chunked prefill. @hnyls2002 @vikranth22446
- Implement speculative decoding. See also a prototype.
Parallelism
- Support sequence parallelism for long context models.
Quantization
- Support W8A16, W4A16 weight-only integer quantization. @zhyncs
- Support W4A8 quantization with fp8 activation and int4 weight.
- Support fp8/fp4 KV cache quantization. int4/int8 is low priority. Currently we've supported fp8 e5m2, and we should also support fp8 e4m3. @ispobock
Observability
- Integrate Grafana / Prometheus
Model Coverage
- Support interleaved window attention (gemma). @Ying1123
- [Feat] Add window attention for gemma-2 #1056
- [Fix] Compatibility of window attention and cuda graph #1090
- [Fix] Window attention compatible with RadixAttention and chunked prefill #1112
- Save memory from interleaved attention #1151 (delayed to Q4, which is dependent on new memory manager)
- Language Models
- Vision Language Models
- Embedding models
- Add llama embedding modules [unreachable code] - step 1/3 #983
- Add io struct for embedding models [unreachable code] - step 2/3 #987
- Add e5-mistral embedding model - step 3/3 #988
- Add openai embedding API #997
- Support embedding input as a list #1014
- Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model #1186
Hardware Coverage
Language API
- Function calling. Add
tools
argument insgl.gen
. See also guidance and Claudette. For OpenAI models, we can translate to their function calling API (https://platform.openai.com/docs/guides/function-calling). For local models, we can use SGLang primitives (regex, select) and constrained decoding to implement a similar workflow. Or we can interrupt the decoding process to replace it with function callings. @Yiyun-Liang - Support sending a full serialized SGL program to the server.
- Constraint decoding
LoRA Support
Usage examples
- Add more usage examples (e.g., parallel JSON decoding, auto parallel decoding, Self-Discover: Large Language Models Self-Compose Reasoning Structures).
Others
- Setup CI. @zhyncs @hnyls2002 @merrymercy @Ying1123
- Documentation website.
- Compiler mode optimizations for the language. (Delayed to Q4)
zhyncs, m0g1cian, ispobock, max99x, kevineen and 13 more
Metadata
Metadata
Assignees
Labels
No labels