[Feature] Generation Inputs: input_embeds

### Motivation

I propose to add `input_embeds` as an optional input to the generation params. 

# Why is this important

Nowadays there are a lot of Vision Language Models (VLMs) and they all have similar architecture: vision tower, projector, LLM. This means vision_tower+projector just prepares embeddings for "image" tokens. So why not allow model developers to handle by themselves the preparation of `input_embeds` for the LLM? 
Lots of new models tend to allow the user to work with bounding boxes and segmentation masks like PaliGemma and Florence, making it quite complicated to add different processors and conversation templates to the codebase. 
By allowing the user to provide `input_embeds` instead of list of messages or text prompts, you reduce your own headache in the future.
Another point is that VLM developers can focus on caching image embeddings while building on top of the SGLang, allowing even higher throughput.

vLLM users required this feature long time ago and this topic gained a lot of positive attention from the community:
- https://github.com/vllm-project/vllm/pull/1265

This unique feature will make the SGLang the main framework for all VLMs.

I am happy to help implement this if you direct me in the codebase and thank you for your time and consideration 🤗

# Proposed usages

```python
response = client.chat.completions.create(
    model="default",
    input_embeds=[...],
    temperature=0.8,
    max_tokens=64,
)
```
```python
backend.run(input_embeds=input_embeds)
```
```python
@dataclass
class GenerateReqInput:
    # The input prompt. It can be a single prompt or a batch of prompts.
    text: Optional[Union[List[str], str]] = None
    # The token ids for text; one can either specify text or input_ids.
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
    # The embeddings for input_ids; if specified, input_ids should also be provided
    input_embeds: Optional[Union[List[List[List[float]]], List[List[float]]]] = None
    # The image input. It can be a file name, a url, or base64 encoded string.
    # See also python/sglang/srt/utils.py:load_image.
    image_data: Optional[Union[List[str], str]] = None
    # The sampling_params.
    sampling_params: Union[List[Dict], Dict] = None
    # The request id.
    rid: Optional[Union[List[str], str]] = None
    # Whether to return logprobs.
    return_logprob: Optional[Union[List[bool], bool]] = None
    # The start location of the prompt for return_logprob.
    logprob_start_len: Optional[Union[List[int], int]] = None
    # The number of top logprobs to return.
    top_logprobs_num: Optional[Union[List[int], int]] = None
    # Whether to detokenize tokens in logprobs.
    return_text_in_logprobs: bool = False
    # Whether to stream output.
    stream: bool = False
```



### Related resources

- https://github.com/vllm-project/vllm/issues/416
- https://github.com/vllm-project/vllm/pull/1265


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Generation Inputs: input_embeds #745

Motivation

Why is this important

Proposed usages

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Generation Inputs: input_embeds #745

Description

Motivation

Why is this important

Proposed usages

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions