vLLM max_model_len should be set as the sum of max_prompt_len and max_completion_length

### Feature request

vLLM max_model_len should be set as the sum of max_prompt_len and max_completion_length if those parameters are provided by the user. 

### Motivation

max-model-len should be set as the sum of max_prompt_len and max_completion_length. Otherwise, it can negatively impact performance.

This parameter determines the maximum sequence length that vLLM can process in a single inference session. For example, if max_model_len = 2048, the model stores 2048 KV pairs per sequence. Increasing it to 4096 doubles memory usage, significantly increasing GPU consumption.

Even if the actual input prompt is short (controlled by max_prompt_len), vLLM reserves GPU memory based on max_model_len to accommodate the worst-case scenario. A higher max_model_len reduces the number of concurrent requests that can fit into memory, impacting throughput.



### Your contribution

I can modify the code to check if max_prompt_len and max_completion_length are provided and set max_model_len to the sum of these params.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vLLM max_model_len should be set as the sum of max_prompt_len and max_completion_length #3113

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vLLM max_model_len should be set as the sum of max_prompt_len and max_completion_length #3113

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions