Skip to content

vLLM max_model_len should be set as the sum of max_prompt_len and max_completion_length #3113

@toslali-ibm

Description

@toslali-ibm

Feature request

vLLM max_model_len should be set as the sum of max_prompt_len and max_completion_length if those parameters are provided by the user.

Motivation

max-model-len should be set as the sum of max_prompt_len and max_completion_length. Otherwise, it can negatively impact performance.

This parameter determines the maximum sequence length that vLLM can process in a single inference session. For example, if max_model_len = 2048, the model stores 2048 KV pairs per sequence. Increasing it to 4096 doubles memory usage, significantly increasing GPU consumption.

Even if the actual input prompt is short (controlled by max_prompt_len), vLLM reserves GPU memory based on max_model_len to accommodate the worst-case scenario. A higher max_model_len reduces the number of concurrent requests that can fit into memory, impacting throughput.

Your contribution

I can modify the code to check if max_prompt_len and max_completion_length are provided and set max_model_len to the sum of these params.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions