-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Feature request
vLLM max_model_len should be set as the sum of max_prompt_len and max_completion_length if those parameters are provided by the user.
Motivation
max-model-len should be set as the sum of max_prompt_len and max_completion_length. Otherwise, it can negatively impact performance.
This parameter determines the maximum sequence length that vLLM can process in a single inference session. For example, if max_model_len = 2048, the model stores 2048 KV pairs per sequence. Increasing it to 4096 doubles memory usage, significantly increasing GPU consumption.
Even if the actual input prompt is short (controlled by max_prompt_len), vLLM reserves GPU memory based on max_model_len to accommodate the worst-case scenario. A higher max_model_len reduces the number of concurrent requests that can fit into memory, impacting throughput.
Your contribution
I can modify the code to check if max_prompt_len and max_completion_length are provided and set max_model_len to the sum of these params.