-
-
Notifications
You must be signed in to change notification settings - Fork 139
2.2.3 Backend: vLLM
Handle:
vllm
URL: http://localhost:33911
A high-throughput and memory-efficient inference and serving engine for LLMs
# [Optional] pre-build the vLLM image
harbor build vllm
# Start the vLLM service
harbor up vllm
- Harbor builds custom
vllm
image withbitsandbytes
-
vllm
will be connected towebui
,aider
,boost
,chatui
and some other services when running together - Official docker images require specific CUDA versions - beware
Once you've found a model you want to run, you can configure it with Harbor:
# Quickly lookup some of the compatible quants
harbor hf find awq
harbor hf find gptq
# This propagates the settings
# to the relevant configuration files
harbor vllm model google/gemma-2-2b-it
# To run a gated model, ensure that you've
# also set your Huggingface API Token
harbor hf token <your-token>
You can configure specific portions of vllm via Harbor CLI:
# See original CLI help
harbor run vllm --help
# Get/Set the extra arguments
harbor vllm args
harbor vllm args '--dtype bfloat16 --code-revision 3.5'
# Select attention backend
harbor vllm attention ROCM_FLASH
harbor config set vllm.host.port 4090
Version and update
# Get/set desired vLLM version
harbor vllm version # v0.9.1
# Command accepts a docker tag
harbor vllm version latest
# Customize docker image
harbor config set vllm.image custom/vllm
# Force-pull new version of the base image
# if you have set version to "latest"
docker pull $(harbor config get vllm.image):$(harbor config get vllm.version)
You can specify more options directly via the .env
file.
Below are some steps to take if running out of VRAM (no magic, though).
You can limit the context length to reduce the memory footprint. This can be done via the --max-model-len
flag.
harbor vllm args --max-model-len 2048
vLLM supports many different quantization formats. You would typically configure this via --load-format
and --quantization
flags. For example:
harbor vllm args --load-format bitsandbytes --quantization bitsandbytes
vLLM supports partial offloading to the CPU, similar to llama.cpp and some other backends. This can be configured via the --cpu-offload-gb
flag.
harbor vllm args --cpu-offload-gb 4
When loading the model, VRAM usage can spike when computing the CUDA graphs. This can be disabled via --enforce-eager
flag.
harbor vllm args --enforce-eager
Reduce the amount of VRAM allocated for the model executor. Can be ranged from 0 to 1.0, 0.9
by default.
harbor vllm args --gpu-memory-utilization 0
You can move to CPU by setting the --device cpu
flag.
harbor vllm args --device cpu