2.2.3 Backend: vLLM

vLLM

Handle: vllm
URL: http://localhost:33911

vLLM

A high-throughput and memory-efficient inference and serving engine for LLMs

Starting

# [Optional] pre-build the vLLM image
harbor build vllm

# Start the vLLM service
harbor up vllm

Harbor builds custom vllm image with bitsandbytes
vllm will be connected to webui, aider, boost, chatui and some other services when running together
Official docker images require specific CUDA versions - beware

Models

Once you've found a model you want to run, you can configure it with Harbor:

# Quickly lookup some of the compatible quants
harbor hf find awq
harbor hf find gptq

# This propagates the settings
# to the relevant configuration files
harbor vllm model google/gemma-2-2b-it

# To run a gated model, ensure that you've
# also set your Huggingface API Token
harbor hf token <your-token>

Configuration

You can configure specific portions of vllm via Harbor CLI:

# See original CLI help
harbor run vllm --help

# Get/Set the extra arguments
harbor vllm args
harbor vllm args '--dtype bfloat16 --code-revision 3.5'

# Select attention backend
harbor vllm attention ROCM_FLASH

harbor config set vllm.host.port 4090

Version and update

# Get/set desired vLLM version
harbor vllm version # v0.9.1
# Command accepts a docker tag
harbor vllm version latest
# Customize docker image
harbor config set vllm.image custom/vllm
# Force-pull new version of the base image
# if you have set version to "latest"
docker pull $(harbor config get vllm.image):$(harbor config get vllm.version)

You can specify more options directly via the .env file.

VRAM

Below are some steps to take if running out of VRAM (no magic, though).

Limit Context Length

You can limit the context length to reduce the memory footprint. This can be done via the --max-model-len flag.

harbor vllm args --max-model-len 2048

Quantization

vLLM supports many different quantization formats. You would typically configure this via --load-format and --quantization flags. For example:

harbor vllm args --load-format bitsandbytes --quantization bitsandbytes

Offloading

vLLM supports partial offloading to the CPU, similar to llama.cpp and some other backends. This can be configured via the --cpu-offload-gb flag.

harbor vllm args --cpu-offload-gb 4

Disable CUDA Graphs

When loading the model, VRAM usage can spike when computing the CUDA graphs. This can be disabled via --enforce-eager flag.

harbor vllm args --enforce-eager

GPU Memory Utilization

Reduce the amount of VRAM allocated for the model executor. Can be ranged from 0 to 1.0, 0.9 by default.

harbor vllm args --gpu-memory-utilization 0

Run on CPU

You can move to CPU by setting the --device cpu flag.

harbor vllm args --device cpu

Home | CLI Reference | Services | Adding New Service | Compatibility

Uh oh!

2.2.3 Backend: vLLM

vLLM

Starting

Models

Configuration

VRAM

Limit Context Length

Quantization

Offloading

Disable CUDA Graphs

GPU Memory Utilization

Run on CPU

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!