This repository contains a script for training SmolVLM with only using HuggingFace.
[Phi3-Vision Finetuning]
[Qwen2-VL Finetuning]
[Llama3.2-Vision Finetuning]
[Molmo Finetune]
[Pixtral Finetune]
[Gemma3 Finetune]
- [2025/01/24] Add option for using DoRA.
- [2025/01/24] Fixed error in LoRA.
- [2025/01/24] đ„Supports mixed-modality data.
- Fine-tuning SmolVLM
- Deepspeed
- LoRA/QLoRA
- Full-finetuning
- Enable finetuning
vision_model
while using LoRA. - Disable/enable Flash Attention 2
- Multi-image and video training
To simplfy the setting process for training, you could use the provided pre-build environments.
The settings are done in the conda env named train
.
You could find more information about the image here.
docker pull john119/vlm
docker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm /bin/bash
- Ubuntu 22.04
- Nvidia-Driver 550.120
- Cuda version 12.4
Install the required packages using environment.yaml
.
pip install -r requirements.txt --index-url https://download.pytorch.org/whl/cu126
pip install flash-attn --no-build-isolation
pip install pillow-avif-plugin
pip install num2words
conda env create -f environment.yaml
conda activate train
pip install flash-attn --no-build-isolation
pip install pillow-avif-plugin
pip install num2words
Note: You should install flash-attn after installing the other packages.
The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder
.
When using a multi-image dataset, the image tokens should all be <image>
, and the image file names should have been in a list.
Please see the example below and follow format your data.
Example for single image dataset
[
{
"id": "000000033471",
"image": "000000033471.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
},
{
"from": "human",
"value": "What feature can be seen on the back of the bus?"
},
{
"from": "gpt",
"value": "The back of the bus features an advertisement."
},
{
"from": "human",
"value": "Is the bus driving down the street or pulled off to the side?"
},
{
"from": "gpt",
"value": "The bus is driving down the street, which is crowded with people and other vehicles."
}
]
}
...
]
Example for multi image dataset
[
{
"id": "000000033471",
"image": ["000000033471.jpg", "000000033472.jpg"],
"conversations": [
{
"from": "human",
"value": "<image>\n<image>\nIs the perspective of the camera differnt?"
},
{
"from": "gpt",
"value": "Yes, It the perspective of the camera is different."
}
]
}
...
]
Example for video dataset
[
{
"id": "sample1",
"video": "sample1.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\nWhat is going on in this video?"
},
{
"from": "gpt",
"value": "A man is walking down the road."
}
]
}
...
]
Note: SmolVLM uses a video as a sequential of images.
Note: With the mixed-dataset (e.g. some data in a batch have images while some don't) It only supports with zero2.
To run the training script, use the following command:
bash scripts/finetune.sh
If you want to train only the language model with LoRA and perform full training for the vision model:
bash scripts/finetune_lora.sh
If you want to train both the language model and the vision model with LoRA:
bash scripts/finetune_lora_vision.sh
IMPORTANT: If you want to tune the embed_token
with LoRA, You need to tune lm_head
together.
Training arguments
--deepspeed
(str): Path to DeepSpeed config file (default: "scripts/zero2.json").--data_path
(str): Path to the LLaVA formatted training data (a JSON file). (Required)--image_folder
(str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)--model_id
(str): Path to the SmolVLM model. (Required)--output_dir
(str): Output directory for model checkpoints--num_train_epochs
(int): Number of training epochs (default: 1).--per_device_train_batch_size
(int): Training batch size per GPU per forwarding step.--gradient_accumulation_steps
(int): Gradient accumulation steps (default: 4).--freeze_vision_tower
(bool): Option to freeze vision_model (default: False).--freeze_llm
(bool): Option to freeze LLM (default: False).--tune_connector
(bool): Option to tune projector (default: True).--num_lora_modules
(int): Number of target modules to add LoRA (-1 means all layers).--vision_lr
(float): Learning rate for vision_model.--connector_lr
(float): Learning rate for merger(projector).--learning_rate
(float): Learning rate for language module.--bf16
(bool): Option for using bfloat16.--fp16
(bool): Option for using fp16.--min_pixels
(int): Option for minimum input tokens.--max_pixles
(int): OPtion for maximum maxmimum tokens.--lora_enable
(bool): Option for enabling LoRA (default: False)--vision_lora
(bool): Option for including vision_tower to the LoRA module. Thelora_enable
should beTrue
to use this option. (default: False)--use_dora
(bool): Option for using DoRA instead of LoRA. Thelora_enable
should beTrue
to use this option. (default: False)--lora_namespan_exclude
(str): Exclude modules with namespans to add LoRA.--max_seq_length
(int): Maximum sequence length (default: 32K).--bits
(int): Quantization bits (default: 16).--disable_flash_attn2
(bool): Disable Flash Attention 2.--report_to
(str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').--logging_dir
(str): Logging directory (default: "./tf-logs").--lora_rank
(int): LoRA rank (default: 16).--lora_alpha
(int): LoRA alpha (default: 16).--lora_dropout
(float): LoRA dropout (default: 0.05).--logging_steps
(int): Logging steps (default: 1).--dataloader_num_workers
(int): Number of data loader workers (default: 4).
Note: The learning rate of vision_model
should be 10x ~ 5x smaller than the language_model
.
You can train the model using a video dataset. However, SmolVLm processes videos as a sequence of images, so youâll need to select specific frames and treat them as multiple images for training. You can set LoRA configs and use for LoRA too.
bash scripts/finetune_video.sh
Note: When training with video, it just as multi-image so you should adjust the max_pixels
for maximum resolution and fps
based on the available VRAM.
If you run out of vram, you can use zero3_offload instead of zero3. However, using zero3 is preferred.
bash scripts/merge_lora.sh
Note: Remember to replace the paths in finetune.sh
or finetune_lora.sh
with your specific paths. (Also in merge_lora.sh
when using LoRA.)
Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
You could run unset LD_LIBRARY_PATH
for this error.
You could see this issue
- Add feature for controlling image size.
- Add support smolvlm2.
- Add DPO Training.
- Handle interleaved dataset.
- Hadnle mixed-modality dataset.
This project is licensed under the Apache-2.0 License. See the LICENSE file for details.
If you find this repository useful in your project, please consider giving a â and citing:
@misc{SmolVLM-Finetuning,
author = {Yuwon Lee},
title = {SmolmVLM-Finetune},
year = {2025},
publisher = {GitHub},
url = {https://github.com/2U1/SmolVLM-Finetune}
}
This project is based on
- LLaVA-NeXT: An amazing open-source project of LMM.
- Mipha: Open-source projcet of SMM with amazing capabilites.
- SmolVLM: Awesome pretrained MLLM based on SmolLM2.