Fine-tuning SmolVLM

This repository contains a script for training SmolVLM with only using HuggingFace.

Other projects

[Phi3-Vision Finetuning]
[Qwen2-VL Finetuning]
[Llama3.2-Vision Finetuning]
[Molmo Finetune]
[Pixtral Finetune]
[Gemma3 Finetune]

Update

[2025/01/24] Add option for using DoRA.
[2025/01/24] Fixed error in LoRA.
[2025/01/24] 🔥Supports mixed-modality data.

Supported Features

Deepspeed
LoRA/QLoRA
Full-finetuning
Enable finetuning vision_model while using LoRA.
Disable/enable Flash Attention 2
Multi-image and video training

Docker

To simplfy the setting process for training, you could use the provided pre-build environments.
The settings are done in the conda env named train.

You could find more information about the image here.

docker pull john119/vlm
docker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm /bin/bash

Installation

Environments

Ubuntu 22.04
Nvidia-Driver 550.120
Cuda version 12.4

Install the required packages using environment.yaml.

Using `requirements.txt`

pip install -r requirements.txt --index-url https://download.pytorch.org/whl/cu126
pip install flash-attn --no-build-isolation
pip install pillow-avif-plugin
pip install num2words

Using `environment.yaml`

conda env create -f environment.yaml
conda activate train
pip install flash-attn --no-build-isolation
pip install pillow-avif-plugin
pip install num2words

Note: You should install flash-attn after installing the other packages.

Dataset Preparation

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder.

When using a multi-image dataset, the image tokens should all be <image>, and the image file names should have been in a list. Please see the example below and follow format your data.

Example for single image dataset

[
  {
    "id": "000000033471",
    "image": "000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      },
      {
        "from": "human",
        "value": "What feature can be seen on the back of the bus?"
      },
      {
        "from": "gpt",
        "value": "The back of the bus features an advertisement."
      },
      {
        "from": "human",
        "value": "Is the bus driving down the street or pulled off to the side?"
      },
      {
        "from": "gpt",
        "value": "The bus is driving down the street, which is crowded with people and other vehicles."
      }
    ]
  }
  ...
]

Example for multi image dataset

[
  {
    "id": "000000033471",
    "image": ["000000033471.jpg", "000000033472.jpg"],
    "conversations": [
      {
        "from": "human",
        "value": "<image>\n<image>\nIs the perspective of the camera differnt?"
      },
      {
        "from": "gpt",
        "value": "Yes, It the perspective of the camera is different."
      }
    ]
  }
  ...
]

Example for video dataset

[
  {
    "id": "sample1",
    "video": "sample1.mp4",
    "conversations": [
      {
        "from": "human",
        "value": "<video>\nWhat is going on in this video?"
      },
      {
        "from": "gpt",
        "value": "A man is walking down the road."
      }
    ]
  }
  ...
]

Note: SmolVLM uses a video as a sequential of images.

Training

Note: With the mixed-dataset (e.g. some data in a batch have images while some don't) It only supports with zero2.

To run the training script, use the following command:

Full Finetuning

bash scripts/finetune.sh

Finetune with LoRA

If you want to train only the language model with LoRA and perform full training for the vision model:

bash scripts/finetune_lora.sh

If you want to train both the language model and the vision model with LoRA:

bash scripts/finetune_lora_vision.sh

IMPORTANT: If you want to tune the embed_token with LoRA, You need to tune lm_head together.

Training arguments

--deepspeed (str): Path to DeepSpeed config file (default: "scripts/zero2.json").
--data_path (str): Path to the LLaVA formatted training data (a JSON file). (Required)
--image_folder (str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)
--model_id (str): Path to the SmolVLM model. (Required)
--output_dir (str): Output directory for model checkpoints
--num_train_epochs (int): Number of training epochs (default: 1).
--per_device_train_batch_size (int): Training batch size per GPU per forwarding step.
--gradient_accumulation_steps (int): Gradient accumulation steps (default: 4).
--freeze_vision_tower (bool): Option to freeze vision_model (default: False).
--freeze_llm (bool): Option to freeze LLM (default: False).
--tune_connector (bool): Option to tune projector (default: True).
--num_lora_modules (int): Number of target modules to add LoRA (-1 means all layers).
--vision_lr (float): Learning rate for vision_model.
--connector_lr (float): Learning rate for merger(projector).
--learning_rate (float): Learning rate for language module.
--bf16 (bool): Option for using bfloat16.
--fp16 (bool): Option for using fp16.
--min_pixels (int): Option for minimum input tokens.
--max_pixles (int): OPtion for maximum maxmimum tokens.
--lora_enable (bool): Option for enabling LoRA (default: False)
--vision_lora (bool): Option for including vision_tower to the LoRA module. The lora_enable should be True to use this option. (default: False)
--use_dora (bool): Option for using DoRA instead of LoRA. The lora_enable should be True to use this option. (default: False)
--lora_namespan_exclude (str): Exclude modules with namespans to add LoRA.
--max_seq_length (int): Maximum sequence length (default: 32K).
--bits (int): Quantization bits (default: 16).
--disable_flash_attn2 (bool): Disable Flash Attention 2.
--report_to (str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').
--logging_dir (str): Logging directory (default: "./tf-logs").
--lora_rank (int): LoRA rank (default: 16).
--lora_alpha (int): LoRA alpha (default: 16).
--lora_dropout (float): LoRA dropout (default: 0.05).
--logging_steps (int): Logging steps (default: 1).
--dataloader_num_workers (int): Number of data loader workers (default: 4).

Note: The learning rate of vision_model should be 10x ~ 5x smaller than the language_model.

Train with video dataset

You can train the model using a video dataset. However, SmolVLm processes videos as a sequence of images, so you’ll need to select specific frames and treat them as multiple images for training. You can set LoRA configs and use for LoRA too.

bash scripts/finetune_video.sh

Note: When training with video, it just as multi-image so you should adjust the max_pixels for maximum resolution and fps based on the available VRAM.

If you run out of vram, you can use zero3_offload instead of zero3. However, using zero3 is preferred.

Merge LoRA Weights

bash scripts/merge_lora.sh

Note: Remember to replace the paths in finetune.sh or finetune_lora.sh with your specific paths. (Also in merge_lora.sh when using LoRA.)

Issue for libcudnn error

Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8

You could run unset LD_LIBRARY_PATH for this error. You could see this issue

TODO

Known Issues

libcudnn issue

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

Citation

If you find this repository useful in your project, please consider giving a ⭐ and citing:

@misc{SmolVLM-Finetuning,
  author = {Yuwon Lee},
  title = {SmolmVLM-Finetune},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/2U1/SmolVLM-Finetune}
}

Acknowledgement

This project is based on

LLaVA-NeXT: An amazing open-source project of LMM.
Mipha: Open-source projcet of SMM with amazing capabilites.
SmolVLM: Awesome pretrained MLLM based on SmolLM2.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fine-tuning SmolVLM

Other projects

Update

Table of Contents

Supported Features

Docker

Installation

Environments

Using `requirements.txt`

Using `environment.yaml`

Dataset Preparation

Training

Full Finetuning

Finetune with LoRA

Train with video dataset

Merge LoRA Weights

Issue for libcudnn error

TODO

Known Issues

License

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

2U1/SmolVLM-Finetune

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning SmolVLM

Other projects

Update

Table of Contents

Supported Features

Docker

Installation

Environments

Using requirements.txt

Using environment.yaml

Dataset Preparation

Training

Full Finetuning

Finetune with LoRA

Train with video dataset

Merge LoRA Weights

Issue for libcudnn error

TODO

Known Issues

License

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Using `requirements.txt`

Using `environment.yaml`

Packages