Skip to content

Event-AHU/Open_VLTrack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Long-term Vision-Language Tracking

ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking, Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang, arXiv:2508.05221 [arXiv] [Code]

Abstract:

Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy.

How to Download TNLLT dataset?

fig-1 Currently, the dataset can be downloaded from the BaiduYun:

  • Baiduyun Drive:
Full Dataset:
URL: https://pan.baidu.com/s/1Bsr3PENWaa9k_yCNpkUh_Q?pwd=1d6b
Code: 1d6b 

Example Sequence:
URL: https://pan.baidu.com/s/1onjHTESlh-V1vgR2AYgnLw?pwd=ed76
Code: ed76 
  • Dropbox:
To be released
1. cat TNLLT_part_* > TNLLT_restored.tar.gz
2. gunzip TNLLT_restored.tar.gz
3. md5sum -c TNLLT.tar.gz.md5 (optional)
4. tar -xvf TNLLT_restored.tar

Tutorial for the Evaluation Toolkit:

  1. Download this github file:
git clone https://github.com/Event-AHU/Open_VLTrack.git
  1. Download annos from: [Annos (word:bsf7)]:
unzip annos.zip and put it into Open_VLTrack/TNLLT_Evaluation_Toolkit/annos

Note: If there is a nested 'annos' folder after decompression, it should be removed.

  1. Download the benchmark results from: [Benchmark-Results (word:s48i)]:
unzip tracking_results.zip and put it into Open_VLTrack/TNLLT_Evaluation_Toolkit/tracking_results

Note: If there is a nested 'tracking_results' folder after decompression, it should be removed.

  1. Open the Matlab and run the script:
run_tracker_performance_evaluation.m

Note: In the file run_tracker_performance_evaluation.m, you can

  1. Change flag (line 25) for precision (1), normalized precision (2) or success rate (3).
  2. Uncomment the line (line 167-line194) of run_tracker_performance_evaluation for the per-attribute performance plot.
  3. In the file utils/plot_draw_save.m, you can change the color and line style of the plot.
  1. Wait and see final results: fig-1

Tutorial for the Supervised Fine-Tuning

  • Clone the repository and install the dependencies:
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
  • Prepare the SFT dataset:
SFT Dataset:
URL: https://pan.baidu.com/s/126Gn8R629OC1UVstSIkQDA?pwd=6arr
Code: 6arr
  • Download the dataset and unzip it into the /your_root_path path.
  • Put [ReasoningData] into LLaMA_Factory/data
  • use [script] to transforme images' path in [ReasoningData]
  • The reference configuration during the training process is as follows [training_args].

Tutorial for the Reinforcement Learning

Installation

  • Please refer to the official [EasyR1 repo] for RL env configuration guidelines.
  • Then refer to the official [DUTrack repo] for Tracking env configuration guidelines.
  • You may also refer to our environment configuration in [environment.yml]

Pre-execution checklist

  • First, refer to the official [DUTrack repo] to ensure that the testing process runs correctly.
  • Next, move demo3.py to the root directory of DUTrack, ensuring that the two functions within it can run correctly (these functions are used to calculate rewards in reinforcement learning(it can be found under ReasoningTrack/Reinforcement Learning/examples/reward_function/track.py line 107)).

GRPO Training

bash run_full.sh

Merge Checkpoint

python3 scripts/model_merger.py --local_dir your_checkpoint_path/global_step_1/actor

[TIP] If you encounter issues with connecting to Hugging Face, consider using export HF_ENDPOINT=https://hf-mirror.com.

Evaluation

Use vllm for inference

You can access our two-stage training model from Baidu Cloud.[SFT_stage (word:434y)][RL_stage (word:xt79)]

bash startvllm.sh

change the parameters in startvllm.sh to your own path.

After starting the vllm service, you can run ReasoningTrack/Evaluation/i2d.py to test whether the service is functioning correctly (Note: change the input path for the images in i2d.py according to the actual situation; test images are provided in Evaluation/sample001).

The output should be like:

descript: a ship
tag: ['no']
think: ['Okay, let\'s see. The initial text is "a ship". I need to check if this needs updating based on the two frames provided.\n\nLooking at Frame 1: There\'s a red ship with a green flag sailing on water. The background has cliffs and some structures in the distance. The sun is shining, creating a bright and open feel. The ship is moving towards the right side of the frame.\n\nNow, Frame 2: The same ship is still present, but the camera angle seems slightly different. Maybe it\'s zoomed in or shifted a bit. The position relative to the background elements like the cliffs and structures has changed slightly. The lighting might be a bit different due to the angle, but the overall scene is similar. The ship\'s orientation appears to have adjusted slightly, perhaps leaning forward or turning.\n\nComparing both frames, the main subject (the ship) hasn\'t changed much in terms of appearance—still red with green sails. The background elements are consistent, just in different positions. Since the core object (the ship) remains the same, the description doesn\'t need an update. The changes are minor positional adjustments and possible lighting differences, which don\'t fundamentally alter the object\'s identity or characteristics.']
ans: ['a ship']

Tutorial for embed large language models in tracking process

  1. Firstly, we modify the i2d.py file under DUTrack/lib/models/dutrack/i2d.py to support the text generation by Qwen2.5-VL. (The modified code can be found under ReasoningTrack/Evaluation/i2d.py and you should modify it based on your vllm service.)
  2. Then, we modify dutrack.py under DUTrack/lib/test/tracker/dutrack.py to call descriptgenRefiner function. (The key point of modification lies in obtaining the correct image path during the tracking process; you may refer to our implementation located in the ReasoningTrack/Evaluation/dutrack.py.)
  3. If you wish to apply it to other tracking frameworks, the modifications are similar.

Some testing examples

  • TNLLT and TNL2k datasets evaluated refer to the official implementation[TNLLT][TNL2k].
  • GOT10k-test
    python tracking/test.py dutrack dutrack_256_full --dataset got10k_test --threads 16 --num_gpus 2
    python lib/test/utils/transform_got10k.py --tracker_name dutrack --cfg_name dutrack_256_full
  • OTB99
    python tracking/test.py dutrack dutrack_256_full --dataset otb_lang --threads 1 --num_gpus 1
    python tracking/analysis_results.py # need to modify tracker configs and names

Acknowledgement

About

Vision-Language based Visual Object Tracking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published