ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking, Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang, arXiv:2508.05221 [arXiv] [Code]
Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy.
Currently, the dataset can be downloaded from the BaiduYun:
- Baiduyun Drive:
Full Dataset:
URL: https://pan.baidu.com/s/1Bsr3PENWaa9k_yCNpkUh_Q?pwd=1d6b
Code: 1d6b
Example Sequence:
URL: https://pan.baidu.com/s/1onjHTESlh-V1vgR2AYgnLw?pwd=ed76
Code: ed76
- Dropbox:
To be released
1. cat TNLLT_part_* > TNLLT_restored.tar.gz
2. gunzip TNLLT_restored.tar.gz
3. md5sum -c TNLLT.tar.gz.md5 (optional)
4. tar -xvf TNLLT_restored.tar
- Download this github file:
git clone https://github.com/Event-AHU/Open_VLTrack.git
- Download annos from: [Annos (word:bsf7)]:
unzip annos.zip and put it into Open_VLTrack/TNLLT_Evaluation_Toolkit/annos
Note: If there is a nested 'annos' folder after decompression, it should be removed.
- Download the benchmark results from: [Benchmark-Results (word:s48i)]:
unzip tracking_results.zip and put it into Open_VLTrack/TNLLT_Evaluation_Toolkit/tracking_results
Note: If there is a nested 'tracking_results' folder after decompression, it should be removed.
- Open the Matlab and run the script:
run_tracker_performance_evaluation.m
Note: In the file
run_tracker_performance_evaluation.m
, you can
- Change flag (line 25) for precision (1), normalized precision (2) or success rate (3).
- Uncomment the line (line 167-line194) of
run_tracker_performance_evaluation
for the per-attribute performance plot.- In the file
utils/plot_draw_save.m
, you can change the color and line style of the plot.
- Clone the repository and install the dependencies:
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
- Prepare the SFT dataset:
SFT Dataset:
URL: https://pan.baidu.com/s/126Gn8R629OC1UVstSIkQDA?pwd=6arr
Code: 6arr
- Download the dataset and unzip it into the
/your_root_path
path. - Put [ReasoningData] into
LLaMA_Factory/data
- use [script] to transforme images' path in [ReasoningData]
- The reference configuration during the training process is as follows [training_args].
- Please refer to the official [EasyR1 repo] for RL env configuration guidelines.
- Then refer to the official [DUTrack repo] for Tracking env configuration guidelines.
- You may also refer to our environment configuration in [environment.yml]
- First, refer to the official [DUTrack repo] to ensure that the testing process runs correctly.
- Next, move demo3.py to the root directory of DUTrack, ensuring that the two functions within it can run correctly (these functions are used to calculate rewards in reinforcement learning(it can be found under ReasoningTrack/Reinforcement Learning/examples/reward_function/track.py line 107)).
bash run_full.sh
python3 scripts/model_merger.py --local_dir your_checkpoint_path/global_step_1/actor
[TIP] If you encounter issues with connecting to Hugging Face, consider using
export HF_ENDPOINT=https://hf-mirror.com
.
You can access our two-stage training model from Baidu Cloud.[SFT_stage (word:434y)][RL_stage (word:xt79)]
bash startvllm.sh
change the parameters in startvllm.sh to your own path.
After starting the vllm service, you can run ReasoningTrack/Evaluation/i2d.py to test whether the service is functioning correctly (Note: change the input path for the images in i2d.py according to the actual situation; test images are provided in Evaluation/sample001).
The output should be like:
descript: a ship
tag: ['no']
think: ['Okay, let\'s see. The initial text is "a ship". I need to check if this needs updating based on the two frames provided.\n\nLooking at Frame 1: There\'s a red ship with a green flag sailing on water. The background has cliffs and some structures in the distance. The sun is shining, creating a bright and open feel. The ship is moving towards the right side of the frame.\n\nNow, Frame 2: The same ship is still present, but the camera angle seems slightly different. Maybe it\'s zoomed in or shifted a bit. The position relative to the background elements like the cliffs and structures has changed slightly. The lighting might be a bit different due to the angle, but the overall scene is similar. The ship\'s orientation appears to have adjusted slightly, perhaps leaning forward or turning.\n\nComparing both frames, the main subject (the ship) hasn\'t changed much in terms of appearance—still red with green sails. The background elements are consistent, just in different positions. Since the core object (the ship) remains the same, the description doesn\'t need an update. The changes are minor positional adjustments and possible lighting differences, which don\'t fundamentally alter the object\'s identity or characteristics.']
ans: ['a ship']
- Firstly, we modify the i2d.py file under DUTrack/lib/models/dutrack/i2d.py to support the text generation by Qwen2.5-VL. (The modified code can be found under ReasoningTrack/Evaluation/i2d.py and you should modify it based on your vllm service.)
- Then, we modify dutrack.py under DUTrack/lib/test/tracker/dutrack.py to call descriptgenRefiner function. (The key point of modification lies in obtaining the correct image path during the tracking process; you may refer to our implementation located in the ReasoningTrack/Evaluation/dutrack.py.)
- If you wish to apply it to other tracking frameworks, the modifications are similar.
- TNLLT and TNL2k datasets evaluated refer to the official implementation[TNLLT][TNL2k].
- GOT10k-test
python tracking/test.py dutrack dutrack_256_full --dataset got10k_test --threads 16 --num_gpus 2 python lib/test/utils/transform_got10k.py --tracker_name dutrack --cfg_name dutrack_256_full
- OTB99
python tracking/test.py dutrack dutrack_256_full --dataset otb_lang --threads 1 --num_gpus 1 python tracking/analysis_results.py # need to modify tracker configs and names
- This evaluation_toolkit code is modified based on the evaluation toolkit of [LaSOT].
- This work is built upon the [LLaMA-Factory] and [EasyR1].
- This work utilizes models from [Qwen2.5-VL] and [DUTrack].