Long-term Vision-Language Tracking

ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking, Xiao Wang, Liye Jin, Xufeng Lou, Shiao Wang, Lan Chen, Bo Jiang, Zhipeng Zhang, arXiv:2508.05221 [arXiv] [Code]

Abstract:

Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy.

How to Download TNLLT dataset?

Currently, the dataset can be downloaded from the BaiduYun:

Baiduyun Drive:

Full Dataset：
URL: https://pan.baidu.com/s/1Bsr3PENWaa9k_yCNpkUh_Q?pwd=1d6b
Code: 1d6b 

Example Sequence：
URL: https://pan.baidu.com/s/1onjHTESlh-V1vgR2AYgnLw?pwd=ed76
Code: ed76

Dropbox:

To be released

1. cat TNLLT_part_* > TNLLT_restored.tar.gz
2. gunzip TNLLT_restored.tar.gz
3. md5sum -c TNLLT.tar.gz.md5 (optional)
4. tar -xvf TNLLT_restored.tar

Tutorial for the Evaluation Toolkit:

Download this github file:

git clone https://github.com/Event-AHU/Open_VLTrack.git

Download annos from: [Annos (word:bsf7)]:

unzip annos.zip and put it into Open_VLTrack/TNLLT_Evaluation_Toolkit/annos

Note: If there is a nested 'annos' folder after decompression, it should be removed.

Download the benchmark results from: [Benchmark-Results (word:s48i)]:

unzip tracking_results.zip and put it into Open_VLTrack/TNLLT_Evaluation_Toolkit/tracking_results

Note: If there is a nested 'tracking_results' folder after decompression, it should be removed.

Open the Matlab and run the script:

run_tracker_performance_evaluation.m

Note: In the file run_tracker_performance_evaluation.m, you can

Change flag (line 25) for precision (1), normalized precision (2) or success rate (3).

Uncomment the line (line 167-line194) of run_tracker_performance_evaluation for the per-attribute performance plot.

In the file utils/plot_draw_save.m, you can change the color and line style of the plot.

Wait and see final results:

Tutorial for the Supervised Fine-Tuning

Clone the repository and install the dependencies:

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

Prepare the SFT dataset:

SFT Dataset:
URL: https://pan.baidu.com/s/126Gn8R629OC1UVstSIkQDA?pwd=6arr
Code: 6arr

Download the dataset and unzip it into the /your_root_path path.
Put [ReasoningData] into LLaMA_Factory/data
use [script] to transforme images' path in [ReasoningData]
The reference configuration during the training process is as follows [training_args].

Tutorial for the Reinforcement Learning

Installation

Please refer to the official [EasyR1 repo] for RL env configuration guidelines.
Then refer to the official [DUTrack repo] for Tracking env configuration guidelines.
You may also refer to our environment configuration in [environment.yml]

Pre-execution checklist

First, refer to the official [DUTrack repo] to ensure that the testing process runs correctly.
Next, move demo3.py to the root directory of DUTrack, ensuring that the two functions within it can run correctly (these functions are used to calculate rewards in reinforcement learning(it can be found under ReasoningTrack/Reinforcement Learning/examples/reward_function/track.py line 107)).

GRPO Training

bash run_full.sh

Merge Checkpoint

python3 scripts/model_merger.py --local_dir your_checkpoint_path/global_step_1/actor

[TIP] If you encounter issues with connecting to Hugging Face, consider using export HF_ENDPOINT=https://hf-mirror.com.

Evaluation

Use vllm for inference

You can access our two-stage training model from Baidu Cloud.[SFT_stage (word:434y)][RL_stage (word:xt79)]

bash startvllm.sh

change the parameters in startvllm.sh to your own path.

After starting the vllm service, you can run ReasoningTrack/Evaluation/i2d.py to test whether the service is functioning correctly (Note: change the input path for the images in i2d.py according to the actual situation; test images are provided in Evaluation/sample001).

The output should be like:

descript: a ship
tag: ['no']
think: ['Okay, let\'s see. The initial text is "a ship". I need to check if this needs updating based on the two frames provided.\n\nLooking at Frame 1: There\'s a red ship with a green flag sailing on water. The background has cliffs and some structures in the distance. The sun is shining, creating a bright and open feel. The ship is moving towards the right side of the frame.\n\nNow, Frame 2: The same ship is still present, but the camera angle seems slightly different. Maybe it\'s zoomed in or shifted a bit. The position relative to the background elements like the cliffs and structures has changed slightly. The lighting might be a bit different due to the angle, but the overall scene is similar. The ship\'s orientation appears to have adjusted slightly, perhaps leaning forward or turning.\n\nComparing both frames, the main subject (the ship) hasn\'t changed much in terms of appearance—still red with green sails. The background elements are consistent, just in different positions. Since the core object (the ship) remains the same, the description doesn\'t need an update. The changes are minor positional adjustments and possible lighting differences, which don\'t fundamentally alter the object\'s identity or characteristics.']
ans: ['a ship']

Tutorial for embed large language models in tracking process

Firstly, we modify the i2d.py file under DUTrack/lib/models/dutrack/i2d.py to support the text generation by Qwen2.5-VL. (The modified code can be found under ReasoningTrack/Evaluation/i2d.py and you should modify it based on your vllm service.)
Then, we modify dutrack.py under DUTrack/lib/test/tracker/dutrack.py to call descriptgenRefiner function. (The key point of modification lies in obtaining the correct image path during the tracking process; you may refer to our implementation located in the ReasoningTrack/Evaluation/dutrack.py.)
If you wish to apply it to other tracking frameworks, the modifications are similar.

Some testing examples

TNLLT and TNL2k datasets evaluated refer to the official implementation[TNLLT][TNL2k].

GOT10k-test

python tracking/test.py dutrack dutrack_256_full --dataset got10k_test --threads 16 --num_gpus 2
python lib/test/utils/transform_got10k.py --tracker_name dutrack --cfg_name dutrack_256_full

OTB99

python tracking/test.py dutrack dutrack_256_full --dataset otb_lang --threads 1 --num_gpus 1
python tracking/analysis_results.py # need to modify tracker configs and names

Acknowledgement

This evaluation_toolkit code is modified based on the evaluation toolkit of [LaSOT].
This work is built upon the [LLaMA-Factory] and [EasyR1].
This work utilizes models from [Qwen2.5-VL] and [DUTrack].

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
ReasoningTrack		ReasoningTrack
TNLLT_Evaluation_Toolkit		TNLLT_Evaluation_Toolkit
figures		figures
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Long-term Vision-Language Tracking

Abstract:

How to Download TNLLT dataset?

Tutorial for the Evaluation Toolkit:

Tutorial for the Supervised Fine-Tuning

Tutorial for the Reinforcement Learning

Installation

Pre-execution checklist

GRPO Training

Merge Checkpoint

Evaluation

Use vllm for inference

Tutorial for embed large language models in tracking process

Some testing examples

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Event-AHU/Open_VLTrack

Folders and files

Latest commit

History

Repository files navigation

Long-term Vision-Language Tracking

Abstract:

How to Download TNLLT dataset?

Tutorial for the Evaluation Toolkit:

Tutorial for the Supervised Fine-Tuning

Tutorial for the Reinforcement Learning

Installation

Pre-execution checklist

GRPO Training

Merge Checkpoint

Evaluation

Use vllm for inference

Tutorial for embed large language models in tracking process

Some testing examples

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages