Skip to content

UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown #4689

@mn8821236

Description

@mn8821236

这个训练了好几次 都是会出现这个情况 可能会突然出现报错

训练中途出现报错
{'loss': 0.66634569, 'token_acc': 0.74117086, 'grad_norm': 2.37827325, 'learning_rate': 4.992e-05, 'memory(GiB)': 93.91, 'train_speed(iter/s)': 0.137518, 'epoch': 0.07, 'global_step/max_steps': '270/3659', 'percentage': '7.38%', 'elapsed_time': '32m 43s', 'remaining_time': '6h 50m 39s'}
Train: 12%|█▏ | 435/3659 [52:14<6:55:44, 7.74s/it]/root/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

查看dmesg日志发现有这个报错,已经尝试训练了好几次 修改过pytorch版本还是出现这个问题
[Mon Jun 23 19:39:31 2025] python3.10[3078129]: segfault at 60 ip 00007f24b8cf2616 sp 00007ffc0f212680 error 4 in libc10_cuda.so[7f24b8cc8000+5c000] likely on CPU 15 (core 8, socket 1)
[Mon Jun 23 21:56:10 2025] python3.10[3079645]: segfault at b0 ip 00007f6172185616 sp 00007fffe5199880 error 4 in libc10_cuda.so[7f617215b000+5c000] likely on CPU 15 (core 8, socket 1)
[Mon Jun 23 23:09:57 2025] python3.10[3083743]: segfault at 60 ip 00007f47f1fc3616 sp 00007ffee908ea40 error 4 in libc10_cuda.so[7f47f1f99000+5c000] likely on CPU 10 (core 0, socket 1)
[Mon Jun 23 23:51:43 2025] python3.10[3089239]: segfault at 60 ip 00007f23b7fd9616 sp 00007fffb403ec60 error 4 in libc10_cuda.so[7f23b7faf000+5c000] likely on CPU 39 (core 12, socket 1)
[Tue Jun 24 04:37:33 2025] python3.10[3099720]: segfault at 60 ip 00007f9383b8f24f sp 00007ffe32ea83f0 error 4 in libc10_cuda.so[7f9383b63000+4f000] likely on CPU 18 (core 11, socket 1)
[Tue Jun 24 13:17:54 2025] python3.10[3108834]: segfault at 60 ip 00007fb47cb4924f sp 00007ffd0ccc1020 error 4 in libc10_cuda.so[7fb47cb1d000+4f000] likely on CPU 31 (core 1, socket 1)

系统环境
NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8
python3.10
transformers==4.51.3
torch==2.8.0.dev20250623+cu128
有3个显卡

启动命令
export TOKENIZERS_PARALLELISM=false
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32
export OMP_NUM_THREADS=1

CUDA_VISIBLE_DEVICES=0,1 nohup bash -c "MAX_PIXELS=501760 swift sft
--model /home/data/swift/Qwen/Qwen2.5-VL-7B-Instruct
--custom_dataset_info /home/data/swift/qwen_train/dataset_info.json
--dataset price_grounding
--train_type lora
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 4
--per_device_eval_batch_size 6
--learning_rate 5e-5
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--gradient_accumulation_steps 8
--eval_steps 500
--save_steps 500
--save_total_limit 2
--logging_steps 10
--max_length 2048
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 0
--dataset_num_proc 1
--bf16 true
--gradient_checkpointing true" > training.log 2>&1 &

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions