-
Notifications
You must be signed in to change notification settings - Fork 840
Description
这个训练了好几次 都是会出现这个情况 可能会突然出现报错
训练中途出现报错
{'loss': 0.66634569, 'token_acc': 0.74117086, 'grad_norm': 2.37827325, 'learning_rate': 4.992e-05, 'memory(GiB)': 93.91, 'train_speed(iter/s)': 0.137518, 'epoch': 0.07, 'global_step/max_steps': '270/3659', 'percentage': '7.38%', 'elapsed_time': '32m 43s', 'remaining_time': '6h 50m 39s'}
Train: 12%|█▏ | 435/3659 [52:14<6:55:44, 7.74s/it]/root/miniconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
查看dmesg日志发现有这个报错,已经尝试训练了好几次 修改过pytorch版本还是出现这个问题
[Mon Jun 23 19:39:31 2025] python3.10[3078129]: segfault at 60 ip 00007f24b8cf2616 sp 00007ffc0f212680 error 4 in libc10_cuda.so[7f24b8cc8000+5c000] likely on CPU 15 (core 8, socket 1)
[Mon Jun 23 21:56:10 2025] python3.10[3079645]: segfault at b0 ip 00007f6172185616 sp 00007fffe5199880 error 4 in libc10_cuda.so[7f617215b000+5c000] likely on CPU 15 (core 8, socket 1)
[Mon Jun 23 23:09:57 2025] python3.10[3083743]: segfault at 60 ip 00007f47f1fc3616 sp 00007ffee908ea40 error 4 in libc10_cuda.so[7f47f1f99000+5c000] likely on CPU 10 (core 0, socket 1)
[Mon Jun 23 23:51:43 2025] python3.10[3089239]: segfault at 60 ip 00007f23b7fd9616 sp 00007fffb403ec60 error 4 in libc10_cuda.so[7f23b7faf000+5c000] likely on CPU 39 (core 12, socket 1)
[Tue Jun 24 04:37:33 2025] python3.10[3099720]: segfault at 60 ip 00007f9383b8f24f sp 00007ffe32ea83f0 error 4 in libc10_cuda.so[7f9383b63000+4f000] likely on CPU 18 (core 11, socket 1)
[Tue Jun 24 13:17:54 2025] python3.10[3108834]: segfault at 60 ip 00007fb47cb4924f sp 00007ffd0ccc1020 error 4 in libc10_cuda.so[7fb47cb1d000+4f000] likely on CPU 31 (core 1, socket 1)
系统环境
NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8
python3.10
transformers==4.51.3
torch==2.8.0.dev20250623+cu128
有3个显卡
启动命令
export TOKENIZERS_PARALLELISM=false
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32
export OMP_NUM_THREADS=1
CUDA_VISIBLE_DEVICES=0,1 nohup bash -c "MAX_PIXELS=501760 swift sft
--model /home/data/swift/Qwen/Qwen2.5-VL-7B-Instruct
--custom_dataset_info /home/data/swift/qwen_train/dataset_info.json
--dataset price_grounding
--train_type lora
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 4
--per_device_eval_batch_size 6
--learning_rate 5e-5
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--freeze_vit true
--gradient_accumulation_steps 8
--eval_steps 500
--save_steps 500
--save_total_limit 2
--logging_steps 10
--max_length 2048
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 0
--dataset_num_proc 1
--bf16 true
--gradient_checkpointing true" > training.log 2>&1 &