-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Open
Labels
high priorityoncall: profilerprofiler-related issues (cpu, gpu, kineto)profiler-related issues (cpu, gpu, kineto)triage review
Description
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
- Modify huggingface transformer's trainer.py, wrap train function with profiler.profile
- Run a huggingface transformer's model single-node multi-gpus training, with DeepSpeed enabled
Expected behavior
Will raise error after some training steps:
File "./run_mlm.py", line 898, in <module>
main()
File "./run_mlm.py", line 849, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1989, in train
my_profiler.step()
File "/opt/conda/lib/python3.8/site-packages/torch/profiler/profiler.py", line 280, in step
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/profiler.py", line 1137, in parse_kineto_results
if filter_name(kineto_event.name()):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 4: invalid start byte
I'm Confused about this error info, since can't get the detail about 'kineto_event'.
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
- PyTorch version: 1.10.0a0+ecc3718
- Is debug build: False
- CUDA used to build PyTorch: 11.4
- ROCM used to build PyTorch: N/A
- OS: Ubuntu 20.04.2 LTS (x86_64)
- GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
- Clang version: Could not collect
- CMake version: version 3.21.0
- Libc version: glibc-2.31
- Python version: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0] (64-bit runtime)
- Python platform: Linux-5.11.0-31-lowlatency-x86_64-with-glibc2.10
- Is CUDA available: True
- CUDA runtime version: 11.4.48
- GPU models and configuration:
- GPU 0: NVIDIA A100-SXM4-40GB
- GPU 1: NVIDIA A100-SXM4-40GB
- GPU 2: NVIDIA A100-SXM4-40GB
- GPU 3: NVIDIA A100-SXM4-40GB
- GPU 4: NVIDIA A100-SXM4-40GB
- GPU 5: NVIDIA A100-SXM4-40GB
- GPU 6: NVIDIA A100-SXM4-40GB
- GPU 7: NVIDIA A100-SXM4-40GB
- Nvidia driver version: 470.57.02
- cuDNN version: Probably one of the following:
- /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.2
- /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.2
- /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.2
- /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.2
- /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.2
- /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.2
- /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.2
- HIP runtime version: N/A
- MIOpen runtime version: N/A
- Versions of relevant libraries:
- [pip3] numpy==1.21.1
- [pip3] nvidia-dlprof-pytorch-nvtx==1.3.0
- [pip3] pytorch-quantization==2.1.0
- [pip3] pytorch-transformers==1.1.0
- [pip3] torch==1.10.0a0+ecc3718
- [pip3] torchtext==0.11.0a0
- [pip3] torchvision==0.11.0a0
- [conda] magma-cuda110 2.5.2 5 local
- [conda] mkl 2019.5 281 conda-forge
- [conda] mkl-include 2019.5 281 conda-forge
- [conda] numpy 1.21.1 py38h9894fe3_0 conda-forge
- [conda] nvidia-dlprof-pytorch-nvtx 1.3.0 pypi_0 pypi
- [conda] pytorch-quantization 2.1.0 pypi_0 pypi
- [conda] pytorch-transformers 1.1.0 pypi_0 pypi
- [conda] torch 1.10.0a0+ecc3718 pypi_0 pypi
- [conda] torchtext 0.11.0a0 pypi_0 pypi
- [conda] torchvision 0.11.0a0 pypi_0 pypi
Additional context
cc @ezyang @gchanan @zou3519 @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb @ilia-cher @gdankel @bitfort @orionr
pooyadavooditimmytonga
Metadata
Metadata
Assignees
Labels
high priorityoncall: profilerprofiler-related issues (cpu, gpu, kineto)profiler-related issues (cpu, gpu, kineto)triage review