Skip to content

Profiler UTF-8 decode issue #64345

@Gforky

Description

@Gforky

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. Modify huggingface transformer's trainer.py, wrap train function with profiler.profile
  2. Run a huggingface transformer's model single-node multi-gpus training, with DeepSpeed enabled

Expected behavior

Will raise error after some training steps:

  File "./run_mlm.py", line 898, in <module>
    main()
  File "./run_mlm.py", line 849, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1989, in train
    my_profiler.step()
  File "/opt/conda/lib/python3.8/site-packages/torch/profiler/profiler.py", line 280, in step
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/profiler.py", line 1137, in parse_kineto_results
    if filter_name(kineto_event.name()):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 4: invalid start byte

I'm Confused about this error info, since can't get the detail about 'kineto_event'.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch version: 1.10.0a0+ecc3718
  • Is debug build: False
  • CUDA used to build PyTorch: 11.4
  • ROCM used to build PyTorch: N/A
  • OS: Ubuntu 20.04.2 LTS (x86_64)
  • GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
  • Clang version: Could not collect
  • CMake version: version 3.21.0
  • Libc version: glibc-2.31
  • Python version: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0] (64-bit runtime)
  • Python platform: Linux-5.11.0-31-lowlatency-x86_64-with-glibc2.10
  • Is CUDA available: True
  • CUDA runtime version: 11.4.48
  • GPU models and configuration:
  • GPU 0: NVIDIA A100-SXM4-40GB
  • GPU 1: NVIDIA A100-SXM4-40GB
  • GPU 2: NVIDIA A100-SXM4-40GB
  • GPU 3: NVIDIA A100-SXM4-40GB
  • GPU 4: NVIDIA A100-SXM4-40GB
  • GPU 5: NVIDIA A100-SXM4-40GB
  • GPU 6: NVIDIA A100-SXM4-40GB
  • GPU 7: NVIDIA A100-SXM4-40GB
  • Nvidia driver version: 470.57.02
  • cuDNN version: Probably one of the following:
  • /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.2
  • /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.2
  • /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.2
  • /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.2
  • /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.2
  • /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.2
  • /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.2
  • HIP runtime version: N/A
  • MIOpen runtime version: N/A
  • Versions of relevant libraries:
  • [pip3] numpy==1.21.1
  • [pip3] nvidia-dlprof-pytorch-nvtx==1.3.0
  • [pip3] pytorch-quantization==2.1.0
  • [pip3] pytorch-transformers==1.1.0
  • [pip3] torch==1.10.0a0+ecc3718
  • [pip3] torchtext==0.11.0a0
  • [pip3] torchvision==0.11.0a0
  • [conda] magma-cuda110 2.5.2 5 local
  • [conda] mkl 2019.5 281 conda-forge
  • [conda] mkl-include 2019.5 281 conda-forge
  • [conda] numpy 1.21.1 py38h9894fe3_0 conda-forge
  • [conda] nvidia-dlprof-pytorch-nvtx 1.3.0 pypi_0 pypi
  • [conda] pytorch-quantization 2.1.0 pypi_0 pypi
  • [conda] pytorch-transformers 1.1.0 pypi_0 pypi
  • [conda] torch 1.10.0a0+ecc3718 pypi_0 pypi
  • [conda] torchtext 0.11.0a0 pypi_0 pypi
  • [conda] torchvision 0.11.0a0 pypi_0 pypi

Additional context

cc @ezyang @gchanan @zou3519 @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb @ilia-cher @gdankel @bitfort @orionr

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions