-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed
Labels
Description
I'm trying to build sgl-kernel
from source. but failed during the sgl-attn
build stage. the detail error info is in the end.
environments
hardware: H20 * 8
software:
host driver version: 570.158.01 (535.161.07 meet the same error)
docker images:
cuda: 12.6
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
I have successfully compiled sgl-kernel
fully before, with cuda dirver 535.161.07, cuda 12.4
. Due to the greenctx_stream
feature, we update the docker image to cuda 12.6
to compile the newest sgl-kernel
. but encountered this error during sgl-attn
compiling stage.
According to the error log, I have no idea what caused this error. It seems to have nothing to do with cuda driver
version. I'm confused what cause this error, why there was no such problem with cuda 12.4
.
Hope yours give some advices and help to solve the problem. 3ks.
specific error infos
[311/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
FAILED: CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/cfs/xtchen/repositories/sglang/sgl-kernel/include -I/cfs/xtchen/repositories/sglang/sgl-kernel/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/include -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/tools/util/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/mscclpp/include -I/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o.d -x cu -c /cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
ptxas info : (C7517) warpgroup.wait is injected in around line 399385 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb0ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb0ELb1ELb0ELb0EEENS1_30DynamicPersistentTileSchedulerILi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb0ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb0ELb1ELb0ELb0EEENS1_30DynamicPersistentTileSchedulerILi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7517) warpgroup.wait is injected in around line 354007 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7517) warpgroup.wait is injected in around line 325475 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
Segmentation fault (core dumped)
[312/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
FAILED: CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/cfs/xtchen/repositories/sglang/sgl-kernel/include -I/cfs/xtchen/repositories/sglang/sgl-kernel/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/include -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/tools/util/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/mscclpp/include -I/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o.d -x cu -c /cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
Segmentation fault (core dumped)
[313/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_sm90.cu.o
ninja: build stopped: subcommand failed.
*** CMake build failed
× Failed to build `/cfs/xtchen/repositories/sglang/sgl-kernel`
├─▶ The build backend returned an error
╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1)
hint: This usually indicates a problem with the package or the build environment.
make: *** [Makefile:29:build] 错误 2