Skip to content

build sgl-kernel failed in the newest code with nvcc/cuda version is 12.6 #8097

@XiaotaoChen

Description

@XiaotaoChen

I'm trying to build sgl-kernel from source. but failed during the sgl-attn build stage. the detail error info is in the end.

environments

hardware: H20 * 8

software:
host driver version: 570.158.01 (535.161.07 meet the same error)

docker images: 
cuda: 12.6
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

I have successfully compiled sgl-kernel fully before, with cuda dirver 535.161.07, cuda 12.4. Due to the greenctx_stream feature, we update the docker image to cuda 12.6 to compile the newest sgl-kernel. but encountered this error during sgl-attn compiling stage.
According to the error log, I have no idea what caused this error. It seems to have nothing to do with cuda driver version. I'm confused what cause this error, why there was no such problem with cuda 12.4.
Hope yours give some advices and help to solve the problem. 3ks.

specific error infos

[311/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
FAILED: CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/cfs/xtchen/repositories/sglang/sgl-kernel/include -I/cfs/xtchen/repositories/sglang/sgl-kernel/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/include -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/tools/util/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/mscclpp/include -I/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o.d -x cu -c /cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
ptxas info    : (C7517) warpgroup.wait is injected in around line 399385 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb0ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb0ELb1ELb0ELb0EEENS1_30DynamicPersistentTileSchedulerILi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb0ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb0ELb1ELb0ELb0EEENS1_30DynamicPersistentTileSchedulerILi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7517) warpgroup.wait is injected in around line 354007 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7517) warpgroup.wait is injected in around line 325475 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
Segmentation fault (core dumped)
[312/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
FAILED: CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/cfs/xtchen/repositories/sglang/sgl-kernel/include -I/cfs/xtchen/repositories/sglang/sgl-kernel/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/include -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/tools/util/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/mscclpp/include -I/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o.d -x cu -c /cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
Segmentation fault (core dumped)
[313/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_sm90.cu.o
ninja: build stopped: subcommand failed.

*** CMake build failed
  × Failed to build `/cfs/xtchen/repositories/sglang/sgl-kernel`
  ├─▶ The build backend returned an error
  ╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1)
      hint: This usually indicates a problem with the package or the build environment.
make: *** [Makefile:29:build] 错误 2

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions