[Bug] fix sgl-kernel on Blackwell

### Checklist

- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.

### Describe the bug

test_cublas_grouped_gemm.py

```
test_cublas_grouped_gemm.py ....................................................................................................................................................... [ 60%]
....................................................FFFFFFF........................................                                                                                 [100%]

======================================================================================== FAILURES =========================================================================================
____________________________________________________________________ test_grouped_gemm_accuracy[8192-2-16-out_dtype1] _____________________________________________________________________

out_dtype = torch.bfloat16, M = 16, N = 2, K = 8192

    @pytest.mark.skipif(
        skip_condition, reason="CUDA not available or CUDA version lower than 12.5"
    )
    @pytest.mark.parametrize("out_dtype", [torch.float16, torch.bfloat16])
    @pytest.mark.parametrize("M", [1, 16, 32, 256, 1024])
    @pytest.mark.parametrize("N", [2, 16, 128, 256, 4096])
    @pytest.mark.parametrize("K", [3, 16, 32, 512, 8192])
    def test_grouped_gemm_accuracy(out_dtype, M, N, K):
        a = torch.randn((M, K), device="cuda", dtype=out_dtype) * 5
        b = torch.randn((N, K), device="cuda", dtype=out_dtype) * 5
        expected = torch.matmul(a, b.t()).to(out_dtype)

        a_array = [a]
        b_array = [b]
        c_array = [torch.empty((M, N), device="cuda", dtype=out_dtype)]

        result_torch = torch_grouped_gemm(a_array, b_array, out_dtype)[0]
        cublas_grouped_gemm(a_array, b_array, c_array, out_dtype)

        torch.testing.assert_close(result_torch, expected)
>       torch.testing.assert_close(c_array[0], expected)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 3 / 32 (9.4%)
E       Greatest absolute difference: 2.375 at index (7, 1) (up to 1e-05 allowed)
E       Greatest relative difference: 0.11865234375 at index (7, 1) (up to 0.016 allowed)

test_cublas_grouped_gemm.py:36: AssertionError
____________________________________________________________________ test_grouped_gemm_accuracy[8192-2-32-out_dtype0] _____________________________________________________________________

out_dtype = torch.float16, M = 32, N = 2, K = 8192

    @pytest.mark.skipif(
        skip_condition, reason="CUDA not available or CUDA version lower than 12.5"
    )
    @pytest.mark.parametrize("out_dtype", [torch.float16, torch.bfloat16])
    @pytest.mark.parametrize("M", [1, 16, 32, 256, 1024])
    @pytest.mark.parametrize("N", [2, 16, 128, 256, 4096])
    @pytest.mark.parametrize("K", [3, 16, 32, 512, 8192])
    def test_grouped_gemm_accuracy(out_dtype, M, N, K):
        a = torch.randn((M, K), device="cuda", dtype=out_dtype) * 5
        b = torch.randn((N, K), device="cuda", dtype=out_dtype) * 5
        expected = torch.matmul(a, b.t()).to(out_dtype)

        a_array = [a]
        b_array = [b]
        c_array = [torch.empty((M, N), device="cuda", dtype=out_dtype)]

        result_torch = torch_grouped_gemm(a_array, b_array, out_dtype)[0]
        cublas_grouped_gemm(a_array, b_array, c_array, out_dtype)

        torch.testing.assert_close(result_torch, expected)
>       torch.testing.assert_close(c_array[0], expected)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 10 / 64 (15.6%)
E       Greatest absolute difference: 2.0 at index (1, 1) (up to 1e-05 allowed)
E       Greatest relative difference: 0.0033245086669921875 at index (29, 0) (up to 0.001 allowed)

test_cublas_grouped_gemm.py:36: AssertionError
____________________________________________________________________ test_grouped_gemm_accuracy[8192-2-32-out_dtype1] _____________________________________________________________________

out_dtype = torch.bfloat16, M = 32, N = 2, K = 8192

    @pytest.mark.skipif(
        skip_condition, reason="CUDA not available or CUDA version lower than 12.5"
    )
    @pytest.mark.parametrize("out_dtype", [torch.float16, torch.bfloat16])
    @pytest.mark.parametrize("M", [1, 16, 32, 256, 1024])
    @pytest.mark.parametrize("N", [2, 16, 128, 256, 4096])
    @pytest.mark.parametrize("K", [3, 16, 32, 512, 8192])
    def test_grouped_gemm_accuracy(out_dtype, M, N, K):
        a = torch.randn((M, K), device="cuda", dtype=out_dtype) * 5
        b = torch.randn((N, K), device="cuda", dtype=out_dtype) * 5
        expected = torch.matmul(a, b.t()).to(out_dtype)

        a_array = [a]
        b_array = [b]
        c_array = [torch.empty((M, N), device="cuda", dtype=out_dtype)]

        result_torch = torch_grouped_gemm(a_array, b_array, out_dtype)[0]
        cublas_grouped_gemm(a_array, b_array, c_array, out_dtype)

        torch.testing.assert_close(result_torch, expected)
>       torch.testing.assert_close(c_array[0], expected)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 3 / 64 (4.7%)
E       Greatest absolute difference: 4.5 at index (11, 0) (up to 1e-05 allowed)
E       Greatest relative difference: 0.1259765625 at index (27, 0) (up to 0.016 allowed)

test_cublas_grouped_gemm.py:36: AssertionError
____________________________________________________________________ test_grouped_gemm_accuracy[8192-2-256-out_dtype0] ____________________________________________________________________

out_dtype = torch.float16, M = 256, N = 2, K = 8192

    @pytest.mark.skipif(
        skip_condition, reason="CUDA not available or CUDA version lower than 12.5"
    )
    @pytest.mark.parametrize("out_dtype", [torch.float16, torch.bfloat16])
    @pytest.mark.parametrize("M", [1, 16, 32, 256, 1024])
    @pytest.mark.parametrize("N", [2, 16, 128, 256, 4096])
    @pytest.mark.parametrize("K", [3, 16, 32, 512, 8192])
    def test_grouped_gemm_accuracy(out_dtype, M, N, K):
        a = torch.randn((M, K), device="cuda", dtype=out_dtype) * 5
        b = torch.randn((N, K), device="cuda", dtype=out_dtype) * 5
        expected = torch.matmul(a, b.t()).to(out_dtype)

        a_array = [a]
        b_array = [b]
        c_array = [torch.empty((M, N), device="cuda", dtype=out_dtype)]

        result_torch = torch_grouped_gemm(a_array, b_array, out_dtype)[0]
        cublas_grouped_gemm(a_array, b_array, c_array, out_dtype)

        torch.testing.assert_close(result_torch, expected)
>       torch.testing.assert_close(c_array[0], expected)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 63 / 512 (12.3%)
E       Greatest absolute difference: 1.5 at index (5, 0) (up to 1e-05 allowed)
E       Greatest relative difference: 0.09844970703125 at index (31, 0) (up to 0.001 allowed)

test_cublas_grouped_gemm.py:36: AssertionError
____________________________________________________________________ test_grouped_gemm_accuracy[8192-2-256-out_dtype1] ____________________________________________________________________

out_dtype = torch.bfloat16, M = 256, N = 2, K = 8192

    @pytest.mark.skipif(
        skip_condition, reason="CUDA not available or CUDA version lower than 12.5"
    )
    @pytest.mark.parametrize("out_dtype", [torch.float16, torch.bfloat16])
    @pytest.mark.parametrize("M", [1, 16, 32, 256, 1024])
    @pytest.mark.parametrize("N", [2, 16, 128, 256, 4096])
    @pytest.mark.parametrize("K", [3, 16, 32, 512, 8192])
    def test_grouped_gemm_accuracy(out_dtype, M, N, K):
        a = torch.randn((M, K), device="cuda", dtype=out_dtype) * 5
        b = torch.randn((N, K), device="cuda", dtype=out_dtype) * 5
        expected = torch.matmul(a, b.t()).to(out_dtype)

        a_array = [a]
        b_array = [b]
        c_array = [torch.empty((M, N), device="cuda", dtype=out_dtype)]

        result_torch = torch_grouped_gemm(a_array, b_array, out_dtype)[0]
        cublas_grouped_gemm(a_array, b_array, c_array, out_dtype)

        torch.testing.assert_close(result_torch, expected)
>       torch.testing.assert_close(c_array[0], expected)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 31 / 512 (6.1%)
E       Greatest absolute difference: 40.0 at index (148, 1) (up to 1e-05 allowed)
E       Greatest relative difference: 0.392578125 at index (48, 0) (up to 0.016 allowed)

test_cublas_grouped_gemm.py:36: AssertionError
___________________________________________________________________ test_grouped_gemm_accuracy[8192-2-1024-out_dtype0] ____________________________________________________________________

out_dtype = torch.float16, M = 1024, N = 2, K = 8192

    @pytest.mark.skipif(
        skip_condition, reason="CUDA not available or CUDA version lower than 12.5"
    )
    @pytest.mark.parametrize("out_dtype", [torch.float16, torch.bfloat16])
    @pytest.mark.parametrize("M", [1, 16, 32, 256, 1024])
    @pytest.mark.parametrize("N", [2, 16, 128, 256, 4096])
    @pytest.mark.parametrize("K", [3, 16, 32, 512, 8192])
    def test_grouped_gemm_accuracy(out_dtype, M, N, K):
        a = torch.randn((M, K), device="cuda", dtype=out_dtype) * 5
        b = torch.randn((N, K), device="cuda", dtype=out_dtype) * 5
        expected = torch.matmul(a, b.t()).to(out_dtype)

        a_array = [a]
        b_array = [b]
        c_array = [torch.empty((M, N), device="cuda", dtype=out_dtype)]

        result_torch = torch_grouped_gemm(a_array, b_array, out_dtype)[0]
        cublas_grouped_gemm(a_array, b_array, c_array, out_dtype)

        torch.testing.assert_close(result_torch, expected)
>       torch.testing.assert_close(c_array[0], expected)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 280 / 2048 (13.7%)
E       Greatest absolute difference: 4.0 at index (143, 0) (up to 1e-05 allowed)
E       Greatest relative difference: 0.377197265625 at index (65, 0) (up to 0.001 allowed)

test_cublas_grouped_gemm.py:36: AssertionError
___________________________________________________________________ test_grouped_gemm_accuracy[8192-2-1024-out_dtype1] ____________________________________________________________________

out_dtype = torch.bfloat16, M = 1024, N = 2, K = 8192

    @pytest.mark.skipif(
        skip_condition, reason="CUDA not available or CUDA version lower than 12.5"
    )
    @pytest.mark.parametrize("out_dtype", [torch.float16, torch.bfloat16])
    @pytest.mark.parametrize("M", [1, 16, 32, 256, 1024])
    @pytest.mark.parametrize("N", [2, 16, 128, 256, 4096])
    @pytest.mark.parametrize("K", [3, 16, 32, 512, 8192])
    def test_grouped_gemm_accuracy(out_dtype, M, N, K):
        a = torch.randn((M, K), device="cuda", dtype=out_dtype) * 5
        b = torch.randn((N, K), device="cuda", dtype=out_dtype) * 5
        expected = torch.matmul(a, b.t()).to(out_dtype)

        a_array = [a]
        b_array = [b]
        c_array = [torch.empty((M, N), device="cuda", dtype=out_dtype)]

        result_torch = torch_grouped_gemm(a_array, b_array, out_dtype)[0]
        cublas_grouped_gemm(a_array, b_array, c_array, out_dtype)

        torch.testing.assert_close(result_torch, expected)
>       torch.testing.assert_close(c_array[0], expected)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 118 / 2048 (5.8%)
E       Greatest absolute difference: 30.0 at index (893, 0) (up to 1e-05 allowed)
E       Greatest relative difference: 5.0625 at index (347, 1) (up to 0.016 allowed)

test_cublas_grouped_gemm.py:36: AssertionError
================================================================================= short test summary info =================================================================================
FAILED test_cublas_grouped_gemm.py::test_grouped_gemm_accuracy[8192-2-16-out_dtype1] - AssertionError: Tensor-likes are not close!
FAILED test_cublas_grouped_gemm.py::test_grouped_gemm_accuracy[8192-2-32-out_dtype0] - AssertionError: Tensor-likes are not close!
FAILED test_cublas_grouped_gemm.py::test_grouped_gemm_accuracy[8192-2-32-out_dtype1] - AssertionError: Tensor-likes are not close!
FAILED test_cublas_grouped_gemm.py::test_grouped_gemm_accuracy[8192-2-256-out_dtype0] - AssertionError: Tensor-likes are not close!
FAILED test_cublas_grouped_gemm.py::test_grouped_gemm_accuracy[8192-2-256-out_dtype1] - AssertionError: Tensor-likes are not close!
FAILED test_cublas_grouped_gemm.py::test_grouped_gemm_accuracy[8192-2-1024-out_dtype0] - AssertionError: Tensor-likes are not close!
FAILED test_cublas_grouped_gemm.py::test_grouped_gemm_accuracy[8192-2-1024-out_dtype1] - AssertionError: Tensor-likes are not close!
============================================================================== 7 failed, 243 passed in 8.62s ==============================================================================
```

test_int8_gemm.py

all failed

test_per_token_group_quant_8bit.py

```
============================================================================ 900 failed, 300 passed in 16.52s =============================================================================
```


### Reproduction

N/A

### Environment

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] fix sgl-kernel on Blackwell #5305

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] fix sgl-kernel on Blackwell #5305

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions