Fix SER (CUDA) #416

ikawrakow · 2025-05-13T14:16:11Z

Follow up of #415. This should fix SER issues on CUDA.

ubergarm · 2025-05-13T15:30:55Z

Interestingly I recompiled main with CUDA (after you merged #415 into main) and haven't been able to reproduce the error now.

fwiw this command is working both with and without this PR:

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-server \
    --model /mnt/raid/hf/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf \
    --alias ubergarm/DeepSeek-R1-IQ2_K_R4 \
    --ctx-size 131072 \
    -ctk f16 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    -ser 6,1 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080

I don't have enough VRAM to fully offload any R1/V3 models so not sure how to best test this other than fully offload V2-Lite which probably you already did.

ikawrakow · 2025-05-13T15:43:01Z

On CUDA it is more difficult to trigger the bug. I used Qwen3-30B-A3B quantized with IQ5_K. I only have a 16 GB GPU, so I had to leave the last 19 layers of exerts on the CPU. I used llama-cli like this

./bin/llama-cli -m ../ncuda/junk.bin -t 16 -ngl 100 -c 20000 -cnv -p " " -rtr -fa -s 1234 -ot "blk\.29\.ffn=CPU,blk\.[3-4][0-9]\.ffn=CPU" -ser 6,1

and prompted with

Encoded text:\noyfjdnisdr rtqwainr acxz mynzbhhx\nDecoded text:\nThink step by step\n\nEncoded text:\nsudlcg jncgpxoydflx ky lraebdtvlxmy nzbnkyaibh ttemgsdfqu gkdx pvsunvaauyacairrlxyy\nDecoded text:\n<think>

(and I guess the same can be done with the server).

The thinking goes well for a while, but eventually it starts spitting out GGGGG.
The PR fixes that.

Interestingly enough, after the fix it does solve the puzzle with -ser 6,1, but fails with -ser 7,1.

I don't think partial offload is required, and it is likely the bug will trigger quicker if all layers are on the GPU. I found it is easier to debug with a "thinking" model because there isn't much interaction required to have the model generate many tokens one-by-one.

ikawrakow · 2025-05-13T15:57:54Z

Oops, it is still failing with DeepSeek-Lite. Converting to draft.

Iwan Kawrakow added 3 commits May 13, 2025 17:57

Fixing SER bugs

c5b914a

Cleanup

564cbba

This seems to fix it.

e78e7f5

ikawrakow force-pushed the ik/fix_ser_cuda branch from ac1e322 to e78e7f5 Compare May 13, 2025 14:57

ikawrakow marked this pull request as draft May 13, 2025 15:59

This seems to work

79bdbbb

ikawrakow marked this pull request as ready for review May 13, 2025 17:02

ikawrakow merged commit b90d6ed into main May 14, 2025

saood06 mentioned this pull request May 16, 2025

Bug: llama-batched-bench crashed with batch size >2 #389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix SER (CUDA) #416

Fix SER (CUDA) #416

Uh oh!

ikawrakow commented May 13, 2025

Uh oh!

ubergarm commented May 13, 2025

Uh oh!

ikawrakow commented May 13, 2025

Uh oh!

ikawrakow commented May 13, 2025

Uh oh!

Uh oh!

Fix SER (CUDA) #416

Fix SER (CUDA) #416

Uh oh!

Conversation

ikawrakow commented May 13, 2025

Uh oh!

ubergarm commented May 13, 2025

Uh oh!

ikawrakow commented May 13, 2025

Uh oh!

ikawrakow commented May 13, 2025

Uh oh!

Uh oh!