Skip to content

Conversation

ikawrakow
Copy link
Owner

Follow up of #415. This should fix SER issues on CUDA.

@ubergarm
Copy link
Contributor

Interestingly I recompiled main with CUDA (after you merged #415 into main) and haven't been able to reproduce the error now.

fwiw this command is working both with and without this PR:

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-server \
    --model /mnt/raid/hf/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf \
    --alias ubergarm/DeepSeek-R1-IQ2_K_R4 \
    --ctx-size 131072 \
    -ctk f16 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    -ser 6,1 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080

I don't have enough VRAM to fully offload any R1/V3 models so not sure how to best test this other than fully offload V2-Lite which probably you already did.

@ikawrakow
Copy link
Owner Author

On CUDA it is more difficult to trigger the bug. I used Qwen3-30B-A3B quantized with IQ5_K. I only have a 16 GB GPU, so I had to leave the last 19 layers of exerts on the CPU. I used llama-cli like this

./bin/llama-cli -m ../ncuda/junk.bin -t 16 -ngl 100 -c 20000 -cnv -p " " -rtr -fa -s 1234 -ot "blk\.29\.ffn=CPU,blk\.[3-4][0-9]\.ffn=CPU" -ser 6,1

and prompted with

Encoded text:\noyfjdnisdr rtqwainr acxz mynzbhhx\nDecoded text:\nThink step by step\n\nEncoded text:\nsudlcg jncgpxoydflx ky lraebdtvlxmy nzbnkyaibh ttemgsdfqu gkdx pvsunvaauyacairrlxyy\nDecoded text:\n<think>

(and I guess the same can be done with the server).

The thinking goes well for a while, but eventually it starts spitting out GGGGG.
The PR fixes that.

Interestingly enough, after the fix it does solve the puzzle with -ser 6,1, but fails with -ser 7,1.

I don't think partial offload is required, and it is likely the bug will trigger quicker if all layers are on the GPU. I found it is easier to debug with a "thinking" model because there isn't much interaction required to have the model generate many tokens one-by-one.

@ikawrakow
Copy link
Owner Author

Oops, it is still failing with DeepSeek-Lite. Converting to draft.

@ikawrakow ikawrakow marked this pull request as draft May 13, 2025 15:59
@ikawrakow ikawrakow marked this pull request as ready for review May 13, 2025 17:02
@ikawrakow ikawrakow merged commit b90d6ed into main May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants