Skip to content

[Bug] Cannot run bitsandbytes llama models  #2600

@merrymercy

Description

@merrymercy

The issue is the same as #2556, but for llama models. We should be able to fix with a similar approach.

The following command crashes.

python3 -m sglang.bench_one_batch --model unsloth/llama-3-8b-bnb-4bit --load-format bitsandbytes
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/root/sglang/python/sglang/bench_one_batch.py", line 470, in <module>
[rank0]:     main(server_args, bench_args)
[rank0]:   File "/root/sglang/python/sglang/bench_one_batch.py", line 434, in main
[rank0]:     work_func(server_args, port_args, bench_args, 0)
[rank0]:   File "/root/sglang/python/sglang/bench_one_batch.py", line 369, in latency_test
[rank0]:     model_runner, tokenizer = load_model(server_args, port_args, tp_rank)
[rank0]:   File "/root/sglang/python/sglang/bench_one_batch.py", line 121, in load_model
[rank0]:     model_runner = ModelRunner(
[rank0]:   File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 158, in __init__
[rank0]:     self.load_model()
[rank0]:   File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 258, in load_model
[rank0]:     self.model = get_model(
[rank0]:   File "/root/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
[rank0]:     return loader.load_model(
[rank0]:   File "/root/sglang/python/sglang/srt/model_loader/loader.py", line 1029, in load_model
[rank0]:     self._load_weights(model_config, model)
[rank0]:   File "/root/sglang/python/sglang/srt/model_loader/loader.py", line 960, in _load_weights
[rank0]:     model.load_weights(qweight_iterator)
[rank0]:   File "/root/sglang/python/sglang/srt/models/llama.py", line 442, in load_weights
[rank0]:     param = params_dict[name]
[rank0]: KeyError: 'model.layers.0.mlp.down_proj.qweight'
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:01<?, ?it/s]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions