[Quantization][1/N] MoE support BNB-Inflight Quantization #20061

jeejeelee · 2025-06-25T07:02:32Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Main changes

1 Expose the MOE model's make_expert_params_mapping, with the purpose of obtaining the mapping between loaded weights and model parameters, see: get_expert_mapping
2 Use the way of pre-dequantization to first dequantize the weights of fused experts, so we can use the current fused moe kernel directly, see: 4bit moe inference

Supported Model

Uncertain

I can generate reasonable results locally using the above models, but I'm unable to achieve complete alignment with the results generated by transformers.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

gemini-code-assist

Summary of Changes

Hello @jeejeelee, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for BitsAndBytes (BNB) quantization for Mixture-of-Experts (MoE) models within vLLM. It extends the existing BNB framework to properly handle the unique weight structures and loading requirements of MoE layers, aiming to reduce memory footprint and enable efficient inference for quantized MoE models.

Highlights

MoE Quantization Method: A new BitsAndBytesMoEMethod has been introduced to handle 4-bit quantization specifically for Mixture-of-Experts (MoE) layers. This method manages the creation of quantized weights and the forward pass, including on-the-fly dequantization.
Optimized MoE Weight Loading: The weight loading process for MoE layers has been enhanced to support BitsAndBytes. This includes modifying the _load_w13 function to allow full loading of pre-sharded weights and implementing a maybe_fuse_moe_quant_states function to consolidate individual expert quantization states (e.g., w1 and w3 into a fused w13) for improved memory efficiency.
Model Integration for MoE: MoE models, specifically Qwen2 and Qwen3, have been updated to expose their expert weight structures through new get_expert_mapping methods and packed_modules_mapping. This enables the BitsAndBytes loader to correctly identify, map, and process the unique weight configurations of MoE layers.
General Quantization Utility: A calculate_quant_ratio utility function has been extracted and made globally accessible within the quantization module, promoting code reusability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces support for MoE models with BitsAndBytes quantization. The changes include modifications to the fused MoE layer, bitsandbytes loader, and model definitions for Qwen2 and Qwen3 MoE models. The code appears well-structured and addresses the objective of adding MoE support. However, there are opportunities to improve error handling, code clarity, and documentation, as highlighted in the review comments.

gemini-code-assist · 2025-06-25T07:04:34Z

vllm/model_executor/layers/fused_moe/layer.py

+                self._load_w13(
+                    shard_id=shard_id,
+                    shard_dim=shard_dim,
+                    loaded_weight=loaded_weight,
+                    expert_data=expert_data,
+                    tp_rank=self.tp_rank,
+                    load_full=full_load,
+                )


This block of code handles the BitsAndBytes case, but there's no explicit error handling if self._load_w13 fails. Consider adding a try-except block to catch potential exceptions and log an error message, which would help in debugging.

Suggested change

self._load_w13(

shard_id=shard_id,

shard_dim=shard_dim,

loaded_weight=loaded_weight,

expert_data=expert_data,

tp_rank=self.tp_rank,

load_full=full_load,

)

try:

self._load_w13(

shard_id=shard_id,

shard_dim=shard_dim,

loaded_weight=loaded_weight,

expert_data=expert_data,

tp_rank=self.tp_rank,

load_full=full_load,

)

except Exception as e:

logger.error(f"Error loading w13 for shard_id {shard_id}: {e}")

gemini-code-assist · 2025-06-25T07:04:35Z

vllm/model_executor/model_loader/bitsandbytes_loader.py

+                    raise AttributeError(
+                        f"MoE Model {type(model).__name__} does not support "
+                        "BitsAndBytes quantization yet. "
+                        "No 'get_expert_mapping' found.")


This AttributeError is raised when get_expert_mapping is not found. Consider providing a more specific error message that guides the user on how to resolve the issue, such as suggesting they check the model documentation or use a different model.

raise AttributeError( f"MoE Model {type(model).__name__} does not support " "BitsAndBytes quantization yet. Ensure the model has a 'get_expert_mapping' method and that MoE is correctly configured.")

gemini-code-assist · 2025-06-25T07:04:35Z

vllm/model_executor/model_loader/bitsandbytes_loader.py

+            if shard_id not in ("w1", "w2", "w3"):
+                raise ValueError(f"shard_id must be ['w1','w2','w3'] but "
+                                    f"got {shard_id}.")


Consider adding a check to ensure that the quant_state is not None before attempting to dequantize it. This can prevent potential errors if the quant_state is unexpectedly missing.

if quant_states: dequantize_dq(quant_states) set_weight_attrs(param, {"bnb_quant_state": quant_states}) else: logger.warning(f"No quant state found for {param_name}")

gemini-code-assist · 2025-06-25T07:04:35Z

vllm/model_executor/model_loader/bitsandbytes_loader.py

+            layer_prefix = name.split("experts")[0]
+            weight_qual_name = layer_prefix + exp[1] + "weight"


The function dequantize_dq is called without checking if quant_states is None. Add a check to ensure that quant_states is not None before calling dequantize_dq to prevent potential errors.

if quant_states: dequantize_dq(quant_states) set_weight_attrs(param, {"bnb_quant_state": quant_states}) else: logger.warning(f"No quant state found for {param_name}")

gemini-code-assist · 2025-06-25T07:04:35Z

vllm/model_executor/model_loader/bitsandbytes_loader.py

+            quant_state = dequantize_dq(quant_states_dict[weight_qual_name])
+            if shard_id == "w1":
+                w1_states_lst.append(quant_state)
+            elif shard_id == "w2":
+                w2_states_lst.append(quant_state)
+            else:
+                w3_states_lst.append(quant_state)
+            del quant_states_dict[weight_qual_name]
+        assert (len(w1_states_lst) == len(w2_states_lst) ==
+                len(w3_states_lst))
+        w13_absmax_lst = []
+        w2_absmax_lst = []
+        w13_shape_lst = []
+        w2_shape_lst = []
+        for w1_qs, w2_qs, w3_qs in zip(w1_states_lst, w2_states_lst,
+                                        w3_states_lst):
+            assert w1_qs.shape == w3_qs.shape
+            assert w1_qs.blocksize == w2_qs.blocksize == w3_qs.blocksize
+            assert w1_qs.dtype == w2_qs.dtype == w3_qs.dtype
+            # w1 and w3 are interleaved in storage
+            w13_absmax_lst.append(w1_qs.absmax)
+            w13_absmax_lst.append(w3_qs.absmax)
+            w2_absmax_lst.append(w2_qs.absmax)
+            w13_shape_lst.append(w1_qs.shape)
+            w13_shape_lst.append(w3_qs.shape)
+            w2_shape_lst.append(w2_qs.shape)
+        # FIXME dimension is dirty
+        w13_dim0 = 0
+        w13_dim1 = w13_shape_lst[0][1]
+        for shape in w13_shape_lst:
+            w13_dim0 += shape[0]
+
+        w2_dim0 = 0
+        w2_dim1 = w2_shape_lst[0][1]
+        for shape in w2_shape_lst:
+            w2_dim0 += shape[0]
+        w13_absmax = torch.cat(w13_absmax_lst)
+        w2_absmax = torch.cat(w2_absmax_lst)
+        # Create fused quantization state for w13.
+        w13_qs = QuantState(
+            absmax=w13_absmax,
+            shape=(w13_dim0, w13_dim1),
+            code=w1_states_lst[0].code,
+            blocksize=w1_states_lst[0].blocksize,
+            dtype=w1_states_lst[0].dtype,
+        )
+        # Create fused quantization state for w2.
+        w2_qs = QuantState(
+            absmax=w2_absmax,
+            shape=(w2_dim0, w2_dim1),
+            code=w2_states_lst[0].code,
+            blocksize=w2_states_lst[0].blocksize,
+            dtype=w2_states_lst[0].dtype,
+        )
+        # The weight suffixes .w13_weight and .w2_weight are consistent
+        # with the param in BitsAndBytesMoEMethod.
+        w13_weight_name = name + ".w13_weight"
+        w2_weight_name = name + ".w2_weight"
+        quant_states_dict[w13_weight_name] = w13_qs
+        quant_states_dict[w2_weight_name] = w2_qs


This entire function maybe_fuse_moe_quant_states could benefit from more comprehensive error handling. For instance, the assertions and value checks could be wrapped in try-except blocks to provide more informative error messages and prevent the function from crashing. Additionally, consider logging warnings or errors when certain conditions are not met, instead of just returning.

github-actions · 2025-06-25T07:39:15Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee · 2025-07-11T02:59:31Z

I will temporarily remove mixtral's BNB support. The reason is that Granite series MoE models using mixtral's load_weigh causes errors. This issue will be resolved in a subsequent PR.

QiliangCui · 2025-07-11T14:50:45Z

this change seems to cause failure in TPU.

INFO 07-11 09:33:22 [__init__.py:253] Automatically detected platform tpu.
INFO 07-11 09:33:22 [tpu.py:196] tpu_commons not found, using vLLM's TpuPlatform
INFO 07-11 09:33:24 [importing.py:43] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 07-11 09:33:24 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
INFO 07-11 09:33:26 [api_server.py:1641] vLLM API server version 0.9.2rc2.dev168+g8020e98c9
INFO 07-11 09:33:26 [cli_args.py:325] non-default args: {'model': 'RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', 'seed': 42, 'max_model_len': 2048, 'download_dir': '/mnt/disks/persist', 'enable_prefix_caching': False, 'max_num_batched_tokens': 1024, 'max_num_seqs': 128, 'disable_log_requests': True}
INFO 07-11 09:33:33 [config.py:852] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 07-11 09:33:33 [config.py:1500] Using max model len 2048
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/workspace/vllm/vllm/entrypoints/cli/main.py", line 65, in main
    args.dispatch_function(args)
  File "/workspace/vllm/vllm/entrypoints/cli/serve.py", line 57, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1677, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1697, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 162, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 184, in build_async_engine_client_from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context=usage_context)
  File "/workspace/vllm/vllm/engine/arg_utils.py", line 1104, in create_engine_config
    model_config = self.create_model_config()
  File "/workspace/vllm/vllm/engine/arg_utils.py", line 976, in create_model_config
    return ModelConfig(
  File "/usr/local/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
  File "/workspace/vllm/vllm/config.py", line 631, in __post_init__
    self._verify_quantization()
  File "/workspace/vllm/vllm/config.py", line 935, in _verify_quantization
    method = me_quant.get_quantization_config(name)
  File "/workspace/vllm/vllm/model_executor/layers/quantization/__init__.py", line 94, in get_quantization_config
    from .bitsandbytes import BitsAndBytesConfig
  File "/workspace/vllm/vllm/model_executor/layers/quantization/bitsandbytes.py", line 8, in <module>
    from vllm.model_executor.layers.fused_moe import fused_experts
ImportError: cannot import name 'fused_experts' from 'vllm.model_executor.layers.fused_moe' (/workspace/vllm/vllm/model_executor/layers/fused_moe/__init__.py

QiliangCui · 2025-07-11T14:52:50Z

The CI test "TPU V1 Benchmark Test" failed and gives same error.

jeejeelee · 2025-07-11T14:55:26Z

Sry, I am not aware of this failure, please feel free to fix this

QiliangCui · 2025-07-11T15:01:52Z

Sry, I am not aware of this failure, please feel free to fix this

Could you fix forward or rollback?

Isotr0py · 2025-07-11T15:19:41Z

#20822 should fix this TPU failure as well.

QiliangCui · 2025-07-11T15:41:34Z

#20822 should fix this TPU failure as well.

Thank you :).

…import bitsandbytes Signed-off-by: Chendi.Xue <chendi.xue@intel.com>