Skip to content

Conversation

jeejeelee
Copy link
Collaborator

@jeejeelee jeejeelee commented Jun 25, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Main changes

  • 1 Expose the MOE model's make_expert_params_mapping, with the purpose of obtaining the mapping between loaded weights and model parameters, see: get_expert_mapping
  • 2 Use the way of pre-dequantization to first dequantize the weights of fused experts, so we can use the current fused moe kernel directly, see: 4bit moe inference

Supported Model

Uncertain

I can generate reasonable results locally using the above models, but I'm unable to achieve complete alignment with the results generated by transformers.

@jeejeelee jeejeelee marked this pull request as draft June 25, 2025 07:02
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @jeejeelee, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for BitsAndBytes (BNB) quantization for Mixture-of-Experts (MoE) models within vLLM. It extends the existing BNB framework to properly handle the unique weight structures and loading requirements of MoE layers, aiming to reduce memory footprint and enable efficient inference for quantized MoE models.

Highlights

  • MoE Quantization Method: A new BitsAndBytesMoEMethod has been introduced to handle 4-bit quantization specifically for Mixture-of-Experts (MoE) layers. This method manages the creation of quantized weights and the forward pass, including on-the-fly dequantization.
  • Optimized MoE Weight Loading: The weight loading process for MoE layers has been enhanced to support BitsAndBytes. This includes modifying the _load_w13 function to allow full loading of pre-sharded weights and implementing a maybe_fuse_moe_quant_states function to consolidate individual expert quantization states (e.g., w1 and w3 into a fused w13) for improved memory efficiency.
  • Model Integration for MoE: MoE models, specifically Qwen2 and Qwen3, have been updated to expose their expert weight structures through new get_expert_mapping methods and packed_modules_mapping. This enables the BitsAndBytes loader to correctly identify, map, and process the unique weight configurations of MoE layers.
  • General Quantization Utility: A calculate_quant_ratio utility function has been extracted and made globally accessible within the quantization module, promoting code reusability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the qwen Related to Qwen models label Jun 25, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for MoE models with BitsAndBytes quantization. The changes include modifications to the fused MoE layer, bitsandbytes loader, and model definitions for Qwen2 and Qwen3 MoE models. The code appears well-structured and addresses the objective of adding MoE support. However, there are opportunities to improve error handling, code clarity, and documentation, as highlighted in the review comments.

Comment on lines +1177 to +1184
self._load_w13(
shard_id=shard_id,
shard_dim=shard_dim,
loaded_weight=loaded_weight,
expert_data=expert_data,
tp_rank=self.tp_rank,
load_full=full_load,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code handles the BitsAndBytes case, but there's no explicit error handling if self._load_w13 fails. Consider adding a try-except block to catch potential exceptions and log an error message, which would help in debugging.

Suggested change
self._load_w13(
shard_id=shard_id,
shard_dim=shard_dim,
loaded_weight=loaded_weight,
expert_data=expert_data,
tp_rank=self.tp_rank,
load_full=full_load,
)
try:
self._load_w13(
shard_id=shard_id,
shard_dim=shard_dim,
loaded_weight=loaded_weight,
expert_data=expert_data,
tp_rank=self.tp_rank,
load_full=full_load,
)
except Exception as e:
logger.error(f"Error loading w13 for shard_id {shard_id}: {e}")

Comment on lines 411 to 414
raise AttributeError(
f"MoE Model {type(model).__name__} does not support "
"BitsAndBytes quantization yet. "
"No 'get_expert_mapping' found.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This AttributeError is raised when get_expert_mapping is not found. Consider providing a more specific error message that guides the user on how to resolve the issue, such as suggesting they check the model documentation or use a different model.

                    raise AttributeError(
                        f"MoE Model {type(model).__name__} does not support "
                        "BitsAndBytes quantization yet. Ensure the model has a 'get_expert_mapping' method and that MoE is correctly configured.")

Comment on lines 656 to 658
if shard_id not in ("w1", "w2", "w3"):
raise ValueError(f"shard_id must be ['w1','w2','w3'] but "
f"got {shard_id}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a check to ensure that the quant_state is not None before attempting to dequantize it. This can prevent potential errors if the quant_state is unexpectedly missing.

                if quant_states:
                    dequantize_dq(quant_states)
                    set_weight_attrs(param, {"bnb_quant_state": quant_states})
                else:
                    logger.warning(f"No quant state found for {param_name}")

Comment on lines 659 to 660
layer_prefix = name.split("experts")[0]
weight_qual_name = layer_prefix + exp[1] + "weight"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function dequantize_dq is called without checking if quant_states is None. Add a check to ensure that quant_states is not None before calling dequantize_dq to prevent potential errors.

                if quant_states:
                    dequantize_dq(quant_states)
                    set_weight_attrs(param, {"bnb_quant_state": quant_states})
                else:
                    logger.warning(f"No quant state found for {param_name}")

Comment on lines 661 to 720
quant_state = dequantize_dq(quant_states_dict[weight_qual_name])
if shard_id == "w1":
w1_states_lst.append(quant_state)
elif shard_id == "w2":
w2_states_lst.append(quant_state)
else:
w3_states_lst.append(quant_state)
del quant_states_dict[weight_qual_name]
assert (len(w1_states_lst) == len(w2_states_lst) ==
len(w3_states_lst))
w13_absmax_lst = []
w2_absmax_lst = []
w13_shape_lst = []
w2_shape_lst = []
for w1_qs, w2_qs, w3_qs in zip(w1_states_lst, w2_states_lst,
w3_states_lst):
assert w1_qs.shape == w3_qs.shape
assert w1_qs.blocksize == w2_qs.blocksize == w3_qs.blocksize
assert w1_qs.dtype == w2_qs.dtype == w3_qs.dtype
# w1 and w3 are interleaved in storage
w13_absmax_lst.append(w1_qs.absmax)
w13_absmax_lst.append(w3_qs.absmax)
w2_absmax_lst.append(w2_qs.absmax)
w13_shape_lst.append(w1_qs.shape)
w13_shape_lst.append(w3_qs.shape)
w2_shape_lst.append(w2_qs.shape)
# FIXME dimension is dirty
w13_dim0 = 0
w13_dim1 = w13_shape_lst[0][1]
for shape in w13_shape_lst:
w13_dim0 += shape[0]

w2_dim0 = 0
w2_dim1 = w2_shape_lst[0][1]
for shape in w2_shape_lst:
w2_dim0 += shape[0]
w13_absmax = torch.cat(w13_absmax_lst)
w2_absmax = torch.cat(w2_absmax_lst)
# Create fused quantization state for w13.
w13_qs = QuantState(
absmax=w13_absmax,
shape=(w13_dim0, w13_dim1),
code=w1_states_lst[0].code,
blocksize=w1_states_lst[0].blocksize,
dtype=w1_states_lst[0].dtype,
)
# Create fused quantization state for w2.
w2_qs = QuantState(
absmax=w2_absmax,
shape=(w2_dim0, w2_dim1),
code=w2_states_lst[0].code,
blocksize=w2_states_lst[0].blocksize,
dtype=w2_states_lst[0].dtype,
)
# The weight suffixes .w13_weight and .w2_weight are consistent
# with the param in BitsAndBytesMoEMethod.
w13_weight_name = name + ".w13_weight"
w2_weight_name = name + ".w2_weight"
quant_states_dict[w13_weight_name] = w13_qs
quant_states_dict[w2_weight_name] = w2_qs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This entire function maybe_fuse_moe_quant_states could benefit from more comprehensive error handling. For instance, the assertions and value checks could be wrapped in try-except blocks to provide more informative error messages and prevent the function from crashing. Additionally, consider logging warnings or errors when certain conditions are not met, instead of just returning.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify mergify bot added the ci/build label Jun 27, 2025
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee
Copy link
Collaborator Author

I will temporarily remove mixtral's BNB support. The reason is that Granite series MoE models using mixtral's load_weigh causes errors. This issue will be resolved in a subsequent PR.

@jeejeelee jeejeelee enabled auto-merge (squash) July 11, 2025 06:12
@jeejeelee jeejeelee merged commit 8020e98 into vllm-project:main Jul 11, 2025
78 checks passed
@QiliangCui
Copy link
Contributor

this change seems to cause failure in TPU.

INFO 07-11 09:33:22 [__init__.py:253] Automatically detected platform tpu.
INFO 07-11 09:33:22 [tpu.py:196] tpu_commons not found, using vLLM's TpuPlatform
INFO 07-11 09:33:24 [importing.py:43] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 07-11 09:33:24 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
INFO 07-11 09:33:26 [api_server.py:1641] vLLM API server version 0.9.2rc2.dev168+g8020e98c9
INFO 07-11 09:33:26 [cli_args.py:325] non-default args: {'model': 'RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8', 'seed': 42, 'max_model_len': 2048, 'download_dir': '/mnt/disks/persist', 'enable_prefix_caching': False, 'max_num_batched_tokens': 1024, 'max_num_seqs': 128, 'disable_log_requests': True}
INFO 07-11 09:33:33 [config.py:852] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 07-11 09:33:33 [config.py:1500] Using max model len 2048
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/workspace/vllm/vllm/entrypoints/cli/main.py", line 65, in main
    args.dispatch_function(args)
  File "/workspace/vllm/vllm/entrypoints/cli/serve.py", line 57, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1677, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1697, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 162, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 184, in build_async_engine_client_from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context=usage_context)
  File "/workspace/vllm/vllm/engine/arg_utils.py", line 1104, in create_engine_config
    model_config = self.create_model_config()
  File "/workspace/vllm/vllm/engine/arg_utils.py", line 976, in create_model_config
    return ModelConfig(
  File "/usr/local/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
  File "/workspace/vllm/vllm/config.py", line 631, in __post_init__
    self._verify_quantization()
  File "/workspace/vllm/vllm/config.py", line 935, in _verify_quantization
    method = me_quant.get_quantization_config(name)
  File "/workspace/vllm/vllm/model_executor/layers/quantization/__init__.py", line 94, in get_quantization_config
    from .bitsandbytes import BitsAndBytesConfig
  File "/workspace/vllm/vllm/model_executor/layers/quantization/bitsandbytes.py", line 8, in <module>
    from vllm.model_executor.layers.fused_moe import fused_experts
ImportError: cannot import name 'fused_experts' from 'vllm.model_executor.layers.fused_moe' (/workspace/vllm/vllm/model_executor/layers/fused_moe/__init__.py

@jeejeelee jeejeelee deleted the moe-support-bnb branch July 11, 2025 14:51
@QiliangCui
Copy link
Contributor

The CI test "TPU V1 Benchmark Test" failed and gives same error.

image image

@jeejeelee
Copy link
Collaborator Author

Sry, I am not aware of this failure, please feel free to fix this

@QiliangCui
Copy link
Contributor

Sry, I am not aware of this failure, please feel free to fix this

Could you fix forward or rollback?

@Isotr0py
Copy link
Member

#20822 should fix this TPU failure as well.

@QiliangCui
Copy link
Contributor

#20822 should fix this TPU failure as well.

Thank you :).

xuechendi added a commit to xuechendi/vllm-fork that referenced this pull request Jul 11, 2025
…import

bitsandbytes

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025
joerunde pushed a commit to vllm-project/vllm-spyre that referenced this pull request Jul 14, 2025
Closes #305 

The changes in upstream that are breaking tests with vLLM:main branch
are twofolds:
1. [This PR](vllm-project/vllm#20061)
introduces[ an
import](https://github.com/vllm-project/vllm/blob/8020e98c9f033e76c97eb8261f772d59eba49c9a/vllm/model_executor/model_loader/bitsandbytes_loader.py#L23)
which is incompatible with torch 2.5.1
2. Previously, when `model_config.task` was set to `auto`, the task
could be automatically resolved in vLLM: after the post-init method of
`ModelConfig`, the `task` field was automatically replaced from `auto`
to .e.g. `generate` or `embed`). With the changes in [This
PR](vllm-project/vllm@020f58a),
the task in `model_config`, cannot be automatically resolved anymore
when it is `generate`. This doesn't look like it is intended, but rather
a side effect of adding new supported tasks. A cleaner (but longer) way
would be to do a PR to vLLM to fix that and reintroduce the automatic
resolution of `auto` task field.

This PR addresses both of the breaking changes: 
* for 1. it installs pytorch==2.7.0 in github workflows when vLLM:main
branch is used
* for 2: it infers the task through `model_config.supported_tasks`
rather than `model_config.task` now

---------

Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
patrickvonplaten pushed a commit to patrickvonplaten/vllm that referenced this pull request Jul 15, 2025
…ct#20061)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
…ct#20061)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
…ct#20061)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
…ct#20061)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
…ct#20061)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants