-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[Quantization][1/N] MoE support BNB-Inflight Quantization #20061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @jeejeelee, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces initial support for BitsAndBytes (BNB) quantization for Mixture-of-Experts (MoE) models within vLLM. It extends the existing BNB framework to properly handle the unique weight structures and loading requirements of MoE layers, aiming to reduce memory footprint and enable efficient inference for quantized MoE models.
Highlights
- MoE Quantization Method: A new
BitsAndBytesMoEMethod
has been introduced to handle 4-bit quantization specifically for Mixture-of-Experts (MoE) layers. This method manages the creation of quantized weights and the forward pass, including on-the-fly dequantization. - Optimized MoE Weight Loading: The weight loading process for MoE layers has been enhanced to support BitsAndBytes. This includes modifying the
_load_w13
function to allow full loading of pre-sharded weights and implementing amaybe_fuse_moe_quant_states
function to consolidate individual expert quantization states (e.g.,w1
andw3
into a fusedw13
) for improved memory efficiency. - Model Integration for MoE: MoE models, specifically Qwen2 and Qwen3, have been updated to expose their expert weight structures through new
get_expert_mapping
methods andpacked_modules_mapping
. This enables the BitsAndBytes loader to correctly identify, map, and process the unique weight configurations of MoE layers. - General Quantization Utility: A
calculate_quant_ratio
utility function has been extracted and made globally accessible within the quantization module, promoting code reusability.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces support for MoE models with BitsAndBytes quantization. The changes include modifications to the fused MoE layer, bitsandbytes loader, and model definitions for Qwen2 and Qwen3 MoE models. The code appears well-structured and addresses the objective of adding MoE support. However, there are opportunities to improve error handling, code clarity, and documentation, as highlighted in the review comments.
self._load_w13( | ||
shard_id=shard_id, | ||
shard_dim=shard_dim, | ||
loaded_weight=loaded_weight, | ||
expert_data=expert_data, | ||
tp_rank=self.tp_rank, | ||
load_full=full_load, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code handles the BitsAndBytes case, but there's no explicit error handling if self._load_w13
fails. Consider adding a try-except block to catch potential exceptions and log an error message, which would help in debugging.
self._load_w13( | |
shard_id=shard_id, | |
shard_dim=shard_dim, | |
loaded_weight=loaded_weight, | |
expert_data=expert_data, | |
tp_rank=self.tp_rank, | |
load_full=full_load, | |
) | |
try: | |
self._load_w13( | |
shard_id=shard_id, | |
shard_dim=shard_dim, | |
loaded_weight=loaded_weight, | |
expert_data=expert_data, | |
tp_rank=self.tp_rank, | |
load_full=full_load, | |
) | |
except Exception as e: | |
logger.error(f"Error loading w13 for shard_id {shard_id}: {e}") |
raise AttributeError( | ||
f"MoE Model {type(model).__name__} does not support " | ||
"BitsAndBytes quantization yet. " | ||
"No 'get_expert_mapping' found.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This AttributeError
is raised when get_expert_mapping
is not found. Consider providing a more specific error message that guides the user on how to resolve the issue, such as suggesting they check the model documentation or use a different model.
raise AttributeError(
f"MoE Model {type(model).__name__} does not support "
"BitsAndBytes quantization yet. Ensure the model has a 'get_expert_mapping' method and that MoE is correctly configured.")
if shard_id not in ("w1", "w2", "w3"): | ||
raise ValueError(f"shard_id must be ['w1','w2','w3'] but " | ||
f"got {shard_id}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a check to ensure that the quant_state
is not None before attempting to dequantize it. This can prevent potential errors if the quant_state
is unexpectedly missing.
if quant_states:
dequantize_dq(quant_states)
set_weight_attrs(param, {"bnb_quant_state": quant_states})
else:
logger.warning(f"No quant state found for {param_name}")
layer_prefix = name.split("experts")[0] | ||
weight_qual_name = layer_prefix + exp[1] + "weight" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function dequantize_dq
is called without checking if quant_states
is None. Add a check to ensure that quant_states
is not None before calling dequantize_dq
to prevent potential errors.
if quant_states:
dequantize_dq(quant_states)
set_weight_attrs(param, {"bnb_quant_state": quant_states})
else:
logger.warning(f"No quant state found for {param_name}")
quant_state = dequantize_dq(quant_states_dict[weight_qual_name]) | ||
if shard_id == "w1": | ||
w1_states_lst.append(quant_state) | ||
elif shard_id == "w2": | ||
w2_states_lst.append(quant_state) | ||
else: | ||
w3_states_lst.append(quant_state) | ||
del quant_states_dict[weight_qual_name] | ||
assert (len(w1_states_lst) == len(w2_states_lst) == | ||
len(w3_states_lst)) | ||
w13_absmax_lst = [] | ||
w2_absmax_lst = [] | ||
w13_shape_lst = [] | ||
w2_shape_lst = [] | ||
for w1_qs, w2_qs, w3_qs in zip(w1_states_lst, w2_states_lst, | ||
w3_states_lst): | ||
assert w1_qs.shape == w3_qs.shape | ||
assert w1_qs.blocksize == w2_qs.blocksize == w3_qs.blocksize | ||
assert w1_qs.dtype == w2_qs.dtype == w3_qs.dtype | ||
# w1 and w3 are interleaved in storage | ||
w13_absmax_lst.append(w1_qs.absmax) | ||
w13_absmax_lst.append(w3_qs.absmax) | ||
w2_absmax_lst.append(w2_qs.absmax) | ||
w13_shape_lst.append(w1_qs.shape) | ||
w13_shape_lst.append(w3_qs.shape) | ||
w2_shape_lst.append(w2_qs.shape) | ||
# FIXME dimension is dirty | ||
w13_dim0 = 0 | ||
w13_dim1 = w13_shape_lst[0][1] | ||
for shape in w13_shape_lst: | ||
w13_dim0 += shape[0] | ||
|
||
w2_dim0 = 0 | ||
w2_dim1 = w2_shape_lst[0][1] | ||
for shape in w2_shape_lst: | ||
w2_dim0 += shape[0] | ||
w13_absmax = torch.cat(w13_absmax_lst) | ||
w2_absmax = torch.cat(w2_absmax_lst) | ||
# Create fused quantization state for w13. | ||
w13_qs = QuantState( | ||
absmax=w13_absmax, | ||
shape=(w13_dim0, w13_dim1), | ||
code=w1_states_lst[0].code, | ||
blocksize=w1_states_lst[0].blocksize, | ||
dtype=w1_states_lst[0].dtype, | ||
) | ||
# Create fused quantization state for w2. | ||
w2_qs = QuantState( | ||
absmax=w2_absmax, | ||
shape=(w2_dim0, w2_dim1), | ||
code=w2_states_lst[0].code, | ||
blocksize=w2_states_lst[0].blocksize, | ||
dtype=w2_states_lst[0].dtype, | ||
) | ||
# The weight suffixes .w13_weight and .w2_weight are consistent | ||
# with the param in BitsAndBytesMoEMethod. | ||
w13_weight_name = name + ".w13_weight" | ||
w2_weight_name = name + ".w2_weight" | ||
quant_states_dict[w13_weight_name] = w13_qs | ||
quant_states_dict[w2_weight_name] = w2_qs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire function maybe_fuse_moe_quant_states
could benefit from more comprehensive error handling. For instance, the assertions and value checks could be wrapped in try-except blocks to provide more informative error messages and prevent the function from crashing. Additionally, consider logging warnings or errors when certain conditions are not met, instead of just returning.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
8ffcd06
to
f862320
Compare
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
I will temporarily remove mixtral's BNB support. The reason is that Granite series MoE models using mixtral's |
this change seems to cause failure in TPU.
|
Sry, I am not aware of this failure, please feel free to fix this |
Could you fix forward or rollback? |
#20822 should fix this TPU failure as well. |
Thank you :). |
…import bitsandbytes Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Closes #305 The changes in upstream that are breaking tests with vLLM:main branch are twofolds: 1. [This PR](vllm-project/vllm#20061) introduces[ an import](https://github.com/vllm-project/vllm/blob/8020e98c9f033e76c97eb8261f772d59eba49c9a/vllm/model_executor/model_loader/bitsandbytes_loader.py#L23) which is incompatible with torch 2.5.1 2. Previously, when `model_config.task` was set to `auto`, the task could be automatically resolved in vLLM: after the post-init method of `ModelConfig`, the `task` field was automatically replaced from `auto` to .e.g. `generate` or `embed`). With the changes in [This PR](vllm-project/vllm@020f58a), the task in `model_config`, cannot be automatically resolved anymore when it is `generate`. This doesn't look like it is intended, but rather a side effect of adding new supported tasks. A cleaner (but longer) way would be to do a PR to vLLM to fix that and reintroduce the automatic resolution of `auto` task field. This PR addresses both of the breaking changes: * for 1. it installs pytorch==2.7.0 in github workflows when vLLM:main branch is used * for 2: it infers the task through `model_config.supported_tasks` rather than `model_config.task` now --------- Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
…ct#20061) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Main changes
Supported Model
mistralai/Mixtral-8x7B-Instruct-v0.1Uncertain
I can generate reasonable results locally using the above models, but I'm unable to achieve complete alignment with the results generated by transformers.