-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[New Model]: Support Qwen3 Embedding & Reranker #19260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @noooop, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR aims to add initial support for Qwen3 Embedding and Reranker models within vLLM. Based on the PR title and description, the primary goal is to enable these specific Qwen3 model types. The author notes that the current implementation, particularly for the Reranker task, is a "dirty fix" due to underlying architectural challenges where both Embedding and Reranker models share the same Qwen3ForCausalLM
architecture in Hugging Face, which vLLM isn't currently designed to handle cleanly for these distinct tasks. The PR focuses on implementing a specific scoring mechanism for the Reranker task by leveraging the existing causal language model structure.
Highlights
- Qwen3 Model Support: Adds initial support for Qwen3 Embedding and Reranker models.
- Reranker Scoring Logic: Implements a specific scoring method for the Reranker task by extracting logits for predefined true/false tokens (2152 and 9693) and calculating a score based on their probabilities.
- Cross-Encoding Interface: The
Qwen3ForCausalLM
class now implements theSupportsCrossEncoding
interface, indicating its capability for tasks like reranking. - Known Limitations: The author explicitly mentions several existing issues with this approach, including the challenge of supporting both Embedding and Reranker tasks with the same underlying architecture and the current scoring method being a workaround.
Changelog
- vllm/model_executor/models/qwen3.py
- Added imports for pooling-related classes (
LastPool
,PoolingMetadata
,PoolerOutput
,PoolingSequenceGroupOutput
) and theSupportsCrossEncoding
interface. - Updated the
Qwen3ForCausalLM
class definition to inherit fromSupportsCrossEncoding
. - In the
Qwen3ForCausalLM
constructor, initialized aLastPool
layer if the model task is set to "score". - Added a
pooler
method toQwen3ForCausalLM
which implements the custom scoring logic for the reranker task using specific token logits (2152 and 9693).
- Added imports for pooling-related classes (
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR adds support for Qwen3 Embedding and Reranker models by introducing specific handling for the 'score' task within the Qwen3ForCausalLM
architecture. The approach for reranking is acknowledged as a workaround and involves using logits of specific 'true'/'false' tokens. The changes are targeted and address the issues outlined in the PR description. However, the hardcoded token IDs are a concern for maintainability and generalizability.
Summary of Findings
- Hardcoded Token IDs: The
pooler
method invllm/model_executor/models/qwen3.py
uses hardcoded token IDs (2152 for false, 9693 for true). This is a maintainability concern and should ideally be made configurable or derived from model/tokenizer configuration. - Clarity on Reranker Logic: The choice of
LastPool
in the__init__
method for the 'score' task, and how it relates to the subsequent logit-based scoring in thepooler
method, could benefit from more explicit comments or documentation, especially given its characterization as a "dirty fix".
Merge Readiness
This pull request makes a good effort to support Qwen3 Reranker models within the existing Qwen3ForCausalLM
architecture, acknowledging the current limitations. The approach taken is a pragmatic workaround.
However, the use of hardcoded token IDs is a significant concern that should be addressed. Ideally, these should be configurable or automatically derived. At a minimum, their origin and necessity for being hardcoded should be clearly documented in the code.
Given these points, and the author's own acknowledgement of this being a "dirty fix", I recommend addressing the hardcoded token ID issue and potentially clarifying the LastPool
rationale before merging. I am unable to approve pull requests, but I suggest further discussion on these points. Other reviewers should assess the overall architectural implications of this workaround.
quick review |
tests/models/language/pooling/test_qwen3_reranker.py is a bit hastily, I am refactoring the test for score, the next PR will fix it. |
The current code is a bit hacky. Wait until next week to see if the official can convert the model into Qwen3ForSequenceClassification format. |
Hello! Thank you for your useful work here @noooop. I converted the model to a sequence classification model in the meantime for me to be able to do some testing - it might come in handy for you as well: https://huggingface.co/tomaarsen/Qwen3-Reranker-0.6B-seq-cls I can also move it to the
|
quick review |
thinks |
fixed by #19686 |
Signed-off-by: minpeter <kali2005611@gmail.com>
@noooop 能更新下modelscope、github的ReadMe vllm的用法吗? |
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Summary
Usage
curl
Caution
Please use the query_template and document_template to format the query and document for better reranker results.
without template, the results are almost as random. PTAL #19344
for models that have been converted into
Qwen3ForSequenceClassification
, such astomaarsen/Qwen3-Reranker-0.6B-seq-cls
/score
expected output
/rerank
expected output
for the official model
Why do we need hf_overrides:
Qwen3-Reranker is a language model that doing reranker by using the logits of "no" and "yes" tokens.
vllm converts it to Qwen3ForSequenceClassification when loaded for better performance.
"architectures": ["Qwen3ForSequenceClassification"],
to manually route to Qwen3ForSequenceClassification."classifier_from_token": ["no", "yes"]
.using "is_original_qwen3_reranker": True
.If you correctly start vllm serve, use the corresponding model name, then calling the official model is the same as the model that has been converted into
Qwen3ForSequenceClassification
./score
expected output
/rerank
expected output
Offline and formating query & document:
requests demo + formating query & document:
expected output
Legacy
For Embedding
After merging
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B/discussions/2
embeddings-benchmark/mteb#2769 (comment)
Qwen3-Embedding can already output results close to SentenceTransformers.
For Reranker
format_instruction, Should format_instruction be handled by users or vllm, and where should this piece of code be placed? I temporarily added a process_inputs callback function for LLM.score, but the online version (OpenAI-Compatible Server) doesn't know how to handle it.Please format the string using the method above.FIX #19229
FIX #19252
FIX #19366