Skip to content

Conversation

noooop
Copy link
Contributor

@noooop noooop commented Jun 6, 2025

Summary

  • Qwen3 Embedding
    • Qwen/Qwen3-Embedding-0.6B
    • Qwen/Qwen3-Embedding-4B
    • Qwen/Qwen3-Embedding-8B
  • Qwen3 Reranker
    • Qwen/Qwen3-Reranker-0.6B
    • Qwen/Qwen3-Reranker-4B
    • Qwen/Qwen3-Reranker-8B
    • tomaarsen/Qwen3-Reranker-0.6B-seq-cls

Usage

  • Qwen3 Embedding
vllm serve Qwen/Qwen3-Embedding-0.6B

curl

curl http://127.0.0.1:8000/v1/embeddings \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "Follow the white rabbit.",
    "model": "Qwen/Qwen3-Embedding-0.6B",
    "encoding_format": "float"
  }'
  • Qwen3 Reranker

Caution

Please use the query_template and document_template to format the query and document for better reranker results.
without template, the results are almost as random. PTAL #19344

for models that have been converted into Qwen3ForSequenceClassification, such as tomaarsen/Qwen3-Reranker-0.6B-seq-cls

vllm serve tomaarsen/Qwen3-Reranker-0.6B-seq-cls

/score

curl http://127.0.0.1:8000/score \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "text_1": "ping",
    "text_2": "pong",
    "model": "tomaarsen/Qwen3-Reranker-0.6B-seq-cls"
  }'

expected output

{"id":"score-ee337e20e932467a83792d220614a7cd","object":"list","created":1749527048,"model":"tomaarsen/Qwen3-Reranker-0.6B-seq-cls","data":[{"index":0,"object":"score","score":0.06634521484375}],"usage":{"prompt_tokens":2,"total_tokens":2,"completion_tokens":0,"prompt_tokens_details":null}}

/rerank

curl http://127.0.0.1:8000/rerank \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "ping",
    "documents": ["pong"],
    "model": "tomaarsen/Qwen3-Reranker-0.6B-seq-cls"
  }'

expected output

{"id":"rerank-fe06b692387444b7a56e282944f285f9","model":"tomaarsen/Qwen3-Reranker-0.6B-seq-cls","usage":{"total_tokens":2},"results":[{"index":0,"document":{"text":"pong"},"relevance_score":0.06634521484375}]}

for the official model

vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'

Why do we need hf_overrides:
Qwen3-Reranker is a language model that doing reranker by using the logits of "no" and "yes" tokens.
vllm converts it to Qwen3ForSequenceClassification when loaded for better performance.

  • Firstly, we need using "architectures": ["Qwen3ForSequenceClassification"], to manually route to Qwen3ForSequenceClassification.
  • Then, we will extract the vector corresponding to classifier_from_token from lm_head using "classifier_from_token": ["no", "yes"].
  • Third, we will convert these two vectors into one vector. The use of conversion logic is controlled by using "is_original_qwen3_reranker": True.

If you correctly start vllm serve, use the corresponding model name, then calling the official model is the same as the model that has been converted into Qwen3ForSequenceClassification.

/score

curl http://127.0.0.1:8000/score \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "text_1": "ping",
    "text_2": "pong",
    "model": "Qwen/Qwen3-Reranker-0.6B"
  }'

expected output

{"id":"score-7dbe101346ea4aeea4b85aa7971ddf8f","object":"list","created":1749527323,"model":"Qwen/Qwen3-Reranker-0.6B","data":[{"index":0,"object":"score","score":0.0673828125}],"usage":{"prompt_tokens":2,"total_tokens":2,"completion_tokens":0,"prompt_tokens_details":null}}

/rerank

curl http://127.0.0.1:8000/rerank \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "ping",
    "documents": ["pong"],
    "model": "Qwen/Qwen3-Reranker-0.6B"
  }'

expected output

{"id":"rerank-43ddc0f96f174ae4a5eef07d51a8defd","model":"Qwen/Qwen3-Reranker-0.6B","usage":{"total_tokens":2},"results":[{"index":0,"document":{"text":"pong"},"relevance_score":0.0673828125}]}

Offline and formating query & document:

from vllm import LLM

model_name = "Qwen/Qwen3-Reranker-0.6B"

# What is the difference between the official original version and one
# that has been converted into a sequence classification model?
# Qwen3-Reranker is a language model that doing reranker by using the
# logits of "no" and "yes" tokens.
# It needs to computing 151669 tokens logits, making this method extremely
# inefficient, not to mention incompatible with the vllm score API.
# A method for converting the original model into a sequence classification
# model was proposed. See:https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
# Models converted offline using this method can not only be more efficient
# and support the vllm score API, but also make the init parameters more
# concise, for example.
# model = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", task="score")

# If you want to load the official original version, the init parameters are
# as follows.

model = LLM(
    model=model_name,
    task="score",
    hf_overrides={
        "architectures": ["Qwen3ForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
        "is_original_qwen3_reranker": True,
    },
)

# Why do we need hf_overrides for the official original version:
# vllm converts it to Qwen3ForSequenceClassification when loaded for
# better performance.
# - Firstly, we need using `"architectures": ["Qwen3ForSequenceClassification"],`
# to manually route to Qwen3ForSequenceClassification.
# - Then, we will extract the vector corresponding to classifier_from_token
# from lm_head using `"classifier_from_token": ["no", "yes"]`.
# - Third, we will convert these two vectors into one vector.  The use of
# conversion logic is controlled by `using "is_original_qwen3_reranker": True`.

# Please use the query_template and document_template to format the query and
# document for better reranker results.

prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
document_template = "<Document>: {doc}{suffix}"

if __name__ == "__main__":
    instruction = (
        "Given a web search query, retrieve relevant passages that answer the query"
    )

    queries = [
        "What is the capital of China?",
        "Explain gravity",
    ]

    documents = [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
    ]

    queries = [
        query_template.format(prefix=prefix, instruction=instruction, query=query)
        for query in queries
    ]
    documents = [document_template.format(doc=doc, suffix=suffix) for doc in documents]

    outputs = model.score(queries, documents)

    print([output.outputs.score for output in outputs])

requests demo + formating query & document:

import requests

url = "http://127.0.0.1:8000/score"
MODEL_NAME = "tomaarsen/Qwen3-Reranker-0.6B-seq-cls"

# Please use the query_template and document_template to format the query and
# document for better reranker results.

prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
document_template = "<Document>: {doc}{suffix}"

instruction = (
    "Given a web search query, retrieve relevant passages that answer the query"
)

queries = [
    "What is the capital of China?",
    "Explain gravity",
]

documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

queries = [
    query_template.format(prefix=prefix, instruction=instruction, query=query)
    for query in queries
]
documents = [
    document_template.format(doc=doc, suffix=suffix) for doc in documents
]

response = requests.post(url,
                         json={
                             "model": MODEL_NAME,
                             "text_1": queries,
                             "text_2": documents,
                             "truncate_prompt_tokens": -1,
                         }).json()

print(response)

expected output

{'id': 'score-14f698f021b9434482ec3d94a5757e11', 'object': 'list', 'created': 1749786173, 'model': 'tomaarsen/Qwen3-Reranker-0.6B-seq-cls', 'data': [{'index': 0, 'object': 'score', 'score': 0.99951171875}, {'index': 1, 'object': 'score', 'score': 0.99951171875}], 'usage': {'prompt_tokens': 189, 'total_tokens': 189, 'completion_tokens': 0, 'prompt_tokens_details': None}}

Legacy

For Embedding
After merging
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B/discussions/2
embeddings-benchmark/mteb#2769 (comment)
Qwen3-Embedding can already output results close to SentenceTransformers.

For Reranker

  • Qwen3ForCausalLM: Qwen3 Embedding & Reranker both use the same architecture Qwen3ForCausalLM, vllm currently has no way to allow a single architecture to support Embedding and Reranker at the same time.
  • SupportsCrossEncoding: For Reranker, The biggest problem is that the task Score will treat the Qwen3ForCausalLM model in a way that calculates the Score based on embedding models, which involves calculating embeddings and cosine distance. This is definitely not what was wanted.
  • Qwen3ForSequenceClassification: Perhaps ultimately we need something like --hf-overrides '{"architectures": ["Qwen3ForSequenceClassification"]}' to get Qwen3 Reranker to run correctly.
  • classifier_from_token: A more efficient approach is to extract token_false_id = 2152 and token_true_id = 9693 into a 2-class classification task rather than the current 151669-class classification task. We need a new interface (classifier_from_token) to implement it
  • converted to a single label classification model The 2-way classifier is actually just a 1- way head classifier. https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
  • format_instruction, Should format_instruction be handled by users or vllm, and where should this piece of code be placed? I temporarily added a process_inputs callback function for LLM.score, but the online version (OpenAI-Compatible Server) doesn't know how to handle it. Please format the string using the method above.

FIX #19229
FIX #19252
FIX #19366

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @noooop, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR aims to add initial support for Qwen3 Embedding and Reranker models within vLLM. Based on the PR title and description, the primary goal is to enable these specific Qwen3 model types. The author notes that the current implementation, particularly for the Reranker task, is a "dirty fix" due to underlying architectural challenges where both Embedding and Reranker models share the same Qwen3ForCausalLM architecture in Hugging Face, which vLLM isn't currently designed to handle cleanly for these distinct tasks. The PR focuses on implementing a specific scoring mechanism for the Reranker task by leveraging the existing causal language model structure.

Highlights

  • Qwen3 Model Support: Adds initial support for Qwen3 Embedding and Reranker models.
  • Reranker Scoring Logic: Implements a specific scoring method for the Reranker task by extracting logits for predefined true/false tokens (2152 and 9693) and calculating a score based on their probabilities.
  • Cross-Encoding Interface: The Qwen3ForCausalLM class now implements the SupportsCrossEncoding interface, indicating its capability for tasks like reranking.
  • Known Limitations: The author explicitly mentions several existing issues with this approach, including the challenge of supporting both Embedding and Reranker tasks with the same underlying architecture and the current scoring method being a workaround.

Changelog

  • vllm/model_executor/models/qwen3.py
    • Added imports for pooling-related classes (LastPool, PoolingMetadata, PoolerOutput, PoolingSequenceGroupOutput) and the SupportsCrossEncoding interface.
    • Updated the Qwen3ForCausalLM class definition to inherit from SupportsCrossEncoding.
    • In the Qwen3ForCausalLM constructor, initialized a LastPool layer if the model task is set to "score".
    • Added a pooler method to Qwen3ForCausalLM which implements the custom scoring logic for the reranker task using specific token logits (2152 and 9693).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

github-actions bot commented Jun 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds support for Qwen3 Embedding and Reranker models by introducing specific handling for the 'score' task within the Qwen3ForCausalLM architecture. The approach for reranking is acknowledged as a workaround and involves using logits of specific 'true'/'false' tokens. The changes are targeted and address the issues outlined in the PR description. However, the hardcoded token IDs are a concern for maintainability and generalizability.

Summary of Findings

  • Hardcoded Token IDs: The pooler method in vllm/model_executor/models/qwen3.py uses hardcoded token IDs (2152 for false, 9693 for true). This is a maintainability concern and should ideally be made configurable or derived from model/tokenizer configuration.
  • Clarity on Reranker Logic: The choice of LastPool in the __init__ method for the 'score' task, and how it relates to the subsequent logit-based scoring in the pooler method, could benefit from more explicit comments or documentation, especially given its characterization as a "dirty fix".

Merge Readiness

This pull request makes a good effort to support Qwen3 Reranker models within the existing Qwen3ForCausalLM architecture, acknowledging the current limitations. The approach taken is a pragmatic workaround.

However, the use of hardcoded token IDs is a significant concern that should be addressed. Ideally, these should be configurable or automatically derived. At a minimum, their origin and necessity for being hardcoded should be clearly documented in the code.

Given these points, and the author's own acknowledgement of this being a "dirty fix", I recommend addressing the hardcoded token ID issue and potentially clarifying the LastPool rationale before merging. I am unable to approve pull requests, but I suggest further discussion on these points. Other reviewers should assess the overall architectural implications of this workaround.

@DarkLight1337 DarkLight1337 self-assigned this Jun 6, 2025
@mergify mergify bot added the frontend label Jun 6, 2025
@noooop
Copy link
Contributor Author

noooop commented Jun 6, 2025

@DarkLight1337

quick review

@noooop noooop marked this pull request as ready for review June 6, 2025 12:17
@noooop noooop requested a review from ywang96 as a code owner June 6, 2025 12:17
@noooop
Copy link
Contributor Author

noooop commented Jun 6, 2025

tests/models/language/pooling/test_qwen3_reranker.py is a bit hastily, I am refactoring the test for score, the next PR will fix it.

@noooop
Copy link
Contributor Author

noooop commented Jun 6, 2025

The current code is a bit hacky.

Wait until next week to see if the official can convert the model into Qwen3ForSequenceClassification format.

@tomaarsen
Copy link

tomaarsen commented Jun 6, 2025

Hello!

Thank you for your useful work here @noooop. I converted the model to a sequence classification model in the meantime for me to be able to do some testing - it might come in handy for you as well: https://huggingface.co/tomaarsen/Qwen3-Reranker-0.6B-seq-cls

I can also move it to the cross-encoder organization, but I'd like to avoid that until I get models with these larger templates working more nicely with Sentence Transformers, i.e. without having to do all kinds of pre-processing manually.

  • Tom Aarsen

@lovetian1991
Copy link

quick review

@TPLink32
Copy link

en3ForSequenceClassification"],"cla

thinks
New problem with embedding
INFO 06-17 16:37:49 [engine.py:317] Added request embd-be953f7ccae14a96b5f9f8e13b16b4d7-61.
INFO 06-17 16:37:49 [engine.py:317] Added request embd-be953f7ccae14a96b5f9f8e13b16b4d7-62.
INFO 06-17 16:37:49 [engine.py:317] Added request embd-be953f7ccae14a96b5f9f8e13b16b4d7-63.
ERROR 06-17 16:37:49 [engine.py:165] AttributeError("'PlaceholderBlockSpaceManager' object has no attribute 'remove_seq_from_computed_blocks_tracker'")
ERROR 06-17 16:37:49 [engine.py:165] Traceback (most recent call last):
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 163, in start
ERROR 06-17 16:37:49 [engine.py:165] self.run_engine_loop()
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 226, in run_engine_loop
ERROR 06-17 16:37:49 [engine.py:165] request_outputs = self.engine_step()
ERROR 06-17 16:37:49 [engine.py:165] ^^^^^^^^^^^^^^^^^^
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 252, in engine_step
ERROR 06-17 16:37:49 [engine.py:165] raise e
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 235, in engine_step
ERROR 06-17 16:37:49 [engine.py:165] return self.engine.step()
ERROR 06-17 16:37:49 [engine.py:165] ^^^^^^^^^^^^^^^^^^
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 1296, in step
ERROR 06-17 16:37:49 [engine.py:165] ) = self.scheduler[virtual_engine].schedule()
ERROR 06-17 16:37:49 [engine.py:165] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1553, in schedule
ERROR 06-17 16:37:49 [engine.py:165] scheduler_outputs: SchedulerOutputs = self._schedule()
ERROR 06-17 16:37:49 [engine.py:165] ^^^^^^^^^^^^^^^^
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1512, in _schedule
ERROR 06-17 16:37:49 [engine.py:165] return self._schedule_default()
ERROR 06-17 16:37:49 [engine.py:165] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1277, in _schedule_default
ERROR 06-17 16:37:49 [engine.py:165] prefills = self._schedule_prefills(budget,
ERROR 06-17 16:37:49 [engine.py:165] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1195, in _schedule_prefills
ERROR 06-17 16:37:49 [engine.py:165] self.remove_seq_from_computed_blocks_tracker(
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1715, in remove_seq_from_computed_blocks_tracker
ERROR 06-17 16:37:49 [engine.py:165] self._remove_seq_from_computed_blocks_tracker(seq)
ERROR 06-17 16:37:49 [engine.py:165] File "/data/code/llm_test/vllm_env/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1722, in _remove_seq_from_computed_blocks_tracker
ERROR 06-17 16:37:49 [engine.py:165] self.block_manager.remove_seq_from_computed_blocks_tracker(seq)
ERROR 06-17 16:37:49 [engine.py:165] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-17 16:37:49 [engine.py:165] AttributeError: 'PlaceholderBlockSpaceManager' object has no attribute 'remove_seq_from_computed_blocks_tracker'

@noooop
Copy link
Contributor Author

noooop commented Jun 17, 2025

ERROR 06-17 16:37:49 [engine.py:165] AttributeError: 'PlaceholderBlockSpaceManager' object has no attribute 'remove_seq_from_computed_blocks_tracker'

fixed by #19686

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: minpeter <kali2005611@gmail.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
Thachnh pushed a commit to deepinfra/vllm that referenced this pull request Jul 1, 2025
@noooop noooop deleted the Qwen3_gte branch July 10, 2025 04:46
@umie0128
Copy link

@noooop 能更新下modelscope、github的ReadMe vllm的用法吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: support to qwen3 embedding and rerank via vllm serve command [Bug]: Support Qwen3 Reranker [Feature]: Support Qwen3 Embedding & Reranker