[CI] Update FlashInfer to 0.2.6.post1 #19297

mgoin · 2025-06-06T20:38:29Z

Purpose

Update to the latest stable release of FlashInfer. This is the first stable release with Blackwell support, so fairly important to solidify on. However there are not pre-built wheels yet. We can wait to see if wheels will be published, or build our own. @huydhn could you help me with this?

I updated the instructions in the dockerfile to match the new method for building AOT kernels, based on https://docs.flashinfer.ai/installation.html#install-from-source

Test Plan

Test Result

Signed-off-by: mgoin <mgoin64@gmail.com>

github-actions · 2025-06-06T20:38:37Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @mgoin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR, authored by @mgoin, aims to update the FlashInfer library dependency to version 0.2.6. According to the description, this version is significant as it's the first stable release with Blackwell support. The changes primarily involve modifying the Dockerfile used for CI builds to fetch and build FlashInfer from the v0.2.6 tag on GitHub, moving away from a conditional installation that previously attempted to use a pre-built wheel for CUDA 12.8. The PR also updates the supported CUDA architecture list in the build process.

Highlights

FlashInfer Version Update: Updates the FlashInfer dependency to the latest stable release, v0.2.6, which includes initial support for Blackwell GPUs.
CI Build Logic Change: Modifies the docker/Dockerfile to consistently build FlashInfer from source using the v0.2.6 tag from the GitHub repository, removing the previous conditional logic that would attempt to use a pre-built wheel for CUDA 12.8.
CUDA Architecture Support: Updates the TORCH_CUDA_ARCH_LIST environment variable in the Dockerfile (line 264) to include 10.0+PTX, likely to enable support for newer architectures like Blackwell (SM 10.0).

Changelog

docker/Dockerfile
- Commented out the conditional block (lines 261-263, 270) that checked for CUDA 12.8 and attempted to install a specific pre-built FlashInfer wheel.
- Updated the TORCH_CUDA_ARCH_LIST environment variable (line 264) to include 10.0+PTX.
- Modified the uv pip install command (line 269) to fetch FlashInfer directly from the v0.2.6 tag on GitHub instead of a previous commit hash.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates FlashInfer to version 0.2.6, which is an important step for Blackwell support. The changes in the Dockerfile correctly target the new version and update the TORCH_CUDA_ARCH_LIST to include compute capability 10.0.

The PR description clearly explains the rationale for building from source (lack of pre-built wheels for v0.2.6 currently).

I have a couple of suggestions to improve the clarity and maintainability of the Dockerfile.

Summary of Findings

Outdated/Misleading Comment: A comment on line 260 regarding pre-built wheels for FlashInfer is outdated given the change to build v0.2.6 from source.
Clarity of Commented-Out Code: The block of previously active code for FlashInfer installation (lines 261-263, 270) is now commented out. Replacing it with a concise explanatory comment or removing it would improve Dockerfile readability.

Merge Readiness

The core change to update FlashInfer to v0.2.6 and enable Blackwell support is well-implemented. However, there are a few medium severity issues related to comments and commented-out code that affect the Dockerfile's clarity and maintainability.

I recommend addressing these suggestions to improve the codebase. As I am an AI, I am not authorized to approve pull requests. Please ensure these changes are reviewed and approved by a human maintainer before merging.

gemini-code-assist · 2025-06-06T20:39:51Z

docker/Dockerfile

-        fi; \
-        uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@21ea1d2545f74782b91eb8c08fd503ac4c0743fc" ; \
-    fi \
+    # FlashInfer already has a wheel for PyTorch 2.7.0 and CUDA 12.8. This is enough for CI use


The comment # FlashInfer already has a wheel for PyTorch 2.7.0 and CUDA 12.8. This is enough for CI use now precedes logic that unconditionally builds FlashInfer v0.2.6 from source (for non-arm64 platforms). This comment seems to refer to the previous state with FlashInfer v0.2.5, where a pre-built wheel was used for CUDA 12.8.

Could we update this comment to more accurately reflect the current strategy of building v0.2.6 from source? This would prevent potential confusion for future readers.

# Building FlashInfer v0.2.6 from source as pre-built wheels for this version are not yet available.

gemini-code-assist · 2025-06-06T20:39:51Z

docker/Dockerfile

+    # if [[ "$CUDA_VERSION" == 12.8* ]]; then \
+    #     uv pip install --system https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl; \
+    # else \


This block of commented-out code previously handled the conditional installation of FlashInfer v0.2.5 (either via a pre-built wheel for CUDA 12.8 or by building from a specific commit for other versions). The PR description clarifies that v0.2.6 wheels are not yet available, hence the shift to building from the v0.2.6 tag.

To improve Dockerfile readability and reduce clutter, would it be better to replace these commented-out lines with a single, more concise comment explaining the current situation or a TODO for future wheel availability? For example:

- # if [[ "$CUDA_VERSION" == 12.8* ]]; then \ - # uv pip install --system https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl; \ - # else \ + # TODO: Re-evaluate using pre-built wheels for FlashInfer v0.2.6 if/when they become available.

This would make the Dockerfile's intent clearer.

# TODO: Re-evaluate using pre-built wheels for FlashInfer v0.2.6 if/when they become available.

gemini-code-assist · 2025-06-06T20:39:51Z

docker/Dockerfile

+        export FLASHINFER_ENABLE_SM90=0; \
+    fi; \
+    uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6" ; \
+    # fi \


This commented-out # fi corresponds to the if block (lines 261-263) that is also now commented out.

If the preceding commented block (lines 261-263) is removed or replaced by a more concise comment as suggested, this line should also be removed to maintain consistency and clarity in the Dockerfile.

huydhn · 2025-06-07T00:10:56Z

Yes, I can help build and publish that wheel on download.pytorch.org

Signed-off-by: mgoin <mgoin64@gmail.com>

huydhn · 2025-06-10T00:49:25Z

Sorry for the delay, I have the wheel buit for 0.2.6.post1 ready at https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl. The wheel is built with FLASHINFER_LOCAL_VERSION=cu128torch2.7 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0a 10.0a' FLASHINFER_ENABLE_AOT=1 python -m build --no-isolation --wheel to match what you have in the PR. Let me know if it works for you

mgoin · 2025-06-10T17:09:26Z

Thank you @huydhn ! Will update now

Signed-off-by: mgoin <mgoin64@gmail.com>

davefojtik · 2025-06-10T20:11:19Z

Can we please get official Flashinfer AOT wheels for the cu126torch2.7 combination too? It should be supported, right?

houseroad · 2025-06-11T13:08:52Z

Maybe @huydhn could take a look at CUDA12.6 + torch2.7 combination for flashinfer wheel.

cyril23 · 2025-06-18T15:59:28Z

This Pullrequest broke SM 120 Blackwell compability (RTX 50xx, RTX PRO).

You can't use -e VLLM_USE_FLASHINFER_SAMPLER=1 anymore (which is the default) and need to fall back to -e VLLM_USE_FLASHINFER_SAMPLER=0 which will give you less performance and this warning:

WARNING 06-18 08:55:01 [topk_topp_sampler.py:52] FlashInfer is available, but it is not enabled. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please set VLLM_USE_FLASHINFER_SAMPLER=1.

I did 2 builds, both with --build-arg torch_cuda_arch_list='12.0 (SM 120 compatible only) and pushed them to Docker hub:

wurstdeploy/vllm:azure10thjunesolo120 which is based on the last commit of 10th June (da9b523) and which still uses the old FlashInfer version

git checkout -b 10thjune da9b523ce1fd5c27bfd18921ba0388bf2e8e4618
DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=64   --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1   --build-arg torch_cuda_arch_list='12.0'   --build-arg RUN_WHEEL_CHECK=false   --tag wurstdeploy/vllm:azure10thjunesolo120 --target vllm-openai   --progress plain -f docker/Dockerfile .

# this is still SM 120 compatible, you can run via
sudo docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000 \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  wurstdeploy/vllm:azure10thjunesolo120    --model Qwen/Qwen3-0.6B

wurstdeploy/vllm:azure11thjunesolo120 which is based on the last commit of 11th June (42f52cc) and already includes your commit 497a91e and therefore the updated Flashinfer version

git checkout -b 11thjune 42f52cc95bf34a2e15f4cdbc8474503a9bcc970f
DOCKER_BUILDKIT=1 sudo docker build --build-arg max_jobs=64   --build-arg USE_SCCACHE=0 --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1   --build-arg torch_cuda_arch_list='12.0'   --build-arg RUN_WHEEL_CHECK=false   --tag wurstdeploy/vllm:azure11thjunesolo120 --target vllm-openai   --progress plain -f docker/Dockerfile .

# this is not fully SM 120 compatible anymore:
sudo docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000 \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  wurstdeploy/vllm:azure11thjunesolo120    --model Qwen/Qwen3-0.6B

INFO 06-18 08:53:41 [monitor.py:34] torch.compile takes 18.01 s in total
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Process EngineCore_0:
ERROR 06-18 08:53:41 [core.py:515] EngineCore failed to start.
ERROR 06-18 08:53:41 [core.py:515] Traceback (most recent call last):
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-18 08:53:41 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-18 08:53:41 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in __init__
ERROR 06-18 08:53:41 [core.py:515]     self._initialize_kv_caches(vllm_config)
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
ERROR 06-18 08:53:41 [core.py:515]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-18 08:53:41 [core.py:515]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-18 08:53:41 [core.py:515]     output = self.collective_rpc("determine_available_memory")
ERROR 06-18 08:53:41 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-18 08:53:41 [core.py:515]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-18 08:53:41 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2680, in run_method
ERROR 06-18 08:53:41 [core.py:515]     return func(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-18 08:53:41 [core.py:515]     return func(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 205, in determine_available_memory
ERROR 06-18 08:53:41 [core.py:515]     self.model_runner.profile_run()
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2015, in profile_run
ERROR 06-18 08:53:41 [core.py:515]     sampler_output = self._dummy_sampler_run(hidden_states)
ERROR 06-18 08:53:41 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-18 08:53:41 [core.py:515]     return func(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1913, in _dummy_sampler_run
ERROR 06-18 08:53:41 [core.py:515]     raise e
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1903, in _dummy_sampler_run
ERROR 06-18 08:53:41 [core.py:515]     sampler_output = self.sampler(logits=logits,
ERROR 06-18 08:53:41 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-18 08:53:41 [core.py:515]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-18 08:53:41 [core.py:515]     return forward_call(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 52, in forward
ERROR 06-18 08:53:41 [core.py:515]     sampled = self.sample(logits, sampling_metadata)
ERROR 06-18 08:53:41 [core.py:515]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 118, in sample
ERROR 06-18 08:53:41 [core.py:515]     random_sampled = self.topk_topp_sampler(
ERROR 06-18 08:53:41 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-18 08:53:41 [core.py:515]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-18 08:53:41 [core.py:515]     return forward_call(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 104, in forward_cuda
ERROR 06-18 08:53:41 [core.py:515]     return flashinfer_sample(logits, k, p, generators)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 290, in flashinfer_sample
ERROR 06-18 08:53:41 [core.py:515]     next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
ERROR 06-18 08:53:41 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py", line 901, in top_k_top_p_sampling_from_logits
ERROR 06-18 08:53:41 [core.py:515]     masked_logits = top_k_mask_logits(logits, top_k)
ERROR 06-18 08:53:41 [core.py:515]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py", line 1221, in top_k_mask_logits
ERROR 06-18 08:53:41 [core.py:515]     return get_sampling_module().top_k_mask_logits(
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/sampling.py", line 352, in top_k_mask_logits
ERROR 06-18 08:53:41 [core.py:515]     module.top_k_mask_logits.default(
ERROR 06-18 08:53:41 [core.py:515]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
ERROR 06-18 08:53:41 [core.py:515]     return self._op(*args, **kwargs)
ERROR 06-18 08:53:41 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-18 08:53:41 [core.py:515] RuntimeError: TopKMaskLogits failed with error code no kernel image is available for execution on the device


# you can only run it without Flashinfer, i.e. -e VLLM_USE_FLASHINFER_SAMPLER=0:
sudo docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000 \
  -e VLLM_USE_FLASHINFER_SAMPLER=0 \
  wurstdeploy/vllm:azure11thjunesolo120    --model Qwen/Qwen3-0.6B
> WARNING 06-18 08:55:01 [topk_topp_sampler.py:52] FlashInfer is available, but it is not enabled. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please set VLLM_USE_FLASHINFER_SAMPLER=1.

Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: minpeter <kali2005611@gmail.com>

Signed-off-by: mgoin <mgoin64@gmail.com>

Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

Signed-off-by: mgoin <mgoin64@gmail.com>

Update FlashInfer to 0.2.6

cb2d8e7

Signed-off-by: mgoin <mgoin64@gmail.com>

gemini-code-assist bot reviewed Jun 6, 2025

View reviewed changes

mergify bot added the ci/build label Jun 6, 2025

gemini-code-assist bot suggested changes Jun 6, 2025

View reviewed changes

Update Dockerfile

572b3a2

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2025

mgoin added 2 commits June 9, 2025 18:16

Build AOT kernels

58546f6

Signed-off-by: mgoin <mgoin64@gmail.com>

Merge branch 'main' into update-to-flashinfer-0.2.6

52e83f5

mgoin changed the title ~~[CI] Update FlashInfer to 0.2.6~~ [CI] Update FlashInfer to 0.2.6.post1 Jun 9, 2025

python3

2ab812f

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin added 2 commits June 10, 2025 13:10

Add pre-built wheel

3195f57

Signed-off-by: mgoin <mgoin64@gmail.com>

Merge branch 'main' into update-to-flashinfer-0.2.6

7103af2

houseroad approved these changes Jun 11, 2025

View reviewed changes

youkaichao merged commit 497a91e into vllm-project:main Jun 11, 2025
93 of 95 checks passed

xinli-sw mentioned this pull request Jun 13, 2025

[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist() #19483

Open

1 task

cyril23 mentioned this pull request Jun 18, 2025

buildkite release pipeline: add torch_cuda_arch_list including 12.0 to the Docker "Build release image" build args in order to enable Blackwell SM120 support #19747

Closed

3 tasks

cyril23 mentioned this pull request Jun 18, 2025

Revert "[CI] Update FlashInfer to 0.2.6.post1" --- edit: No, better add "12.0" to FlashInfer TORCH_CUDA_ARCH_LIST see PR #19794 #19810

Closed

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297)

969a962

Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: minpeter <kali2005611@gmail.com>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

[CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297)

33b0a0a

Signed-off-by: mgoin <mgoin64@gmail.com>

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297)

6797586

Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025

[CI] Update FlashInfer to 0.2.6.post1 (vllm-project#19297)

f79cb5b

Signed-off-by: mgoin <mgoin64@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI] Update FlashInfer to 0.2.6.post1 #19297

[CI] Update FlashInfer to 0.2.6.post1 #19297

Uh oh!

mgoin commented Jun 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 6, 2025

Uh oh!

gemini-code-assist bot Jun 6, 2025

Uh oh!

gemini-code-assist bot Jun 6, 2025

Uh oh!

huydhn commented Jun 7, 2025

Uh oh!

huydhn commented Jun 10, 2025 •

edited

Loading

Uh oh!

mgoin commented Jun 10, 2025

Uh oh!

davefojtik commented Jun 10, 2025

Uh oh!

houseroad commented Jun 11, 2025

Uh oh!

Uh oh!

cyril23 commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

[CI] Update FlashInfer to 0.2.6.post1 #19297

[CI] Update FlashInfer to 0.2.6.post1 #19297

Uh oh!

Conversation

mgoin commented Jun 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

huydhn commented Jun 7, 2025

Uh oh!

huydhn commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Jun 10, 2025

Uh oh!

davefojtik commented Jun 10, 2025

Uh oh!

houseroad commented Jun 11, 2025

Uh oh!

Uh oh!

cyril23 commented Jun 18, 2025

Uh oh!

Uh oh!

mgoin commented Jun 6, 2025 •

edited by github-actions bot

Loading

huydhn commented Jun 10, 2025 •

edited

Loading