[Bug]: FP8 kvcache causes RuntimeError in v1 engine

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.4
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.119-19-0013_plus-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.6.77
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
GPU 2: NVIDIA GeForce RTX 4090
GPU 3: NVIDIA GeForce RTX 4090
GPU 4: NVIDIA GeForce RTX 4090
GPU 5: NVIDIA GeForce RTX 4090
GPU 6: NVIDIA GeForce RTX 4090
GPU 7: NVIDIA GeForce RTX 4090

Nvidia driver version: 535.171.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          112
On-line CPU(s) list:             0-111
Vendor ID:                       GenuineIntel
BIOS Vendor ID:                  Red Hat
Model name:                      Intel(R) Xeon(R) Platinum 8476C
BIOS Model name:                 3.0
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              28
Socket(s):                       2
Stepping:                        6
BogoMIPS:                        5200.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb ibrs_enhanced fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq movdiri movdir64b fsrm arch_capabilities
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       2.6 MiB (56 instances)
L1i cache:                       1.8 MiB (56 instances)
L2 cache:                        112 MiB (56 instances)
L3 cache:                        195 MiB (2 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-111
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cudnn-frontend==1.7.0
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-dali-cuda120==1.42.0
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-modelopt==0.17.0
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvimgcodec-cu12==0.3.0.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] nvidia-pyindex==1.0.9
[pip3] onnx==1.16.2
[pip3] optree==0.13.0
[pip3] pynvml==11.4.1
[pip3] pytorch-triton==3.0.0+dedb7bdf3
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torch_tensorrt==2.5.0a0
[pip3] torchprofile==0.0.4
[pip3] torchvision==0.20.1
[pip3] transformers==4.47.1
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.5
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     PIX     PIX     SYS     SYS     SYS     SYS     0-111   0               N/A
GPU1    PIX      X      PIX     PIX     SYS     SYS     SYS     SYS     0-111   0               N/A
GPU2    PIX     PIX      X      PIX     SYS     SYS     SYS     SYS     0-111   0               N/A
GPU3    PIX     PIX     PIX      X      SYS     SYS     SYS     SYS     0-111   0               N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     PIX     PIX     0-111   0               N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      PIX     PIX     0-111   0               N/A
GPU6    SYS     SYS     SYS     SYS     PIX     PIX      X      PIX     0-111   0               N/A
GPU7    SYS     SYS     SYS     SYS     PIX     PIX     PIX      X      0-111   0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=all
CUBLAS_VERSION=12.6.3.3
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_CACHE_DISABLE=1
TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
NCCL_VERSION=2.22.3
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.6.2.004
PYTORCH_VERSION=2.5.0a0+e000cf0
PYTORCH_BUILD_NUMBER=0
CUDNN_FRONTEND_VERSION=1.7.0
CUDNN_VERSION=9.5.0.50
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=114410972
CUDA_DRIVER_VERSION=560.35.03
PYTORCH_BUILD_VERSION=2.5.0a0+e000cf0
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=24.10
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

I encountered an issue when using the v1 engine of vLLM with FP8 kvCache, It seems that the issue might be related to the flash attention kernel?

**Steps to reproduce**
```bash
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server  \
    --model meta-llama/Llama-3.1-8B-Instruct --dtype auto  \
    --served-model-name llama3-8b --kv-cache-dtype fp8 --port 9314 \
    --max-num-seqs 128 --gpu-memory-utilization 0.9 --max_num_batched_tokens 5048 --max-model-len 5048
```
**Full traceback**
```
ERROR 12-19 09:04:11 core.py:270] query and key must have the same dtype
ERROR 12-19 09:04:11 core.py:270] Traceback (most recent call last):
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 264, in run_engine_core
ERROR 12-19 09:04:11 core.py:270]     engine_core.run_busy_loop()
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 302, in run_busy_loop
ERROR 12-19 09:04:11 core.py:270]     outputs = self.step()
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 125, in step
ERROR 12-19 09:04:11 core.py:270]     output = self.model_executor.execute_model(scheduler_output)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/uniproc_executor.py", line 72, in execute_model
ERROR 12-19 09:04:11 core.py:270]     output = self.worker.execute_model(scheduler_output)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-19 09:04:11 core.py:270]     return func(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 203, in execute_model
ERROR 12-19 09:04:11 core.py:270]     output = self.model_runner.execute_model(scheduler_output)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-19 09:04:11 core.py:270]     return func(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 509, in execute_model
ERROR 12-19 09:04:11 core.py:270]     hidden_states = self.model(
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 12-19 09:04:11 core.py:270]     return self._call_impl(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 12-19 09:04:11 core.py:270]     return forward_call(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 568, in forward
ERROR 12-19 09:04:11 core.py:270]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 205, in __call__
ERROR 12-19 09:04:11 core.py:270]     model_output = self.forward(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 338, in forward
ERROR 12-19 09:04:11 core.py:270]     def forward(
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 12-19 09:04:11 core.py:270]     return self._call_impl(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 12-19 09:04:11 core.py:270]     return forward_call(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
ERROR 12-19 09:04:11 core.py:270]     return fn(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 784, in call_wrapped
ERROR 12-19 09:04:11 core.py:270]     return self._wrapped_call(self, *args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 361, in __call__
ERROR 12-19 09:04:11 core.py:270]     raise e
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 348, in __call__
ERROR 12-19 09:04:11 core.py:270]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 12-19 09:04:11 core.py:270]     return self._call_impl(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 12-19 09:04:11 core.py:270]     return forward_call(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "<eval_with_key>.66", line 240, in forward
ERROR 12-19 09:04:11 core.py:270]     submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3, l_kv_caches_0_);  getitem = getitem_1 = getitem_2 = l_kv_caches_0_ = submod_1 = None
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 784, in call_wrapped
ERROR 12-19 09:04:11 core.py:270]     return self._wrapped_call(self, *args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 361, in __call__
ERROR 12-19 09:04:11 core.py:270]     raise e
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 348, in __call__
ERROR 12-19 09:04:11 core.py:270]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 12-19 09:04:11 core.py:270]     return self._call_impl(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 12-19 09:04:11 core.py:270]     return forward_call(*args, **kwargs)
ERROR 12-19 09:04:11 core.py:270]   File "<eval_with_key>.2", line 5, in forward
ERROR 12-19 09:04:11 core.py:270]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, l_kv_caches_0_, 'decoder', 'model.layers.0.self_attn.attn');  query_2 = key_2 = value = output_1 = l_kv_caches_0_ = unified_attention_with_output = None
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
ERROR 12-19 09:04:11 core.py:270]     return self._op(*args, **(kwargs or {}))
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 287, in unified_attention_with_output
ERROR 12-19 09:04:11 core.py:270]     self.impl.forward(query,
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 172, in forward
ERROR 12-19 09:04:11 core.py:270]     flash_attn_varlen_func(
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 270, in flash_attn_varlen_func
ERROR 12-19 09:04:11 core.py:270]     out, softmax_lse = _flash_attn_varlen_forward(
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 91, in _flash_attn_varlen_forward
ERROR 12-19 09:04:11 core.py:270]     out, softmax_lse = torch.ops.vllm_flash_attn_c.varlen_fwd(
ERROR 12-19 09:04:11 core.py:270]   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
ERROR 12-19 09:04:11 core.py:270]     return self._op(*args, **(kwargs or {}))
ERROR 12-19 09:04:11 core.py:270] RuntimeError: query and key must have the same dtype
Process ForkProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 271, in run_engine_core
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 264, in run_engine_core
    engine_core.run_busy_loop()
  File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 302, in run_busy_loop
    outputs = self.step()
  File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 125, in step
    output = self.model_executor.execute_model(scheduler_output)
  File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/uniproc_executor.py", line 72, in execute_model
    output = self.worker.execute_model(scheduler_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 203, in execute_model
    output = self.model_runner.execute_model(scheduler_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 509, in execute_model
    hidden_states = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 568, in forward
    model_output = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 205, in __call__
    model_output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 338, in forward
    def forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 784, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 361, in __call__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 348, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.66", line 240, in forward
    submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3, l_kv_caches_0_);  getitem = getitem_1 = getitem_2 = l_kv_caches_0_ = submod_1 = None
  File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 784, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 361, in __call__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 348, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.2", line 5, in forward
    unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, l_kv_caches_0_, 'decoder', 'model.layers.0.self_attn.attn');  query_2 = key_2 = value = output_1 = l_kv_caches_0_ = unified_attention_with_output = None
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 287, in unified_attention_with_output
    self.impl.forward(query,
  File "/usr/local/lib/python3.10/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 172, in forward
    flash_attn_varlen_func(
  File "/usr/local/lib/python3.10/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 270, in flash_attn_varlen_func
    out, softmax_lse = _flash_attn_varlen_forward(
  File "/usr/local/lib/python3.10/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 91, in _flash_attn_varlen_forward
    out, softmax_lse = torch.ops.vllm_flash_attn_c.varlen_fwd(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: query and key must have the same dtype
```


### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: FP8 kvcache causes RuntimeError in v1 engine #11329

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: FP8 kvcache causes RuntimeError in v1 engine #11329

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions