[Bug]: PixtralHF accuracy on MMMU regressed since 0.6.4.post1

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Your output of `python collect_env.py` here
```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

It seems to be that pixtral_hf accuracy has been affected since the last known good result from 0.6.4.post1. 

[Reference results on HF model card](https://huggingface.co/neuralmagic/pixtral-12b-FP8-dynamic#multimodal-benchmarks), we will look at `MMMU (CoT) ~= 51%. Evals ran using [mistral-evals](https://github.com/mistralai/mistral-evals)

vLLM 0.6.4.post1, server and eval:
```console
> uv pip install vllm==0.6.4.post1
> vllm serve nm-testing/pixtral-12b-FP8-dynamic --max-num-seqs 30 --max-model-len 30000 --limit-mm-per-prompt image=5 --port 9000

> python -m eval.run eval_vllm --model_name nm-testing/pixtral-12b-FP8-dynamic --url http://0.0.0.0:9000 --output_dir output/ --eval_name "mmmu"
...
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.5044444444444445,
    "anywhere_in_answer_relaxed_correctness": 0.5044444444444445
}
================================================================================
```

vLLM 0.6.5, server and eval:
```console
> uv pip install vllm==0.6.5
> vllm serve nm-testing/pixtral-12b-FP8-dynamic --max-num-seqs 30 --max-model-len 30000 --limit-mm-per-prompt image=5 --port 9000

> python -m eval.run eval_vllm --model_name nm-testing/pixtral-12b-FP8-dynamic --url http://0.0.0.0:9000 --output_dir output/ --eval_name "mmmu"
...
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.0011111111111111111,
    "anywhere_in_answer_relaxed_correctness": 0.3466666666666667
}
================================================================================
```

vLLM using https://github.com/vllm-project/vllm/pull/11741, server and eval:
```console
> uv pip install vllm==0.6.5
> vllm serve nm-testing/pixtral-12b-FP8-dynamic --max-num-seqs 30 --max-model-len 30000 --limit-mm-per-prompt image=5 --port 9000

> python -m eval.run eval_vllm --model_name nm-testing/pixtral-12b-FP8-dynamic --url http://0.0.0.0:9000 --output_dir output/ --eval_name "mmmu"
...
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.0011111111111111111,
    "anywhere_in_answer_relaxed_correctness": 0.3466666666666667
}
================================================================================
```

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: PixtralHF accuracy on MMMU regressed since 0.6.4.post1 #11816

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: PixtralHF accuracy on MMMU regressed since 0.6.4.post1 #11816

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions