[multimodal dtensor] Inconsistent logprobs for multimodal models

**Describe the bug**

When running GRPO for VLMs (Qwen2.5VL, LLaVa, etc.) the logprobs generated by vllm and that by huggingface differ by a margin higher than 1.05. Although the policy converges across different VLMs.

**Steps/Code to reproduce bug**

Run `uv run examples/run_vlm_grpo.py` from PR #712 

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment overview (please complete the following information)**

 - Environment location: [local / cluster]
 - Method of install: [pip install or from source]. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[multimodal dtensor] Inconsistent logprobs for multimodal models #793

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[multimodal dtensor] Inconsistent logprobs for multimodal models #793

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions