Skip to content

[multimodal dtensor] Inconsistent logprobs for multimodal models #793

@rohitrango

Description

@rohitrango

Describe the bug

When running GRPO for VLMs (Qwen2.5VL, LLaVa, etc.) the logprobs generated by vllm and that by huggingface differ by a margin higher than 1.05. Although the policy converges across different VLMs.

Steps/Code to reproduce bug

Run uv run examples/run_vlm_grpo.py from PR #712

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location: [local / cluster]
  • Method of install: [pip install or from source].

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions