Skip to content

Conversation

parthchadha
Copy link
Contributor

(enabling sp converts internal buffers to dtensor and broadcast was failing)

What does this PR do ?

Fixes below issue seen with some models:

(DTensorPolicyWorker pid=343603)   File "/app/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/tensor/_dispatch.py", line 393, in unwrap_to_op_info [repeated 6x across cluster]
(DTensorPolicyWorker pid=343603)     assert compute_mesh is not None, ( [repeated 6x across cluster]
(DTensorPolicyWorker pid=343603)            ^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(DTensorPolicyWorker pid=343603) AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default! [repeated 6x across cluster]

Issues

Closes #621.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

…o dtensor)

Signed-off-by: Parth Chadha <pchadha@nvidia.com>
@parthchadha parthchadha requested review from SahilJain314 and terrykong and removed request for SahilJain314 July 8, 2025 23:19
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@terrykong terrykong added this pull request to the merge queue Jul 8, 2025
Merged via the queue into main with commit 3b070a4 Jul 9, 2025
13 of 14 checks passed
@terrykong terrykong deleted the pchadha/fix-qwen-sp-bug branch July 9, 2025 01:27
jialei777 pushed a commit to jialei777/nemo-rl that referenced this pull request Jul 23, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FSDP2+TP2 demo script does not work
2 participants