Skip to content

Conversation

SahilJain314
Copy link
Contributor

@SahilJain314 SahilJain314 commented May 20, 2025

What does this PR do ?

Creates a NamedSharding abstraction that lets us represent the sharding of computation on the workers as a tensor with named axes. Moves all actors to this new method.

convergence with tp=1 and tp=2 against past baseline:
image

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
…{Dict, List, Tuple} to primitive dict, list tuple

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
… tied worker groups

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@SahilJain314 SahilJain314 force-pushed the sahilj/named_sharding branch from 7ab88ba to 9cd192b Compare May 20, 2025 00:54
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@SahilJain314 SahilJain314 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 20, 2025
@SahilJain314 SahilJain314 marked this pull request as draft May 20, 2025 17:30
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@SahilJain314 SahilJain314 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 21, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels May 21, 2025
@SahilJain314 SahilJain314 changed the base branch from main to sahilj/type_fix May 21, 2025 20:52
Base automatically changed from sahilj/type_fix to main May 21, 2025 22:33
@github-actions github-actions bot removed documentation Improvements or additions to documentation CI Relating to CI labels May 21, 2025
@SahilJain314 SahilJain314 enabled auto-merge May 22, 2025 00:35
@SahilJain314 SahilJain314 added this pull request to the merge queue May 22, 2025
Merged via the queue into main with commit f04ef67 May 22, 2025
13 of 14 checks passed
@SahilJain314 SahilJain314 deleted the sahilj/named_sharding branch May 22, 2025 01:40
YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants