Skip to content

Conversation

SahilJain314
Copy link
Contributor

@SahilJain314 SahilJain314 commented Jul 11, 2025

Enables Context Parallelism + Sequence Packing for Megatron-Core based training

image Qwen 1.5B with TP + PP + CP (2,2,2) highlighted

SahilJain314 and others added 30 commits May 11, 2025 18:39
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
…{Dict, List, Tuple} to primitive dict, list tuple

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>

wip

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix it

Signed-off-by: Terry Kong <terryk@nvidia.com>

patthing fix

Signed-off-by: Terry Kong <terryk@nvidia.com>

wip

Signed-off-by: Terry Kong <terryk@nvidia.com>

doesn't look like i needed that

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix

Signed-off-by: Terry Kong <terryk@nvidia.com>

revert stuff

Signed-off-by: Terry Kong <terryk@nvidia.com>

make it better

Signed-off-by: Terry Kong <terryk@nvidia.com>

go

Signed-off-by: Terry Kong <terryk@nvidia.com>

cleanup

Signed-off-by: Terry Kong <terryk@nvidia.com>

mix it up

Signed-off-by: Terry Kong <terryk@nvidia.com>

touch up

Signed-off-by: Terry Kong <terryk@nvidia.com>

clean

Signed-off-by: Terry Kong <terryk@nvidia.com>

better

Signed-off-by: Terry Kong <terryk@nvidia.com>

clean up

Signed-off-by: Terry Kong <terryk@nvidia.com>

add it in

Signed-off-by: Terry Kong <terryk@nvidia.com>

mcore extra

Signed-off-by: Terry Kong <terryk@nvidia.com>

instructions

Signed-off-by: Terry Kong <terryk@nvidia.com>

works

Signed-off-by: Terry Kong <terryk@nvidia.com>

revert to 3.10, 3.12 didn't seem necessary

Signed-off-by: Terry Kong <terryk@nvidia.com>

ci has to recursively clone

Signed-off-by: Terry Kong <terryk@nvidia.com>

bump build workflow

Signed-off-by: Terry Kong <terryk@nvidia.com>

add megatron.core import

Signed-off-by: Terry Kong <terryk@nvidia.com>

potential fix for unit test on CI

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix the test

Signed-off-by: Terry Kong <terryk@nvidia.com>

this should fix test (it was a collision of namespace)

Signed-off-by: Terry Kong <terryk@nvidia.com>

remove fp8 from test

Signed-off-by: Terry Kong <terryk@nvidia.com>

add shallow

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix base build

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix instructions

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix the messed up indenting

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix

Signed-off-by: Terry Kong <terryk@nvidia.com>

try nesting

Signed-off-by: Terry Kong <terryk@nvidia.com>

okay, got it to work

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix up the readme

Signed-off-by: Terry Kong <terryk@nvidia.com>

ok

Signed-off-by: Terry Kong <terryk@nvidia.com>

touchup

Signed-off-by: Terry Kong <terryk@nvidia.com>

add the lock file back

Signed-off-by: Terry Kong <terryk@nvidia.com>

got

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
… tied worker groups

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
…session scope

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Jul 19, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@SahilJain314 SahilJain314 force-pushed the sahilj/cp-rebase branch 2 times, most recently from 5816d64 to 35da884 Compare July 21, 2025 19:19
@SahilJain314 SahilJain314 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 21, 2025
parthchadha
parthchadha previously approved these changes Jul 21, 2025
xxman-google and others added 10 commits July 21, 2025 13:57
Signed-off-by: Xuehan <xxman@google.com>
)

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@terrykong
Copy link
Contributor

closing since newer PR makes this one obsolete #704

@terrykong terrykong closed this Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation r0.3.0 Release r0.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants