[Mcore] context parallel #970

ISEEKYAN · 2025-04-08T06:34:48Z

support context parallel for mcore backend.
Changes on:

configs
model loader
checkpint
single control dispatcher
forward preprocess and postprocess

ccclyu · 2025-04-08T06:44:53Z

thanks a ton for quick support! May I know whether you have done some benchmarking or testing of training efficiency upon the context parallel?

ISEEKYAN · 2025-04-08T07:15:19Z

thanks a ton for quick support! May I know whether you have done some benchmarking or testing of training efficiency upon the context parallel?

I tried with 1 node with 8 H100, comparing tp4dp2cp1 with tp4dp1cp2. cp2(gray line) is slower than cp1 in this test. The result is reasonable because it is not a memory limited situation, with less data parallel and more communication. CP would be useful when the sequence length is larger, but so far, I have only focused on whether the functions have been implemented, and I haven't had the time to pay attention to further performance testing.

add CI tests

eric-haibin-lin

@ccclyu please try it out and provide feedbacks~

ISEEKYAN · 2025-04-09T07:00:00Z

verl/utils/megatron_utils.py

    #TODO: support ep
-    return os.path.join(checkpoint_path, f"optim", f"distrib_optim_pp{pp_rank}_tp{tp_rank}.pt")
+    return os.path.join(checkpoint_path, f"optim", f"distrib_optim_pp{pp_rank}_tp{tp_rank}_cp{cp_rank}_dp{dp_rank}.pt")


@ETOgaosion due to the optimizer states are distributed across all gpus, the dp rank also should be saved separately.

Get~ Later will sync to the doc.

support context parallel for mcore backend. Changes on: * configs * model loader * checkpint * single control dispatcher * forward preprocess and postprocess --------- Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>

ISEEKYAN added 3 commits April 7, 2025 10:29

CP load/forward/vllm dispatch/prob dispatch

b946cbb

correct the CP data split with sequence packing

22d6ae2

CP in checkpoint

702af95

ISEEKYAN marked this pull request as ready for review April 8, 2025 06:35

ETOgaosion and others added 2 commits April 8, 2025 15:45

add CI tests

bc0527c

Merge pull request #5 from ETOgaosion/mcore_context_parallel

5a80a68

add CI tests

eric-haibin-lin reviewed Apr 8, 2025

View reviewed changes

fix

22ee8f7

ISEEKYAN commented Apr 9, 2025

View reviewed changes

ETOgaosion approved these changes Apr 10, 2025

View reviewed changes

ETOgaosion merged commit 9f405b4 into volcengine:main Apr 10, 2025
21 of 22 checks passed

ISEEKYAN mentioned this pull request Apr 11, 2025

[mcore] verl+megatron development tracking #1033

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Mcore] context parallel #970

[Mcore] context parallel #970

Uh oh!

ISEEKYAN commented Apr 8, 2025

Uh oh!

ccclyu commented Apr 8, 2025

Uh oh!

ISEEKYAN commented Apr 8, 2025

Uh oh!

eric-haibin-lin left a comment

Uh oh!

ISEEKYAN Apr 9, 2025

Uh oh!

ETOgaosion Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

[Mcore] context parallel #970

[Mcore] context parallel #970

Uh oh!

Conversation

ISEEKYAN commented Apr 8, 2025

Uh oh!

ccclyu commented Apr 8, 2025

Uh oh!

ISEEKYAN commented Apr 8, 2025

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

ISEEKYAN Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!