Skip to content

Conversation

ISEEKYAN
Copy link
Contributor

@ISEEKYAN ISEEKYAN commented Apr 8, 2025

support context parallel for mcore backend.
Changes on:

  • configs
  • model loader
  • checkpint
  • single control dispatcher
  • forward preprocess and postprocess

@ISEEKYAN ISEEKYAN marked this pull request as ready for review April 8, 2025 06:35
@ccclyu
Copy link
Collaborator

ccclyu commented Apr 8, 2025

thanks a ton for quick support! May I know whether you have done some benchmarking or testing of training efficiency upon the context parallel?

@ISEEKYAN
Copy link
Contributor Author

ISEEKYAN commented Apr 8, 2025

thanks a ton for quick support! May I know whether you have done some benchmarking or testing of training efficiency upon the context parallel?

I tried with 1 node with 8 H100, comparing tp4dp2cp1 with tp4dp1cp2. cp2(gray line) is slower than cp1 in this test. The result is reasonable because it is not a memory limited situation, with less data parallel and more communication. CP would be useful when the sequence length is larger, but so far, I have only focused on whether the functions have been implemented, and I haven't had the time to pay attention to further performance testing.

截屏2025-04-08 15 09 42

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccclyu please try it out and provide feedbacks~

#TODO: support ep
return os.path.join(checkpoint_path, f"optim", f"distrib_optim_pp{pp_rank}_tp{tp_rank}.pt")
return os.path.join(checkpoint_path, f"optim", f"distrib_optim_pp{pp_rank}_tp{tp_rank}_cp{cp_rank}_dp{dp_rank}.pt")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ETOgaosion due to the optimizer states are distributed across all gpus, the dp rank also should be saved separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get~ Later will sync to the doc.

@ETOgaosion ETOgaosion merged commit 9f405b4 into volcengine:main Apr 10, 2025
21 of 22 checks passed
yanfeng98 pushed a commit to yanfeng98/fork-verl that referenced this pull request Apr 11, 2025
support context parallel for mcore backend.
Changes on:
* configs
* model loader
* checkpint
* single control dispatcher
* forward preprocess and postprocess

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025
support context parallel for mcore backend.
Changes on:
* configs
* model loader
* checkpint
* single control dispatcher
* forward preprocess and postprocess

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
yhyang201 pushed a commit to yhyang201/verl that referenced this pull request Apr 26, 2025
support context parallel for mcore backend.
Changes on:
* configs
* model loader
* checkpint
* single control dispatcher
* forward preprocess and postprocess

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants