[single controller] feat: mitigate pickle cost #1862

vermouth1992 · 2025-06-05T08:04:21Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

ray put all the args in advance to avoid duplicate serialization cost for megatron dispatch.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
New CI unit test(s) are added to cover the code path.
Rely on existing unit tests on CI that covers the code path.

eric-haibin-lin

Why does it avoid serialization cost?

ann-qin-lu · 2025-06-05T22:19:32Z

Changes LGTM. Tested with 40 nodes setup (TP=8, PP=10, CP=2, DP=2) and after this fix, the training time reduced by 50% :)

hiyouga · 2025-06-06T06:06:41Z

@vermouth1992 will this feature reduce the time of VLMs fsdp dispatch?

vermouth1992 · 2025-06-06T06:11:26Z

No, it won't reduce fsdp dispatch as there is no redundancy in fsdp dispatch method. Eventually, we have to use uri based methods for images/videos.

ccclyu · 2025-06-06T06:16:09Z

ray.put serializes each argument once and returns an ObjectRef. Subsequent calls pass only the lightweight references. If these lines were removed and the arguments were large, each dispatch to the remote workers would cause Ray to serialize and transfer the full objects repeatedly—one copy per worker. For large payloads, that duplicate serialization and network transfer can significantly increase memory usage and slow down the dispatch, or even exceed available resources. Using ray.put avoids that overhead by storing the large data once and reusing the references.

Quote from OpenAI Codex agent that helped me understand the code logic LOL cc: @eric-haibin-lin

hiyouga · 2025-06-06T06:17:57Z

@vermouth1992 sure, I'll refer to this PR and migrate it into verl

GHGmc2 · 2025-06-06T06:41:51Z

@vermouth1992 sure, I'll refer to this PR and migrate it into verl

Glad to hear that, I believe this can fix the performance issue from multi-modal datasets as described here: #1418

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? ray put all the args in advance to avoid duplicate serialization cost for megatron dispatch. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

vermouth1992 added 2 commits June 5, 2025 16:03

mitigate pickle cost

a9202f6

fix

f0d8fd2

eric-haibin-lin reviewed Jun 5, 2025

View reviewed changes

eric-haibin-lin approved these changes Jun 5, 2025

View reviewed changes

vermouth1992 merged commit f1fd0f0 into main Jun 6, 2025
37 checks passed

vermouth1992 deleted the chi/pickle branch June 6, 2025 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[single controller] feat: mitigate pickle cost #1862

[single controller] feat: mitigate pickle cost #1862

Uh oh!

vermouth1992 commented Jun 5, 2025 •

edited by eric-haibin-lin

Loading

Uh oh!

eric-haibin-lin left a comment

Uh oh!

ann-qin-lu commented Jun 5, 2025

Uh oh!

Uh oh!

hiyouga commented Jun 6, 2025

Uh oh!

vermouth1992 commented Jun 6, 2025 •

edited

Loading

Uh oh!

ccclyu commented Jun 6, 2025 •

edited

Loading

Uh oh!

hiyouga commented Jun 6, 2025

Uh oh!

GHGmc2 commented Jun 6, 2025

Uh oh!

Uh oh!

[single controller] feat: mitigate pickle cost #1862

[single controller] feat: mitigate pickle cost #1862

Uh oh!

Conversation

vermouth1992 commented Jun 5, 2025 • edited by eric-haibin-lin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

ann-qin-lu commented Jun 5, 2025

Uh oh!

Uh oh!

hiyouga commented Jun 6, 2025

Uh oh!

vermouth1992 commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccclyu commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hiyouga commented Jun 6, 2025

Uh oh!

GHGmc2 commented Jun 6, 2025

Uh oh!

Uh oh!

vermouth1992 commented Jun 5, 2025 •

edited by eric-haibin-lin

Loading

vermouth1992 commented Jun 6, 2025 •

edited

Loading

ccclyu commented Jun 6, 2025 •

edited

Loading