Skip to content

Conversation

vermouth1992
Copy link
Collaborator

@vermouth1992 vermouth1992 commented Jun 5, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

ray put all the args in advance to avoid duplicate serialization cost for megatron dispatch.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • New CI unit test(s) are added to cover the code path.
  • Rely on existing unit tests on CI that covers the code path.

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it avoid serialization cost?

@ann-qin-lu
Copy link
Contributor

Changes LGTM. Tested with 40 nodes setup (TP=8, PP=10, CP=2, DP=2) and after this fix, the training time reduced by 50% :)

@vermouth1992 vermouth1992 merged commit f1fd0f0 into main Jun 6, 2025
37 checks passed
@vermouth1992 vermouth1992 deleted the chi/pickle branch June 6, 2025 01:35
@hiyouga
Copy link
Collaborator

hiyouga commented Jun 6, 2025

@vermouth1992 will this feature reduce the time of VLMs fsdp dispatch?

@vermouth1992
Copy link
Collaborator Author

vermouth1992 commented Jun 6, 2025

No, it won't reduce fsdp dispatch as there is no redundancy in fsdp dispatch method. Eventually, we have to use uri based methods for images/videos.

@ccclyu
Copy link
Collaborator

ccclyu commented Jun 6, 2025

ray.put serializes each argument once and returns an ObjectRef. Subsequent calls pass only the lightweight references. If these lines were removed and the arguments were large, each dispatch to the remote workers would cause Ray to serialize and transfer the full objects repeatedly—one copy per worker. For large payloads, that duplicate serialization and network transfer can significantly increase memory usage and slow down the dispatch, or even exceed available resources. Using ray.put avoids that overhead by storing the large data once and reusing the references.

Quote from OpenAI Codex agent that helped me understand the code logic LOL cc: @eric-haibin-lin

@hiyouga
Copy link
Collaborator

hiyouga commented Jun 6, 2025

@vermouth1992 sure, I'll refer to this PR and migrate it into verl

@GHGmc2
Copy link
Contributor

GHGmc2 commented Jun 6, 2025

@vermouth1992 sure, I'll refer to this PR and migrate it into verl

Glad to hear that, I believe this can fix the performance issue from multi-modal datasets as described here: #1418

yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 6, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

ray put all the args in advance to avoid duplicate serialization cost
for megatron dispatch.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] New CI unit test(s) are added to cover the code path.
- [ ] Rely on existing unit tests on CI that covers the code path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants