Skip to content

Conversation

mingruimingrui
Copy link
Contributor

@mingruimingrui mingruimingrui commented Apr 23, 2025

What does this PR do?

This PR implements fused losses for alignment. #710
It reduces the memory required for loss calculation to a small constant amount.

ChangeLog:

  • added the option use_fused_kernels
  • monkey patch to make model.forward return last_hidden_state and not calculate logits
  • Added FusedLinearForPPO to verl/utils/experimental/torch_functional.py

Usage

Simply add the following option

actor_rollout_ref.model.use_fused_kernels=True

Before submitting

  • Did you read the Contribute Guide and finish the code format check?
  • Did you make sure to update the documentations with your changes in the docs especially for breaking config etc?
  • Did you write any test cases if neccessary? Please add CI tests to your new feature.

Additional Info:

  • The current implementation uses chunking to reduce the memory consumption to a constant value.
    • It works by splitting the loss calculations into chunks of 512 tokens. Calculating the log_probs / entropy values / gradients for each chunk and accumulating them.
    • However the current implementation can be slow. It processes each chunk sequentially in a python for loop.
    • In the future we should consider converting the fused functions into triton or some other JIT solution.
  • Compared to FusedPPOLossFunction, optimizing hidden_states -> entropy & log_probs is much better for algorithm developers as the memory heavy part is optimized away for them and they are free to combine the values for their own custom loss functions.

@hiyouga hiyouga self-requested a review April 23, 2025 07:21
@ETOgaosion
Copy link
Collaborator

@mingruimingrui Could you also add the e2e integrated test? Simply enable actor_rollout_ref.model.use_fused_kernels=True in tests/e2e/ppo_trainer/run_function_reward.sh

And we should set it default to True(use_fused_kernels: True) to reduce memory in verl/trainer/config/ppo_trainer.yaml

@ETOgaosion ETOgaosion merged commit eb077f6 into volcengine:main May 16, 2025
37 of 38 checks passed
@mingruimingrui mingruimingrui deleted the feat/memory-optimized-loss branch May 18, 2025 09:43
@eric-haibin-lin eric-haibin-lin mentioned this pull request May 19, 2025
33 tasks
feifeibear added a commit to feifeibear/verl that referenced this pull request May 20, 2025
@plutoZZZZ plutoZZZZ mentioned this pull request May 20, 2025
6 tasks
vermouth1992 pushed a commit that referenced this pull request May 20, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?
Currently, the `e2e_prime` test encounters the error` AttributeError:
'NoneType' object has no attribute 'squeeze'`, which is caused by [
#1212].

In PR [#1568], the parameter `use_fused_kernel` in `ppo_trainer.yaml`
was set to `false`, but the corresponding parameter in
`prime_trainer.yaml` was not updated. This is preventing the CI from
passing. Before the root cause of `use_fused_kernel` is fully resolved ,
I guess we should temporarily set `use_fused_kernel` to `false` in
`prime_trainer.yaml`
### High-Level Design

Not needed

### Specific Changes

- Default use_fused_kernels = False

### API

Not needed

### Usage Example

Not needed

### Test

Not needed

### Additional Info.

Not needed

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants