-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Description
With async rollout, we've separated the pipeline during the generate stage. However, we must wait for the batch to complete before moving to the reward stage. When compute_reward_async is enabled, reward calculation can run in parallel with old_log_prob and value computation. In practice, reward calculation is often slower than old_log_prob and value computation. This creates GPU idle time before computing advantages, as shown in the figure below:
To start reward calculation sooner and avoid GPU idle time, it’s better to integrate compute_score into the generate pipeline:
- compute_score will be executed by a Ray actor.
- The reward manager get Ray futures from compute_score, then calculate reward_tensor and reward_extra_info from the scores.
HorHang, eric-haibin-lin and 777ki
Metadata
Metadata
Assignees
Labels
No labels