Skip to content

Async pipeline in generate and compute score #1584

@chenhaiq

Description

@chenhaiq

With async rollout, we've separated the pipeline during the generate stage. However, we must wait for the batch to complete before moving to the reward stage. When compute_reward_async is enabled, reward calculation can run in parallel with old_log_prob and value computation. In practice, reward calculation is often slower than old_log_prob and value computation. This creates GPU idle time before computing advantages, as shown in the figure below:

Image

To start reward calculation sooner and avoid GPU idle time, it’s better to integrate compute_score into the generate pipeline:

  1. compute_score will be executed by a Ray actor.
  2. The reward manager get Ray futures from compute_score, then calculate reward_tensor and reward_extra_info from the scores.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions