[feat][BREAKING] Megatron support dynamic batch size, to rebalance the workloads #1617

ETOgaosion · 2025-05-21T08:33:03Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Megatron support dynamic batch size, to rebalance the workloads.
Fix missing critic metrics.

High-Level Design

Follow the FSDP's dynamic batch size.

Specific Changes

Use the rearrange_micro_batches API, but compatible with Megatron VPP constraints.

vpp_size = mpu.get_virtual_pipeline_model_parallel_world_size()
if vpp_size is not None and vpp_size > 1:
    microbatch_group_size_per_vp_stage = self.tf_config.microbatch_group_size_per_vp_stage
    micro_batches, indices = rearrange_micro_batches(batch=mini_batch.batch, num_batches_devided_by=microbatch_group_size_per_vp_stage, max_token_len=max_token_len)
    assert len(micro_batches) % self.tf_config.microbatch_group_size_per_vp_stage == 0, f"micro_batches {micro_batches} must be divisible by microbatch_group_size_per_vp_stage {microbatch_group_size_per_vp_stage} for megatron backend"
else:
    micro_batches, indices = rearrange_micro_batches(batch=mini_batch.batch, max_token_len=max_token_len)

@vermouth1992 please check whether it makes sense.

Megatron's constraint when using interleaving pipeline:

# If the final micro-batch group has fewer micro-batches than pipeline-parallel size,
    # the pipeline will have dependency bubbles.
    final_microbatch_group_size = num_microbatches % config.microbatch_group_size_per_vp_stage
    if 0 < final_microbatch_group_size < pipeline_parallel_size:
        msg = 'The remainder of M (the total micro-batches) divided by N (number of '
        msg += 'contiguous micro-batches in a virtual pipeline stage) should be 0, '
        msg += 'or larger than or equal to the pipeline-parallel size, but it is '
        msg += f'{final_microbatch_group_size}. '
        msg += 'Otherwise, it introduces dependency bubbles in the pipeline '
        msg += 'and reduces throughput.'
        raise RuntimeError(msg)

API

Megatron forward_backward_batch has changed input, and the output has become a dict, containing original output and the indices needed for compute_old_log_probs.

Usage Example

    actor_rollout_ref.actor.use_dynamic_bsz=${USE_DYNAMIC_BSZ} \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} \
    critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \

Other models will directly copy the config.

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

verl/workers/reward_model/megatron/reward_model.py

ccclyu

high level comment: same logic appears in both actor, critic, reward and it might be better to re-resign and make it into a standalone module. not urgent for this PR.

ccclyu · 2025-05-26T08:20:12Z

verl/utils/seqlen_balancing.py

+    return a + (a % b)
+
+
+def rearrange_micro_batches(batch, max_token_len, dp_group=None, num_batches_devided_by=None, same_micro_num_in_dp=True, min_num_micro_batch=None):


should be a typo? num_batched_devided_by -> num_batched_divided_by. there are multiple places appearing.

Thanks for reviewing~ Maybe I already fixed in previous commit?

no. it should be divided not devided. previous commit did not fix this. there are multiple places using devided.

ccclyu · 2025-05-26T08:21:58Z

verl/utils/seqlen_balancing.py

    """
    Split a batch into micro-batches by total token count, with optional DP sync and padding.

    Args:
        batch (TensorDict): must include "attention_mask" (B*S); other fields are sliced similarly.
        max_token_len (int): max sum of attention_mask per micro-batch.
        dp_group (optional): torch.distributed group for data-parallel sync.
+        vpp_size (optional): virtual pipeline parallel size, for megatron.


is vpp_size in the rearrange_micro_batches definition?

I used to pass vpp_size, which is an error, use num_batches_devided_by instead~

ccclyu · 2025-05-26T08:26:53Z

verl/utils/seqlen_balancing.py

@@ -212,14 +212,19 @@ def ceildiv(a, b):
    return -(a // -b)


-def rearrange_micro_batches(batch, max_token_len, dp_group=None, same_micro_num_in_dp=True, min_num_micro_batch=None):
+def roundup_divisible(a, b):
+    return a + (a % b)


i might think the implantation is not correct. how about the following one? you can double check it.

return ((a + b - 1) // b) * b

Yeah, this is correct, I tried to use %, which has error logic

ETOgaosion · 2025-05-26T13:51:12Z

For the repeat contents, yes, I also feel that there are common logics in these models, we can use abstraction to reduce the repetition.

@vermouth1992

…e workloads (volcengine#1617) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? 1. Megatron support dynamic batch size, to rebalance the workloads. 2. Fix missing critic metrics. ### High-Level Design Follow the FSDP's dynamic batch size. ### Specific Changes Use the `rearrange_micro_batches` API, but compatible with Megatron VPP constraints. ```py vpp_size = mpu.get_virtual_pipeline_model_parallel_world_size() if vpp_size is not None and vpp_size > 1: microbatch_group_size_per_vp_stage = self.tf_config.microbatch_group_size_per_vp_stage micro_batches, indices = rearrange_micro_batches(batch=mini_batch.batch, num_batches_devided_by=microbatch_group_size_per_vp_stage, max_token_len=max_token_len) assert len(micro_batches) % self.tf_config.microbatch_group_size_per_vp_stage == 0, f"micro_batches {micro_batches} must be divisible by microbatch_group_size_per_vp_stage {microbatch_group_size_per_vp_stage} for megatron backend" else: micro_batches, indices = rearrange_micro_batches(batch=mini_batch.batch, max_token_len=max_token_len) ``` @vermouth1992 please check whether it makes sense. Megatron's constraint when using interleaving pipeline: ```py # If the final micro-batch group has fewer micro-batches than pipeline-parallel size, # the pipeline will have dependency bubbles. final_microbatch_group_size = num_microbatches % config.microbatch_group_size_per_vp_stage if 0 < final_microbatch_group_size < pipeline_parallel_size: msg = 'The remainder of M (the total micro-batches) divided by N (number of ' msg += 'contiguous micro-batches in a virtual pipeline stage) should be 0, ' msg += 'or larger than or equal to the pipeline-parallel size, but it is ' msg += f'{final_microbatch_group_size}. ' msg += 'Otherwise, it introduces dependency bubbles in the pipeline ' msg += 'and reduces throughput.' raise RuntimeError(msg) ``` ### API Megatron forward_backward_batch has changed input, and the output has become a dict, containing original `output` and the `indices` needed for compute_old_log_probs. ### Usage Example ```bash actor_rollout_ref.actor.use_dynamic_bsz=${USE_DYNAMIC_BSZ} \ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} \ critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \ ``` Other models will directly copy the config. ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary.

@vermouth1992

…e workloads (volcengine#1617) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? 1. Megatron support dynamic batch size, to rebalance the workloads. 2. Fix missing critic metrics. ### High-Level Design Follow the FSDP's dynamic batch size. ### Specific Changes Use the `rearrange_micro_batches` API, but compatible with Megatron VPP constraints. ```py vpp_size = mpu.get_virtual_pipeline_model_parallel_world_size() if vpp_size is not None and vpp_size > 1: microbatch_group_size_per_vp_stage = self.tf_config.microbatch_group_size_per_vp_stage micro_batches, indices = rearrange_micro_batches(batch=mini_batch.batch, num_batches_devided_by=microbatch_group_size_per_vp_stage, max_token_len=max_token_len) assert len(micro_batches) % self.tf_config.microbatch_group_size_per_vp_stage == 0, f"micro_batches {micro_batches} must be divisible by microbatch_group_size_per_vp_stage {microbatch_group_size_per_vp_stage} for megatron backend" else: micro_batches, indices = rearrange_micro_batches(batch=mini_batch.batch, max_token_len=max_token_len) ``` @vermouth1992 please check whether it makes sense. Megatron's constraint when using interleaving pipeline: ```py # If the final micro-batch group has fewer micro-batches than pipeline-parallel size, # the pipeline will have dependency bubbles. final_microbatch_group_size = num_microbatches % config.microbatch_group_size_per_vp_stage if 0 < final_microbatch_group_size < pipeline_parallel_size: msg = 'The remainder of M (the total micro-batches) divided by N (number of ' msg += 'contiguous micro-batches in a virtual pipeline stage) should be 0, ' msg += 'or larger than or equal to the pipeline-parallel size, but it is ' msg += f'{final_microbatch_group_size}. ' msg += 'Otherwise, it introduces dependency bubbles in the pipeline ' msg += 'and reduces throughput.' raise RuntimeError(msg) ``` ### API Megatron forward_backward_batch has changed input, and the output has become a dict, containing original `output` and the `indices` needed for compute_old_log_probs. ### Usage Example ```bash actor_rollout_ref.actor.use_dynamic_bsz=${USE_DYNAMIC_BSZ} \ actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${ppo_max_token_len_per_gpu} \ critic.ppo_max_token_len_per_gpu=${forward_max_token_len_per_gpu} \ ``` Other models will directly copy the config. ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary.

ETOgaosion and others added 9 commits May 21, 2025 01:08

finish develop dynamic bsz

8804992

fix some bugs and make no change with forward_backward_func return

f2e1ed0

finish debug

60db9bf

unknow change to vscode setting

79f4f43

actually is pp_size

3261ef1

seperate interleaving and no interleaving support

9c5b7f1

to be more precisely

fc4c340

tokens too much to pass CI

80ad63b

fix original

bd868b0

ETOgaosion requested review from ccclyu and vermouth1992 May 21, 2025 15:20

vermouth1992 reviewed May 21, 2025

View reviewed changes

verl/workers/reward_model/megatron/reward_model.py Show resolved Hide resolved

ccclyu added the status: review in process label May 21, 2025

ISEEKYAN approved these changes May 22, 2025

View reviewed changes

ETOgaosion added 3 commits May 23, 2025 11:14

Merge branch 'main' into megatron_dynamic_bsz

c652709

Merge branch 'main' into megatron_dynamic_bsz

ba77340

Merge branch 'main' into megatron_dynamic_bsz

6344238

ccclyu reviewed May 26, 2025

View reviewed changes

fix errors

bfec262

fix divide typo

ded1752

ETOgaosion enabled auto-merge (squash) May 28, 2025 02:48

vermouth1992 approved these changes May 28, 2025

View reviewed changes

ETOgaosion merged commit 432f9e9 into volcengine:main May 28, 2025
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat][BREAKING] Megatron support dynamic batch size, to rebalance the workloads #1617

[feat][BREAKING] Megatron support dynamic batch size, to rebalance the workloads #1617

Uh oh!

ETOgaosion commented May 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

ccclyu left a comment •

edited

Loading

Uh oh!

ccclyu May 26, 2025 •

edited

Loading

Uh oh!

ETOgaosion May 26, 2025

Uh oh!

ccclyu May 27, 2025 •

edited

Loading

Uh oh!

ccclyu May 26, 2025

Uh oh!

ETOgaosion May 26, 2025

Uh oh!

ccclyu May 26, 2025

Uh oh!

ETOgaosion May 26, 2025 •

edited

Loading

Uh oh!

ETOgaosion commented May 26, 2025

Uh oh!

Uh oh!

Uh oh!

		return a + (a % b)


		def rearrange_micro_batches(batch, max_token_len, dp_group=None, num_batches_devided_by=None, same_micro_num_in_dp=True, min_num_micro_batch=None):

[feat][BREAKING] Megatron support dynamic batch size, to rebalance the workloads #1617

[feat][BREAKING] Megatron support dynamic batch size, to rebalance the workloads #1617

Uh oh!

Conversation

ETOgaosion commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

Uh oh!

ccclyu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccclyu May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ETOgaosion May 26, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccclyu May 26, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion May 26, 2025

Choose a reason for hiding this comment

Uh oh!

ccclyu May 26, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ETOgaosion commented May 26, 2025

Uh oh!

Uh oh!

Uh oh!

ETOgaosion commented May 21, 2025 •

edited

Loading

ccclyu left a comment •

edited

Loading

ccclyu May 26, 2025 •

edited

Loading

ccclyu May 27, 2025 •

edited

Loading

ETOgaosion May 26, 2025 •

edited

Loading