Fix Configuration for Micro Batch Size in Megatron's Ref Policy #1700

none0663 · 2025-05-26T12:57:36Z

What does this PR do?

Fix Configuration for Micro Batch Size in Megatron's Ref Policy

High-Level Design

This pull request addresses an issue with the micro batch size configuration in the ref policy of Megatron. The default ppo_megatron_trainer.yaml only includes two configurations: log_prob_micro_batch_size and log_prob_micro_batch_size_per_gpu.

verl/verl/trainer/config/ppo_megatron_trainer.yaml

Lines 119 to 120 in 54c9b73

    
           log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu 
        
           log_prob_micro_batch_size_per_gpu: null

However, in megatron_workers.py, the required configuration is ref.log_prob_micro_batch_size_per_gpu

verl/verl/workers/megatron_workers.py

Lines 517 to 518 in 54c9b73

    
           micro_batch_size = self.config.ref.log_prob_micro_batch_size_per_gpu 
        
           data.meta_info["micro_batch_size"] = micro_batch_size

or in megatron_actor.py the required configuration is ref.ppo_micro_batch_size_per_gpu,

verl/verl/workers/actor/megatron_actor.py

Lines 271 to 274 in 54c9b73

    
           if data.meta_info.get("micro_batch_size", None) is not None: 
        
               batch_size = data.meta_info["micro_batch_size"] 
        
           else: 
        
               batch_size = self.config.ppo_micro_batch_size_per_gpu

which are not directly related to ppo_micro_batch_size.

To resolve this, I have made modifications to the configuration calculations and added raise ValueError statements to ensure that the necessary parameters are correctly defined.

This update ensures that the required parameters are properly handled, preventing runtime errors and improving the overall robustness of the training process.

Changes Made:

Modified the configuration calculations in megatron_workers.py.
Added raise ValueError statements to check for the presence of log_prob_micro_batch_size_per_gpu and ppo_micro_batch_size_per_gpu.

ETOgaosion · 2025-05-26T15:39:38Z

verl/workers/megatron_workers.py

+            else:
+                if self.config.ref.get("log_prob_micro_batch_size_per_gpu", None):
+                    self.config.ref.ppo_micro_batch_size_per_gpu = self.config.ref.log_prob_micro_batch_size_per_gpu
+                elif self.config.ref.get("ppo_micro_batch_size_per_gpu", None):


Thanks for contribution!

I think that here is a typo, so we may not need to consider ppo_micro_batch_size_per_gpu, you can simply judge the key above~

fix, del the ppo_micro_batch_size_per_gpu

…cengine#1700) ### What does this PR do? Fix Configuration for Micro Batch Size in Megatron's Ref Policy ### High-Level Design This pull request addresses an issue with the micro batch size configuration in the ref policy of Megatron. The default ppo_megatron_trainer.yaml only includes two configurations: log_prob_micro_batch_size and log_prob_micro_batch_size_per_gpu. https://github.com/volcengine/verl/blob/54c9b7364c2d188b2ba4107404cfa3c2b446df19/verl/trainer/config/ppo_megatron_trainer.yaml#L119-L120 However, in `megatron_workers.py`, the required configuration is ref.log_prob_micro_batch_size_per_gpu https://github.com/volcengine/verl/blob/54c9b7364c2d188b2ba4107404cfa3c2b446df19/verl/workers/megatron_workers.py#L517-L518 or in `megatron_actor.py ` the required configuration is ref.ppo_micro_batch_size_per_gpu, https://github.com/volcengine/verl/blob/54c9b7364c2d188b2ba4107404cfa3c2b446df19/verl/workers/actor/megatron_actor.py#L271-L274 which are not directly related to ppo_micro_batch_size. To resolve this, I have made modifications to the configuration calculations and added raise ValueError statements to ensure that the necessary parameters are correctly defined. This update ensures that the required parameters are properly handled, preventing runtime errors and improving the overall robustness of the training process. ### Changes Made: - Modified the configuration calculations in megatron_workers.py. - Added raise ValueError statements to check for the presence of log_prob_micro_batch_size_per_gpu and ppo_micro_batch_size_per_gpu.

fix megatron micro batch size config

15695dc

ETOgaosion reviewed May 26, 2025

View reviewed changes

del ppo_micro_batch_size_per_gpu key

7783f90

ETOgaosion approved these changes May 28, 2025

View reviewed changes

ETOgaosion merged commit 99e749a into volcengine:main May 28, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Configuration for Micro Batch Size in Megatron's Ref Policy #1700

Fix Configuration for Micro Batch Size in Megatron's Ref Policy #1700

Uh oh!

none0663 commented May 26, 2025

Uh oh!

ETOgaosion May 26, 2025

Uh oh!

none0663 May 28, 2025

Uh oh!

Uh oh!

Uh oh!

	log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
	log_prob_micro_batch_size_per_gpu: null

	micro_batch_size = self.config.ref.log_prob_micro_batch_size_per_gpu
	data.meta_info["micro_batch_size"] = micro_batch_size

	if data.meta_info.get("micro_batch_size", None) is not None:
	batch_size = data.meta_info["micro_batch_size"]
	else:
	batch_size = self.config.ppo_micro_batch_size_per_gpu

Fix Configuration for Micro Batch Size in Megatron's Ref Policy #1700

Fix Configuration for Micro Batch Size in Megatron's Ref Policy #1700

Uh oh!

Conversation

none0663 commented May 26, 2025

What does this PR do?

High-Level Design

Changes Made:

Uh oh!

ETOgaosion May 26, 2025

Choose a reason for hiding this comment

Uh oh!

none0663 May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!