Skip to content

Conversation

Qsingle
Copy link

@Qsingle Qsingle commented Jul 2, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Add initial support for Gemma3

High-Level Design

Abstract the preprocess procedure for LMMs, make it easy to add new Multi-Modal

Specific Changes

Add the preprocessor API in the verl.utils.dataset
Modify the verl.utils.dataset.rl_dataset to support the preprocessor API.

API

Add the preprocessor for the multi-modality model.

Usage Example

Provide usage example(s) for easier usage.

set -x
ENGINE=${1:-vllm}
export NCCL_DEBUG=WARN


python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$HOME/data/geo3k/train.parquet \
    data.val_files=$HOME/data/geo3k/test.parquet \
    data.train_batch_size=4 \
    data.max_prompt_length=4096 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.image_key=images \
    actor_rollout_ref.model.path=Google/gemma-3-4b-it \
    actor_rollout_ref.model.trust_remote_code=True \
    actor_rollout_ref.actor.optim.lr=2e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=2 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=$ENGINE \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_grpo_colon' \
    trainer.experiment_name='gemma3_12b_it_colon' \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=3000 \
    trainer.val_before_train=False \
    trainer.test_freq=-1 \
    trainer.total_epochs=15 $@

Test

  • To verify the model. Currently, my GPU does not support me in verifying the training of the model, like Gemma3 (Out of Memory on the GPU). But I have checked that the support for the Qwen2VL series is not broken by this PR.

The training curve for InternVl2.5-1B
image

The training curve for InternVL3-1B
image

Additional Info.

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if necessary.

Qsingle and others added 2 commits July 2, 2025 17:42
add support for gemma3 and internvl for grpo training
@Weiyun1025
Copy link

2025-07-04 17:06:43,418	INFO cli.py:88 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
  File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 355, in generate_sequences
    response_attention_mask = get_response_mask(
  File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 242, in get_response_mask
    eos_mask = torch.isin(response_id, torch.tensor(eos_token, device=response_id.device)).int()
RuntimeError: Could not infer dtype of NoneType

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I encountered this error while using this PR to train InternVL3. Do you have any suggestions?

@xylcbd
Copy link
Contributor

xylcbd commented Jul 4, 2025

LoRA training will report an error, and it needs to be fixed like this below

# verl/utils/fsdp_utils.py + 90
default_transformer_cls_names_to_wrap = getattr(module, "_no_split_modules", None)
    if re.match("internvl", module.__class__.__name__, re.IGNORECASE) or (module.__class__.__name__ == 'PeftModelForCausalLM' and re.match("internvl", module.base_model.model.__class__.__name__, re.IGNORECASE)):
        update_cls_names_to_wrap = []
        for mod in default_transformer_cls_names_to_wrap:
            if mod != "LlamaDecoderLayer":
                update_cls_names_to_wrap.append(mod)
        default_transformer_cls_names_to_wrap = update_cls_names_to_wrap
    elif re.match("gemma3", module.__class__.__name__, re.IGNORECASE) or (module.__class__.__name__ == 'PeftModelForCausalLM' and re.match("gemma3", module.base_model.model.__class__.__name__, re.IGNORECASE)):
        update_cls_names_to_wrap = []
        for mod in default_transformer_cls_names_to_wrap:
            if mod != "SiglipMultiheadAttentionPoolingHead":
                update_cls_names_to_wrap.append(mod)
        default_transformer_cls_names_to_wrap = update_cls_names_to_wrap
    fsdp_transformer_layer_cls_to_wrap = _get_attr(
        "transformer_layer_cls_to_wrap", default_transformer_cls_names_to_wrap
    )

@Weiyun1025
Copy link

        meta_info = {
            "eos_token_id": self.generation_config.eos_token_id
            if getattr(self.generation_config, "eos_token_id", None) is not None
            else self.tokenizer.eos_token_id,
            "pad_token_id": self.generation_config.pad_token_id
            if getattr(self.generation_config, "pad_token_id", None) is not None
            else self.tokenizer.pad_token_id,
        }

Seems that fsdp_workers.py should be modified to set the correct eos_token_id when eos_token_id is not set in the generation_config.

@Qsingle
Copy link
Author

Qsingle commented Jul 5, 2025

        meta_info = {
            "eos_token_id": self.generation_config.eos_token_id
            if getattr(self.generation_config, "eos_token_id", None) is not None
            else self.tokenizer.eos_token_id,
            "pad_token_id": self.generation_config.pad_token_id
            if getattr(self.generation_config, "pad_token_id", None) is not None
            else self.tokenizer.pad_token_id,
        }

Seems that fsdp_workers.py should be modified to set the correct eos_token_id when eos_token_id is not set in the generation_config.

Could you provide the script you used to train the InternVL3?

@Weiyun1025
Copy link

ray job submit --address=${RAY_ADDRESS} \
    -- python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=[${CURRENT_PATH}/verl_data_with_gt/math_pkg_250701.json_geo3k_acc.parquet] \
    data.val_files=${CURRENT_PATH}/verl_data/geo3k/test.parquet \
    data.train_batch_size=${ROLLOUT_BATCH_SIZE} \
    data.max_prompt_length=18432 \
    data.max_response_length=32768 \
    data.filter_overlong_prompts=True \
    data.filter_overlong_prompts_workers=8 \
    data.truncation='error' \
    data.image_key=images \
    data.trust_remote_code=True \
    actor_rollout_ref.model.path=${CURRENT_PATH}/pretrained/InternVL3-1B-64K \
    actor_rollout_ref.model.trust_remote_code=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.kl_loss_coef=0.0 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${MICRO_ROLLOUT_BATCH_SIZE} \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${TENSOR_PARALLEL} \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=${N_SAMPLES_PER_PROMPT} \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
    actor_rollout_ref.actor.loss_agg_mode=token-mean \
    algorithm.use_kl_in_reward=False \
    algorithm.kl_ctrl.kl_coef=0.0 \
    trainer.critic_warmup=0 \
    trainer.default_local_dir=${OUTPUT_PATH} \
    trainer.logger=['console','tensorboard'] \
    trainer.project_name=${PROJECT_NAME} \
    trainer.experiment_name=${TASK_NAME} \
    trainer.n_gpus_per_node=${NPROC_PER_NODE} \
    trainer.nnodes=${WORLD_SIZE} \
    trainer.save_freq=20 \
    trainer.test_freq=5000 \
    trainer.val_before_train=False \
    trainer.rollout_data_dir=${OUTPUT_PATH}/rollouts \
    trainer.total_epochs=100 2>&1 | tee ${JOBLOG}

BTW, I encountered another error when setting actor_rollout_ref.actor.ulysses_sequence_parallel_size=2. It seems there are still some issues with the SP adaptation for InternVL3.

�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/ray/base.py", line 710, in func
�[36m(TaskRunner pid=5245)�[0m     return getattr(self.worker_dict[key], name)(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/base/decorator.py", line 549, in inner
�[36m(TaskRunner pid=5245)�[0m     return func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/fsdp_workers.py", line 802, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m     output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 81, in f
�[36m(TaskRunner pid=5245)�[0m     return self.log(decorated_function, *args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 94, in log
�[36m(TaskRunner pid=5245)�[0m     output = func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 364, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m     entropy, log_probs = self._forward_micro_batch(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 197, in _forward_micro_batch
�[36m(TaskRunner pid=5245)�[0m     log_probs = logprobs_from_logits(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 87, in logprobs_from_logits
�[36m(TaskRunner pid=5245)�[0m     output = logprobs_from_logits_flash_attn(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 97, in logprobs_from_logits_flash_attn
�[36m(TaskRunner pid=5245)�[0m     output = cross_entropy_loss(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss
�[36m(TaskRunner pid=5245)�[0m     return CrossEntropyLoss.apply(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
�[36m(TaskRunner pid=5245)�[0m     return super().apply(*args, **kwargs)  # type: ignore[misc]
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 170, in forward
�[36m(TaskRunner pid=5245)�[0m     assert labels.shape == (n_rows,)
�[36m(TaskRunner pid=5245)�[0m AssertionError


__all__ = ["Gemma3Preprocessor"]

@PREPROCESSOR_REGISTER.register()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I am thinking moving all model related code to the same folder, one per model. #2338 (review)
Given the complexity of multimodal structures, i think it's worth a RFC for the overall approach and design

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it is a good strategy for the Multi-modality framework.

@eric-haibin-lin eric-haibin-lin self-assigned this Jul 7, 2025
@Qsingle
Copy link
Author

Qsingle commented Jul 7, 2025

ray job submit --address=${RAY_ADDRESS} \
    -- python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=[${CURRENT_PATH}/verl_data_with_gt/math_pkg_250701.json_geo3k_acc.parquet] \
    data.val_files=${CURRENT_PATH}/verl_data/geo3k/test.parquet \
    data.train_batch_size=${ROLLOUT_BATCH_SIZE} \
    data.max_prompt_length=18432 \
    data.max_response_length=32768 \
    data.filter_overlong_prompts=True \
    data.filter_overlong_prompts_workers=8 \
    data.truncation='error' \
    data.image_key=images \
    data.trust_remote_code=True \
    actor_rollout_ref.model.path=${CURRENT_PATH}/pretrained/InternVL3-1B-64K \
    actor_rollout_ref.model.trust_remote_code=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.kl_loss_coef=0.0 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${MICRO_ROLLOUT_BATCH_SIZE} \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${TENSOR_PARALLEL} \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=${N_SAMPLES_PER_PROMPT} \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
    actor_rollout_ref.actor.loss_agg_mode=token-mean \
    algorithm.use_kl_in_reward=False \
    algorithm.kl_ctrl.kl_coef=0.0 \
    trainer.critic_warmup=0 \
    trainer.default_local_dir=${OUTPUT_PATH} \
    trainer.logger=['console','tensorboard'] \
    trainer.project_name=${PROJECT_NAME} \
    trainer.experiment_name=${TASK_NAME} \
    trainer.n_gpus_per_node=${NPROC_PER_NODE} \
    trainer.nnodes=${WORLD_SIZE} \
    trainer.save_freq=20 \
    trainer.test_freq=5000 \
    trainer.val_before_train=False \
    trainer.rollout_data_dir=${OUTPUT_PATH}/rollouts \
    trainer.total_epochs=100 2>&1 | tee ${JOBLOG}

BTW, I encountered another error when setting actor_rollout_ref.actor.ulysses_sequence_parallel_size=2. It seems there are still some issues with the SP adaptation for InternVL3.

�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/ray/base.py", line 710, in func
�[36m(TaskRunner pid=5245)�[0m     return getattr(self.worker_dict[key], name)(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/base/decorator.py", line 549, in inner
�[36m(TaskRunner pid=5245)�[0m     return func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/fsdp_workers.py", line 802, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m     output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 81, in f
�[36m(TaskRunner pid=5245)�[0m     return self.log(decorated_function, *args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 94, in log
�[36m(TaskRunner pid=5245)�[0m     output = func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 364, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m     entropy, log_probs = self._forward_micro_batch(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 197, in _forward_micro_batch
�[36m(TaskRunner pid=5245)�[0m     log_probs = logprobs_from_logits(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 87, in logprobs_from_logits
�[36m(TaskRunner pid=5245)�[0m     output = logprobs_from_logits_flash_attn(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 97, in logprobs_from_logits_flash_attn
�[36m(TaskRunner pid=5245)�[0m     output = cross_entropy_loss(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss
�[36m(TaskRunner pid=5245)�[0m     return CrossEntropyLoss.apply(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
�[36m(TaskRunner pid=5245)�[0m     return super().apply(*args, **kwargs)  # type: ignore[misc]
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 170, in forward
�[36m(TaskRunner pid=5245)�[0m     assert labels.shape == (n_rows,)
�[36m(TaskRunner pid=5245)�[0m AssertionError

Thanks for your feedback. I will try to resolve this problem.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the "model_init_kwargs" isn't used

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to use it in current version.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello, I have one question here. I didn't see any code for internVL model for monkey path here. Does that mean InternVL do not require custom code or sequence parallel is not applicaple for InternVL now?
Thanks a lot!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternVL does not have a special design that requires monkey patching. However, the vision model of InternVL does generate a high memory cost. For example, InternVL-Chat-V1.5, a 26B model, requires about 50G of memory for model parameters in BF16 format, and considering the additional overhead during training, it requires around 100-150G. The special requirement for vision encoder may need some discussion.

@sailfish009
Copy link

IMHO, verl also seems to need an approach like unsloth or rl2. Something simple and lightweight is needed. I think sources like uvg are worth referring to in a limited GPU memory environment. I was able to run InternVL3-1B in batch 1 by combining the three patches above.

@ZZYuting
Copy link

Has anyone successfully merged the trained fsdp model into a huggingface model? I try using
python -m verl.model_merger merge \ --backend fsdp \ --local_dir checkpoints/xx/global_step_1/actor \ --target_dir /path/to/merged_hf_model , it failed because "InternvlChat" is not support, hope guys suggestion

@Qsingle
Copy link
Author

Qsingle commented Jul 20, 2025

Has anyone successfully merged the trained fsdp model into a huggingface model? I try using
python -m verl.model_merger merge \ --backend fsdp \ --local_dir checkpoints/xx/global_step_1/actor \ --target_dir /path/to/merged_hf_model , it failed because "InternvlChat" is not support, hope guys suggestion

Yeah, some code modifications are necessary to provide support.

@ZZYuting
Copy link

ZZYuting commented Jul 20, 2025

I have fixed the mere problem, we need to modify the merge verl/model_merger/base_model_merger.py

class BaseModelMerger(ABC):
  			 ......
         elif "ForConditionalGeneration" in self.model_config.architectures[0]:
             return AutoModelForVision2Seq
+        elif "InternVLChatModel" in self.model_config.architectures[0]:
+            return AutoModel

         raise NotImplementedError(f"Unknown architecture {self.model_config.architectures}")

Besides, I also find in lastes transformers, we need modify tokenizer in the verl/utils/tokenizer.py model define have small diff.

tokenizer.context_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.context_image_token) #for transformers >= 4.52.2
tokenizer.start_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.start_image_token) #for transformers >= 4.53.2
tokenizer.end_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.end_image_token) #for transformers >= 4.53.2

@SStoica12
Copy link

SStoica12 commented Jul 23, 2025

Thank you for your work!
I just found this, but I would like to ask some questions about your integration with InvernVL. I am an aspiring researcher looking into integrating InternVL3 (and maybe 2.5 as well) into EasyR1, which is built off of verl.

  1. Why do you define your processors in the way that you do? For example, why do you not use the Qwen2VLImageProcessor and ClipImageProcessor for Qwen and InternVL, respectively? You seem to have defined your own. In addition, it seems that prepocessor for internVL does not take into account the CLIPFeatureExtractor when you process the image or the video (my understanding is that you just extract the video or image?). Wouldn't we want to use the CLIPFeatureExtractor as that is what the preprocessor_config.json for InternVL uses: https://huggingface.co/OpenGVLab/InternVL3-2B/blob/main/preprocessor_config.json has?
  2. Why did you not change dp_actor.py, dp_critic.py relience on position_ids (e.g., still passed in position_ids for ulysses_pad_and_slice_inputs function in dp_actor.py, dp_critic.py) even though internVL does not use position ids?
  3. Why do you only add image_flags for the InternVL model in dp_actor?

Thank you very much.

@dle666
Copy link

dle666 commented Jul 23, 2025

I encountered an error while running internvl3: "Only support config type of {'deepseek_v3', 'minicpmo', 'qwen2_5_vl', 'qwen3_moe', 'qwen3', 'minicpmv', 'llama', 'qwen2', 'qwen2_vl'}, but got internvl_chat. MFU will always be zero."

Could you please provide me with some guidance?

@ZZYuting
Copy link

I encountered an error while running internvl3: "Only support config type of {'deepseek_v3', 'minicpmo', 'qwen2_5_vl', 'qwen3_moe', 'qwen3', 'minicpmv', 'llama', 'qwen2', 'qwen2_vl'}, but got internvl_chat. MFU will always be zero."

Could you please provide me with some guidance?

I think this is only warning to print some infomation such as "MFU", in my case,it works well.

@Qsingle
Copy link
Author

Qsingle commented Jul 28, 2025

Thank you for your work! I just found this, but I would like to ask some questions about your integration with InvernVL. I am an aspiring researcher looking into integrating InternVL3 (and maybe 2.5 as well) into EasyR1, which is built off of verl.

  1. Why do you define your processors in the way that you do? For example, why do you not use the Qwen2VLImageProcessor and ClipImageProcessor for Qwen and InternVL, respectively? You seem to have defined your own. In addition, it seems that prepocessor for internVL does not take into account the CLIPFeatureExtractor when you process the image or the video (my understanding is that you just extract the video or image?). Wouldn't we want to use the CLIPFeatureExtractor as that is what the preprocessor_config.json for InternVL uses: https://huggingface.co/OpenGVLab/InternVL3-2B/blob/main/preprocessor_config.json has?
  2. Why did you not change dp_actor.py, dp_critic.py relience on position_ids (e.g., still passed in position_ids for ulysses_pad_and_slice_inputs function in dp_actor.py, dp_critic.py) even though internVL does not use position ids?
  3. Why do you only add image_flags for the InternVL model in dp_actor?

Thank you very much.

  1. I think the preprocessing method for different models may be different. Using a preprocessor for different models could wrap the preprocess method and could be easily customised. For the InternVL, the processor is based on the GotOcr2ImageProcessor, and InternVLVideoProcessor is added in the latest version of transformers.
  2. The position_id in dp_actor.py is necessary. The position_id is important when using flash_attention.
  3. The image_flags are used in InternVL; the image_flags are used as the mask for image. But in the HuggingFace version, the image_flags may not be necessary.

@dle666
Copy link

dle666 commented Aug 1, 2025

How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it.

@dle666
Copy link

dle666 commented Aug 1, 2025

How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it.

This is the error I encountered:
ValueError: Unrecognized configuration class <class 'transformers.models.gemma3.configuration_gemma3.Gemma3Config'> for this kind of AutoModel: AutoModelForVision2Seq.

@Qsingle
Copy link
Author

Qsingle commented Aug 4, 2025

How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it.

This is the error I encountered: ValueError: Unrecognized configuration class <class 'transformers.models.gemma3.configuration_gemma3.Gemma3Config'> for this kind of AutoModel: AutoModelForVision2Seq.

You could add the following code before elif "ForConditionalGeneration" in self.model_config.architectures[0]:

elif "gemma3" in self.model_config.architectures[0].lower():
            return AutoModelForImageTextToText

@dle666
Copy link

dle666 commented Aug 4, 2025

How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it.

This is the error I encountered: ValueError: Unrecognized configuration class <class 'transformers.models.gemma3.configuration_gemma3.Gemma3Config'> for this kind of AutoModel: AutoModelForVision2Seq.

You could add the following code before elif "ForConditionalGeneration" in self.model_config.architectures[0]:

elif "gemma3" in self.model_config.architectures[0].lower():
            return AutoModelForImageTextToText

Thank you for your reply. After I merged the weights using this method, I encountered an error when loading the model using VLLM:

ValueError: There is no module or parameter named 'lm_head' in Gemma3ForConditionalGeneration

vllm=0.8.2 transformers=4.52.2下方是我的加载脚本:

if name == 'main':
from vllm import LLM, SamplingParams
import torch
import logging
from transformers import AutoProcessor

# 设置日志文件
logging.basicConfig(filename="vllm_gemma3_model_parameters.log", level=logging.INFO, format="%(message)s")

# 加载处理器
processor = AutoProcessor.from_pretrained(
    "/workspace/images-ks3-starfs-hd/workspace/denglinger/Internvl-GRPO/verl-main/checkpoints/gemma3_4b_it/gemma3_4b_recon_e8_2025-07-31_12-00-56/global_step_60/actor/huggingface",
    padding_side="left"
)

# 准备输入数据
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": "/workspace/images-ks3-starfs-hd/workspace/denglinger/comparison_models_linechart.png"},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to("cuda")

# 配置 vLLM 的参数
model_path = "/workspace/images-ks3-starfs-hd/workspace/denglinger/Internvl-GRPO/verl-main/checkpoints/gemma3_4b_it/gemma3_4b_recon_e8_2025-07-31_12-00-56/global_step_60/actor/huggingface"
tensor_parallel_size = 1  # 根据实际环境配置
max_model_len = 1024  # 根据模型上下文长度配置

# 初始化 vLLM 引擎
inference_engine = LLM(
    model=model_path,
    tensor_parallel_size=tensor_parallel_size,
    dtype=torch.bfloat16,
    max_model_len=max_model_len,
    trust_remote_code=True,
)

# 配置采样参数
sampling_params = SamplingParams(
    n=1,
    max_tokens=50,  # 设置生成的最大 token 数
    temperature=1.0,
    top_p=0.9,
    top_k=50
)

# 准备 vLLM 输入
vllm_inputs = {
    "prompt_token_ids": inputs["input_ids"].tolist(),  # 转换为列表形式
    "attention_mask": inputs["attention_mask"].tolist(),  # 转换为列表形式
}

# 使用 vLLM 生成结果
outputs = inference_engine.generate(
    prompts=[vllm_inputs],
    sampling_params=sampling_params,
)

# 解码并打印结果
decoded_output = processor.decode(outputs[0].outputs[0].token_ids, skip_special_tokens=True)
print(decoded_output)

Copy link
Contributor

@rich-junwang rich-junwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. It would be nice to add some unit tests for the data preprocessing part.

}
"""

PREPROCESSOR_REGISTER.register()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing @ sign here?

Comment on lines +30 to +50
def process_image(self, image, **kwargs):
if isinstance(image, Image.Image):
image_obj = image
elif image.startswith("http://") or image.startswith("https://"):
# fix memory leak issue while using BytesIO
with requests.get(image, stream=True) as response:
response.raise_for_status()
with BytesIO(response.content) as bio:
image_obj = copy.deepcopy(Image.open(bio))
elif image.startswith("file://"):
image_obj = Image.open(image[7:])
elif image.startswith("data:image"):
if "base64," in image:
_, base64_data = image.split("base64,", 1)
data = base64.b64decode(base64_data)
# fix memory leak issue while using BytesIO
with BytesIO(data) as bio:
image_obj = copy.deepcopy(Image.open(bio))
else:
image_obj = Image.open(image)
return image_obj.convert("RGB")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be possible to create some kind of mixin class to handle the duplicate code? such as:

class MediaProcessingMixin:
    """Mixin providing common media processing functionality"""
    
    def _process_image_from_source(self, image, **kwargs):
        """Shared image processing logic"""
        if isinstance(image, Image.Image):
            image_obj = image
        elif image.startswith("http://") or image.startswith("https://"):
            with requests.get(image, stream=True) as response:
                response.raise_for_status()
                with BytesIO(response.content) as bio:
                    image_obj = copy.deepcopy(Image.open(bio))
        elif image.startswith("file://"):
            image_obj = Image.open(image[7:])
        elif image.startswith("data:image"):
            if "base64," in image:
                _, base64_data = image.split("base64,", 1)
                data = base64.b64decode(base64_data)
                with BytesIO(data) as bio:
                    image_obj = copy.deepcopy(Image.open(bio))
        else:
            image_obj = Image.open(image)
        return image_obj.convert("RGB")

# Now each preprocessor can inherit from both the base class AND the mixin
class Gemma3Preprocessor(BasicPreprocessor, MediaProcessingMixin):
    def process_image(self, image, **kwargs):
        return self._process_image_from_source(image, **kwargs)

class InternVLPreprocessor(BasicPreprocessor, MediaProcessingMixin):
    def process_image(self, image, **kwargs):
        return self._process_image_from_source(image, **kwargs)

class KimiVLPreprocessor(BasicPreprocessor, MediaProcessingMixin):
    def process_image(self, image, **kwargs):
        return self._process_image_from_source(image, **kwargs)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advice, I will solve this.

@@ -411,14 +433,15 @@ def _build_model_optimizer(
# TODO: add more optimizer args into config
if role == "actor" and optim_config is not None:
from verl.utils.torch_functional import get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup
optim_strategy = optim_config.get("strategy", "adamw")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not used. Any particular reason we keep this? If not, it would be nice to remove it.

@@ -1134,7 +1157,8 @@ def _build_critic_model_optimizer(self, config):
enable_activation_offloading(critic_module, config.strategy, enable_gradient_checkpointing)

log_gpu_memory_usage("After critic FSDP", logger=None)

optim_strategy = config.optim.get("strategy", "adamw")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this line is used to set the strategy of optimizer. However, I do not include the implementation of this in the current version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants