Add the support for more VLMs(Gemma3 and InternVL) #2327

Qsingle · 2025-07-02T10:15:21Z

Checklist Before Starting

Search for similar PR(s).
- The gemma3 support is mentioned in add support for gemma3 #1123 . In this PR, we add initial support for the Gemma3.

What does this PR do?

Add initial support for Gemma3

High-Level Design

Abstract the preprocess procedure for LMMs, make it easy to add new Multi-Modal

Specific Changes

Add the preprocessor API in the verl.utils.dataset
Modify the verl.utils.dataset.rl_dataset to support the preprocessor API.

API

Add the preprocessor for the multi-modality model.

Usage Example

Provide usage example(s) for easier usage.

set -x
ENGINE=${1:-vllm}
export NCCL_DEBUG=WARN


python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$HOME/data/geo3k/train.parquet \
    data.val_files=$HOME/data/geo3k/test.parquet \
    data.train_batch_size=4 \
    data.max_prompt_length=4096 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.image_key=images \
    actor_rollout_ref.model.path=Google/gemma-3-4b-it \
    actor_rollout_ref.model.trust_remote_code=True \
    actor_rollout_ref.actor.optim.lr=2e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=2 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=$ENGINE \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_grpo_colon' \
    trainer.experiment_name='gemma3_12b_it_colon' \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=3000 \
    trainer.val_before_train=False \
    trainer.test_freq=-1 \
    trainer.total_epochs=15 $@

Test

To verify the model. Currently, my GPU does not support me in verifying the training of the model, like Gemma3 (Out of Memory on the GPU). But I have checked that the support for the Qwen2VL series is not broken by this PR.

The training curve for InternVl2.5-1B

The training curve for InternVL3-1B

Additional Info.

Issue Number: Is internvl series model supported? #1239
Training: FSDP
Inference: None

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

add support for gemma3 and internvl for grpo training

Weiyun1025 · 2025-07-04T09:53:59Z

2025-07-04 17:06:43,418	INFO cli.py:88 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
  File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 355, in generate_sequences
    response_attention_mask = get_response_mask(
  File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 242, in get_response_mask
    eos_mask = torch.isin(response_id, torch.tensor(eos_token, device=response_id.device)).int()
RuntimeError: Could not infer dtype of NoneType

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I encountered this error while using this PR to train InternVL3. Do you have any suggestions?

xylcbd · 2025-07-04T11:00:33Z

LoRA training will report an error, and it needs to be fixed like this below

# verl/utils/fsdp_utils.py + 90
default_transformer_cls_names_to_wrap = getattr(module, "_no_split_modules", None)
    if re.match("internvl", module.__class__.__name__, re.IGNORECASE) or (module.__class__.__name__ == 'PeftModelForCausalLM' and re.match("internvl", module.base_model.model.__class__.__name__, re.IGNORECASE)):
        update_cls_names_to_wrap = []
        for mod in default_transformer_cls_names_to_wrap:
            if mod != "LlamaDecoderLayer":
                update_cls_names_to_wrap.append(mod)
        default_transformer_cls_names_to_wrap = update_cls_names_to_wrap
    elif re.match("gemma3", module.__class__.__name__, re.IGNORECASE) or (module.__class__.__name__ == 'PeftModelForCausalLM' and re.match("gemma3", module.base_model.model.__class__.__name__, re.IGNORECASE)):
        update_cls_names_to_wrap = []
        for mod in default_transformer_cls_names_to_wrap:
            if mod != "SiglipMultiheadAttentionPoolingHead":
                update_cls_names_to_wrap.append(mod)
        default_transformer_cls_names_to_wrap = update_cls_names_to_wrap
    fsdp_transformer_layer_cls_to_wrap = _get_attr(
        "transformer_layer_cls_to_wrap", default_transformer_cls_names_to_wrap
    )

Weiyun1025 · 2025-07-04T15:45:14Z

        meta_info = {
            "eos_token_id": self.generation_config.eos_token_id
            if getattr(self.generation_config, "eos_token_id", None) is not None
            else self.tokenizer.eos_token_id,
            "pad_token_id": self.generation_config.pad_token_id
            if getattr(self.generation_config, "pad_token_id", None) is not None
            else self.tokenizer.pad_token_id,
        }

Seems that fsdp_workers.py should be modified to set the correct eos_token_id when eos_token_id is not set in the generation_config.

Qsingle · 2025-07-05T07:19:09Z

        meta_info = {
            "eos_token_id": self.generation_config.eos_token_id
            if getattr(self.generation_config, "eos_token_id", None) is not None
            else self.tokenizer.eos_token_id,
            "pad_token_id": self.generation_config.pad_token_id
            if getattr(self.generation_config, "pad_token_id", None) is not None
            else self.tokenizer.pad_token_id,
        }

Seems that fsdp_workers.py should be modified to set the correct eos_token_id when eos_token_id is not set in the generation_config.

Could you provide the script you used to train the InternVL3?

Weiyun1025 · 2025-07-05T17:55:58Z

ray job submit --address=${RAY_ADDRESS} \
    -- python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=[${CURRENT_PATH}/verl_data_with_gt/math_pkg_250701.json_geo3k_acc.parquet] \
    data.val_files=${CURRENT_PATH}/verl_data/geo3k/test.parquet \
    data.train_batch_size=${ROLLOUT_BATCH_SIZE} \
    data.max_prompt_length=18432 \
    data.max_response_length=32768 \
    data.filter_overlong_prompts=True \
    data.filter_overlong_prompts_workers=8 \
    data.truncation='error' \
    data.image_key=images \
    data.trust_remote_code=True \
    actor_rollout_ref.model.path=${CURRENT_PATH}/pretrained/InternVL3-1B-64K \
    actor_rollout_ref.model.trust_remote_code=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.kl_loss_coef=0.0 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${MICRO_ROLLOUT_BATCH_SIZE} \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${TENSOR_PARALLEL} \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=${N_SAMPLES_PER_PROMPT} \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
    actor_rollout_ref.actor.loss_agg_mode=token-mean \
    algorithm.use_kl_in_reward=False \
    algorithm.kl_ctrl.kl_coef=0.0 \
    trainer.critic_warmup=0 \
    trainer.default_local_dir=${OUTPUT_PATH} \
    trainer.logger=['console','tensorboard'] \
    trainer.project_name=${PROJECT_NAME} \
    trainer.experiment_name=${TASK_NAME} \
    trainer.n_gpus_per_node=${NPROC_PER_NODE} \
    trainer.nnodes=${WORLD_SIZE} \
    trainer.save_freq=20 \
    trainer.test_freq=5000 \
    trainer.val_before_train=False \
    trainer.rollout_data_dir=${OUTPUT_PATH}/rollouts \
    trainer.total_epochs=100 2>&1 | tee ${JOBLOG}

BTW, I encountered another error when setting actor_rollout_ref.actor.ulysses_sequence_parallel_size=2. It seems there are still some issues with the SP adaptation for InternVL3.

�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/ray/base.py", line 710, in func
�[36m(TaskRunner pid=5245)�[0m     return getattr(self.worker_dict[key], name)(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/base/decorator.py", line 549, in inner
�[36m(TaskRunner pid=5245)�[0m     return func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/fsdp_workers.py", line 802, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m     output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 81, in f
�[36m(TaskRunner pid=5245)�[0m     return self.log(decorated_function, *args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 94, in log
�[36m(TaskRunner pid=5245)�[0m     output = func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 364, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m     entropy, log_probs = self._forward_micro_batch(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 197, in _forward_micro_batch
�[36m(TaskRunner pid=5245)�[0m     log_probs = logprobs_from_logits(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 87, in logprobs_from_logits
�[36m(TaskRunner pid=5245)�[0m     output = logprobs_from_logits_flash_attn(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 97, in logprobs_from_logits_flash_attn
�[36m(TaskRunner pid=5245)�[0m     output = cross_entropy_loss(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss
�[36m(TaskRunner pid=5245)�[0m     return CrossEntropyLoss.apply(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
�[36m(TaskRunner pid=5245)�[0m     return super().apply(*args, **kwargs)  # type: ignore[misc]
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 170, in forward
�[36m(TaskRunner pid=5245)�[0m     assert labels.shape == (n_rows,)
�[36m(TaskRunner pid=5245)�[0m AssertionError

eric-haibin-lin · 2025-07-07T02:28:20Z

verl/utils/dataset/preprocessor/gemma.py

+
+__all__ = ["Gemma3Preprocessor"]
+
+@PREPROCESSOR_REGISTER.register()


BTW I am thinking moving all model related code to the same folder, one per model. #2338 (review)
Given the complexity of multimodal structures, i think it's worth a RFC for the overall approach and design

Yeah, I think it is a good strategy for the Multi-modality framework.

Qsingle · 2025-07-07T05:07:45Z

ray job submit --address=${RAY_ADDRESS} \
    -- python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=[${CURRENT_PATH}/verl_data_with_gt/math_pkg_250701.json_geo3k_acc.parquet] \
    data.val_files=${CURRENT_PATH}/verl_data/geo3k/test.parquet \
    data.train_batch_size=${ROLLOUT_BATCH_SIZE} \
    data.max_prompt_length=18432 \
    data.max_response_length=32768 \
    data.filter_overlong_prompts=True \
    data.filter_overlong_prompts_workers=8 \
    data.truncation='error' \
    data.image_key=images \
    data.trust_remote_code=True \
    actor_rollout_ref.model.path=${CURRENT_PATH}/pretrained/InternVL3-1B-64K \
    actor_rollout_ref.model.trust_remote_code=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.kl_loss_coef=0.0 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${MICRO_ROLLOUT_BATCH_SIZE} \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${TENSOR_PARALLEL} \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=${N_SAMPLES_PER_PROMPT} \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
    actor_rollout_ref.actor.loss_agg_mode=token-mean \
    algorithm.use_kl_in_reward=False \
    algorithm.kl_ctrl.kl_coef=0.0 \
    trainer.critic_warmup=0 \
    trainer.default_local_dir=${OUTPUT_PATH} \
    trainer.logger=['console','tensorboard'] \
    trainer.project_name=${PROJECT_NAME} \
    trainer.experiment_name=${TASK_NAME} \
    trainer.n_gpus_per_node=${NPROC_PER_NODE} \
    trainer.nnodes=${WORLD_SIZE} \
    trainer.save_freq=20 \
    trainer.test_freq=5000 \
    trainer.val_before_train=False \
    trainer.rollout_data_dir=${OUTPUT_PATH}/rollouts \
    trainer.total_epochs=100 2>&1 | tee ${JOBLOG}

BTW, I encountered another error when setting actor_rollout_ref.actor.ulysses_sequence_parallel_size=2. It seems there are still some issues with the SP adaptation for InternVL3.

�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/ray/base.py", line 710, in func
�[36m(TaskRunner pid=5245)�[0m     return getattr(self.worker_dict[key], name)(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/base/decorator.py", line 549, in inner
�[36m(TaskRunner pid=5245)�[0m     return func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/fsdp_workers.py", line 802, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m     output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 81, in f
�[36m(TaskRunner pid=5245)�[0m     return self.log(decorated_function, *args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 94, in log
�[36m(TaskRunner pid=5245)�[0m     output = func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 364, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m     entropy, log_probs = self._forward_micro_batch(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 197, in _forward_micro_batch
�[36m(TaskRunner pid=5245)�[0m     log_probs = logprobs_from_logits(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 87, in logprobs_from_logits
�[36m(TaskRunner pid=5245)�[0m     output = logprobs_from_logits_flash_attn(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 97, in logprobs_from_logits_flash_attn
�[36m(TaskRunner pid=5245)�[0m     output = cross_entropy_loss(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss
�[36m(TaskRunner pid=5245)�[0m     return CrossEntropyLoss.apply(
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
�[36m(TaskRunner pid=5245)�[0m     return super().apply(*args, **kwargs)  # type: ignore[misc]
�[36m(TaskRunner pid=5245)�[0m   File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 170, in forward
�[36m(TaskRunner pid=5245)�[0m     assert labels.shape == (n_rows,)
�[36m(TaskRunner pid=5245)�[0m AssertionError

Thanks for your feedback. I will try to resolve this problem.

wangskyGit · 2025-07-14T12:24:27Z

verl/workers/fsdp_workers.py

it looks like the "model_init_kwargs" isn't used

Sorry, I forgot to use it in current version.

wangskyGit · 2025-07-15T06:44:00Z

verl/models/transformers/monkey_patch.py

hello, I have one question here. I didn't see any code for internVL model for monkey path here. Does that mean InternVL do not require custom code or sequence parallel is not applicaple for InternVL now?
Thanks a lot!

InternVL does not have a special design that requires monkey patching. However, the vision model of InternVL does generate a high memory cost. For example, InternVL-Chat-V1.5, a 26B model, requires about 50G of memory for model parameters in BF16 format, and considering the additional overhead during training, it requires around 100-150G. The special requirement for vision encoder may need some discussion.

sailfish009 · 2025-07-18T13:56:46Z

IMHO, verl also seems to need an approach like unsloth or rl2. Something simple and lightweight is needed. I think sources like uvg are worth referring to in a limited GPU memory environment. I was able to run InternVL3-1B in batch 1 by combining the three patches above.

ZZYuting · 2025-07-19T16:24:42Z

Has anyone successfully merged the trained fsdp model into a huggingface model? I try using
python -m verl.model_merger merge \ --backend fsdp \ --local_dir checkpoints/xx/global_step_1/actor \ --target_dir /path/to/merged_hf_model , it failed because "InternvlChat" is not support, hope guys suggestion

Qsingle · 2025-07-20T02:33:43Z

Has anyone successfully merged the trained fsdp model into a huggingface model? I try using
python -m verl.model_merger merge \ --backend fsdp \ --local_dir checkpoints/xx/global_step_1/actor \ --target_dir /path/to/merged_hf_model , it failed because "InternvlChat" is not support, hope guys suggestion

Yeah, some code modifications are necessary to provide support.

ZZYuting · 2025-07-20T04:10:51Z

I have fixed the mere problem， we need to modify the merge verl/model_merger/base_model_merger.py

class BaseModelMerger(ABC):
  			 ......
         elif "ForConditionalGeneration" in self.model_config.architectures[0]:
             return AutoModelForVision2Seq
+        elif "InternVLChatModel" in self.model_config.architectures[0]:
+            return AutoModel

         raise NotImplementedError(f"Unknown architecture {self.model_config.architectures}")

Besides, I also find in lastes transformers， we need modify tokenizer in the verl/utils/tokenizer.py model define have small diff.

tokenizer.context_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.context_image_token) #for transformers >= 4.52.2
tokenizer.start_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.start_image_token) #for transformers >= 4.53.2
tokenizer.end_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.end_image_token) #for transformers >= 4.53.2

SStoica12 · 2025-07-23T08:00:17Z

Thank you for your work!
I just found this, but I would like to ask some questions about your integration with InvernVL. I am an aspiring researcher looking into integrating InternVL3 (and maybe 2.5 as well) into EasyR1, which is built off of verl.

Why do you define your processors in the way that you do? For example, why do you not use the Qwen2VLImageProcessor and ClipImageProcessor for Qwen and InternVL, respectively? You seem to have defined your own. In addition, it seems that prepocessor for internVL does not take into account the CLIPFeatureExtractor when you process the image or the video (my understanding is that you just extract the video or image?). Wouldn't we want to use the CLIPFeatureExtractor as that is what the preprocessor_config.json for InternVL uses: https://huggingface.co/OpenGVLab/InternVL3-2B/blob/main/preprocessor_config.json has?
Why did you not change dp_actor.py, dp_critic.py relience on position_ids (e.g., still passed in position_ids for ulysses_pad_and_slice_inputs function in dp_actor.py, dp_critic.py) even though internVL does not use position ids?
Why do you only add image_flags for the InternVL model in dp_actor?

Thank you very much.

dle666 · 2025-07-23T14:38:49Z

I encountered an error while running internvl3: "Only support config type of {'deepseek_v3', 'minicpmo', 'qwen2_5_vl', 'qwen3_moe', 'qwen3', 'minicpmv', 'llama', 'qwen2', 'qwen2_vl'}, but got internvl_chat. MFU will always be zero."

Could you please provide me with some guidance?

ZZYuting · 2025-07-24T05:19:02Z

I encountered an error while running internvl3: "Only support config type of {'deepseek_v3', 'minicpmo', 'qwen2_5_vl', 'qwen3_moe', 'qwen3', 'minicpmv', 'llama', 'qwen2', 'qwen2_vl'}, but got internvl_chat. MFU will always be zero."

Could you please provide me with some guidance?

I think this is only warning to print some infomation such as "MFU"， in my case，it works well.

Qsingle · 2025-07-28T02:41:03Z

Thank you for your work! I just found this, but I would like to ask some questions about your integration with InvernVL. I am an aspiring researcher looking into integrating InternVL3 (and maybe 2.5 as well) into EasyR1, which is built off of verl.

Why do you define your processors in the way that you do? For example, why do you not use the Qwen2VLImageProcessor and ClipImageProcessor for Qwen and InternVL, respectively? You seem to have defined your own. In addition, it seems that prepocessor for internVL does not take into account the CLIPFeatureExtractor when you process the image or the video (my understanding is that you just extract the video or image?). Wouldn't we want to use the CLIPFeatureExtractor as that is what the preprocessor_config.json for InternVL uses: https://huggingface.co/OpenGVLab/InternVL3-2B/blob/main/preprocessor_config.json has?

Why did you not change dp_actor.py, dp_critic.py relience on position_ids (e.g., still passed in position_ids for ulysses_pad_and_slice_inputs function in dp_actor.py, dp_critic.py) even though internVL does not use position ids?

Why do you only add image_flags for the InternVL model in dp_actor?

Thank you very much.

I think the preprocessing method for different models may be different. Using a preprocessor for different models could wrap the preprocess method and could be easily customised. For the InternVL, the processor is based on the GotOcr2ImageProcessor, and InternVLVideoProcessor is added in the latest version of transformers.
The position_id in dp_actor.py is necessary. The position_id is important when using flash_attention.
The image_flags are used in InternVL; the image_flags are used as the mask for image. But in the HuggingFace version, the image_flags may not be necessary.

dle666 · 2025-08-01T03:49:42Z

How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it.

dle666 · 2025-08-01T03:59:06Z

How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it.

This is the error I encountered：
ValueError: Unrecognized configuration class <class 'transformers.models.gemma3.configuration_gemma3.Gemma3Config'> for this kind of AutoModel: AutoModelForVision2Seq.

Qsingle · 2025-08-04T07:12:42Z

How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it.

This is the error I encountered： ValueError: Unrecognized configuration class <class 'transformers.models.gemma3.configuration_gemma3.Gemma3Config'> for this kind of AutoModel: AutoModelForVision2Seq.

You could add the following code before elif "ForConditionalGeneration" in self.model_config.architectures[0]:

elif "gemma3" in self.model_config.architectures[0].lower():
            return AutoModelForImageTextToText

dle666 · 2025-08-04T09:32:51Z

How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it.

This is the error I encountered： ValueError: Unrecognized configuration class <class 'transformers.models.gemma3.configuration_gemma3.Gemma3Config'> for this kind of AutoModel: AutoModelForVision2Seq.

You could add the following code before elif "ForConditionalGeneration" in self.model_config.architectures[0]:
elif "gemma3" in self.model_config.architectures[0].lower():
            return AutoModelForImageTextToText

Thank you for your reply. After I merged the weights using this method, I encountered an error when loading the model using VLLM：

ValueError: There is no module or parameter named 'lm_head' in Gemma3ForConditionalGeneration

vllm=0.8.2 transformers=4.52.2下方是我的加载脚本：

if name == 'main':
from vllm import LLM, SamplingParams
import torch
import logging
from transformers import AutoProcessor

# 设置日志文件
logging.basicConfig(filename="vllm_gemma3_model_parameters.log", level=logging.INFO, format="%(message)s")

# 加载处理器
processor = AutoProcessor.from_pretrained(
    "/workspace/images-ks3-starfs-hd/workspace/denglinger/Internvl-GRPO/verl-main/checkpoints/gemma3_4b_it/gemma3_4b_recon_e8_2025-07-31_12-00-56/global_step_60/actor/huggingface",
    padding_side="left"
)

# 准备输入数据
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": "/workspace/images-ks3-starfs-hd/workspace/denglinger/comparison_models_linechart.png"},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to("cuda")

# 配置 vLLM 的参数
model_path = "/workspace/images-ks3-starfs-hd/workspace/denglinger/Internvl-GRPO/verl-main/checkpoints/gemma3_4b_it/gemma3_4b_recon_e8_2025-07-31_12-00-56/global_step_60/actor/huggingface"
tensor_parallel_size = 1  # 根据实际环境配置
max_model_len = 1024  # 根据模型上下文长度配置

# 初始化 vLLM 引擎
inference_engine = LLM(
    model=model_path,
    tensor_parallel_size=tensor_parallel_size,
    dtype=torch.bfloat16,
    max_model_len=max_model_len,
    trust_remote_code=True,
)

# 配置采样参数
sampling_params = SamplingParams(
    n=1,
    max_tokens=50,  # 设置生成的最大 token 数
    temperature=1.0,
    top_p=0.9,
    top_k=50
)

# 准备 vLLM 输入
vllm_inputs = {
    "prompt_token_ids": inputs["input_ids"].tolist(),  # 转换为列表形式
    "attention_mask": inputs["attention_mask"].tolist(),  # 转换为列表形式
}

# 使用 vLLM 生成结果
outputs = inference_engine.generate(
    prompts=[vllm_inputs],
    sampling_params=sampling_params,
)

# 解码并打印结果
decoded_output = processor.decode(outputs[0].outputs[0].token_ids, skip_special_tokens=True)
print(decoded_output)

rich-junwang

Thanks for the PR. It would be nice to add some unit tests for the data preprocessing part.

rich-junwang · 2025-08-05T03:33:17Z

verl/utils/dataset/preprocessor/minicpmo.py

+}
+"""
+
+PREPROCESSOR_REGISTER.register()


Missing @ sign here?

rich-junwang · 2025-08-05T03:42:48Z

verl/utils/dataset/preprocessor/gemma.py

+    def process_image(self, image, **kwargs):
+        if isinstance(image, Image.Image):
+            image_obj = image
+        elif image.startswith("http://") or image.startswith("https://"):
+            # fix memory leak issue while using BytesIO
+            with requests.get(image, stream=True) as response:
+                response.raise_for_status()
+                with BytesIO(response.content) as bio:
+                    image_obj = copy.deepcopy(Image.open(bio))
+        elif image.startswith("file://"):
+            image_obj = Image.open(image[7:])
+        elif image.startswith("data:image"):
+            if "base64," in image:
+                _, base64_data = image.split("base64,", 1)
+                data = base64.b64decode(base64_data)
+                # fix memory leak issue while using BytesIO
+                with BytesIO(data) as bio:
+                    image_obj = copy.deepcopy(Image.open(bio))
+        else:
+            image_obj = Image.open(image)
+        return image_obj.convert("RGB")


Would that be possible to create some kind of mixin class to handle the duplicate code? such as:

class MediaProcessingMixin: """Mixin providing common media processing functionality""" def _process_image_from_source(self, image, **kwargs): """Shared image processing logic""" if isinstance(image, Image.Image): image_obj = image elif image.startswith("http://") or image.startswith("https://"): with requests.get(image, stream=True) as response: response.raise_for_status() with BytesIO(response.content) as bio: image_obj = copy.deepcopy(Image.open(bio)) elif image.startswith("file://"): image_obj = Image.open(image[7:]) elif image.startswith("data:image"): if "base64," in image: _, base64_data = image.split("base64,", 1) data = base64.b64decode(base64_data) with BytesIO(data) as bio: image_obj = copy.deepcopy(Image.open(bio)) else: image_obj = Image.open(image) return image_obj.convert("RGB") # Now each preprocessor can inherit from both the base class AND the mixin class Gemma3Preprocessor(BasicPreprocessor, MediaProcessingMixin): def process_image(self, image, **kwargs): return self._process_image_from_source(image, **kwargs) class InternVLPreprocessor(BasicPreprocessor, MediaProcessingMixin): def process_image(self, image, **kwargs): return self._process_image_from_source(image, **kwargs) class KimiVLPreprocessor(BasicPreprocessor, MediaProcessingMixin): def process_image(self, image, **kwargs): return self._process_image_from_source(image, **kwargs)

Thanks for your advice, I will solve this.

rich-junwang · 2025-08-05T03:43:38Z

verl/workers/fsdp_workers.py

@@ -411,14 +433,15 @@ def _build_model_optimizer(
        # TODO: add more optimizer args into config
        if role == "actor" and optim_config is not None:
            from verl.utils.torch_functional import get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup
+            optim_strategy = optim_config.get("strategy", "adamw")


This line is not used. Any particular reason we keep this? If not, it would be nice to remove it.

rich-junwang · 2025-08-05T03:43:45Z

verl/workers/fsdp_workers.py

@@ -1134,7 +1157,8 @@ def _build_critic_model_optimizer(self, config):
            enable_activation_offloading(critic_module, config.strategy, enable_gradient_checkpointing)

        log_gpu_memory_usage("After critic FSDP", logger=None)
-
+        optim_strategy = config.optim.get("strategy", "adamw")


Okay, this line is used to set the strategy of optimizer. However, I do not include the implementation of this in the current version.

Qsingle and others added 2 commits July 2, 2025 17:42

store temporely

a4cf020

add support for gemma3 and internvl for grpo training

Merge branch 'volcengine:main' into main

e057c06

Qsingle requested review from eric-haibin-lin, vermouth1992, tongyx361 and PeterSH6 as code owners July 2, 2025 10:15

vermouth1992 requested a review from hiyouga July 2, 2025 12:23

eric-haibin-lin reviewed Jul 7, 2025

View reviewed changes

eric-haibin-lin self-assigned this Jul 7, 2025

wangskyGit reviewed Jul 14, 2025

View reviewed changes

wangskyGit reviewed Jul 15, 2025

View reviewed changes

rich-junwang reviewed Aug 5, 2025

View reviewed changes


		__all__ = ["Gemma3Preprocessor"]

		@PREPROCESSOR_REGISTER.register()

Add the support for more VLMs(Gemma3 and InternVL) #2327

Are you sure you want to change the base?

Add the support for more VLMs(Gemma3 and InternVL) #2327

Conversation

Qsingle commented Jul 2, 2025

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

Weiyun1025 commented Jul 4, 2025

Uh oh!

xylcbd commented Jul 4, 2025

Uh oh!

Weiyun1025 commented Jul 4, 2025

Uh oh!

Qsingle commented Jul 5, 2025

Uh oh!

Weiyun1025 commented Jul 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Qsingle commented Jul 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sailfish009 commented Jul 18, 2025

Uh oh!

ZZYuting commented Jul 19, 2025

Uh oh!

Qsingle commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZZYuting commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SStoica12 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dle666 commented Jul 23, 2025

Uh oh!

ZZYuting commented Jul 24, 2025

Uh oh!

Qsingle commented Jul 28, 2025

Uh oh!

dle666 commented Aug 1, 2025

Uh oh!

dle666 commented Aug 1, 2025

Uh oh!

Qsingle commented Aug 4, 2025

Uh oh!

dle666 commented Aug 4, 2025

Uh oh!

rich-junwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Qsingle commented Jul 20, 2025 •

edited

Loading

ZZYuting commented Jul 20, 2025 •

edited

Loading

SStoica12 commented Jul 23, 2025 •

edited

Loading