-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add the support for more VLMs(Gemma3 and InternVL) #2327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
add support for gemma3 and internvl for grpo training
2025-07-04 17:06:43,418 INFO cli.py:88 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 355, in generate_sequences
response_attention_mask = get_response_mask(
File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 242, in get_response_mask
eos_mask = torch.isin(response_id, torch.tensor(eos_token, device=response_id.device)).int()
RuntimeError: Could not infer dtype of NoneType
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. I encountered this error while using this PR to train InternVL3. Do you have any suggestions? |
LoRA training will report an error, and it needs to be fixed like this below # verl/utils/fsdp_utils.py + 90
default_transformer_cls_names_to_wrap = getattr(module, "_no_split_modules", None)
if re.match("internvl", module.__class__.__name__, re.IGNORECASE) or (module.__class__.__name__ == 'PeftModelForCausalLM' and re.match("internvl", module.base_model.model.__class__.__name__, re.IGNORECASE)):
update_cls_names_to_wrap = []
for mod in default_transformer_cls_names_to_wrap:
if mod != "LlamaDecoderLayer":
update_cls_names_to_wrap.append(mod)
default_transformer_cls_names_to_wrap = update_cls_names_to_wrap
elif re.match("gemma3", module.__class__.__name__, re.IGNORECASE) or (module.__class__.__name__ == 'PeftModelForCausalLM' and re.match("gemma3", module.base_model.model.__class__.__name__, re.IGNORECASE)):
update_cls_names_to_wrap = []
for mod in default_transformer_cls_names_to_wrap:
if mod != "SiglipMultiheadAttentionPoolingHead":
update_cls_names_to_wrap.append(mod)
default_transformer_cls_names_to_wrap = update_cls_names_to_wrap
fsdp_transformer_layer_cls_to_wrap = _get_attr(
"transformer_layer_cls_to_wrap", default_transformer_cls_names_to_wrap
) |
meta_info = {
"eos_token_id": self.generation_config.eos_token_id
if getattr(self.generation_config, "eos_token_id", None) is not None
else self.tokenizer.eos_token_id,
"pad_token_id": self.generation_config.pad_token_id
if getattr(self.generation_config, "pad_token_id", None) is not None
else self.tokenizer.pad_token_id,
} Seems that |
Could you provide the script you used to train the InternVL3? |
ray job submit --address=${RAY_ADDRESS} \
-- python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=[${CURRENT_PATH}/verl_data_with_gt/math_pkg_250701.json_geo3k_acc.parquet] \
data.val_files=${CURRENT_PATH}/verl_data/geo3k/test.parquet \
data.train_batch_size=${ROLLOUT_BATCH_SIZE} \
data.max_prompt_length=18432 \
data.max_response_length=32768 \
data.filter_overlong_prompts=True \
data.filter_overlong_prompts_workers=8 \
data.truncation='error' \
data.image_key=images \
data.trust_remote_code=True \
actor_rollout_ref.model.path=${CURRENT_PATH}/pretrained/InternVL3-1B-64K \
actor_rollout_ref.model.trust_remote_code=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.kl_loss_coef=0.0 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${MICRO_ROLLOUT_BATCH_SIZE} \
actor_rollout_ref.rollout.tensor_model_parallel_size=${TENSOR_PARALLEL} \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.rollout.free_cache_engine=False \
actor_rollout_ref.rollout.n=${N_SAMPLES_PER_PROMPT} \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${MICRO_TRAIN_BATCH_SIZE} \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${SEQUENCE_PARALLEL} \
actor_rollout_ref.actor.loss_agg_mode=token-mean \
algorithm.use_kl_in_reward=False \
algorithm.kl_ctrl.kl_coef=0.0 \
trainer.critic_warmup=0 \
trainer.default_local_dir=${OUTPUT_PATH} \
trainer.logger=['console','tensorboard'] \
trainer.project_name=${PROJECT_NAME} \
trainer.experiment_name=${TASK_NAME} \
trainer.n_gpus_per_node=${NPROC_PER_NODE} \
trainer.nnodes=${WORLD_SIZE} \
trainer.save_freq=20 \
trainer.test_freq=5000 \
trainer.val_before_train=False \
trainer.rollout_data_dir=${OUTPUT_PATH}/rollouts \
trainer.total_epochs=100 2>&1 | tee ${JOBLOG} BTW, I encountered another error when setting �[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/ray/base.py", line 710, in func
�[36m(TaskRunner pid=5245)�[0m return getattr(self.worker_dict[key], name)(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/single_controller/base/decorator.py", line 549, in inner
�[36m(TaskRunner pid=5245)�[0m return func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/fsdp_workers.py", line 802, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 81, in f
�[36m(TaskRunner pid=5245)�[0m return self.log(decorated_function, *args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/debug/performance.py", line 94, in log
�[36m(TaskRunner pid=5245)�[0m output = func(*args, **kwargs)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 364, in compute_log_prob
�[36m(TaskRunner pid=5245)�[0m entropy, log_probs = self._forward_micro_batch(
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/workers/actor/dp_actor.py", line 197, in _forward_micro_batch
�[36m(TaskRunner pid=5245)�[0m log_probs = logprobs_from_logits(
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 87, in logprobs_from_logits
�[36m(TaskRunner pid=5245)�[0m output = logprobs_from_logits_flash_attn(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/workspace_wwy/verl-internvl/verl/utils/torch_functional.py", line 97, in logprobs_from_logits_flash_attn
�[36m(TaskRunner pid=5245)�[0m output = cross_entropy_loss(logits, labels, inplace_backward=inplace_backward)
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss
�[36m(TaskRunner pid=5245)�[0m return CrossEntropyLoss.apply(
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
�[36m(TaskRunner pid=5245)�[0m return super().apply(*args, **kwargs) # type: ignore[misc]
�[36m(TaskRunner pid=5245)�[0m File "/cpfs01/user/wangweiyun/miniconda3/envs/verl-qwenvl/lib/python3.10/site-packages/flash_attn/ops/triton/cross_entropy.py", line 170, in forward
�[36m(TaskRunner pid=5245)�[0m assert labels.shape == (n_rows,)
�[36m(TaskRunner pid=5245)�[0m AssertionError |
|
||
__all__ = ["Gemma3Preprocessor"] | ||
|
||
@PREPROCESSOR_REGISTER.register() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I am thinking moving all model related code to the same folder, one per model. #2338 (review)
Given the complexity of multimodal structures, i think it's worth a RFC for the overall approach and design
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it is a good strategy for the Multi-modality framework.
Thanks for your feedback. I will try to resolve this problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like the "model_init_kwargs" isn't used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to use it in current version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello, I have one question here. I didn't see any code for internVL model for monkey path here. Does that mean InternVL do not require custom code or sequence parallel is not applicaple for InternVL now?
Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InternVL does not have a special design that requires monkey patching. However, the vision model of InternVL does generate a high memory cost. For example, InternVL-Chat-V1.5, a 26B model, requires about 50G of memory for model parameters in BF16 format, and considering the additional overhead during training, it requires around 100-150G. The special requirement for vision encoder may need some discussion.
Has anyone successfully merged the trained fsdp model into a huggingface model? I try using |
Yeah, some code modifications are necessary to provide support. |
I have fixed the mere problem, we need to modify the merge class BaseModelMerger(ABC):
......
elif "ForConditionalGeneration" in self.model_config.architectures[0]:
return AutoModelForVision2Seq
+ elif "InternVLChatModel" in self.model_config.architectures[0]:
+ return AutoModel
raise NotImplementedError(f"Unknown architecture {self.model_config.architectures}") Besides, I also find in lastes transformers, we need modify tokenizer in the tokenizer.context_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.context_image_token) #for transformers >= 4.52.2
tokenizer.start_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.start_image_token) #for transformers >= 4.53.2
tokenizer.end_image_token_id = tokenizer.convert_tokens_to_ids(tokenizer.end_image_token) #for transformers >= 4.53.2 |
Thank you for your work!
Thank you very much. |
I encountered an error while running internvl3: "Only support config type of {'deepseek_v3', 'minicpmo', 'qwen2_5_vl', 'qwen3_moe', 'qwen3', 'minicpmv', 'llama', 'qwen2', 'qwen2_vl'}, but got internvl_chat. MFU will always be zero." Could you please provide me with some guidance? |
I think this is only warning to print some infomation such as "MFU", in my case,it works well. |
|
How do I merge the trained gemma3 weights? The current code shows Unknown architecture and I'm not sure how to modify it. |
This is the error I encountered: |
You could add the following code before
|
Thank you for your reply. After I merged the weights using this method, I encountered an error when loading the model using VLLM: ValueError: There is no module or parameter named 'lm_head' in Gemma3ForConditionalGeneration vllm=0.8.2 transformers=4.52.2下方是我的加载脚本: if name == 'main':
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. It would be nice to add some unit tests for the data preprocessing part.
} | ||
""" | ||
|
||
PREPROCESSOR_REGISTER.register() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing @ sign here?
def process_image(self, image, **kwargs): | ||
if isinstance(image, Image.Image): | ||
image_obj = image | ||
elif image.startswith("http://") or image.startswith("https://"): | ||
# fix memory leak issue while using BytesIO | ||
with requests.get(image, stream=True) as response: | ||
response.raise_for_status() | ||
with BytesIO(response.content) as bio: | ||
image_obj = copy.deepcopy(Image.open(bio)) | ||
elif image.startswith("file://"): | ||
image_obj = Image.open(image[7:]) | ||
elif image.startswith("data:image"): | ||
if "base64," in image: | ||
_, base64_data = image.split("base64,", 1) | ||
data = base64.b64decode(base64_data) | ||
# fix memory leak issue while using BytesIO | ||
with BytesIO(data) as bio: | ||
image_obj = copy.deepcopy(Image.open(bio)) | ||
else: | ||
image_obj = Image.open(image) | ||
return image_obj.convert("RGB") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that be possible to create some kind of mixin class to handle the duplicate code? such as:
class MediaProcessingMixin:
"""Mixin providing common media processing functionality"""
def _process_image_from_source(self, image, **kwargs):
"""Shared image processing logic"""
if isinstance(image, Image.Image):
image_obj = image
elif image.startswith("http://") or image.startswith("https://"):
with requests.get(image, stream=True) as response:
response.raise_for_status()
with BytesIO(response.content) as bio:
image_obj = copy.deepcopy(Image.open(bio))
elif image.startswith("file://"):
image_obj = Image.open(image[7:])
elif image.startswith("data:image"):
if "base64," in image:
_, base64_data = image.split("base64,", 1)
data = base64.b64decode(base64_data)
with BytesIO(data) as bio:
image_obj = copy.deepcopy(Image.open(bio))
else:
image_obj = Image.open(image)
return image_obj.convert("RGB")
# Now each preprocessor can inherit from both the base class AND the mixin
class Gemma3Preprocessor(BasicPreprocessor, MediaProcessingMixin):
def process_image(self, image, **kwargs):
return self._process_image_from_source(image, **kwargs)
class InternVLPreprocessor(BasicPreprocessor, MediaProcessingMixin):
def process_image(self, image, **kwargs):
return self._process_image_from_source(image, **kwargs)
class KimiVLPreprocessor(BasicPreprocessor, MediaProcessingMixin):
def process_image(self, image, **kwargs):
return self._process_image_from_source(image, **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advice, I will solve this.
@@ -411,14 +433,15 @@ def _build_model_optimizer( | |||
# TODO: add more optimizer args into config | |||
if role == "actor" and optim_config is not None: | |||
from verl.utils.torch_functional import get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup | |||
optim_strategy = optim_config.get("strategy", "adamw") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not used. Any particular reason we keep this? If not, it would be nice to remove it.
@@ -1134,7 +1157,8 @@ def _build_critic_model_optimizer(self, config): | |||
enable_activation_offloading(critic_module, config.strategy, enable_gradient_checkpointing) | |||
|
|||
log_gpu_memory_usage("After critic FSDP", logger=None) | |||
|
|||
optim_strategy = config.optim.get("strategy", "adamw") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this line is used to set the strategy of optimizer. However, I do not include the implementation of this in the current version.
Checklist Before Starting
What does this PR do?
High-Level Design
Specific Changes
API
Usage Example
Test
The training curve for InternVl2.5-1B

The training curve for InternVL3-1B

Additional Info.
Checklist Before Submitting
[BREAKING]
to the PR title if it breaks any API.