Skip to content

Conversation

Gaiejj
Copy link
Member

@Gaiejj Gaiejj commented Mar 11, 2025

Description

This PR includes the following changes:

  1. Added support for resuming training from a checkpoint for algorithms other than PPO.
  2. Cleaned up parts of the VLA code using pre-commit.

Usage details:

  1. You should activate save_checkpoint to save the optimizer states and models:
# Execute deepspeed command
deepspeed \
     --master_port ${MASTER_PORT} \
     --module align_anything.trainers.text_image_to_text.dpo \
     ...
     --save_checkpoint True \
     --epochs 2

We set the default value of save_checkpoint as True because we don't want new users to train once again since they have no idea of it.

  1. Then you can restart the training by:
MODEL_NAME_OR_PATH="/PATH/TO/YOUR/CKPT/slice_100" # for example

# Execute deepspeed command
deepspeed \
     --master_port ${MASTER_PORT} \
     --module align_anything.trainers.text_image_to_text.dpo \
     --model_name_or_path ${MODEL_NAME_OR_PATH} \
     ...
     --load_checkpoint True \
     --epochs 2

Motivation and Context

resolve #150

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

@Gaiejj Gaiejj added the enhancement New feature or request label Mar 11, 2025
Copy link
Collaborator

@XuyaoWang XuyaoWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some of the .yaml files have added the parameters save_checkpoint and load_checkpoint, for example, align_anything/configs/train/text_to_text/dpo.yaml does not have these two hyperparameters. It would be best to add these two parameters to the yaml files of modalities that support resuming from checkpoints.

@XuyaoWang
Copy link
Collaborator

There are still many YAML files that haven't fully added the parameters save_checkpoint and load_checkpoint, for example:

  1. align_anything/configs/train/text_audio_to_text/dpo.yaml
  2. align_anything/configs/train/text_to_text/rm.yaml
  3. align_anything/configs/train/text_to_text/rm_score.yaml

Additionally, do algorithms like PPO and GRPO support checkpoint resumption? I noticed that align_anything/configs/train/text_image_to_text/ppo.yaml has added checkpoint resumption parameters, but many other PPO files, such as:

  1. align_anything/configs/train/text_audio_to_text/ppo.yaml
  2. align_anything/configs/train/text_to_text/ppo.yaml
  3. align_anything/configs/train/text_to_text/grpo.yaml

don't have checkpoint resumption parameters. These are just partial results from my checks, and some files may not be listed here.

@Gaiejj Gaiejj closed this Mar 12, 2025
@Gaiejj Gaiejj reopened this Mar 12, 2025
@Gaiejj
Copy link
Member Author

Gaiejj commented Mar 12, 2025

Thanks @XuyaoWang , I have fixed those comments

@XuyaoWang
Copy link
Collaborator

Thanks @XuyaoWang , I have fixed those comments

It seems that currently there is no support for PPO checkpoint retraining, but in the latest commit, checkpoint retraining parameters were added to align_anything/configs/train/text_audio_to_text/ppo.yaml, which appears to be an erroneous modification.

Copy link
Collaborator

@XuyaoWang XuyaoWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@cby-pku cby-pku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Gaiejj Gaiejj merged commit 80c09b6 into PKU-Alignment:main Mar 13, 2025
LichenLillc pushed a commit to LichenLillc/align-anything that referenced this pull request May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add Support for Resuming Training from Checkpoints
3 participants