feat: support resume training from ckpt #153

Gaiejj · 2025-03-11T11:43:14Z

Description

This PR includes the following changes:

Added support for resuming training from a checkpoint for algorithms other than PPO.
Cleaned up parts of the VLA code using pre-commit.

Usage details:

You should activate save_checkpoint to save the optimizer states and models:

# Execute deepspeed command
deepspeed \
     --master_port ${MASTER_PORT} \
     --module align_anything.trainers.text_image_to_text.dpo \
     ...
     --save_checkpoint True \
     --epochs 2

We set the default value of save_checkpoint as True because we don't want new users to train once again since they have no idea of it.

Then you can restart the training by:

MODEL_NAME_OR_PATH="/PATH/TO/YOUR/CKPT/slice_100" # for example

# Execute deepspeed command
deepspeed \
     --master_port ${MASTER_PORT} \
     --module align_anything.trainers.text_image_to_text.dpo \
     --model_name_or_path ${MODEL_NAME_OR_PATH} \
     ...
     --load_checkpoint True \
     --epochs 2

Motivation and Context

resolve #150

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide. (required)
My change requires a change to the documentation.
I have updated the tests accordingly. (required for a bug fix or a new feature)
I have updated the documentation accordingly.

XuyaoWang

Only some of the .yaml files have added the parameters save_checkpoint and load_checkpoint, for example, align_anything/configs/train/text_to_text/dpo.yaml does not have these two hyperparameters. It would be best to add these two parameters to the yaml files of modalities that support resuming from checkpoints.

XuyaoWang · 2025-03-12T12:40:06Z

There are still many YAML files that haven't fully added the parameters save_checkpoint and load_checkpoint, for example:

align_anything/configs/train/text_audio_to_text/dpo.yaml
align_anything/configs/train/text_to_text/rm.yaml
align_anything/configs/train/text_to_text/rm_score.yaml

Additionally, do algorithms like PPO and GRPO support checkpoint resumption? I noticed that align_anything/configs/train/text_image_to_text/ppo.yaml has added checkpoint resumption parameters, but many other PPO files, such as:

align_anything/configs/train/text_audio_to_text/ppo.yaml
align_anything/configs/train/text_to_text/ppo.yaml
align_anything/configs/train/text_to_text/grpo.yaml

don't have checkpoint resumption parameters. These are just partial results from my checks, and some files may not be listed here.

Gaiejj · 2025-03-12T13:48:06Z

Thanks @XuyaoWang , I have fixed those comments

XuyaoWang · 2025-03-12T13:52:32Z

Thanks @XuyaoWang , I have fixed those comments

It seems that currently there is no support for PPO checkpoint retraining, but in the latest commit, checkpoint retraining parameters were added to align_anything/configs/train/text_audio_to_text/ppo.yaml, which appears to be an erroneous modification.

XuyaoWang

LGTM.

cby-pku

LGTM.

Gaiejj added the enhancement New feature or request label Mar 11, 2025

XuyaoWang reviewed Mar 11, 2025

View reviewed changes

Gaiejj closed this Mar 12, 2025

Gaiejj reopened this Mar 12, 2025

Gaiejj added 8 commits March 12, 2025 23:29

feat: support resume training

77fb083

feat: support total-limit saving

9f216ad

minor fix

bfbda0c

docs: update README

be1b252

fix: minor

92a77c3

minor fix

bcf146e

wip

e6112d6

wip

e41c71a

Gaiejj force-pushed the dev-resume branch from 8f3d729 to e41c71a Compare March 12, 2025 15:31

Gaiejj requested a review from XuyaoWang March 13, 2025 05:49

wip

7039498

XuyaoWang approved these changes Mar 13, 2025

View reviewed changes

cby-pku approved these changes Mar 13, 2025

View reviewed changes

Gaiejj merged commit 80c09b6 into PKU-Alignment:main Mar 13, 2025

LichenLillc pushed a commit to LichenLillc/align-anything that referenced this pull request May 29, 2025

feat: support resume training from ckpt (PKU-Alignment#153)

e4d57da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support resume training from ckpt #153

feat: support resume training from ckpt #153

Uh oh!

Gaiejj commented Mar 11, 2025

Uh oh!

XuyaoWang left a comment

Uh oh!

XuyaoWang commented Mar 12, 2025

Uh oh!

Gaiejj commented Mar 12, 2025

Uh oh!

XuyaoWang commented Mar 12, 2025

Uh oh!

XuyaoWang left a comment

Uh oh!

cby-pku left a comment

Uh oh!

Uh oh!

feat: support resume training from ckpt #153

feat: support resume training from ckpt #153

Uh oh!

Conversation

Gaiejj commented Mar 11, 2025

Description

Motivation and Context

Types of changes

Checklist

Uh oh!

XuyaoWang left a comment

Choose a reason for hiding this comment

Uh oh!

XuyaoWang commented Mar 12, 2025

Uh oh!

Gaiejj commented Mar 12, 2025

Uh oh!

XuyaoWang commented Mar 12, 2025

Uh oh!

XuyaoWang left a comment

Choose a reason for hiding this comment

Uh oh!

cby-pku left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!