Ensure that a checkpoint is saved on the final training epoch #499
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses an issue where the previous version only wrote a checkpoint when the epoch number exactly divides the checkpoint frequency. If using an epoch count that does not evenly divide the checkpoint frequency then a checkpoint for the final epoch is not written and the last n epochs' worth of training is lost.
e.g. Log from run where
max_epoch = 199
(for exactly 200 epochs) andsave_ckpt_freq = 5
.Note that the final checkpoint written belongs to Epoch 195, the Test phase at the end of the log loads Epoch 195, and work for Epochs 196…199 is lost.
This PR resolves this issue by always saving a checkpoint on the final epoch, as per sample log below (
max_epochs = 2
,save_ckpt_freq = 5
. A checkpoint is written at Epoch 0 and Epoch 2, and the Epoch 2 checkpoint is loaded for the Test phase.This change is