Ensure that a checkpoint is saved on the final training epoch #499

ntw-au · 2022-03-25T04:30:20Z

This PR addresses an issue where the previous version only wrote a checkpoint when the epoch number exactly divides the checkpoint frequency. If using an epoch count that does not evenly divide the checkpoint frequency then a checkpoint for the final epoch is not written and the last n epochs' worth of training is lost.

e.g. Log from run where max_epoch = 199 (for exactly 200 epochs) and save_ckpt_freq = 5.
Note that the final checkpoint written belongs to Epoch 195, the Test phase at the end of the log loads Epoch 195, and work for Epochs 196…199 is lost.

=== EPOCH 195/199 ===
Loss train: 0.073  eval: 0.069
Mean acc train: 0.792  eval: 0.780
Mean IoU train: 0.749  eval: 0.731
Epoch 195: save ckpt to /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint
=== EPOCH 196/199 ===
Loss train: 0.076  eval: 0.073
Mean acc train: 0.792  eval: 0.781
Mean IoU train: 0.748  eval: 0.728
=== EPOCH 197/199 ===
Loss train: 0.076  eval: 0.084
Mean acc train: 0.792  eval: 0.776
Mean IoU train: 0.749  eval: 0.729
=== EPOCH 198/199 ===
Loss train: 0.075  eval: 0.101
Mean acc train: 0.794  eval: 0.777
Mean IoU train: 0.749  eval: 0.727
=== EPOCH 199/199 ===
Loss train: 0.074  eval: 0.098
Mean acc train: 0.792  eval: 0.779
Mean IoU train: 0.749  eval: 0.730
DEVICE : cuda
Logging in file : /opt/ml/processing/logs/RandLANet_Las_torch/log_test_2022-03-24_14:43:50.txt
ckpt_path not given. Restore from the latest ckpt
Loading checkpoint /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint/ckpt_00195.pth
Loading checkpoint optimizer_state_dict
Loading checkpoint scheduler_state_dict
Started testing

This PR resolves this issue by always saving a checkpoint on the final epoch, as per sample log below (max_epochs = 2, save_ckpt_freq = 5. A checkpoint is written at Epoch 0 and Epoch 2, and the Epoch 2 checkpoint is loaded for the Test phase.

INFO - 2022-03-25 04:00:40,691 - semantic_segmentation - === EPOCH 0/2 ===
training: 100%|█████████████████████████████████████████████████████████████████████████| 63/63 [04:20<00:00,  4.13s/it]
validation: 100%|█████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.02s/it]
INFO - 2022-03-25 04:05:16,792 - semantic_segmentation - Loss train: 0.726  eval: 2.242
INFO - 2022-03-25 04:05:16,793 - semantic_segmentation - Mean acc train: 0.330  eval: 0.212
INFO - 2022-03-25 04:05:16,794 - semantic_segmentation - Mean IoU train: 0.276  eval: 0.152
INFO - 2022-03-25 04:05:17,034 - semantic_segmentation - Epoch   0: save ckpt to /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint
INFO - 2022-03-25 04:05:17,034 - semantic_segmentation - === EPOCH 1/2 ===
training: 100%|█████████████████████████████████████████████████████████████████████████| 63/63 [04:01<00:00,  3.84s/it]
validation: 100%|█████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.38s/it]
INFO - 2022-03-25 04:09:32,407 - semantic_segmentation - Loss train: 0.386  eval: 0.786
INFO - 2022-03-25 04:09:32,408 - semantic_segmentation - Mean acc train: 0.404  eval: 0.298
INFO - 2022-03-25 04:09:32,408 - semantic_segmentation - Mean IoU train: 0.346  eval: 0.224
INFO - 2022-03-25 04:09:32,409 - semantic_segmentation - === EPOCH 2/2 ===
training: 100%|█████████████████████████████████████████████████████████████████████████| 63/63 [04:01<00:00,  3.83s/it]
validation: 100%|█████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.37s/it]
INFO - 2022-03-25 04:13:47,363 - semantic_segmentation - Loss train: 0.305  eval: 0.721
INFO - 2022-03-25 04:13:47,364 - semantic_segmentation - Mean acc train: 0.469  eval: 0.413
INFO - 2022-03-25 04:13:47,364 - semantic_segmentation - Mean IoU train: 0.414  eval: 0.333
INFO - 2022-03-25 04:13:47,608 - semantic_segmentation - Epoch   2: save ckpt to /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint
INFO - 2022-03-25 04:13:47,611 - train_evaluate - Launching testing
INFO - 2022-03-25 04:13:47,615 - semantic_segmentation - DEVICE : cuda
INFO - 2022-03-25 04:13:47,615 - semantic_segmentation - Logging in file : /opt/ml/processing/logs/RandLANet_Las_torch/log_test_2022-03-25_04:13:47.txt
INFO - 2022-03-25 04:13:49,898 - semantic_segmentation - ckpt_path not given. Restore from the latest ckpt
INFO - 2022-03-25 04:13:49,898 - semantic_segmentation - Loading checkpoint /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint/ckpt_00002.pth
INFO - 2022-03-25 04:13:50,093 - semantic_segmentation - Loading checkpoint optimizer_state_dict
INFO - 2022-03-25 04:13:50,122 - semantic_segmentation - Loading checkpoint scheduler_state_dict
INFO - 2022-03-25 04:13:50,123 - semantic_segmentation - Started testing

This change is

Ensure that a checkpoint is saved on the final epoch

31d0aaa

sanskar107 self-requested a review April 7, 2022 14:37

sanskar107 approved these changes Apr 7, 2022

View reviewed changes

sanskar107 merged commit 10c7af9 into isl-org:dev Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure that a checkpoint is saved on the final training epoch #499

Ensure that a checkpoint is saved on the final training epoch #499

Uh oh!

ntw-au commented Mar 25, 2022 •

edited by germanros1987

Loading

Uh oh!

Uh oh!

Ensure that a checkpoint is saved on the final training epoch #499

Ensure that a checkpoint is saved on the final training epoch #499

Uh oh!

Conversation

ntw-au commented Mar 25, 2022 • edited by germanros1987 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ntw-au commented Mar 25, 2022 •

edited by germanros1987

Loading