Skip to content

Conversation

ntw-au
Copy link
Contributor

@ntw-au ntw-au commented Mar 25, 2022

This PR addresses an issue where the previous version only wrote a checkpoint when the epoch number exactly divides the checkpoint frequency. If using an epoch count that does not evenly divide the checkpoint frequency then a checkpoint for the final epoch is not written and the last n epochs' worth of training is lost.

e.g. Log from run where max_epoch = 199 (for exactly 200 epochs) and save_ckpt_freq = 5.
Note that the final checkpoint written belongs to Epoch 195, the Test phase at the end of the log loads Epoch 195, and work for Epochs 196…199 is lost.

=== EPOCH 195/199 ===
Loss train: 0.073  eval: 0.069
Mean acc train: 0.792  eval: 0.780
Mean IoU train: 0.749  eval: 0.731
Epoch 195: save ckpt to /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint
=== EPOCH 196/199 ===
Loss train: 0.076  eval: 0.073
Mean acc train: 0.792  eval: 0.781
Mean IoU train: 0.748  eval: 0.728
=== EPOCH 197/199 ===
Loss train: 0.076  eval: 0.084
Mean acc train: 0.792  eval: 0.776
Mean IoU train: 0.749  eval: 0.729
=== EPOCH 198/199 ===
Loss train: 0.075  eval: 0.101
Mean acc train: 0.794  eval: 0.777
Mean IoU train: 0.749  eval: 0.727
=== EPOCH 199/199 ===
Loss train: 0.074  eval: 0.098
Mean acc train: 0.792  eval: 0.779
Mean IoU train: 0.749  eval: 0.730
DEVICE : cuda
Logging in file : /opt/ml/processing/logs/RandLANet_Las_torch/log_test_2022-03-24_14:43:50.txt
ckpt_path not given. Restore from the latest ckpt
Loading checkpoint /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint/ckpt_00195.pth
Loading checkpoint optimizer_state_dict
Loading checkpoint scheduler_state_dict
Started testing

This PR resolves this issue by always saving a checkpoint on the final epoch, as per sample log below (max_epochs = 2, save_ckpt_freq = 5. A checkpoint is written at Epoch 0 and Epoch 2, and the Epoch 2 checkpoint is loaded for the Test phase.

INFO - 2022-03-25 04:00:40,691 - semantic_segmentation - === EPOCH 0/2 ===
training: 100%|█████████████████████████████████████████████████████████████████████████| 63/63 [04:20<00:00,  4.13s/it]
validation: 100%|█████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.02s/it]
INFO - 2022-03-25 04:05:16,792 - semantic_segmentation - Loss train: 0.726  eval: 2.242
INFO - 2022-03-25 04:05:16,793 - semantic_segmentation - Mean acc train: 0.330  eval: 0.212
INFO - 2022-03-25 04:05:16,794 - semantic_segmentation - Mean IoU train: 0.276  eval: 0.152
INFO - 2022-03-25 04:05:17,034 - semantic_segmentation - Epoch   0: save ckpt to /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint
INFO - 2022-03-25 04:05:17,034 - semantic_segmentation - === EPOCH 1/2 ===
training: 100%|█████████████████████████████████████████████████████████████████████████| 63/63 [04:01<00:00,  3.84s/it]
validation: 100%|█████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.38s/it]
INFO - 2022-03-25 04:09:32,407 - semantic_segmentation - Loss train: 0.386  eval: 0.786
INFO - 2022-03-25 04:09:32,408 - semantic_segmentation - Mean acc train: 0.404  eval: 0.298
INFO - 2022-03-25 04:09:32,408 - semantic_segmentation - Mean IoU train: 0.346  eval: 0.224
INFO - 2022-03-25 04:09:32,409 - semantic_segmentation - === EPOCH 2/2 ===
training: 100%|█████████████████████████████████████████████████████████████████████████| 63/63 [04:01<00:00,  3.83s/it]
validation: 100%|█████████████████████████████████████████████████████████████████████████| 4/4 [00:13<00:00,  3.37s/it]
INFO - 2022-03-25 04:13:47,363 - semantic_segmentation - Loss train: 0.305  eval: 0.721
INFO - 2022-03-25 04:13:47,364 - semantic_segmentation - Mean acc train: 0.469  eval: 0.413
INFO - 2022-03-25 04:13:47,364 - semantic_segmentation - Mean IoU train: 0.414  eval: 0.333
INFO - 2022-03-25 04:13:47,608 - semantic_segmentation - Epoch   2: save ckpt to /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint
INFO - 2022-03-25 04:13:47,611 - train_evaluate - Launching testing
INFO - 2022-03-25 04:13:47,615 - semantic_segmentation - DEVICE : cuda
INFO - 2022-03-25 04:13:47,615 - semantic_segmentation - Logging in file : /opt/ml/processing/logs/RandLANet_Las_torch/log_test_2022-03-25_04:13:47.txt
INFO - 2022-03-25 04:13:49,898 - semantic_segmentation - ckpt_path not given. Restore from the latest ckpt
INFO - 2022-03-25 04:13:49,898 - semantic_segmentation - Loading checkpoint /opt/ml/processing/logs/RandLANet_Las_torch/checkpoint/ckpt_00002.pth
INFO - 2022-03-25 04:13:50,093 - semantic_segmentation - Loading checkpoint optimizer_state_dict
INFO - 2022-03-25 04:13:50,122 - semantic_segmentation - Loading checkpoint scheduler_state_dict
INFO - 2022-03-25 04:13:50,123 - semantic_segmentation - Started testing

This change is Reviewable

@sanskar107 sanskar107 self-requested a review April 7, 2022 14:37
@sanskar107 sanskar107 merged commit 10c7af9 into isl-org:dev Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants