-
Notifications
You must be signed in to change notification settings - Fork 669
Support Optimizer-in-the-backward #1833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ckward both enabled
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1833
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit ede3641 with merge base b02825a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
self._optimizer.zero_grad(set_to_none=True) | ||
if self._optimizer_in_bwd: | ||
raise NotImplementedError( | ||
"Gradient clipping is not supported after optimizer-in-the-backward." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimizer_in_backward
frees gradient information during loss.backward
, could not get the correct grad_norm
@@ -681,7 +735,12 @@ def train(self) -> None: | |||
time_per_step = time.perf_counter() - t0 | |||
log_dict = { | |||
"loss": loss_to_log, | |||
"lr": self._optimizer.param_groups[0]["lr"], | |||
"lr": get_lr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine get_lr
as an utils for both distributed and single_device to validate if all the LR are the same and return if True
@@ -29,7 +29,10 @@ | |||
|
|||
|
|||
class TestFullFinetuneDistributedRecipe: | |||
def _get_test_config_overrides(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both "optimizer_in_bwd=True" and "clip_grad_norm=100" could cause the wrong grad_norm, separate them here to avoid, loss_value would not be affected by either "optimizer_in_bwd=True" or "clip_grad_norm=100"
@@ -60,9 +63,17 @@ def _fetch_expected_loss_values(self, model_type): | |||
("llama3/8B_full", "llama3", "tune", "NO_SHARD"), | |||
], | |||
) | |||
@pytest.mark.parametrize("optim_in_bwd", [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently add one more param input "optim_in_bwd" to have separate test, shall we have the test in another way? @ebsmothers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this way is OK
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1833 +/- ##
===========================================
- Coverage 67.30% 25.68% -41.62%
===========================================
Files 304 305 +1
Lines 16000 16082 +82
===========================================
- Hits 10768 4131 -6637
- Misses 5232 11951 +6719 ☔ View full report in Codecov by Sentry. |
torchtune/training/lr_schedulers.py
Outdated
def get_lr(optimizer_in_bwd, vanilla_optimizer) -> str: | ||
""" | ||
Full_finetune_distributed and full_finetune_single_deivce assume all optimizers have | ||
the same LR, here to validate whether all the LR are the same and return if True. | ||
Bsed on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper, | ||
name it as vanilla_optimizer to be more general. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this API is used in our recipes, we should
a) expose this as a public API here
b) add it to the API docs here
c) make sure the docstring's format matches those of our other public APIs (for example).
Also do you have pre-commit hooks installed? I think pydoclint should be complaining about this since you have raises that aren't documented in the docstring.
torchtune/training/lr_schedulers.py
Outdated
""" | ||
Full_finetune_distributed and full_finetune_single_deivce assume all optimizers have | ||
the same LR, here to validate whether all the LR are the same and return if True. | ||
Bsed on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bsed on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper, | |
Based on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper, |
@@ -60,9 +63,17 @@ def _fetch_expected_loss_values(self, model_type): | |||
("llama3/8B_full", "llama3", "tune", "NO_SHARD"), | |||
], | |||
) | |||
@pytest.mark.parametrize("optim_in_bwd", [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this way is OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more small comment on the versioning question. After that this should be good to go
great work. |
Optimizer in backward and global gradient norm clipping does not algorithmically make sense 🤔 |
so if |
I think you would need to do something different mathematically, e.g. use previous iteration's total norm or clip each gradient separately. |
Context
What is the purpose of this PR? Is it to
Enable Optimizer-in-the-backward for full_finetune_distributed
Changelog
_optimizer_in_bwd
configTest plan
tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=False max_steps_per_epoch=2 optimizer_in_bwd=True
tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=False max_steps_per_epoch=2 epochs=10 optimizer_in_bwd=True resume_from_checkpoint=True checkpointer.recipe_checkpoint=/tmp/Llama-2-7b-hf/recipe_state.pt checkpointer.checkpoint_files=[hf_model_0001_1.pt,hf_model_0002_1.pt]
pytest tests/torchtune/training/test_distributed.py -k test_optimizer_in_backward
Memory cost analysis:

With each layer gradient cost 193MB memory, the origin(left) case has the peak memory at the 31th layer with accumulation of 193MB memory times 30.
The right case with optimizer-in-the-backward frees these memory during backward, gets lower peak memory.
Training time and loss analysis:
