Support Optimizer-in-the-backward #1833

mori360 · 2024-10-14T23:00:20Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Enable Optimizer-in-the-backward for full_finetune_distributed

Changelog

Update full_finetune_distributed for enabling Optimizer-in-the-backward
Update test_full_finetune_distributed with _optimizer_in_bwd config
updated test_distributed to test running with/without optimized_in_the_backward, and performance after saving-loading state_dict.

Test plan

Test running with optimizer_in_the_backward: tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=False max_steps_per_epoch=2 optimizer_in_bwd=True
Test running optimizer_in_the_backward with resume_from_checkpoint: tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=False max_steps_per_epoch=2 epochs=10 optimizer_in_bwd=True resume_from_checkpoint=True checkpointer.recipe_checkpoint=/tmp/Llama-2-7b-hf/recipe_state.pt checkpointer.checkpoint_files=[hf_model_0001_1.pt,hf_model_0002_1.pt]
Verify that running with Optimizer-in-the-backward could have the same loss, model_state_dict and optimizer_state_dict, model after saving and loading could also have the same: pytest tests/torchtune/training/test_distributed.py -k test_optimizer_in_backward

Memory cost analysis:
With each layer gradient cost 193MB memory, the origin(left) case has the peak memory at the 31th layer with accumulation of 193MB memory times 30.
The right case with optimizer-in-the-backward frees these memory during backward, gets lower peak memory.

Training time and loss analysis:

…ckward both enabled

pytorch-bot · 2024-10-14T23:00:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1833

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PRE-EMPTIVE] Experimenting with new runners linux.aws.a100 on inductor-perf-compare.yml

✅ No Failures

As of commit ede3641 with merge base b02825a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mori360 · 2024-10-14T23:10:18Z

recipes/full_finetune_distributed.py

-                    self._optimizer.zero_grad(set_to_none=True)
+                        if self._optimizer_in_bwd:
+                            raise NotImplementedError(
+                                "Gradient clipping is not supported after optimizer-in-the-backward."


optimizer_in_backward frees gradient information during loss.backward, could not get the correct grad_norm

mori360 · 2024-10-14T23:11:17Z

recipes/full_finetune_distributed.py

@@ -681,7 +735,12 @@ def train(self) -> None:
                        time_per_step = time.perf_counter() - t0
                        log_dict = {
                            "loss": loss_to_log,
-                            "lr": self._optimizer.param_groups[0]["lr"],
+                            "lr": get_lr(


combine get_lr as an utils for both distributed and single_device to validate if all the LR are the same and return if True

mori360 · 2024-10-14T23:12:44Z

tests/recipes/test_full_finetune_distributed.py

@@ -29,7 +29,10 @@


 class TestFullFinetuneDistributedRecipe:
-    def _get_test_config_overrides(self):


Both "optimizer_in_bwd=True" and "clip_grad_norm=100" could cause the wrong grad_norm, separate them here to avoid, loss_value would not be affected by either "optimizer_in_bwd=True" or "clip_grad_norm=100"

torchtune/training/lr_schedulers.py

mori360 · 2024-10-14T23:17:21Z

tests/recipes/test_full_finetune_distributed.py

@@ -60,9 +63,17 @@ def _fetch_expected_loss_values(self, model_type):
            ("llama3/8B_full", "llama3", "tune", "NO_SHARD"),
        ],
    )
+    @pytest.mark.parametrize("optim_in_bwd", [True, False])


Currently add one more param input "optim_in_bwd" to have separate test, shall we have the test in another way? @ebsmothers

I think this way is OK

codecov-commenter · 2024-10-14T23:34:50Z

Codecov Report

Attention: Patch coverage is 10.52632% with 51 lines in your changes missing coverage. Please review.

Project coverage is 25.68%. Comparing base (c70ad29) to head (56bfafa).
Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/full_finetune_distributed.py	0.00%	35 Missing ⚠️
torchtune/training/lr_schedulers.py	20.00%	12 Missing ⚠️
tests/recipes/test_full_finetune_distributed.py	40.00%	3 Missing ⚠️
recipes/full_finetune_single_device.py	0.00%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (c70ad29) and HEAD (56bfafa). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (c70ad29) HEAD (56bfafa)

3 1

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1833       +/-   ##
===========================================
- Coverage   67.30%   25.68%   -41.62%     
===========================================
  Files         304      305        +1     
  Lines       16000    16082       +82     
===========================================
- Hits        10768     4131     -6637     
- Misses       5232    11951     +6719

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers · 2024-10-14T23:50:37Z

torchtune/training/lr_schedulers.py

+def get_lr(optimizer_in_bwd, vanilla_optimizer) -> str:
+    """
+    Full_finetune_distributed and full_finetune_single_deivce assume all optimizers have
+    the same LR, here to validate whether all the LR are the same and return if True.
+    Bsed on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper,
+    name it as vanilla_optimizer to be more general.
+    """


Given this API is used in our recipes, we should

a) expose this as a public API here
b) add it to the API docs here
c) make sure the docstring's format matches those of our other public APIs (for example).

Also do you have pre-commit hooks installed? I think pydoclint should be complaining about this since you have raises that aren't documented in the docstring.

ebsmothers · 2024-10-14T23:50:48Z

torchtune/training/lr_schedulers.py

+    """
+    Full_finetune_distributed and full_finetune_single_deivce assume all optimizers have
+    the same LR, here to validate whether all the LR are the same and return if True.
+    Bsed on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper,


Suggested change

Bsed on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper,

Based on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper,

torchtune/training/lr_schedulers.py

recipes/full_finetune_distributed.py

torchtune/training/lr_schedulers.py

recipes/full_finetune_distributed.py

ebsmothers · 2024-10-15T00:13:59Z

tests/recipes/test_full_finetune_distributed.py

@@ -60,9 +63,17 @@ def _fetch_expected_loss_values(self, model_type):
            ("llama3/8B_full", "llama3", "tune", "NO_SHARD"),
        ],
    )
+    @pytest.mark.parametrize("optim_in_bwd", [True, False])


I think this way is OK

torchtune/training/lr_schedulers.py

recipes/full_finetune_distributed.py

recipes/full_finetune_single_device.py

tests/recipes/test_full_finetune_distributed.py

ebsmothers

One more small comment on the versioning question. After that this should be good to go

tests/recipes/test_full_finetune_distributed.py

gameofdimension · 2024-10-24T02:04:28Z

great work.
i am wondering whether it can be used with cilp_grad_norm_.

awgu · 2024-10-24T02:05:33Z

Optimizer in backward and global gradient norm clipping does not algorithmically make sense 🤔

gameofdimension · 2024-10-24T02:09:35Z

so if cilp_grad_norm_ is required then we can not use "Optimizer in backward"?

awgu · 2024-10-24T02:13:46Z

I think you would need to do something different mathematically, e.g. use previous iteration's total norm or clip each gradient separately.

mori360 added 10 commits October 14, 2024 15:57

enable optimizer_in_bwd, add related unit test

6941887

fix test_full_finetune_distributed

02f93c0

typo

e75cbb6

combine utils with sing_device, add lr_schdulaer check

7c5d146

change test torch version requirement to enable backward

6f70192

zero out gradients for optim_ckpt_wrapper before training

7e60109

add NotImplementedError for Gradient clipping and optimizer-in-the-ba…

d198b2a

…ckward both enabled

change error message, switch some if order following single device

f5f57bb

add optimizer_in_bwd=True to test_full_finetune_distributed

1c9be11

rewrite unit test, move get_lr

c181884

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 14, 2024

mori360 commented Oct 14, 2024

View reviewed changes

mori360 requested review from ebsmothers, awgu and weifengpy October 14, 2024 23:14

mori360 commented Oct 14, 2024

View reviewed changes

torchtune/training/lr_schedulers.py Outdated Show resolved Hide resolved

mori360 commented Oct 14, 2024

View reviewed changes

change the second input for get_lr to be vanilla_optimizer

c6757c1

ebsmothers reviewed Oct 15, 2024

View reviewed changes

weifengpy reviewed Oct 15, 2024

View reviewed changes

recipes/full_finetune_distributed.py Show resolved Hide resolved

weifengpy reviewed Oct 15, 2024

View reviewed changes

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved

weifengpy reviewed Oct 15, 2024

View reviewed changes

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved

weifengpy reviewed Oct 15, 2024

View reviewed changes

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved

weifengpy reviewed Oct 15, 2024

View reviewed changes

recipes/full_finetune_single_device.py Outdated Show resolved Hide resolved

mori360 added 2 commits October 15, 2024 11:51

expose get_lr as a public API and add it to the API docs

56bfafa

reorder if condition at test_full_finetune_distributed

c8477e2

mori360 requested a review from ebsmothers October 15, 2024 20:25

correct version condition

55342d3

mori360 commented Oct 17, 2024

View reviewed changes

tests/recipes/test_full_finetune_distributed.py Outdated Show resolved Hide resolved

mori360 marked this pull request as ready for review October 17, 2024 21:35

ebsmothers approved these changes Oct 21, 2024

View reviewed changes

tests/recipes/test_full_finetune_distributed.py Outdated Show resolved Hide resolved

2.5 is stable, remove version check

dbc2917

mori360 marked this pull request as draft October 22, 2024 00:11

mori360 added 2 commits October 22, 2024 14:10

Merge branch 'pytorch:main' into backward

f75e804

Merge branch 'pytorch:main' into backward

ede3641

mori360 marked this pull request as ready for review October 23, 2024 21:19

mori360 merged commit dc0591c into pytorch:main Oct 23, 2024
17 checks passed

felipemello1 mentioned this pull request Nov 7, 2024

implement activation offloading and opt_in_bwd in knowledge_distillation recipes #1959

Open

rajuptvs mentioned this pull request Feb 14, 2025

Kd recipe update #2395

Open

13 tasks

		@@ -29,7 +29,10 @@


		class TestFullFinetuneDistributedRecipe:
		def _get_test_config_overrides(self):

	Bsed on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper,
	Based on optimizer_in_bwd, the second input here could be optimizer or optim_wrapper,

Support Optimizer-in-the-backward #1833

Support Optimizer-in-the-backward #1833

Uh oh!

Conversation

mori360 commented Oct 14, 2024

Context

Changelog

Test plan

Uh oh!

pytorch-bot bot commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1833

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gameofdimension commented Oct 24, 2024

Uh oh!

awgu commented Oct 24, 2024

Uh oh!

gameofdimension commented Oct 24, 2024

Uh oh!

awgu commented Oct 24, 2024

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading

codecov-commenter commented Oct 14, 2024 •

edited

Loading