Add single device KD recipe #1539

lindawangg · 2024-09-11T15:45:28Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Changelog

What are the changes made in this PR?

KD recipe (knowledge_distillation_single_device.py) is similar to lora_finetune_single_device.py. Main differences are:
- adds kd loss, currently just ForwardKLLoss (in kd_losses.py), to CE loss
- adds teacher model inference to get logits
KD config: knowledge_distillation_single_device.yaml

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

CUDA_VISIBLE_DEVICES=0 tune run knowledge_distillation_single_device --config qwen2/knowledge_distillation_single_device

Llama3.1 KD Training

Legend: KD Llama3.1 student (blue), LoRA Llama3.1 student (orange)

Qwen2 KD Training

Legend: KD Qwen2 0.5B student (green), LoRA Qwen2 0.5B student (grey), LoRA Qwen2 1.5B teacher (blue)

Llama3.1 Eval Results

Qwen2 Eval Results

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Example of docstring:

torchtune/torchtune/modules/vision_transformer.py

Line 285 in 6a7951f

Examples:

Example in our docs: https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#applying-qat-to-llama3-models

I did not change any public API;
I have added an example to docs or docstrings;

pytorch-bot · 2024-09-11T15:45:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1539

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6ea3329 with merge base 63208c6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers · 2024-09-12T23:28:31Z

recipes/configs/llama3_1/kd_single_device.yaml

+#   tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct
+#   tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device


Can you add a bit more detail here? Tbh when I first looked at it I thought you had accidentally just copy-pasted a tune run command from another config 😅 . Maybe just add a couple statements explicitly saying something like "Run this to download the model: {tune download ...}. You will then need to fine-tune the teacher model, you can do this with {tune run...}"

ebsmothers · 2024-09-12T23:29:58Z

recipes/configs/llama3_1/kd_single_device.yaml

+# Teacher checkpoint
+teacher_checkpointer:
+  _component_: torchtune.training.FullModelMetaCheckpointer
+  checkpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/lora_finetuned_single_device_epoch_1/


Should make sure that this directory matches the output of whatever command you give for LoRA finetuning at the top of this file

ebsmothers · 2024-09-12T23:30:23Z

recipes/configs/llama3_1/kd_single_device.yaml

+epochs: 1
+max_steps_per_epoch: null
+gradient_accumulation_steps: 16
+compile: False


Out of curiosity, did you try with compile yet?

Yea tested that compile works (in pink)

Looks like some nice improvements in training speed. I'm curious about the difference in the loss curves, do you see non-determinism across runs there without compile enabled?

The loss curves between compile and non-compile are fairly similar.

I do see some non-determinism during eval. The losses are all around 1.2, but there's slight differences in eval. Also interesting that fine-tuning on alpaca dataset actually hurts performances on all benchmarks but Truthful QA.

ebsmothers · 2024-09-12T23:42:12Z

torchtune/modules/loss/kd_losses.py

+    The Kullback-Leibler divergence loss for valid indexes.
+    Implementation of https://github.com/jongwooko/distillm/blob/master/distillm/losses.py.
+    """


I'm surprised the linter didn't yell at you for this, can we add args (well I guess just single arg) with typehints here?

Also two nits on the link: (1) don't include the period at the end (makes it not clickable), and (2) replace master with a specific commit hash (in case things change in the future)

ebsmothers · 2024-09-12T23:45:21Z

torchtune/modules/loss/kd_losses.py

+        """
+
+        teacher_prob = F.softmax(teacher_logits, dim=-1, dtype=torch.float32)
+        inf_mask = torch.isinf(student_logits)


Noob q: why would student logits be infinite? Does it mean there is some numerical issue? (I know it's in the original implementation, just curious about the rationale)

The teacher_logits could be infinite too. I believe the original implementation only considered that the student logits could be infinite because that's the model that's training. The inf in the student logits would cause the torch.sum part to be inf.

ebsmothers · 2024-09-12T23:46:00Z

torchtune/modules/loss/kd_losses.py

+    the cross entropy normally, but upcasting only one chunk at a time saves considerable memory.
+    """


Similar comment here about init args

ebsmothers · 2024-09-12T23:48:57Z

tests/torchtune/modules/loss/test_kd_losses.py

+        standard_loss = fkl_loss(logits, teacher_logits, labels)
+
+        # Assert
+        assert_expected(chunked_loss, standard_loss, rtol=1e-2, atol=1e-2)


Would also be good to run the loss from jongwooko repo with identical sets of values, use that to determine the expected value, then compare both chunked and standard losses to that (that way we know that we have numerical parity with a reference implementation)

Good idea. I couldn't find any tests in the repo, so I randomly generated the logits and ran it through the distillm implementation.

Looks great! You can also use something like the fixed_init_tensor util in case you are don't want to generate it manually. But in this case the tensors are relatively small anyways so no need for it

ebsmothers

OK I took another, more thorough pass, and this looks great! I left a handful more comments but after that there are no real concerns from me

ebsmothers · 2024-09-13T23:42:30Z

tests/torchtune/modules/loss/test_kd_losses.py

+        standard_loss = fkl_loss(logits, teacher_logits, labels)
+
+        # Assert
+        assert_expected(chunked_loss, standard_loss, rtol=1e-2, atol=1e-2)


Looks great! You can also use something like the fixed_init_tensor util in case you are don't want to generate it manually. But in this case the tensors are relatively small anyways so no need for it

ebsmothers · 2024-09-13T23:43:23Z

torchtune/_recipe_registry.py

+            Config(
+                name="llama3_1/kd_single_device",
+                file_path="llama3_1/kd_single_device.yaml",
+            ),


Need to remove this in the latest version

ebsmothers · 2024-09-13T23:47:31Z

recipes/configs/qwen2/kd_single_device.yaml

+# Logging
+output_dir: /tmp/qwen_kd
+metric_logger:
+  _component_: torchtune.training.metric_logging.TensorBoardLogger


nit: at least before landing make sure to switch this to torchtune.training.metric_logging.DiskLogger, since tensorboard is technically a dev dependency of our library

ebsmothers · 2024-09-13T23:49:01Z

recipes/kd_single_device.py

+            library (https://huggingface.co/docs/bitsandbytes/main/en/index). We've tested the recipe with
+            8-bit AdamW and Paged AdamW.


Might wanna check this if you haven't already (I would be surprised if it doesn't work though)

ebsmothers · 2024-09-13T23:57:17Z

recipes/configs/llama3_1/kd_single_device.yaml

+epochs: 1
+max_steps_per_epoch: null
+gradient_accumulation_steps: 16
+compile: False


Looks like some nice improvements in training speed. I'm curious about the difference in the loss curves, do you see non-determinism across runs there without compile enabled?

ebsmothers · 2024-09-14T00:01:07Z

recipes/kd_single_device.py

+        backend = os.environ.get("TORCH_COMPILE_BACKEND", "inductor")
+        if self._loss_fn.__class__.__name__ == "CEWithChunkedOutputLoss":
+            # set num_output_chunks for model
+            assert (
+                self._loss_fn.num_output_chunks == self._kd_loss_fn.num_output_chunks
+            ), "Number of output chunks for loss_fn and kd_loss_fn must be the same."
+            self._model.set_num_output_chunks(self._loss_fn.num_output_chunks)
+            self._teacher_model.set_num_output_chunks(self._loss_fn.num_output_chunks)
+            if self._model_compile:
+                log.info("Compiling loss with torch.compile...")
+                # For CEWithChunkedOutputLoss, if we compile the entire class
+                # we lose the benefits from the chunked loss.
+                # Therefore, we only compile the cross entropy function + upcasting
+                self._loss_fn.compute_cross_entropy = torch.compile(
+                    self._loss_fn.compute_cross_entropy, backend=backend
+                )
+        else:
+            if self._model_compile:
+                log.info("Compiling loss with torch.compile...")
+                self._loss_fn = torch.compile(self._loss_fn, backend=backend)


Oh we have some utilities for this now, you can try to use those instead. See usage in our LoRA single-device recipe here. (If they don't work for KD out of the box lmk, we can refactor as needed)

Can also try compiling KD loss function

ebsmothers · 2024-09-14T00:05:19Z

recipes/kd_single_device.py

+        if compile_model:
+            log.info("Compiling model layers with torch.compile...")
+            backend = os.environ.get("TORCH_COMPILE_BACKEND", "inductor")
+            for m in reversed(list(model.modules())):
+                if isinstance(m, modules.transformer.TransformerSelfAttentionLayer):
+                    m.compile(backend=backend)


Similar comment here about using the compile utilities

ebsmothers · 2024-09-14T00:06:01Z

recipes/kd_single_device.py

+            training.log_memory_stats(memory_stats)
+        return model
+
+    def _setup_teacher_model(


Wonder if it's worth it to also compile the teacher model when compile=True?

I got an error when trying to compile the teacher model. I'm using torch.no_grad when inferencing the teacher model to save memory consumption. However, it seems that torch.no_grad isn't compatible with torch.compile (pytorch/pytorch#100241).

ebsmothers · 2024-09-14T00:16:59Z

recipes/kd_single_device.py

+                        # Update the number of steps when the weights are updated
+                        self.global_step += 1
+
+                        loss_to_log = running_loss.item()


Nit: I'm not positive, but it might be slightly slower to do things this way. Each .item() will cause a sync and you could probably get away with calling just running_class_loss.item() and running_kd_loss.item() and figuring out loss_to_log from that (I think).

Calling .item() twice instead of 3x helps slightly. I don't see much of a difference:

I removed running_loss and added computing from class and kd loss.

joecummings

Super nit / discussion point:

kd -> distillation. I think it's much easier to understand at a glance what it is. I didn't understand the abbreviation KD until I read through the PR description.

lindawangg · 2024-09-16T17:51:31Z

Super nit / discussion point:

kd -> distillation. I think it's much easier to understand at a glance what it is. I didn't understand the abbreviation KD until I read through the PR description.

We could rename from kd_single_device to knowledge_distillation_single_device. Distillation might be confusing since there's many types.

ebsmothers · 2024-09-19T16:50:03Z

We could rename from kd_single_device to knowledge_distillation_single_device. Distillation might be confusing since there's many types.

Yeah this sounds good to me. I agree that it's nice to just be explicit in the recipe name so it's obvious what the recipe is doing (even if the name is a bit longer as a result).

ebsmothers · 2024-09-19T17:07:30Z

tests/recipes/test_kd_single_device.py

+            metric_logger._component_=torchtune.training.metric_logging.DiskLogger \
+            metric_logger.filename={log_file} \
+            compile={compile} \
+            kd_loss._component_=torchtune.modules.loss.ForwardKLWithChunkedOutputLoss \


Do you actually need this override? This should be the default in the config, right?

ebsmothers · 2024-09-19T17:07:40Z

tests/recipes/test_kd_single_device.py

+        print(loss_values)
+        print(expected_loss_values)


codecov-commenter · 2024-09-19T19:19:11Z

Codecov Report

Attention: Patch coverage is 22.01835% with 340 lines in your changes missing coverage. Please review.

Project coverage is 69.00%. Comparing base (dd348ce) to head (6ea3329).
Report is 474 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/knowledge_distillation_single_device.py	0.00%	261 Missing ⚠️
...cipes/test_knowledge_distillation_single_device.py	22.00%	78 Missing ⚠️
torchtune/modules/loss/kd_losses.py	96.77%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1539      +/-   ##
==========================================
- Coverage   72.26%   69.00%   -3.26%     
==========================================
  Files         290      295       +5     
  Lines       14554    15079     +525     
==========================================
- Hits        10517    10406     -111     
- Misses       4037     4673     +636

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ebsmothers

Thank you for adding this!

lindawangg added 13 commits September 4, 2024 22:17

sft recipes to eval kd

2933cd6

setup kd files

bff065a

delete test config

9dd7b47

added student config

a39e99c

Merge branch 'main' into add-initial-kd-recipe

6dbcd38

added teacher model loading

0c4e4f9

added loss

380f267

kd initial experiment config

da2b4bb

Merge branch 'main' into add-initial-kd-recipe

8beaca0

separated out loss func and added test

b54929a

added documentation

b31c56d

added prereq command to config

fe5ed97

Merge branch 'main' into add-initial-kd-recipe

8b9ea41

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 11, 2024

lindawangg added 3 commits September 11, 2024 08:45

re-add 8B config

3f7fe70

added kd ratio

a87aa0c

revert 8b config

f5feac4

lindawangg requested a review from ebsmothers September 11, 2024 21:46

lindawangg marked this pull request as ready for review September 12, 2024 02:52

lindawangg added 3 commits September 12, 2024 12:12

add kd recipe test

8c3c42a

mark as integration test

6ba0514

add save and load weights test

04ea649

ebsmothers reviewed Sep 12, 2024

View reviewed changes

lindawangg added 3 commits September 12, 2024 19:13

fix comments 1

62faa1d

address kd loss test comments

bf15406

change to qwen2

ac9eb0e

ebsmothers reviewed Sep 14, 2024

View reviewed changes

lindawangg added 2 commits September 14, 2024 23:54

addressing recipe comments

87a80b6

Merge branch 'main' into add-initial-kd-recipe

1fc3f64

joecummings reviewed Sep 16, 2024

View reviewed changes

lindawangg added 2 commits September 16, 2024 14:08

remove todo comment and test activation checkpointing

0f4e922

Merge branch 'main' into add-initial-kd-recipe

22fddca

lindawangg mentioned this pull request Sep 17, 2024

Add kd distributed lindawangg/lw-torchtune#1

Open

13 tasks

ebsmothers reviewed Sep 19, 2024

View reviewed changes

lindawangg added 2 commits September 19, 2024 11:23

Merge branch 'main' into add-initial-kd-recipe

3e2ba9c

change name to knowledge_distillation

6ea3329

ebsmothers approved these changes Sep 19, 2024

View reviewed changes

ebsmothers merged commit 4234b78 into pytorch:main Sep 19, 2024
17 checks passed

lindawangg mentioned this pull request Sep 20, 2024

Add KD distributed recipe #1631

Merged

13 tasks

		# tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct
		# tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device

		the cross entropy normally, but upcasting only one chunk at a time saves considerable memory.
		"""

		library (https://huggingface.co/docs/bitsandbytes/main/en/index). We've tested the recipe with
		8-bit AdamW and Paged AdamW.

Add single device KD recipe #1539

Add single device KD recipe #1539

Uh oh!

Conversation

lindawangg commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

Llama3.1 KD Training

Qwen2 KD Training

Llama3.1 Eval Results

Qwen2 Eval Results

UX

Uh oh!

pytorch-bot bot commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1539

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

lindawangg commented Sep 16, 2024

lindawangg commented Sep 11, 2024 •

edited

Loading

pytorch-bot bot commented Sep 11, 2024 •

edited

Loading

codecov-commenter commented Sep 19, 2024 •

edited

Loading