Fix PackedDataset bug for seq_len > 2 * max_seq_len setting. #1697

mirceamironenco · 2024-09-26T20:40:39Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

Fixes PackedDataset cannot handle long sequence whose length is larger than 2*max_seq_len when using split_across_pack=True #1689 - the case where PackedDataset with max_seq_len * 2 < seq_len would throw a runtime error when trying to pad input_pos.

I've set the PR as a draft and only added 1 test to first get some feedback on this setting. In this case the PackedDataset acts more as a chunking utility. Note that since seq_len > max_seq_len, to avoid out of bounds errors I've changed input_pos to be generated by:

current_pack["input_pos"] += [x % self.max_seq_len for x in range(seq_len)]

This might be unexpected for some users (?).
Also, some of the testing utilities in PackedDataset I assume where not written with this case in mind (e.g. _get_expected_seq_lens_and_input_pos), so before I make further changes I'd like to know this direction is fine.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-09-26T20:40:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1697

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0e14d7c with merge base 3fddc56 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-09-27T18:30:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.56%. Comparing base (6bc143f) to head (9742982).
Report is 15 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1697      +/-   ##
==========================================
- Coverage   70.67%   67.56%   -3.11%     
==========================================
  Files         299      304       +5     
  Lines       15251    15615     +364     
==========================================
- Hits        10778    10551     -227     
- Misses       4473     5064     +591

Flag	Coverage Δ
	`67.56% <100.00%> (-3.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers · 2024-09-27T22:43:04Z

torchtune/datasets/_packed.py

@@ -136,12 +136,15 @@ def _pack(self) -> None:
            # Update the current pack
            current_pack["tokens"] += tokens
            current_pack["labels"] += labels
-            current_pack["input_pos"] += list(range(seq_len))
+            current_pack["input_pos"] += [x % self.max_seq_len for x in range(seq_len)]


So this is the correct thing to do in terms of fixing the bug, my main question is whether it's the correct thing to do from a model training perspective. I think it is? Basically we are treating the 2nd half of a long sample as a new sample starting from position 0, right? And as long as we are using RoPE (or some kind of positional encoding that only cares about relative position) it doesn't matter that the $(max\textunderscore seq\textunderscore len+1)^{th}$ element now has input_pos 0. I guess with absolute positional encoding we should not do this though.

I agree with your analysis, and it's unclear to me what choice would be best in all scenarios; some ideas:

Disallow any samples that have seq_len > max_seq_len (my assumption was that such samples are allowed due to the message of the ValueError on line 131).

Truncate the samples that are too long, throw away the rest of the sample.

Chunk the samples as done in this PR.

Allow either 2 or 3, controlled by a new flag to be added to the PackedDataset class, e.g. truncate_samples (possibly with truncation being the default behavior). One could also have a warning be emitted in this case if any sample is truncated.

Personally I think (3) is most in line with the spirit of split_across_pack=True anyways, we don't care too much whether the split is caused by a single really long sample or just awkward spacing of a sample starting right near max_seq_len of the pack. For (2) it is reasonable but I actually wouldn't add another config (don't want to make things more complicated than they have to be). One thing we could consider is modifying the behavior of split_across_pack=False so that we put any sample whose length exceeds max_seq_len in its own pack, truncate it, and raise a warning. I don't have a strong preference between this and the current state of things though.

ebsmothers

Very nice find, very nice fix, and great unit test. One concern though: I think we will also need to change the seq_lens values in the pack to not exceed max_seq_len, right? It may not error when packing the dataset as we saw with input_pos, but I suspect it will break in flex attention (see here)

I was going to suggest just changing L140 to current_pack["seq_lens"] += [min(seq_len, self.max_seq_len)], but I think you may need to use the sequence lengths of the entire pack to avoid a case like max_seq_len=50, seq_lens=[2, 50] breaking the sum(seq_lens) == max_seq_len assumption. (Hopefully that makes sense, lmk if not)

…han once.

mirceamironenco · 2024-09-28T10:20:05Z

Thanks for taking a look!

Very nice find, very nice fix, and great unit test. One concern though: I think we will also need to change the seq_lens values in the pack to not exceed max_seq_len, right? It may not error when packing the dataset as we saw with input_pos, but I suspect it will break in flex attention (see here)

I was going to suggest just changing L140 to current_pack["seq_lens"] += [min(seq_len, self.max_seq_len)], but I think you may need to use the sequence lengths of the entire pack to avoid a case like max_seq_len=50, seq_lens=[2, 50] breaking the sum(seq_lens) == max_seq_len assumption. (Hopefully that makes sense, lmk if not)

I think I understand what you mean, but unless I've misunderstood how splitting happens this might already be fine:

torchtune/torchtune/datasets/_packed.py

Lines 168 to 172 in 3fddc56

    
           if self.split_across_pack: 
        
               boundary = self.max_seq_len 
        
               # The last elem in ``seq_lens`` ensures that ``sum(seq_lens) == self.max_seq_len`` 
        
               leftover_seq_len = self.max_seq_len - sum(current_pack["seq_lens"][:-1]) 
        
               seq_len_padding = [leftover_seq_len] if leftover_seq_len > 0 else []

Here sum(current_pack["seq_lens"][:-1]) should always be <= self.max_seq_len and for the case where all seq_lens are greater than max_seq_len it will always be 0. If you have a counter-example I can address it. I've also modified the unit test to explicitly check that the seq_len assumption holds (see latest commit at the time of this comment).

Note that, I've played around with this a bit more and the current version of the code (on torchtune/main) will error out even for a simpler case. It is not necessary for seq_len > 2 * max_seq_len, even a single example where seq_len > max_seq_len can cause the same error:

max_seq_len = 60
sample_size = [max_seq_len // 2] * 10 + [max_seq_len + 1]
dataset = [dict(tokens=list(range(size)), labels=list(range(size))) for size in sample_size]
x = PackedDataset(dataset, max_seq_len=max_seq_len, split_across_pack=True) # RuntimeError: upper bound and larger bound inconsistent with step sign.

ebsmothers · 2024-09-28T19:54:18Z

Thanks for taking a look!

Very nice find, very nice fix, and great unit test. One concern though: I think we will also need to change the seq_lens values in the pack to not exceed max_seq_len, right? It may not error when packing the dataset as we saw with input_pos, but I suspect it will break in flex attention (see here)
I was going to suggest just changing L140 to current_pack["seq_lens"] += [min(seq_len, self.max_seq_len)], but I think you may need to use the sequence lengths of the entire pack to avoid a case like max_seq_len=50, seq_lens=[2, 50] breaking the sum(seq_lens) == max_seq_len assumption. (Hopefully that makes sense, lmk if not)

I think I understand what you mean, but unless I've misunderstood how splitting happens this might already be fine:

torchtune/torchtune/datasets/_packed.py

Lines 168 to 172 in 3fddc56

if self.split_across_pack:

boundary = self.max_seq_len

# The last elem in ``seq_lens`` ensures that ``sum(seq_lens) == self.max_seq_len``

leftover_seq_len = self.max_seq_len - sum(current_pack["seq_lens"][:-1])

seq_len_padding = [leftover_seq_len] if leftover_seq_len > 0 else []

Here sum(current_pack["seq_lens"][:-1]) should always be <= self.max_seq_len and for the case where all seq_lens are greater than max_seq_len it will always be 0. If you have a counter-example I can address it. I've also modified the unit test to explicitly check that the seq_len assumption holds (see latest commit at the time of this comment).

Note that, I've played around with this a bit more and the current version of the code (on torchtune/main) will error out even for a simpler case. It is not necessary for seq_len > 2 * max_seq_len, even a single example where seq_len > max_seq_len can cause the same error:
max_seq_len = 60
sample_size = [max_seq_len // 2] * 10 + [max_seq_len + 1]
dataset = [dict(tokens=list(range(size)), labels=list(range(size))) for size in sample_size]
x = PackedDataset(dataset, max_seq_len=max_seq_len, split_across_pack=True) # RuntimeError: upper bound and larger bound inconsistent with step sign.

Oh you're completely correct on this.. somehow when I was looking at the code I missed the [:-1] slice in seq_lens, which got me confused. Then I agree: I think we will always satisfy sum(seq_lens) == max_seq_len, even when dealing with samples whose length exceeds max_seq_len. Then there are no major concerns from my side on this PR; we can decide on the discussion of truncation vs chunking but I don't think it's a blocker here.

ebsmothers

Thank you for the fix!

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 26, 2024

ebsmothers reviewed Sep 27, 2024

View reviewed changes

mirceamironenco added 6 commits September 28, 2024 12:34

Fixed packed dataset bug for case where pack needs to be split more t…

e08d150

…han once.

Update input_pos construction to account for max_seq_len < seq_len case

d1bf053

Added packed dataset chuncked case test

2866d49

Added length check test for chuncked case.

c5beac1

Fix typo.

f40708f

Update packed dataset test to check for correct sum(seq_lens)

0e14d7c

mirceamironenco marked this pull request as ready for review September 28, 2024 20:59

ebsmothers approved these changes Sep 30, 2024

View reviewed changes

ebsmothers merged commit ded8958 into pytorch:main Sep 30, 2024
17 checks passed

joecummings pushed a commit that referenced this pull request Sep 30, 2024

Fix PackedDataset bug for seq_len > 2 * max_seq_len setting. (#1697)

c007010

mirceamironenco deleted the fix-packedds-seqlen branch September 30, 2024 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix PackedDataset bug for seq_len > 2 * max_seq_len setting. #1697

Fix PackedDataset bug for seq_len > 2 * max_seq_len setting. #1697

Uh oh!

mirceamironenco commented Sep 26, 2024

Uh oh!

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 27, 2024 •

edited

Loading

Uh oh!

ebsmothers Sep 27, 2024

Uh oh!

mirceamironenco Sep 28, 2024

Uh oh!

ebsmothers Sep 28, 2024

Uh oh!

ebsmothers left a comment

Uh oh!

mirceamironenco commented Sep 28, 2024

Uh oh!

ebsmothers commented Sep 28, 2024

Uh oh!

ebsmothers left a comment

Uh oh!

Uh oh!

Uh oh!

Fix PackedDataset bug for seq_len > 2 * max_seq_len setting. #1697

Fix PackedDataset bug for seq_len > 2 * max_seq_len setting. #1697

Uh oh!

Conversation

mirceamironenco commented Sep 26, 2024

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1697

✅ No Failures

Uh oh!

codecov-commenter commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ebsmothers Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

mirceamironenco Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

ebsmothers Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

mirceamironenco commented Sep 28, 2024

Uh oh!

ebsmothers commented Sep 28, 2024

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

codecov-commenter commented Sep 27, 2024 •

edited

Loading