Don't duplicate training data #763

derekhiggins · 2024-03-27T23:15:56Z

Reset the train_data list for each iteration of the synth instruction loop. When it is written to file it now only contains one of each entry.

This should dramatically reduce training time.

Fixes #752

Reset the train_data list for each iteration of the synth instruction loop. When it is written to file it now only contains one of each entry. This should dramatically reduce training time. Fixes #752 Signed-off-by: Derek Higgins <derekh@redhat.com>

derekhiggins · 2024-03-27T23:21:41Z

There is a refactor that could be done here to make things more efficient but this should fix the problem

DennisPeriquet · 2024-03-28T11:17:06Z

Looks good and thank you!

$ git show|head -1
commit 030117bf1dd52cb8617b6de8321cb6a5fefe4923

# Small test show results after 'lab generate':
#
$ cat generated/train_llama-2-7b-chat.Q4_0_2024-03-28T07_01_09.jsonl |wc -l
11
$ cat generated/train_llama-2-7b-chat.Q4_0_2024-03-28T07_01_09.jsonl |sort -u |wc -l
11

$ rm -rf generated
...
# configure num_instructions: 100 since that's the default
# 
$ cat generated/train_llama-2-7b-chat.Q4_0_2024-03-28T07_05_22.jsonl |wc -l
100
$ cat generated/train_llama-2-7b-chat.Q4_0_2024-03-28T07_05_22.jsonl |sort -u|wc -l
100

abhi1092 · 2024-04-02T18:47:29Z

It's definitely better to not write entire data every iteration. But with this new change isn't it overwriting the entire file with current train data?

derekhiggins · 2024-04-02T21:57:50Z

It's definitely better to not write entire data every iteration. But with this new change isn't it overwriting the entire file with current train data?

It is overwriting the entire file with all of the train data found so far, it would be more efficient to just append the new data but I saw this as part of the potential bigger refactor I mentioned, I was going to take a look into this after my other refactor is reviewed as the 2 will probably conflict (#688 )

xukai92 · 2024-04-04T05:15:40Z

I have the same question as Abhishek.
Do we still have 100 samples if we run ilab generate by default?
If not the new behavior is right, no?
I do understand the original motivation is to get rid of duplicated samples but maybe we should just deduplicate them or making sure the parameter rouge_threshold is set to a smaller one (https://github.com/instruct-lab/cli/blob/main/cli/lab.py#L366).

derekhiggins · 2024-04-04T08:54:02Z

I have the same question as Abhishek. Do we still have 100 samples if we run ilab generate by default? If not the new behavior is right, no? I do understand the original motivation is to get rid of duplicated samples but maybe we should just deduplicate them or making sure the parameter rouge_threshold is set to a smaller one (https://github.com/instruct-lab/cli/blob/main/cli/lab.py#L366).

With the new code a user will get a training file with 100 samples (sometimes 101), with the old code they would get 1000's. The problem is nothing to do with the rogue threshold and not because the model is producing duplicate samples. We are writing the each sample (the same sample) to file multiple times sometimes 100's of times. It doesn't seem like this was intended and it adds 1000's of iterations to the training stage.

xukai92 · 2024-04-04T13:28:44Z

Thanks for the explanation @derekhiggins.
It indeed looks like a bug.
Let me take a closer review of the change.

xukai92 · 2024-04-04T13:34:58Z

I guess I know what's going on.
There was 1 refactoring PR changed the behavior to dump data to file on the fly.
However, with the old logic it will accumulate partially complete train set and dump it on the fly multiple times, which is what you observed.

Don't duplicate training data

030117b

Reset the train_data list for each iteration of the synth instruction loop. When it is written to file it now only contains one of each entry. This should dramatically reduce training time. Fixes #752 Signed-off-by: Derek Higgins <derekh@redhat.com>

derekhiggins requested a review from abhi1092 as a code owner March 27, 2024 23:15

derekhiggins mentioned this pull request Mar 27, 2024

The "lab generate" produces training file containing large number of duplicate rows (slows training down significantly) #752

Closed

xukai92 approved these changes Apr 4, 2024

View reviewed changes

xukai92 merged commit eb95397 into instructlab:main Apr 4, 2024

hickeyma deleted the nodups branch April 8, 2024 09:28

This was referenced Apr 15, 2024

Documented training durations are too high #867

Closed

Update documented training durations #930

Merged

derekhiggins mentioned this pull request Apr 25, 2024

train dataset contains many duplicates of generated instructions #1003

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't duplicate training data #763

Don't duplicate training data #763

Uh oh!

derekhiggins commented Mar 27, 2024

Uh oh!

derekhiggins commented Mar 27, 2024

Uh oh!

DennisPeriquet commented Mar 28, 2024

Uh oh!

abhi1092 commented Apr 2, 2024

Uh oh!

derekhiggins commented Apr 2, 2024 •

edited

Loading

Uh oh!

xukai92 commented Apr 4, 2024

Uh oh!

derekhiggins commented Apr 4, 2024

Uh oh!

xukai92 commented Apr 4, 2024

Uh oh!

xukai92 commented Apr 4, 2024

Uh oh!

Uh oh!

Don't duplicate training data #763

Don't duplicate training data #763

Uh oh!

Conversation

derekhiggins commented Mar 27, 2024

Uh oh!

derekhiggins commented Mar 27, 2024

Uh oh!

DennisPeriquet commented Mar 28, 2024

Uh oh!

abhi1092 commented Apr 2, 2024

Uh oh!

derekhiggins commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xukai92 commented Apr 4, 2024

Uh oh!

derekhiggins commented Apr 4, 2024

Uh oh!

xukai92 commented Apr 4, 2024

Uh oh!

xukai92 commented Apr 4, 2024

Uh oh!

Uh oh!

derekhiggins commented Apr 2, 2024 •

edited

Loading