Skip to content

Conversation

derekhiggins
Copy link
Contributor

Reset the train_data list for each iteration of the synth instruction loop. When it is written to file it now only contains one of each entry.

This should dramatically reduce training time.

Fixes #752

Reset the train_data list for each iteration of the
synth instruction loop. When it is written to file
it now only contains one of each entry.

This should dramatically reduce training time.

Fixes #752

Signed-off-by: Derek Higgins <derekh@redhat.com>
@derekhiggins
Copy link
Contributor Author

There is a refactor that could be done here to make things more efficient but this should fix the problem

@DennisPeriquet
Copy link
Contributor

Looks good and thank you!

$ git show|head -1
commit 030117bf1dd52cb8617b6de8321cb6a5fefe4923

# Small test show results after 'lab generate':
#
$ cat generated/train_llama-2-7b-chat.Q4_0_2024-03-28T07_01_09.jsonl |wc -l
11
$ cat generated/train_llama-2-7b-chat.Q4_0_2024-03-28T07_01_09.jsonl |sort -u |wc -l
11

$ rm -rf generated
...
# configure num_instructions: 100 since that's the default
# 
$ cat generated/train_llama-2-7b-chat.Q4_0_2024-03-28T07_05_22.jsonl |wc -l
100
$ cat generated/train_llama-2-7b-chat.Q4_0_2024-03-28T07_05_22.jsonl |sort -u|wc -l
100

@abhi1092
Copy link
Member

abhi1092 commented Apr 2, 2024

It's definitely better to not write entire data every iteration. But with this new change isn't it overwriting the entire file with current train data?

@derekhiggins
Copy link
Contributor Author

derekhiggins commented Apr 2, 2024

It's definitely better to not write entire data every iteration. But with this new change isn't it overwriting the entire file with current train data?

It is overwriting the entire file with all of the train data found so far, it would be more efficient to just append the new data but I saw this as part of the potential bigger refactor I mentioned, I was going to take a look into this after my other refactor is reviewed as the 2 will probably conflict (#688 )

@xukai92
Copy link
Member

xukai92 commented Apr 4, 2024

I have the same question as Abhishek.
Do we still have 100 samples if we run ilab generate by default?
If not the new behavior is right, no?
I do understand the original motivation is to get rid of duplicated samples but maybe we should just deduplicate them or making sure the parameter rouge_threshold is set to a smaller one (https://github.com/instruct-lab/cli/blob/main/cli/lab.py#L366).

@derekhiggins
Copy link
Contributor Author

I have the same question as Abhishek. Do we still have 100 samples if we run ilab generate by default? If not the new behavior is right, no? I do understand the original motivation is to get rid of duplicated samples but maybe we should just deduplicate them or making sure the parameter rouge_threshold is set to a smaller one (https://github.com/instruct-lab/cli/blob/main/cli/lab.py#L366).

With the new code a user will get a training file with 100 samples (sometimes 101), with the old code they would get 1000's. The problem is nothing to do with the rogue threshold and not because the model is producing duplicate samples. We are writing the each sample (the same sample) to file multiple times sometimes 100's of times. It doesn't seem like this was intended and it adds 1000's of iterations to the training stage.

@xukai92
Copy link
Member

xukai92 commented Apr 4, 2024

Thanks for the explanation @derekhiggins.
It indeed looks like a bug.
Let me take a closer review of the change.

@xukai92
Copy link
Member

xukai92 commented Apr 4, 2024

I guess I know what's going on.
There was 1 refactoring PR changed the behavior to dump data to file on the fly.
However, with the old logic it will accumulate partially complete train set and dump it on the fly multiple times, which is what you observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The "lab generate" produces training file containing large number of duplicate rows (slows training down significantly)
4 participants