-
Notifications
You must be signed in to change notification settings - Fork 441
Don't duplicate training data #763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reset the train_data list for each iteration of the synth instruction loop. When it is written to file it now only contains one of each entry. This should dramatically reduce training time. Fixes #752 Signed-off-by: Derek Higgins <derekh@redhat.com>
There is a refactor that could be done here to make things more efficient but this should fix the problem |
Looks good and thank you!
|
It's definitely better to not write entire data every iteration. But with this new change isn't it overwriting the entire file with current train data? |
It is overwriting the entire file with all of the train data found so far, it would be more efficient to just append the new data but I saw this as part of the potential bigger refactor I mentioned, I was going to take a look into this after my other refactor is reviewed as the 2 will probably conflict (#688 ) |
I have the same question as Abhishek. |
With the new code a user will get a training file with 100 samples (sometimes 101), with the old code they would get 1000's. The problem is nothing to do with the rogue threshold and not because the model is producing duplicate samples. We are writing the each sample (the same sample) to file multiple times sometimes 100's of times. It doesn't seem like this was intended and it adds 1000's of iterations to the training stage. |
Thanks for the explanation @derekhiggins. |
I guess I know what's going on. |
Reset the train_data list for each iteration of the synth instruction loop. When it is written to file it now only contains one of each entry.
This should dramatically reduce training time.
Fixes #752