-
Notifications
You must be signed in to change notification settings - Fork 441
Generate examples with simpler heading format #857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Today `ilab generate` creates a large percentage (> 50% in my testing) of discarded examples because the examples do not match our expected format. Typically, the model fails to follow our complicated numbering scheme for instructions, inputs, and outputs. This change simplifies that scheme, taking inspiration from Emacs org-mode without any rigid adherence to it. Before this change, we expected the model to continue a format like: ``` 1. Instruction: Generate a joke involving three horses. 1. Input: <noinput> 1. Output: There once were three horses... ``` After this change, we now give the model: ``` * Task 1 ** Instruction Generate a joke involving three horses. ** Input <noinput> ** Output There once were three horses... ``` In my local testing, the model is able to replicate our new format with a much higher accuracy than the previous format, resulting in substantially fewer discarded generations due to format. Other formats I considered but decided against: - JSON was excluded because the model was not able to reliably handle quotes or special characters when generating JSON, even though overall it did seem to understand the expected format. - XML was excluded because it resulted in a more verbose prompt, consuming quite a bit more of our available context window. The model did seem to have a good grasp of XML with CDATA elements used to handle special characters. - Markdown was excluded because knowledge documents are in Markdown format, and I wanted to minimize the chances of the embedded knowledge document interfering with the expected parsing structure. Signed-off-by: Ben Browning <bbrownin@redhat.com>
LGTM, will see if i get similar benefits with this PR in runs sometime today |
Occasionally the model inserts a trailing colon in the generated example headings, so optionally allow that in the regular expression that splits based on those headings. This is another incremental decrease in the overall number of discarded instructions with this new format. Signed-off-by: Ben Browning <bbrownin@redhat.com>
I pushed one more commit with a minor additional change here that allows for a trailing colon at the end of the headings. For example, Also, here are some generation results without this PR and with this PR. In both cases I'm generating only 10 instructions on the exact same dataset (a simple knowledge document test case I was using for other reasons). Without this PR With this PR It's a dramatic difference in wall clock time here as well. Having to generate 197 total instructions only to discard 187 of them is a big waste compared to having to only having to generate 12 total instructions to discard 2. |
looks pretty promising to me, File 1 orig
new
File 2 orig
new
|
I can't add a comment to it as it didn't change but this line might need updating to prevent the model from endlessly rambling (I assume that is why it was there)
Note: your version now returns 3 synthetic samples at a time, the old way returned 2 (as it stopped at No. 5) |
I missed this in the prior commits, but this updates the stop sequence passed to the Completions API to match our new task format so that the model only generates two new instructions per run. We ask it to generate five, but seed it with two examples, it generates two, and it stops as soon as it starts generating the fifth. Signed-off-by: Ben Browning <bbrownin@redhat.com>
@@ -281,7 +274,7 @@ def get_instructions_from_model( | |||
# Requests will be automatically adjusted. | |||
max_tokens=3072, | |||
top_p=top_p, | |||
stop=["\n5", "5.", "5."], | |||
stop=["* Task 5"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekhiggins Great catch, thanks! I entirely missed the stop sequence, but adjusted it here and now the model only outputs two new instructions per call to Completion API just like it did in the old format.
I'm continuing to see big wins with this change - running For now I'm carrying this patch locally just to have reasonable generate times as I increase the number of instructions. I believe this would be a great quality-of-life improvement for all CLI users, especially for knowledge docs, and would love to see it make it into the next release. |
cc: @alimaredia |
Reopened as #919 |
Today
ilab generate
creates a large percentage (> 50% in my testing) of discarded examples because the examples do not match our expected format. Typically, the model fails to follow our complicated numbering scheme for instructions, inputs, and outputs. This change simplifies that scheme, taking inspiration from Emacs org-mode without any rigid adherence to it.Before this change, we expected the model to continue a format like:
After this change, we now give the model:
In my local testing, the model is able to replicate our new format with a much higher accuracy than the previous format, resulting in substantially fewer discarded generations due to format.
Other formats I considered but decided against:
Resolves #466