Generate examples with simpler heading format #857

bbrowning · 2024-04-12T15:50:09Z

Today ilab generate creates a large percentage (> 50% in my testing) of discarded examples because the examples do not match our expected format. Typically, the model fails to follow our complicated numbering scheme for instructions, inputs, and outputs. This change simplifies that scheme, taking inspiration from Emacs org-mode without any rigid adherence to it.

Before this change, we expected the model to continue a format like:

1. Instruction: Generate a joke involving three horses.
1. Input:
<noinput>
1. Output:
There once were three horses...

After this change, we now give the model:

* Task 1
** Instruction
Generate a joke involving three horses.
** Input
<noinput>
** Output
There once were three horses...

In my local testing, the model is able to replicate our new format with a much higher accuracy than the previous format, resulting in substantially fewer discarded generations due to format.

Other formats I considered but decided against:

JSON was excluded because the model was not able to reliably handle quotes or special characters when generating JSON, even though overall it did seem to understand the expected format.
XML was excluded because it resulted in a more verbose prompt, consuming quite a bit more of our available context window. The model did seem to have a good grasp of XML with CDATA elements used to handle special characters.
Markdown was excluded because knowledge documents are in Markdown format, and I wanted to minimize the chances of the embedded knowledge document interfering with the expected parsing structure.

Resolves #466

Today `ilab generate` creates a large percentage (> 50% in my testing) of discarded examples because the examples do not match our expected format. Typically, the model fails to follow our complicated numbering scheme for instructions, inputs, and outputs. This change simplifies that scheme, taking inspiration from Emacs org-mode without any rigid adherence to it. Before this change, we expected the model to continue a format like: ``` 1. Instruction: Generate a joke involving three horses. 1. Input: <noinput> 1. Output: There once were three horses... ``` After this change, we now give the model: ``` * Task 1 ** Instruction Generate a joke involving three horses. ** Input <noinput> ** Output There once were three horses... ``` In my local testing, the model is able to replicate our new format with a much higher accuracy than the previous format, resulting in substantially fewer discarded generations due to format. Other formats I considered but decided against: - JSON was excluded because the model was not able to reliably handle quotes or special characters when generating JSON, even though overall it did seem to understand the expected format. - XML was excluded because it resulted in a more verbose prompt, consuming quite a bit more of our available context window. The model did seem to have a good grasp of XML with CDATA elements used to handle special characters. - Markdown was excluded because knowledge documents are in Markdown format, and I wanted to minimize the chances of the embedded knowledge document interfering with the expected parsing structure. Signed-off-by: Ben Browning <bbrownin@redhat.com>

n1hility · 2024-04-12T17:01:06Z

LGTM, will see if i get similar benefits with this PR in runs sometime today

Occasionally the model inserts a trailing colon in the generated example headings, so optionally allow that in the regular expression that splits based on those headings. This is another incremental decrease in the overall number of discarded instructions with this new format. Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning · 2024-04-12T18:13:04Z

I pushed one more commit with a minor additional change here that allows for a trailing colon at the end of the headings. For example, ** Instruction: will now be accepted as well as ** Instruction for the heading since occasionally the default model wanted to add that trailing colon.

Also, here are some generation results without this PR and with this PR. In both cases I'm generating only 10 instructions on the exact same dataset (a simple knowledge document test case I was using for other reasons).

Without this PR
187 total discards, with 170 of those a problem matching the expected format

With this PR
2 total discards, with 1 of those a problem matching the expected format.

It's a dramatic difference in wall clock time here as well. Having to generate 197 total instructions only to discard 187 of them is a big waste compared to having to only having to generate 12 total instructions to discard 2.

derekhiggins · 2024-04-12T21:07:42Z

looks pretty promising to me,
I've run 2 different taxonomies with and without this PR, in both cases it seems to be a big improvement

File 1

orig

INFO 2024-04-12 13:20:21,223 generate_data.py:545 100 instructions generated, 83 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T12_45_14.log), 13 discarded due to rouge score
INFO 2024-04-12 13:20:21,223 generate_data.py:549 Generation took 2106.38s

new

INFO 2024-04-12 17:21:48,462 generate_data.py:538 102 instructions generated, 1 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T16_58_33.log), 33 discarded due to rouge score
INFO 2024-04-12 17:21:48,462 generate_data.py:542 Generation took 1395.21s

File 2

orig

INFO 2024-04-12 18:37:07,882 generate_data.py:545 101 instructions generated, 58 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T18_04_46.log), 3 discarded due to rouge score
INFO 2024-04-12 18:37:07,882 generate_data.py:549 Generation took 1941.73s

new

INFO 2024-04-12 18:03:54,115 generate_data.py:538 102 instructions generated, 2 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T17_41_15.log), 6 discarded due to rouge score
INFO 2024-04-12 18:03:54,115 generate_data.py:542 Generation took 1358.51s

derekhiggins · 2024-04-12T21:14:06Z

I can't add a comment to it as it didn't change but this line might need updating to prevent the model from endlessly rambling (I assume that is why it was there)
in the call to utils.OpenAIDecodingArguments

    stop=["\n5", "5.", "5."],

Note: your version now returns 3 synthetic samples at a time, the old way returned 2 (as it stopped at No. 5)

I missed this in the prior commits, but this updates the stop sequence passed to the Completions API to match our new task format so that the model only generates two new instructions per run. We ask it to generate five, but seed it with two examples, it generates two, and it stops as soon as it starts generating the fifth. Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning · 2024-04-15T11:27:29Z

cli/generator/generate_data.py

@@ -281,7 +274,7 @@ def get_instructions_from_model(
        # Requests will be automatically adjusted.
        max_tokens=3072,
        top_p=top_p,
-        stop=["\n5", "5.", "5."],
+        stop=["* Task 5"],


@derekhiggins Great catch, thanks! I entirely missed the stop sequence, but adjusted it here and now the model only outputs two new instructions per call to Completion API just like it did in the old format.

bbrowning · 2024-04-18T13:14:56Z

I'm continuing to see big wins with this change - running ilab generate in an automated pipeline to generate just 10 instructions (a simple sanity check) for a knowledge contribution, I'm seeing this take the generate time from 14 minutes down to just 2 minutes with this patch.

For now I'm carrying this patch locally just to have reasonable generate times as I increase the number of instructions. I believe this would be a great quality-of-life improvement for all CLI users, especially for knowledge docs, and would love to see it make it into the next release.

oindrillac · 2024-04-18T14:59:37Z

cc: @alimaredia

bbrowning · 2024-04-19T00:53:59Z

Reopened as #919

bbrowning requested a review from abhi1092 as a code owner April 12, 2024 15:50

bbrowning commented Apr 15, 2024

View reviewed changes

nathan-weinberg closed this Apr 18, 2024

bbrowning mentioned this pull request Apr 19, 2024

Reduce number of discarded generated data samples #919

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generate examples with simpler heading format #857

Generate examples with simpler heading format #857

Uh oh!

bbrowning commented Apr 12, 2024

Uh oh!

n1hility commented Apr 12, 2024

Uh oh!

bbrowning commented Apr 12, 2024

Uh oh!

derekhiggins commented Apr 12, 2024 •

edited

Loading

Uh oh!

derekhiggins commented Apr 12, 2024 •

edited

Loading

Uh oh!

bbrowning Apr 15, 2024

Uh oh!

bbrowning commented Apr 18, 2024

Uh oh!

oindrillac commented Apr 18, 2024

Uh oh!

bbrowning commented Apr 19, 2024

Uh oh!

Uh oh!

Generate examples with simpler heading format #857

Generate examples with simpler heading format #857

Uh oh!

Conversation

bbrowning commented Apr 12, 2024

Uh oh!

n1hility commented Apr 12, 2024

Uh oh!

bbrowning commented Apr 12, 2024

Uh oh!

derekhiggins commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekhiggins commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbrowning Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

bbrowning commented Apr 18, 2024

Uh oh!

oindrillac commented Apr 18, 2024

Uh oh!

bbrowning commented Apr 19, 2024

Uh oh!

Uh oh!

derekhiggins commented Apr 12, 2024 •

edited

Loading

derekhiggins commented Apr 12, 2024 •

edited

Loading