Reduce number of discarded generated data samples #919

bbrowning · 2024-04-19T00:52:55Z

Today ilab generate creates a large percentage (> 50% in my testing) of discarded examples because the examples do not match our expected format. Typically, the model fails to follow our complicated numbering scheme for instructions, inputs, and outputs. This change simplifies that scheme, taking inspiration from Emacs org-mode without any rigid adherence to it.

Before this change, we expected the model to continue a format like:

1. Instruction: Generate a joke involving three horses.
1. Input:
<noinput>
1. Output:
There once were three horses...

After this change, we now give the model:

* Task 1
** Instruction
Generate a joke involving three horses.
** Input
<noinput>
** Output
There once were three horses...

In my local testing, the model is able to replicate our new format with a much higher accuracy than the previous format, resulting in substantially fewer discarded generations due to format.

Other formats I considered but decided against:

JSON was excluded because the model was not able to reliably handle quotes or special characters when generating JSON, even though overall it did seem to understand the expected format.
XML was excluded because it resulted in a more verbose prompt, consuming quite a bit more of our available context window. The model did seem to have a good grasp of XML with CDATA elements used to handle special characters.
Markdown was excluded because knowledge documents are in Markdown format, and I wanted to minimize the chances of the embedded knowledge document interfering with the expected parsing structure.

Resolves #466

This is a reopened version of #857 which accidentally got closed.

Today `ilab generate` creates a large percentage (> 50% in my testing) of discarded examples because the examples do not match our expected format. Typically, the model fails to follow our complicated numbering scheme for instructions, inputs, and outputs. This change simplifies that scheme, taking inspiration from Emacs org-mode without any rigid adherence to it. Before this change, we expected the model to continue a format like: ``` 1. Instruction: Generate a joke involving three horses. 1. Input: <noinput> 1. Output: There once were three horses... ``` After this change, we now give the model: ``` * Task 1 ** Instruction Generate a joke involving three horses. ** Input <noinput> ** Output There once were three horses... ``` In my local testing, the model is able to replicate our new format with a much higher accuracy than the previous format, resulting in substantially fewer discarded generations due to format. Other formats I considered but decided against: - JSON was excluded because the model was not able to reliably handle quotes or special characters when generating JSON, even though overall it did seem to understand the expected format. - XML was excluded because it resulted in a more verbose prompt, consuming quite a bit more of our available context window. The model did seem to have a good grasp of XML with CDATA elements used to handle special characters. - Markdown was excluded because knowledge documents are in Markdown format, and I wanted to minimize the chances of the embedded knowledge document interfering with the expected parsing structure. Signed-off-by: Ben Browning <bbrownin@redhat.com>

Occasionally the model inserts a trailing colon in the generated example headings, so optionally allow that in the regular expression that splits based on those headings. This is another incremental decrease in the overall number of discarded instructions with this new format. Signed-off-by: Ben Browning <bbrownin@redhat.com>

I missed this in the prior commits, but this updates the stop sequence passed to the Completions API to match our new task format so that the model only generates two new instructions per run. We ask it to generate five, but seed it with two examples, it generates two, and it stops as soon as it starts generating the fifth. Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning · 2024-04-19T00:57:07Z

For the sake of anyone newly reviewing this PR, there is some additional useful context and data in the comments of #857 that demonstrates the large reduction in discarded samples this achieves.

derekhiggins · 2024-04-19T07:56:07Z

looks pretty promising to me,
I've run 2 different taxonomies with and without this PR, in both cases it seems to be a big improvement

File 1

orig

INFO 2024-04-12 13:20:21,223 generate_data.py:545 100 instructions generated, 83 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T12_45_14.log), 13 discarded due to rouge score
INFO 2024-04-12 13:20:21,223 generate_data.py:549 Generation took 2106.38s

new

INFO 2024-04-12 17:21:48,462 generate_data.py:538 102 instructions generated, 1 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T16_58_33.log), 33 discarded due to rouge score
INFO 2024-04-12 17:21:48,462 generate_data.py:542 Generation took 1395.21s

File 2

orig

INFO 2024-04-12 18:37:07,882 generate_data.py:545 101 instructions generated, 58 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T18_04_46.log), 3 discarded due to rouge score
INFO 2024-04-12 18:37:07,882 generate_data.py:549 Generation took 1941.73s

new

INFO 2024-04-12 18:03:54,115 generate_data.py:538 102 instructions generated, 2 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T17_41_15.log), 6 discarded due to rouge score
INFO 2024-04-12 18:03:54,115 generate_data.py:542 Generation took 1358.51s

bbrowning · 2024-04-22T14:43:21Z

Thanks for reviewing/merging!

bbrowning added 3 commits April 18, 2024 20:50

bbrowning requested a review from abhi1092 as a code owner April 19, 2024 00:52

bbrowning mentioned this pull request Apr 19, 2024

Generate examples with simpler heading format #857

Closed

bbrowning changed the title ~~Generate examples with simpler heading format~~ Reduce number of discarded generated data samples Apr 19, 2024

derekhiggins mentioned this pull request Apr 19, 2024

WIP: Communicate with the model using JSON #580

Closed

anik120 approved these changes Apr 22, 2024

View reviewed changes

anik120 merged commit e9b5305 into instructlab:main Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce number of discarded generated data samples #919

Reduce number of discarded generated data samples #919

Uh oh!

bbrowning commented Apr 19, 2024

Uh oh!

bbrowning commented Apr 19, 2024

Uh oh!

derekhiggins commented Apr 19, 2024

Uh oh!

bbrowning commented Apr 22, 2024

Uh oh!

Uh oh!

Reduce number of discarded generated data samples #919

Reduce number of discarded generated data samples #919

Uh oh!

Conversation

bbrowning commented Apr 19, 2024

Uh oh!

bbrowning commented Apr 19, 2024

Uh oh!

derekhiggins commented Apr 19, 2024

Uh oh!

bbrowning commented Apr 22, 2024

Uh oh!

Uh oh!