Skip to content

Conversation

bbrowning
Copy link
Contributor

Today ilab generate creates a large percentage (> 50% in my testing) of discarded examples because the examples do not match our expected format. Typically, the model fails to follow our complicated numbering scheme for instructions, inputs, and outputs. This change simplifies that scheme, taking inspiration from Emacs org-mode without any rigid adherence to it.

Before this change, we expected the model to continue a format like:

1. Instruction: Generate a joke involving three horses.
1. Input:
<noinput>
1. Output:
There once were three horses...

After this change, we now give the model:

* Task 1
** Instruction
Generate a joke involving three horses.
** Input
<noinput>
** Output
There once were three horses...

In my local testing, the model is able to replicate our new format with a much higher accuracy than the previous format, resulting in substantially fewer discarded generations due to format.

Other formats I considered but decided against:

  • JSON was excluded because the model was not able to reliably handle quotes or special characters when generating JSON, even though overall it did seem to understand the expected format.
  • XML was excluded because it resulted in a more verbose prompt, consuming quite a bit more of our available context window. The model did seem to have a good grasp of XML with CDATA elements used to handle special characters.
  • Markdown was excluded because knowledge documents are in Markdown format, and I wanted to minimize the chances of the embedded knowledge document interfering with the expected parsing structure.

Resolves #466

This is a reopened version of #857 which accidentally got closed.

Today `ilab generate` creates a large percentage (> 50% in my testing)
of discarded examples because the examples do not match our expected
format. Typically, the model fails to follow our complicated numbering
scheme for instructions, inputs, and outputs. This change simplifies
that scheme, taking inspiration from Emacs org-mode without any rigid
adherence to it.

Before this change, we expected the model to continue a format like:

```
1. Instruction: Generate a joke involving three horses.
1. Input:
<noinput>
1. Output:
There once were three horses...
```

After this change, we now give the model:

```
* Task 1
** Instruction
Generate a joke involving three horses.
** Input
<noinput>
** Output
There once were three horses...
```

In my local testing, the model is able to replicate our new format
with a much higher accuracy than the previous format, resulting in
substantially fewer discarded generations due to format.

Other formats I considered but decided against:
- JSON was excluded because the model was not able to reliably handle
quotes or special characters when generating JSON, even though overall
it did seem to understand the expected format.
- XML was excluded because it resulted in a more verbose prompt,
consuming quite a bit more of our available context window. The model
did seem to have a good grasp of XML with CDATA elements used to
handle special characters.
- Markdown was excluded because knowledge documents are in Markdown
format, and I wanted to minimize the chances of the embedded knowledge
document interfering with the expected parsing structure.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Occasionally the model inserts a trailing colon in the generated
example headings, so optionally allow that in the regular expression
that splits based on those headings. This is another incremental
decrease in the overall number of discarded instructions with this new
format.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
I missed this in the prior commits, but this updates the stop sequence
passed to the Completions API to match our new task format so that the
model only generates two new instructions per run. We ask it to
generate five, but seed it with two examples, it generates two, and it
stops as soon as it starts generating the fifth.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@bbrowning bbrowning requested a review from abhi1092 as a code owner April 19, 2024 00:52
@bbrowning bbrowning changed the title Generate examples with simpler heading format Reduce number of discarded generated data samples Apr 19, 2024
@bbrowning
Copy link
Contributor Author

For the sake of anyone newly reviewing this PR, there is some additional useful context and data in the comments of #857 that demonstrates the large reduction in discarded samples this achieves.

@derekhiggins
Copy link
Contributor

looks pretty promising to me,
I've run 2 different taxonomies with and without this PR, in both cases it seems to be a big improvement

File 1

orig

INFO 2024-04-12 13:20:21,223 generate_data.py:545 100 instructions generated, 83 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T12_45_14.log), 13 discarded due to rouge score
INFO 2024-04-12 13:20:21,223 generate_data.py:549 Generation took 2106.38s

new

INFO 2024-04-12 17:21:48,462 generate_data.py:538 102 instructions generated, 1 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T16_58_33.log), 33 discarded due to rouge score
INFO 2024-04-12 17:21:48,462 generate_data.py:542 Generation took 1395.21s

File 2

orig

INFO 2024-04-12 18:37:07,882 generate_data.py:545 101 instructions generated, 58 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T18_04_46.log), 3 discarded due to rouge score
INFO 2024-04-12 18:37:07,882 generate_data.py:549 Generation took 1941.73s

new

INFO 2024-04-12 18:03:54,115 generate_data.py:538 102 instructions generated, 2 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T17_41_15.log), 6 discarded due to rouge score
INFO 2024-04-12 18:03:54,115 generate_data.py:542 Generation took 1358.51s

@anik120 anik120 merged commit e9b5305 into instructlab:main Apr 22, 2024
@bbrowning
Copy link
Contributor Author

Thanks for reviewing/merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generated model results returning an inconsistent format
3 participants