Skip to content

Conversation

bbrowning
Copy link
Contributor

Today ilab generate creates a large percentage (> 50% in my testing) of discarded examples because the examples do not match our expected format. Typically, the model fails to follow our complicated numbering scheme for instructions, inputs, and outputs. This change simplifies that scheme, taking inspiration from Emacs org-mode without any rigid adherence to it.

Before this change, we expected the model to continue a format like:

1. Instruction: Generate a joke involving three horses.
1. Input:
<noinput>
1. Output:
There once were three horses...

After this change, we now give the model:

* Task 1
** Instruction
Generate a joke involving three horses.
** Input
<noinput>
** Output
There once were three horses...

In my local testing, the model is able to replicate our new format with a much higher accuracy than the previous format, resulting in substantially fewer discarded generations due to format.

Other formats I considered but decided against:

  • JSON was excluded because the model was not able to reliably handle quotes or special characters when generating JSON, even though overall it did seem to understand the expected format.
  • XML was excluded because it resulted in a more verbose prompt, consuming quite a bit more of our available context window. The model did seem to have a good grasp of XML with CDATA elements used to handle special characters.
  • Markdown was excluded because knowledge documents are in Markdown format, and I wanted to minimize the chances of the embedded knowledge document interfering with the expected parsing structure.

Resolves #466

Today `ilab generate` creates a large percentage (> 50% in my testing)
of discarded examples because the examples do not match our expected
format. Typically, the model fails to follow our complicated numbering
scheme for instructions, inputs, and outputs. This change simplifies
that scheme, taking inspiration from Emacs org-mode without any rigid
adherence to it.

Before this change, we expected the model to continue a format like:

```
1. Instruction: Generate a joke involving three horses.
1. Input:
<noinput>
1. Output:
There once were three horses...
```

After this change, we now give the model:

```
* Task 1
** Instruction
Generate a joke involving three horses.
** Input
<noinput>
** Output
There once were three horses...
```

In my local testing, the model is able to replicate our new format
with a much higher accuracy than the previous format, resulting in
substantially fewer discarded generations due to format.

Other formats I considered but decided against:
- JSON was excluded because the model was not able to reliably handle
quotes or special characters when generating JSON, even though overall
it did seem to understand the expected format.
- XML was excluded because it resulted in a more verbose prompt,
consuming quite a bit more of our available context window. The model
did seem to have a good grasp of XML with CDATA elements used to
handle special characters.
- Markdown was excluded because knowledge documents are in Markdown
format, and I wanted to minimize the chances of the embedded knowledge
document interfering with the expected parsing structure.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@bbrowning bbrowning requested a review from abhi1092 as a code owner April 12, 2024 15:50
@n1hility
Copy link
Member

LGTM, will see if i get similar benefits with this PR in runs sometime today

Occasionally the model inserts a trailing colon in the generated
example headings, so optionally allow that in the regular expression
that splits based on those headings. This is another incremental
decrease in the overall number of discarded instructions with this new
format.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@bbrowning
Copy link
Contributor Author

I pushed one more commit with a minor additional change here that allows for a trailing colon at the end of the headings. For example, ** Instruction: will now be accepted as well as ** Instruction for the heading since occasionally the default model wanted to add that trailing colon.

Also, here are some generation results without this PR and with this PR. In both cases I'm generating only 10 instructions on the exact same dataset (a simple knowledge document test case I was using for other reasons).

Without this PR
187 total discards, with 170 of those a problem matching the expected format

With this PR
2 total discards, with 1 of those a problem matching the expected format.

It's a dramatic difference in wall clock time here as well. Having to generate 197 total instructions only to discard 187 of them is a big waste compared to having to only having to generate 12 total instructions to discard 2.

@derekhiggins
Copy link
Contributor

derekhiggins commented Apr 12, 2024

looks pretty promising to me,
I've run 2 different taxonomies with and without this PR, in both cases it seems to be a big improvement

File 1

orig

INFO 2024-04-12 13:20:21,223 generate_data.py:545 100 instructions generated, 83 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T12_45_14.log), 13 discarded due to rouge score
INFO 2024-04-12 13:20:21,223 generate_data.py:549 Generation took 2106.38s

new

INFO 2024-04-12 17:21:48,462 generate_data.py:538 102 instructions generated, 1 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T16_58_33.log), 33 discarded due to rouge score
INFO 2024-04-12 17:21:48,462 generate_data.py:542 Generation took 1395.21s

File 2

orig

INFO 2024-04-12 18:37:07,882 generate_data.py:545 101 instructions generated, 58 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T18_04_46.log), 3 discarded due to rouge score
INFO 2024-04-12 18:37:07,882 generate_data.py:549 Generation took 1941.73s

new

INFO 2024-04-12 18:03:54,115 generate_data.py:538 102 instructions generated, 2 discarded due to format (see generated/discarded_merlinite-7b-Q4_K_M_2024-04-12T17_41_15.log), 6 discarded due to rouge score
INFO 2024-04-12 18:03:54,115 generate_data.py:542 Generation took 1358.51s

@derekhiggins
Copy link
Contributor

derekhiggins commented Apr 12, 2024

I can't add a comment to it as it didn't change but this line might need updating to prevent the model from endlessly rambling (I assume that is why it was there)
in the call to utils.OpenAIDecodingArguments

    stop=["\n5", "5.", "5."],

Note: your version now returns 3 synthetic samples at a time, the old way returned 2 (as it stopped at No. 5)

I missed this in the prior commits, but this updates the stop sequence
passed to the Completions API to match our new task format so that the
model only generates two new instructions per run. We ask it to
generate five, but seed it with two examples, it generates two, and it
stops as soon as it starts generating the fifth.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@@ -281,7 +274,7 @@ def get_instructions_from_model(
# Requests will be automatically adjusted.
max_tokens=3072,
top_p=top_p,
stop=["\n5", "5.", "5."],
stop=["* Task 5"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekhiggins Great catch, thanks! I entirely missed the stop sequence, but adjusted it here and now the model only outputs two new instructions per call to Completion API just like it did in the old format.

@bbrowning
Copy link
Contributor Author

I'm continuing to see big wins with this change - running ilab generate in an automated pipeline to generate just 10 instructions (a simple sanity check) for a knowledge contribution, I'm seeing this take the generate time from 14 minutes down to just 2 minutes with this patch.

For now I'm carrying this patch locally just to have reasonable generate times as I increase the number of instructions. I believe this would be a great quality-of-life improvement for all CLI users, especially for knowledge docs, and would love to see it make it into the next release.

@oindrillac
Copy link

cc: @alimaredia

@bbrowning
Copy link
Contributor Author

Reopened as #919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generated model results returning an inconsistent format
5 participants