Skip to content

Conversation

qeternity
Copy link
Contributor

Motivation

By default, Outlines does not permit new lines and whitespace formatting during constrained JSON generation. Their rationale is that smaller models have a tendency to enter infinite generation loops otherwise. I have not observed this behavior with any recent smaller models (Llama 3.1 8B and Mistral 7B v3) rather we instead find material quality degradation, even with larger models (Llama 3.1 70B and Mistral Large 2) particularly when the JSON property is a list, and the number of elements is variable.

If you observe the natural generation behavior of these models, they emit syntactically formatted JSON and it strikes me a as a better default to permit such behavior.

Modifications

Modify the default Outlines whitespace regex to permit newlines and additional whitespace.

@max99x
Copy link
Contributor

max99x commented Aug 31, 2024

While this change doesn't impact the issue, one thing to note is that with non-constant whitespace like the regexes generated from schemas or pydantic models, the jump-ahead optimization ends up doing nothing. Having predefined spacing (pretty-printed or compressed) will run much faster. If that reduces quality/accuracy, instructing or fine-tuning models to produce either standard pretty-printed or compressed formatting can help.

@qeternity
Copy link
Contributor Author

qeternity commented Aug 31, 2024

Is it truly the case that it would do nothing? It definitely seems that it would increase the number of inter-jump-ahead tokens but ultimately once the model has completed whitespace generation, it would still kick in? I have not dug into the implementation, and we are just now starting to evaluate the performance impact. Will report back with empirical results.

That said, I can strongly attest to the fact that restricted whitespace generation greatly reduces quality of larger SOTA models (we have tested basically everything short of L3 405). Ideally of course the jump-ahead would be pretty print aware, but at the very least is seems reasonable to expose a quality vs performance tradeoff to users. The alternative is to make this a per-request parameter, which seemed less desirable to us at first glance.

@max99x
Copy link
Contributor

max99x commented Aug 31, 2024

It's true that jump-ahead will still kick in for field names, etc., but it is common for whitespace to account for a large chunk of the generated tokens. I'm a user, not an sglang developer, so they may have more insight, but in our usage we've seen 30%+ speed increase from using fixed spacing in JSON regexes.

FWIW, for our use case we ended up ditching JSON generation and instead use a free-text user/assistant chat structure that asks for the relevant fields (sometimes in parallel forks) and composing the JSON from the output ourselves, and at least for smaller models we've gotten much better quality output that way.

@qeternity
Copy link
Contributor Author

Yes, this very much I agree with and it's been well studied that JSON generation reduces overall quality (observed on 4o and 3.5 Sonnet). Constrained generation is still very much a trade off, but what we have found is that prompting to "build a list" to elicit data extraction followed by a constrained generation phase is the happy medium.

@merrymercy merrymercy merged commit 32a4141 into sgl-project:main Sep 1, 2024
4 of 8 checks passed
qeternity added a commit to qeternity/sglang that referenced this pull request Sep 1, 2024
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants