-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Allow new lines during JSON generation #1277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
While this change doesn't impact the issue, one thing to note is that with non-constant whitespace like the regexes generated from schemas or pydantic models, the jump-ahead optimization ends up doing nothing. Having predefined spacing (pretty-printed or compressed) will run much faster. If that reduces quality/accuracy, instructing or fine-tuning models to produce either standard pretty-printed or compressed formatting can help. |
Is it truly the case that it would do nothing? It definitely seems that it would increase the number of inter-jump-ahead tokens but ultimately once the model has completed whitespace generation, it would still kick in? I have not dug into the implementation, and we are just now starting to evaluate the performance impact. Will report back with empirical results. That said, I can strongly attest to the fact that restricted whitespace generation greatly reduces quality of larger SOTA models (we have tested basically everything short of L3 405). Ideally of course the jump-ahead would be pretty print aware, but at the very least is seems reasonable to expose a quality vs performance tradeoff to users. The alternative is to make this a per-request parameter, which seemed less desirable to us at first glance. |
It's true that jump-ahead will still kick in for field names, etc., but it is common for whitespace to account for a large chunk of the generated tokens. I'm a user, not an sglang developer, so they may have more insight, but in our usage we've seen 30%+ speed increase from using fixed spacing in JSON regexes. FWIW, for our use case we ended up ditching JSON generation and instead use a free-text user/assistant chat structure that asks for the relevant fields (sometimes in parallel forks) and composing the JSON from the output ourselves, and at least for smaller models we've gotten much better quality output that way. |
Yes, this very much I agree with and it's been well studied that JSON generation reduces overall quality (observed on 4o and 3.5 Sonnet). Constrained generation is still very much a trade off, but what we have found is that prompting to "build a list" to elicit data extraction followed by a constrained generation phase is the happy medium. |
Motivation
By default, Outlines does not permit new lines and whitespace formatting during constrained JSON generation. Their rationale is that smaller models have a tendency to enter infinite generation loops otherwise. I have not observed this behavior with any recent smaller models (Llama 3.1 8B and Mistral 7B v3) rather we instead find material quality degradation, even with larger models (Llama 3.1 70B and Mistral Large 2) particularly when the JSON property is a list, and the number of elements is variable.
If you observe the natural generation behavior of these models, they emit syntactically formatted JSON and it strikes me a as a better default to permit such behavior.
Modifications
Modify the default Outlines whitespace regex to permit newlines and additional whitespace.