feat: allow streaming for multi-prompt and/or parallel sampling #1134

vhain · 2024-08-17T05:06:04Z

Motivation

Official OpenAI API and vLLM supports stream: true with parallel sampling (n > 1), and multi-prompt completions (len(prompt) > 1). However current version of SGLang does not support this.

The goal here is to support requests with stream: true AND (n > 1 OR len(prompt) > 1).

Modification

This PR:

Updates tokenizer_manager._handle_batch_request function to support streaming response when enabled. This has been implemented by re-using tokenizer_manager._wait_for_response, which is currently being used for tokenizer_manager._handle_single_request.
Index of the response is determined by i * batch_index + parallel_sample_index, which is the same behavior of official OpenAI API and vLLM.
Updates tokenizer_manager._wait_for_response to be able to handle batch request as well.
Updates openai.v1_(chat_|)completions functions to be able to correctly handle batched streaming responses.
Moves stream = False assertion from tokenizer_manager, which was asserting all the batch requests, to openai.process_batch, which will assert only the /v1/batches requests.
Adds relevant tests.

Notes to reviewers

I have tested this on my local environment, and will be rolling out to few of our production GPU nodes soon for further testing.

Checklist

Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

vhain · 2024-08-20T00:42:11Z

@zhyncs Rebased onto latest main, but E2E fails with some unrelated reason from the content of this PR. Maybe should I consider re-running it? https://github.com/sgl-project/sglang/actions/runs/10463104011/job/28974563014?pr=1134#step:3:147

…project#1134)

zhyncs requested review from Ying1123, merrymercy, zhyncs, yichuan-w and hnyls2002 August 17, 2024 14:49

vhain force-pushed the ryan/feat/allow-streaming-in-some-batching branch from f979ef0 to 33baccb Compare August 17, 2024 20:32

feat: allow streaming for multi-prompt and/or parallel sampling

e311cf4

vhain force-pushed the ryan/feat/allow-streaming-in-some-batching branch from 33baccb to e311cf4 Compare August 20, 2024 00:31

vhain mentioned this pull request Aug 20, 2024

[Fix] sampling_params and Runtime bugs for some cases. #1158

Closed

4 tasks

merrymercy merged commit d847681 into sgl-project:main Aug 20, 2024
5 checks passed

Ying1123 mentioned this pull request Aug 25, 2024

[Fix] the issue of random order when input is a list #1199

Merged

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

feat: allow streaming for multi-prompt and/or parallel sampling (sgl-…

d43a275

…project#1134)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: allow streaming for multi-prompt and/or parallel sampling #1134

feat: allow streaming for multi-prompt and/or parallel sampling #1134

Uh oh!

vhain commented Aug 17, 2024

Uh oh!

vhain commented Aug 20, 2024

Uh oh!

Uh oh!

Uh oh!

feat: allow streaming for multi-prompt and/or parallel sampling #1134

feat: allow streaming for multi-prompt and/or parallel sampling #1134

Uh oh!

Conversation

vhain commented Aug 17, 2024

Motivation

Modification

Notes to reviewers

Checklist

Uh oh!

vhain commented Aug 20, 2024

Uh oh!

Uh oh!

Uh oh!