Skip to content

Conversation

vhain
Copy link
Contributor

@vhain vhain commented Aug 17, 2024

Motivation

Official OpenAI API and vLLM supports stream: true with parallel sampling (n > 1), and multi-prompt completions (len(prompt) > 1). However current version of SGLang does not support this.

The goal here is to support requests with stream: true AND (n > 1 OR len(prompt) > 1).

Modification

This PR:

  • Updates tokenizer_manager._handle_batch_request function to support streaming response when enabled. This has been implemented by re-using tokenizer_manager._wait_for_response, which is currently being used for tokenizer_manager._handle_single_request.
  • Index of the response is determined by i * batch_index + parallel_sample_index, which is the same behavior of official OpenAI API and vLLM.
  • Updates tokenizer_manager._wait_for_response to be able to handle batch request as well.
  • Updates openai.v1_(chat_|)completions functions to be able to correctly handle batched streaming responses.
  • Moves stream = False assertion from tokenizer_manager, which was asserting all the batch requests, to openai.process_batch, which will assert only the /v1/batches requests.
  • Adds relevant tests.

Notes to reviewers

I have tested this on my local environment, and will be rolling out to few of our production GPU nodes soon for further testing.

Checklist

  • Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
  • Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
  • Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
  • Modify documentation as needed, such as docstrings or example tutorials.

@vhain vhain force-pushed the ryan/feat/allow-streaming-in-some-batching branch from f979ef0 to 33baccb Compare August 17, 2024 20:32
@vhain vhain force-pushed the ryan/feat/allow-streaming-in-some-batching branch from 33baccb to e311cf4 Compare August 20, 2024 00:31
@vhain
Copy link
Contributor Author

vhain commented Aug 20, 2024

@zhyncs Rebased onto latest main, but E2E fails with some unrelated reason from the content of this PR. Maybe should I consider re-running it? https://github.com/sgl-project/sglang/actions/runs/10463104011/job/28974563014?pr=1134#step:3:147

@merrymercy merrymercy merged commit d847681 into sgl-project:main Aug 20, 2024
5 checks passed
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants