[Bug] Fix the problem of long inference timeouts when using Async rollout #1483

U-rara · 2025-05-12T02:55:09Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

In Async rollout, AsyncOpenAI has a default 600-second timeout, which can lead to timeouts during longer inference. See details at #1138 (comment).

High-Level Design

See details at #1138 (comment).

Specific Changes

See details at #1138 (comment).

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if neccessary.

casper-hansen · 2025-05-12T05:04:17Z

Great bugfix! I think a lot of users are likely to run into this use when using e.g. 32k completion length.

Update async_server.py

2dce1b7

wuxibin89 approved these changes May 12, 2025

View reviewed changes

wuxibin89 merged commit cb1adda into volcengine:main May 12, 2025
27 of 28 checks passed

casper-hansen mentioned this pull request May 26, 2025

[rollout] perf: replace AsyncOpenAI to aiohttp client in ChatCompletionScheduler #1588

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Fix the problem of long inference timeouts when using Async rollout #1483

[Bug] Fix the problem of long inference timeouts when using Async rollout #1483

Uh oh!

U-rara commented May 12, 2025 •

edited

Loading

Uh oh!

casper-hansen commented May 12, 2025

Uh oh!

Uh oh!

Uh oh!

[Bug] Fix the problem of long inference timeouts when using Async rollout #1483

[Bug] Fix the problem of long inference timeouts when using Async rollout #1483

Uh oh!

Conversation

U-rara commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

casper-hansen commented May 12, 2025

Uh oh!

Uh oh!

Uh oh!

U-rara commented May 12, 2025 •

edited

Loading