Replace prob based with threshold based load balancing #2170

ByronHsu · 2024-11-25T05:49:28Z

Motivation

Prob based LB can disturb cache aware when the load is actually balanced. Switching to threshold based LB can help improve the cache hit rate and also makes the behavior more deterministic.

Modifications

bench_serving: for generated shared prefix dataset, Remove argument-based data caching and use auto caching with the key as the dataset params, so users don't have to manually configure the arguments
Shortest queue load balancing is triggered when the load is imbalanced. The imbalanced is defined as (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold. By default it only uses abs_threshold, and rel_threshold was set to a very small value.

Benchmark

Benchmark Results

TLDR: the perf is on par with v1 on perfectly divided and low hit rate cases. In one long sys prompt case, it can outperform original RR DP, while it is slower than v1 because one worker will have "abs_threshold" more requests.

Generated Shared Prefix Dataset

python bench_serving.py --host 127.0.0.1 --port 30000 --dataset-name generated-shared-prefix \
    --generated-input-path ~/.cache/gen.json --generated-input-save-path ~/.cache/gen.json

Method	Throughput	Cache Rate
Original RR DP	82,665	20%
Cache Aware v1	158,596.72	75%
Perfect	160,288	75%
Cache Aware v1.1	158554	75%

SharedGPT Dataset

python bench_serving.py --host 127.0.0.1 --port 30000

Method	Throughput	Cache Rate
Original RR DP	17,164	2%
Cache Aware v1	17,775	2%
Cache Aware v1	17,779	2%

Multi Turn Dataset

python long_prompt_multi_turn.py --port 30000 --tokenizer "/shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693/" | tee client.log

Method	Latency	Cache Rate
Original RR DP	34	35%
Cache Aware v1	19	88%
Perfect	19	88%
Cache Aware v1.1	19	88%

Generated Shared Prefix Dataset but only has one system prompt

python bench_serving.py --host 127.0.0.1 --port 30000 --dataset-name generated-shared-prefix --gen-num-groups 8 --gen-num-groups 1 --gen-prompts-per-group 1024

Version	Throughput	Cache Rate
Original RR DP	154535.56
Cache aware v1	36510.71
Cache aware v1 - routing prob 0.5	190026.64
Cache aware v1.1	174807

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

ByronHsu · 2024-11-25T06:51:37Z

TODO:

Replace the value in running_queue with text.chars().count() => so it can reflect the actual #char workload
Further tune the data value based on real world workload

ByronHsu · 2024-11-25T08:47:09Z

#1732

)

ByronHsu added 3 commits November 24, 2024 21:02

wip

471ccb4

replace prob based load balancing with threshold based

ae1c2c5

update doc

20f03f5

ByronHsu requested a review from Ying1123 as a code owner November 25, 2024 05:49

ByronHsu added 2 commits November 25, 2024 06:18

lint

b055c51

Merge branch 'main' into byhsu/add-threshold

a2f173a

Ying1123 approved these changes Nov 25, 2024

View reviewed changes

ByronHsu merged commit 4b0a1c9 into sgl-project:main Nov 25, 2024
15 of 16 checks passed

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Replace prob based with threshold based load balancing (sgl-project#2170

7d14a13

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace prob based with threshold based load balancing #2170

Replace prob based with threshold based load balancing #2170

Uh oh!

ByronHsu commented Nov 25, 2024 •

edited

Loading

Uh oh!

ByronHsu commented Nov 25, 2024

Uh oh!

Uh oh!

ByronHsu commented Nov 25, 2024

Uh oh!

Uh oh!

Replace prob based with threshold based load balancing #2170

Replace prob based with threshold based load balancing #2170

Uh oh!

Conversation

ByronHsu commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmark

Benchmark Results

Generated Shared Prefix Dataset

SharedGPT Dataset

Multi Turn Dataset

Generated Shared Prefix Dataset but only has one system prompt

Checklist

Uh oh!

ByronHsu commented Nov 25, 2024

Uh oh!

Uh oh!

ByronHsu commented Nov 25, 2024

Uh oh!

Uh oh!

ByronHsu commented Nov 25, 2024 •

edited

Loading