Skip to content

Conversation

ByronHsu
Copy link
Collaborator

@ByronHsu ByronHsu commented Nov 25, 2024

Motivation

Prob based LB can disturb cache aware when the load is actually balanced. Switching to threshold based LB can help improve the cache hit rate and also makes the behavior more deterministic.

Modifications

  1. bench_serving: for generated shared prefix dataset, Remove argument-based data caching and use auto caching with the key as the dataset params, so users don't have to manually configure the arguments
  2. Shortest queue load balancing is triggered when the load is imbalanced. The imbalanced is defined as (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold. By default it only uses abs_threshold, and rel_threshold was set to a very small value.

Benchmark

Benchmark Results

TLDR: the perf is on par with v1 on perfectly divided and low hit rate cases. In one long sys prompt case, it can outperform original RR DP, while it is slower than v1 because one worker will have "abs_threshold" more requests.

Generated Shared Prefix Dataset

python bench_serving.py --host 127.0.0.1 --port 30000 --dataset-name generated-shared-prefix \
    --generated-input-path ~/.cache/gen.json --generated-input-save-path ~/.cache/gen.json
Method Throughput Cache Rate
Original RR DP 82,665 20%
Cache Aware v1 158,596.72 75%
Perfect 160,288 75%
Cache Aware v1.1 158554 75%

SharedGPT Dataset

python bench_serving.py --host 127.0.0.1 --port 30000
Method Throughput Cache Rate
Original RR DP 17,164 2%
Cache Aware v1 17,775 2%
Cache Aware v1 17,779 2%

Multi Turn Dataset

python long_prompt_multi_turn.py --port 30000 --tokenizer "/shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693/" | tee client.log
Method Latency Cache Rate
Original RR DP 34 35%
Cache Aware v1 19 88%
Perfect 19 88%
Cache Aware v1.1 19 88%

Generated Shared Prefix Dataset but only has one system prompt

python bench_serving.py --host 127.0.0.1 --port 30000 --dataset-name generated-shared-prefix --gen-num-groups 8 --gen-num-groups 1 --gen-prompts-per-group 1024
Version Throughput Cache Rate
Original RR DP 154535.56
Cache aware v1 36510.71
Cache aware v1 - routing prob 0.5 190026.64
Cache aware v1.1 174807

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@ByronHsu ByronHsu requested a review from Ying1123 as a code owner November 25, 2024 05:49
@ByronHsu
Copy link
Collaborator Author

TODO:

  1. Replace the value in running_queue with text.chars().count() => so it can reflect the actual #char workload
  2. Further tune the data value based on real world workload

@ByronHsu ByronHsu merged commit 4b0a1c9 into sgl-project:main Nov 25, 2024
15 of 16 checks passed
@ByronHsu
Copy link
Collaborator Author

#1732

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants