Replace prob based with threshold based load balancing #2170
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Prob based LB can disturb cache aware when the load is actually balanced. Switching to threshold based LB can help improve the cache hit rate and also makes the behavior more deterministic.
Modifications
bench_serving
: for generated shared prefix dataset, Remove argument-based data caching and use auto caching with the key as the dataset params, so users don't have to manually configure the arguments(max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold
. By default it only usesabs_threshold
, andrel_threshold
was set to a very small value.Benchmark
Benchmark Results
TLDR: the perf is on par with v1 on perfectly divided and low hit rate cases. In one long sys prompt case, it can outperform original RR DP, while it is slower than v1 because one worker will have "abs_threshold" more requests.
Generated Shared Prefix Dataset
SharedGPT Dataset
Multi Turn Dataset
Generated Shared Prefix Dataset but only has one system prompt
Checklist