feat: predictive active blocks for routing without load metrics #1731

PeaBrane · 2025-07-02T10:11:49Z

Overview:

The core is really a simple data structure, HashMap<SequenceHash, HashSet<RequestId>>, storing the active blocks. This should make all the operations that we need O(1), namely, reading and writing the number of active blocks. This is simply extended to multi-workers by letting each worker have one per OS thread, and various read / write requests are performed via channels.
This data structure is held locked during the read, best worker compute, and update cycle during scheduling. This avoids race conditions and staleness, and is tested to give considerably better results empirically
The KvPushRouter has visibility into the output stream, so it is able to update the active blocks when an output token is generated, and free (or deref) the corresponding active blocks when the output stream is completed.

Turns out the performance is not bad at all! 8 x 8b model, L40S backend, 7000 ISL, 100 ISL, 10 prefix prompts of half ISL

Same as above but with varied ISL (introducing realistic randomness), where KV routing is expected to perform slightly better and it did

And with varied OSL (even more randomness)

(Note that the Python bindings for KvRouter is removed for now as it is not currently being used, and will be reworked / reintroduced in future PRs)

Closes #1723

Summary by CodeRabbit

New Features
- Added advanced context-aware scheduling and token tracking for request streams, improving resource management and efficiency.
- Introduced a new configuration option for controlling worker selection randomness via a "temperature" setting.
- Enhanced metrics with new predictive load tracking and improved endpoint collection and filtering.
Refactor
- Simplified and updated scheduling logic, consolidating configuration parameters and improving concurrency safety.
- Deprecated several legacy Prometheus metrics and streamlined update logic for active KV blocks.
Bug Fixes
- Improved handling of worker selection to support deterministic behavior when randomness is disabled.

PeaBrane · 2025-07-08T07:16:19Z

looks like the relative TTFT is better under a smaller concurrency of 10 instead of 20 but the ITL is worse, suggesting that conc may still play a major factor in the router performance; having a decode queue can potentially close this gap

Signed-off-by: Yan Ru Pei <yanrpei@gmail.com> Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com>

initial commit, data structure, lightly tested

1e867f6

pull-request-size bot added the size/L label Jul 2, 2025

copy-pr-bot bot temporarily deployed to GITLAB July 2, 2025 10:11 Inactive

github-actions bot added the feat label Jul 2, 2025

copy-pr-bot bot temporarily deployed to GITLAB July 2, 2025 10:12 Inactive

clippy

415cf39

copy-pr-bot bot temporarily deployed to GITLAB July 2, 2025 10:24 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 2, 2025 10:25 Inactive

multi workers, one per thread

8c49f44

pull-request-size bot added size/XL and removed size/L labels Jul 2, 2025

copy-pr-bot bot temporarily deployed to GITLAB July 2, 2025 20:31 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 2, 2025 20:32 Inactive

update_workers instead of remove_worker

87c1cb0

copy-pr-bot bot temporarily deployed to GITLAB July 2, 2025 22:41 Inactive

poll_worker_ids in metrics aggregator

2b7b2c1

copy-pr-bot bot temporarily deployed to GITLAB July 2, 2025 22:50 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 2, 2025 22:52 Inactive

potential blocks

50c9371

copy-pr-bot bot temporarily deployed to GITLAB July 3, 2025 01:48 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 3, 2025 01:49 Inactive

predictive load metrics (note no request updates yet at all)

6cee1fe

pull-request-size bot added size/XXL and removed size/XL labels Jul 3, 2025

copy-pr-bot bot temporarily deployed to GITLAB July 3, 2025 04:46 Inactive

small clippy

ccd87de

copy-pr-bot bot temporarily deployed to GITLAB July 3, 2025 04:49 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 3, 2025 04:51 Inactive

passed rust compiler

bee206c

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 01:01 Inactive

remove double free test

6e270cf

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 01:28 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 01:29 Inactive

typo in doc

08c979b

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 01:52 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 01:53 Inactive

update vllm_inc.py to use new metrics publisher api

0065743

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 04:08 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 04:09 Inactive

modify vllm patch for new metrics api

749fab1

PeaBrane requested a review from richardhuo-nv as a code owner July 8, 2025 04:46

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 04:47 Inactive

rm extra space

e387bd6

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 04:53 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 04:54 Inactive

revert the deleted patches

1735f32

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 05:41 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 05:42 Inactive

get imports back

3b1bdc3

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 06:16 Inactive

copy-pr-bot bot temporarily deployed to GITLAB July 8, 2025 06:17 Inactive

PeaBrane merged commit 84e71e2 into main Jul 8, 2025
14 of 15 checks passed

PeaBrane deleted the rupei/router-predictive branch July 8, 2025 07:18

atchernych pushed a commit that referenced this pull request Jul 9, 2025

feat: predictive active blocks for routing without load metrics (#1731)

a042a3a

Signed-off-by: Yan Ru Pei <yanrpei@gmail.com> Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com>

This was referenced Jul 10, 2025

feat: update active blocks in chunks only when necessary #1848

Merged

feat: prefill aware routing #1895

Merged

coderabbitai bot mentioned this pull request Aug 6, 2025

feat: graceful drop of recorder #2326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: predictive active blocks for routing without load metrics #1731

feat: predictive active blocks for routing without load metrics #1731

Uh oh!

PeaBrane commented Jul 2, 2025 •

edited

Loading

Uh oh!

PeaBrane commented Jul 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

feat: predictive active blocks for routing without load metrics #1731

feat: predictive active blocks for routing without load metrics #1731

Uh oh!

Conversation

PeaBrane commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

PeaBrane commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PeaBrane commented Jul 2, 2025 •

edited

Loading

PeaBrane commented Jul 8, 2025 •

edited

Loading