Skip to content

Conversation

PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Jul 2, 2025

Overview:

  1. The core is really a simple data structure, HashMap<SequenceHash, HashSet<RequestId>>, storing the active blocks. This should make all the operations that we need O(1), namely, reading and writing the number of active blocks. This is simply extended to multi-workers by letting each worker have one per OS thread, and various read / write requests are performed via channels.

  2. This data structure is held locked during the read, best worker compute, and update cycle during scheduling. This avoids race conditions and staleness, and is tested to give considerably better results empirically

  3. The KvPushRouter has visibility into the output stream, so it is able to update the active blocks when an output token is generated, and free (or deref) the corresponding active blocks when the output stream is completed.

Turns out the performance is not bad at all! 8 x 8b model, L40S backend, 7000 ISL, 100 ISL, 10 prefix prompts of half ISL
predictive_unnormalized_waiting

Same as above but with varied ISL (introducing realistic randomness), where KV routing is expected to perform slightly better and it did
varied_isl

And with varied OSL (even more randomness)
osl_varied

(Note that the Python bindings for KvRouter is removed for now as it is not currently being used, and will be reworked / reintroduced in future PRs)

Closes #1723

Summary by CodeRabbit

  • New Features

    • Added advanced context-aware scheduling and token tracking for request streams, improving resource management and efficiency.
    • Introduced a new configuration option for controlling worker selection randomness via a "temperature" setting.
    • Enhanced metrics with new predictive load tracking and improved endpoint collection and filtering.
  • Refactor

    • Simplified and updated scheduling logic, consolidating configuration parameters and improving concurrency safety.
    • Deprecated several legacy Prometheus metrics and streamlined update logic for active KV blocks.
  • Bug Fixes

    • Improved handling of worker selection to support deterministic behavior when randomness is disabled.

@PeaBrane
Copy link
Contributor Author

PeaBrane commented Jul 8, 2025

looks like the relative TTFT is better under a smaller concurrency of 10 instead of 20 but the ITL is worse, suggesting that conc may still play a major factor in the router performance; having a decode queue can potentially close this gap
conc_10

@PeaBrane PeaBrane merged commit 84e71e2 into main Jul 8, 2025
14 of 15 checks passed
@PeaBrane PeaBrane deleted the rupei/router-predictive branch July 8, 2025 07:18
atchernych pushed a commit that referenced this pull request Jul 9, 2025
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: An Accurate, No-Change in Framework, and Router-Engine Communication-Free Method to Approximate Active KV Blocks in KV-Aware Router
4 participants