Skip to content

[Feature] Integration into Dynamo Planner #6163

@yicwang

Description

@yicwang

Checklist

Motivation

Dynamo Planner is a dynamo services which can monitor the state of the inference system, and perform scaling up/down prefill/decode workers based on kv cache load and prefill queue sizes. For now it supports the aggregated/disaggregated VLLM worker, but not yet SGLang.

By looking at the code, Dynamo Planner has a well abstracted interfaces which would collect the metrics from different inference framework backends. The server side is implemented in metrics_aggregator.rs, and the client side will use its Python bindings to publish the metrics. The key part in the current VLLM implementation is below:

self.metrics_publisher.publish(
                            metrics.request_active_slots,
                            metrics.request_total_slots,
                            metrics.kv_active_blocks,
                            metrics.kv_total_blocks,
                            metrics.num_requests_waiting, 
                            metrics.gpu_cache_usage_perc, 
                            metrics.gpu_prefix_cache_hit_rate)

In today's Dynamo repo, the features are being maintained by Dynamo community as a huge patch (container/deps/vllm/vllm_v0.8.4-dynamo-kv-disagg-patch.patch). To me this is not a good idea as it is so hard to maintain if not merged to VLLM repo. But I do understand the concern that as a inference framework, probably it is not a good idea to accept code that is intrusive too much. Same between SGLANG and VLLM.

We would really want to contribute to fix the missing piece to make Planner run on SGLANG, hence I want to start the thread here to discuss and explore some ideas in the community. Some options I can think of:

  1. Implement a new class say "DynamoPlannerMetrics", and we will initialize the instance and call the corresponding APIs to collect the metrics in multiple places, and eventually send them out using the Dynamo API. This is similar to how VLLM is being supported, but we will keep in mind to have the minimum intrusion.
  2. Implement a new service in SGLANG, say "metrics". This is going to be purely a SGLANG assets, and we define and provide interfaces and endpoints for internal/external services to access the metrics if needed. It needs to support both "pull" and "push" modes, so it can be integrated into current Dynamo Planner framework.

Please correct me if I am understanding wrongly, and welcome to provide any suggestion or feedbacks on this topic.

Thanks!

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions