[Feature] Integration into Dynamo Planner

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

Dynamo Planner is a dynamo services which can monitor the state of the inference system, and perform scaling up/down prefill/decode workers based on kv cache load and prefill queue sizes. For now it supports the aggregated/disaggregated VLLM worker, but not yet SGLang.

By looking at the code, Dynamo Planner has a well abstracted interfaces which would collect the metrics from different inference framework backends. The server side is implemented in metrics_aggregator.rs, and the client side will use its Python bindings to publish the metrics. The key part in the current VLLM implementation is below:
```
self.metrics_publisher.publish(
                            metrics.request_active_slots,
                            metrics.request_total_slots,
                            metrics.kv_active_blocks,
                            metrics.kv_total_blocks,
                            metrics.num_requests_waiting, 
                            metrics.gpu_cache_usage_perc, 
                            metrics.gpu_prefix_cache_hit_rate)
```

In today's Dynamo repo, the features are being maintained by Dynamo community as a huge patch (container/deps/vllm/vllm_v0.8.4-dynamo-kv-disagg-patch.patch). To me this is not a good idea as it is so hard to maintain if not merged to VLLM repo. But I do understand the concern that as a inference framework, probably it is not a good idea to accept code that is intrusive too much. Same between SGLANG and VLLM.

We would really want to contribute to fix the missing piece to make Planner run on SGLANG, hence I want to start the thread here to discuss and explore some ideas in the community. Some options I can think of:
1. Implement a new class say "DynamoPlannerMetrics", and we will initialize the instance and call the corresponding APIs to collect the metrics in multiple places, and eventually send them out using the Dynamo API. This is similar to how VLLM is being supported, but we will keep in mind to have the minimum intrusion.
2. Implement a new service in SGLANG, say "metrics". This is going to be purely a SGLANG assets, and we define and provide interfaces and endpoints for internal/external services to access the metrics if needed. It needs to support both "pull" and "push" modes, so it can be integrated into current Dynamo Planner framework.

Please correct me if I am understanding wrongly, and welcome to provide any suggestion or feedbacks on this topic.

Thanks!

### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Integration into Dynamo Planner #6163

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Integration into Dynamo Planner #6163

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions