Skip to content

Conversation

robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Jul 20, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

  • previously, we created a PrometheusStatLogger for each EngineCore. This appears okay on the surface, but what was happening is that only the final EngineCore would be able to log stats since we reset the Prometheus state in each constructor unregister_vllm_metrics
  • simply removing the unregister_vllm_metrics does not work, because we can only * create * the metrics once. We just want to have multiple labels for the same metric not multiple metrics
  • this PR adds a class called StatLoggerMananger to deal with this:
StatLoggerManager:
        Logging happens at the level of the EngineCore (per scheduler).
         * DP: >1 EngineCore per AsyncLLM - loggers for each EngineCore.
         * With Local Logger, just make N copies for N EngineCores.
         * With Prometheus, we need a single logger with N "labels"
        This class abstracts away this implementation detail from
        the AsyncLLM, allowing the AsyncLLM to just call .record()
        and .log() to a simple interface.
  • this PR refactors PrometheusStatLogger to enable logging from multiple engine cores
  • this PR ensures that the AsyncLLM only logs the metrics of the EngineCores that it is directly managing

Follow up:

  • Make it work with LoRA
  • Make it work with SpecDecoding
  • Make it work with elastic EP

Test Plan

  • existing CI

  • justfile

MODEL := "Qwen/Qwen3-30B-A3B-FP8"

tp PORT:
  vllm serve {{MODEL}} \
    --port {{PORT}} \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --disable-log-requests

dp_a_internal_lb PORT:
  vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-rpc-port 5555 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b_internal_lb:
  vllm serve {{MODEL}} \
    --headless \
    --data-parallel-size 4 \
    --data-parallel-size-local 2 \
    --data-parallel-start-rank 2 \
    --data-parallel-rpc-port 5555 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_a_external_lb PORT:
   vllm serve {{MODEL}} \
    --port 8100 \
    --data-parallel-size 2 \
    --data-parallel-rank 0 \
    --data-parallel-rpc-port 5555 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests

dp_b_external_lb PORT:
  vllm serve {{MODEL}} \
    --port {{PORT}} \
    --data-parallel-size 2 \
    --data-parallel-rank 1 \
    --data-parallel-rpc-port 5555 \
    --enable-expert-parallel \
    --enforce-eager \
    --disable-log-requests


eval PORT CONCURRENT LIMIT:
  lm_eval --model local-completions --tasks gsm8k \
    --model_args model={{MODEL}},base_url=http://127.0.0.1:{{PORT}}/v1/completions,num_concurrent={{CONCURRENT}},num_retries=0,tokenized_requests=False \
    --limit {{LIMIT}}

metrics PORT:
   curl http://localhost:{{PORT}}/metrics

Test Result

Sample:

  • tp:
just tp 8100
just eval 8100
just metrics 8100
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{engine="0",model_name="Qwen/Qwen3-30B-A3B-FP8"} 4.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{engine="0",model_name="Qwen/Qwen3-30B-A3B-FP8"} 0.0

INFO 07-20 18:05:20 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
  • dp (internal lb) --- head node gives logs for all ranks
just dp_a_internal_lb 8100
just dp_b_internal_lb 8100
just eval 8100
just metrics 8100
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-30B-A3B-FP8"} 2.6821875921956284e-05
vllm:kv_cache_usage_perc{engine="1",model_name="Qwen/Qwen3-30B-A3B-FP8"} 2.686799752815716e-05
vllm:kv_cache_usage_perc{engine="2",model_name="Qwen/Qwen3-30B-A3B-FP8"} 2.686799752815716e-05
vllm:kv_cache_usage_perc{engine="3",model_name="Qwen/Qwen3-30B-A3B-FP8"} 2.663896214605277e-05

INFO 07-20 18:10:58 [loggers.py:122] Engine 000: Avg prompt throughput: 3064.0 tokens/s, Avg generation throughput: 359.0 tokens/s, Running: 26 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.3%, Prefix cache hit rate: 0.0%
INFO 07-20 18:10:58 [loggers.py:122] Engine 001: Avg prompt throughput: 2508.4 tokens/s, Avg generation throughput: 353.7 tokens/s, Running: 25 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.4%, Prefix cache hit rate: 0.0%
INFO 07-20 18:10:58 [loggers.py:122] Engine 002: Avg prompt throughput: 1962.9 tokens/s, Avg generation throughput: 353.5 tokens/s, Running: 24 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.3%, Prefix cache hit rate: 0.0%
INFO 07-20 18:10:58 [loggers.py:122] Engine 003: Avg prompt throughput: 2619.2 tokens/s, Avg generation throughput: 354.6 tokens/s, Running: 25 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 0.6%
  • dp (external lb) --- each node gives logs for its own rank
just dp_a_external_lb 8100
just dp_b_external_lb 8100
just eval 8100
just eval 8200
just metrics 8100
just metrics 8200

rank 0:

vllm:request_success_total{engine="0",finished_reason="stop",model_name="Qwen/Qwen3-30B-A3B-FP8"} 88.0
vllm:request_success_total{engine="0",finished_reason="length",model_name="Qwen/Qwen3-30B-A3B-FP8"} 12.0
vllm:request_success_total{engine="0",finished_reason="abort",model_name="Qwen/Qwen3-30B-A3B-FP8"} 0.0

INFO 07-20 18:15:37 [loggers.py:122] Engine 000: Avg prompt throughput: 10130.6 tokens/s, Avg generation throughput: 506.1 tokens/s, Running: 99 reqs, Waiting: 0 reqs, GPU KV cache usage: 20.0%, Prefix cache hit rate: 0.0%

rank 1:

vllm:request_success_total{engine="1",finished_reason="stop",model_name="Qwen/Qwen3-30B-A3B-FP8"} 88.0
vllm:request_success_total{engine="1",finished_reason="length",model_name="Qwen/Qwen3-30B-A3B-FP8"} 12.0
vllm:request_success_total{engine="1",finished_reason="abort",model_name="Qwen/Qwen3-30B-A3B-FP8"} 0.0

INFO 07-20 18:15:47 [loggers.py:122] Engine 001: Avg prompt throughput: 10129.2 tokens/s, Avg generation throughput: 894.6 tokens/s, Running: 69 reqs, Waiting: 0 reqs, GPU KV cache usage: 14.4%, Prefix cache hit rate: 0.0%

(Optional) Documentation Update

Robert Shaw added 30 commits July 19, 2025 16:27
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Robert Shaw added 2 commits July 20, 2025 17:52
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Copy link

mergify bot commented Jul 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 20, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@mergify mergify bot removed the needs-rebase label Jul 20, 2025
@simon-mo simon-mo added this to the v0.10.0 milestone Jul 20, 2025
@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 20, 2025
Robert Shaw added 3 commits July 21, 2025 00:53
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@vllm-bot vllm-bot merged commit 29d1ffc into vllm-project:main Jul 21, 2025
65 of 67 checks passed
@DarkLight1337
Copy link
Member

Merging to unblock release

@njhill
Copy link
Member

njhill commented Jul 21, 2025

I will do a retroactive review :)

eicherseiji added a commit to eicherseiji/vllm that referenced this pull request Jul 22, 2025
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants