[P/D] Heterogeneous TP #18833

NickLucche · 2025-05-28T10:57:33Z

Yet another take at #18079. It builds on the same commits so the rationale of the PR is the same.

The issue with the previous approach is that it appears using a higher number of descriptors per read -block_size as many, each region smaller by block_size times, so total number of bytes moved is unchanged- causes significant slowdowns. Mind that this is not happening for homogenous TP, where memory regions are seemingly merged by nixl prior to transfer.

Changing NIXL+UCX versions has a noticeable effect on performance, so we could in principle tackle the above directly at the transport layer.
Instead, here we take a different approach to factor out the transport layer altogether, and instantiate the kv cache with a memory layout [2, num_blocks, kv_heads, block_size, head_dim] . We then permute back to the original NHD to provide a view that guarantees correctness in the rest of the codebase.

This enables the splitting to be carried out on dim2, leading to much better performance as we maintain the same number of descriptors as well as bytes per-read (minus a factor of tp_ratio). The code is also somewhat easier to read with one less nested dim to account for.

This PR requires this #18775 to be merged first, as we need the proper scaffolding code for enforcing a different KV cache layout.

Here's some numbers:

This has been tested on NIXL 0.2.1 (4f37f07) and UCX 1.18.0.

In the MLA case, most of the splitting complexity above is not needed as kv caches are replicated and they can just be copied over in their entirety just like homogenous TP.
Codewise the changes are minimal as we're using the same logic for discovery as well as rank assignment so we can conveniently support both.

github-actions · 2025-05-28T10:57:43Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/v1/core/sched/scheduler.py

vllm/v1/worker/gpu_model_runner.py

mergify · 2025-05-29T17:56:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche · 2025-05-30T10:32:19Z

Also tested on MLA with deepseek-vl2-small

tlrmchlsmth · 2025-06-01T16:37:41Z

Also tested on MLA with deepseek-vl2-small

@NickLucche it looks like that's not an MLA model fortunately.

We look for kv_lora_rank to see if the model uses MLA:

vllm/vllm/config.py

Line 1104 in 432ec99

return self.hf_text_config.kv_lora_rank is not None

And there is no kv_lora_rank in that model's config:
https://huggingface.co/deepseek-ai/deepseek-vl2-small/blob/main/config.json

Could you try on deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct?
Or RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8 if you want a smaller model.

tlrmchlsmth · 2025-06-01T16:52:22Z

vllm/distributed/kv_transfer/kv_connector/utils.py

+@functools.lru_cache
+def get_kv_connector_cache_layout():
+    vllm_config = get_current_vllm_config()


Will the @functools.lru_cache will break things if someone creates two LLMEngines? (maybe only when using the UniProcExecutor?

not sure, this should be a noop for all cases but PD with Nixl, and in that case every instance must have the same kv shape and layout to transfer in any case.
@njhill do you see how I could break things here?

I don't think this needs to be cached, it should only be called during initialization anyhow.

I actually forgot to mention, I was anticipating a potential runtime use as in v0 https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/flashinfer.py#L1033.
I can still remove it as this is not the case in v1 right now, but we should consider it.

Let's remove the @functools.lru_cache. I'm strongly suspicious of some edge cases where this could break and there's no benefit to caching here

tlrmchlsmth

It looks like this PR only works with attn backends that can use the HND layout. (I.e. only FlashInfer and FlashAttn.

This is OK for this PR but please make sure we're raising an exception if the wrong attn backend is used.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify · 2025-06-02T03:32:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche · 2025-06-02T08:58:19Z

Thanks a lot for reviewing!

it looks like that's not an MLA model fortunately.

mm I think it is, and provided we start it wit hf_overrides we seem to detect it just fine.

 vllm serve deepseek-ai/deepseek-vl2-small --trust_remote_code --hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'
. . .
INFO 06-02 08:08:29 [cuda.py:153] Forcing kv cache block size to 64 for FlashMLA backend.
INFO 06-02 08:08:39 [cuda.py:192] Using FlashMLA backend on V1 engine.

Anyways for the sake of completeness I also tested with DeepSeek-Coder-V2-Lite-Instruct, getting Measured value: 0.7619408642911296.
Basically there's no kv permutation involved with MLA and no rank splitting due to replication.

njhill

Great work thanks @NickLucche

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

njhill · 2025-06-02T17:06:02Z

vllm/distributed/kv_transfer/kv_connector/utils.py

+@functools.lru_cache
+def get_kv_connector_cache_layout():
+    vllm_config = get_current_vllm_config()


I don't think this needs to be cached, it should only be called during initialization anyhow.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify · 2025-06-03T23:55:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

address race condition optimize req state checking loop release_xfer_handle on status DONE send notif to agent_name fix req abort Signed-off-by: nicklucche <nlucches@redhat.com>

docs Signed-off-by: nicklucche <nlucches@redhat.com>

Signed-off-by: nicklucche <nlucches@redhat.com>

njhill

Awesome work thanks @NickLucche.

Most of my comments are minor - I don't think any of my comments necessarily need to hold up getting this merged.

njhill · 2025-06-04T16:35:04Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        # Map of engine_id -> kv_caches_base_addr. For TP case, each local
+        # rank will still only pull from a single remote TP worker.


@NickLucche probably a stupid question but we only support D_TP >= P_TP specifically D_TP = N*D_TP right? We can't have larger P size than D size

This is a simplification I carried over from dynamo work.
Basically it's just makes sense given the framing of the problem: D is memory bound so greater TP size will yield better performance.

In theory one could support both, but the code gets messier because you have to discern between a single D reading from N prefill workers or N Ds reading from a single P (as in this case). Sync code also gets less clean.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

njhill · 2025-06-04T17:05:58Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            remote_block_size = nixl_agent_meta.block_len / (
+                self.slot_size_bytes)
+            assert self.block_len == nixl_agent_meta.block_len
+        else:
+            remote_block_size = nixl_agent_meta.block_len / (
+                self.slot_size_bytes * tp_ratio)


Should these be // rather than /?

I am asserting equality this should be an exact division. It's something like A=2ABC/2BC

ok sure.. I was suggesting more because this is integer division rather than float division... here remote_block_size will actually be a float

>>> type(8 / 2) <class 'float'>

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Signed-off-by: nicklucche <nlucches@redhat.com>

njhill

Thanks again @NickLucche!

Signed-off-by: nicklucche <nlucches@redhat.com> Signed-off-by: minpeter <kali2005611@gmail.com>

lhtin · 2025-07-03T13:19:38Z

@NickLucche Do you known whether Homogeneous and Heterogeneous TP are supported in multinode TP scenarios? Based on the code I've reviewed, there's currently only a single remote_host and remote_port configuration, which suggests that multinode TP in NIXL PD is not supported.

NickLucche · 2025-07-03T13:23:17Z

Remote_host/port pair are forwarded from the proxy/sidecar server which is aware of the deployment layout. So yeah this is intended to multi node use

lhtin · 2025-07-04T03:23:37Z

@NickLucche Could you provide an example for this part? From the code snippet I see below, it appears that the Decode node only receives a single remote_host and remote_port from kv_transfer_params, not an array. However, in a multi-node scenario (e.g., TP16 with ranks 0-7 on one node and ranks 8-15 on another), wouldn’t we need at least two remote_host entries?

vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Lines 90 to 98 in 25950dc

    
           self.requests[request_id] = ReqMeta( 
        
               local_block_ids=local_block_ids, 
        
               remote_block_ids=kv_transfer_params["remote_block_ids"], 
        
               remote_engine_id=kv_transfer_params["remote_engine_id"], 
        
               remote_host=kv_transfer_params["remote_host"], 
        
               remote_port=kv_transfer_params["remote_port"], 
        
               # P workers don't need to receive tp_size from proxy here. 
        
               tp_size=kv_transfer_params.get("tp_size", 1), 
        
           )

Signed-off-by: nicklucche <nlucches@redhat.com>

Signed-off-by: nicklucche <nlucches@redhat.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

Signed-off-by: nicklucche <nlucches@redhat.com>

NickLucche requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 28, 2025 10:57

mergify bot added the v1 label May 28, 2025

NickLucche commented May 28, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

NickLucche commented May 29, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label May 29, 2025

NickLucche force-pushed the heterogenous-tp-permutekv branch from 7766cc4 to 39118d2 Compare May 30, 2025 10:30

NickLucche requested review from zhuohan123 and youkaichao as code owners May 30, 2025 10:30

mergify bot removed the needs-rebase label May 30, 2025

tlrmchlsmth reviewed Jun 1, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 2, 2025

NickLucche force-pushed the heterogenous-tp-permutekv branch from 378ec58 to 72a9da3 Compare June 2, 2025 09:16

mergify bot removed the needs-rebase label Jun 2, 2025

njhill reviewed Jun 2, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 3, 2025

NickLucche force-pushed the heterogenous-tp-permutekv branch from cc8afb3 to 51a6309 Compare June 4, 2025 07:32

mergify bot removed the needs-rebase label Jun 4, 2025

NickLucche added 5 commits June 4, 2025 14:22

split kv_cache along head dim

42a832c

address race condition optimize req state checking loop release_xfer_handle on status DONE send notif to agent_name fix req abort Signed-off-by: nicklucche <nlucches@redhat.com>

transpose kv cache for faster xfers

ddf969d

docs Signed-off-by: nicklucche <nlucches@redhat.com>

postpone req abort change

4120a42

Signed-off-by: nicklucche <nlucches@redhat.com>

working MLA

b84b6e2

Signed-off-by: nicklucche <nlucches@redhat.com>

FA stride order for nixl+rebase cruft

9098232

Signed-off-by: nicklucche <nlucches@redhat.com>

NickLucche force-pushed the heterogenous-tp-permutekv branch from 51a6309 to 9098232 Compare June 4, 2025 14:23

NickLucche mentioned this pull request Jun 4, 2025

[P/D] Exchange NIXL metadata through rank 0 #19080

Open

remove get_kv_connector_cache_layout caching

112717d

Signed-off-by: nicklucche <nlucches@redhat.com>

tlrmchlsmth approved these changes Jun 4, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 4, 2025

tlrmchlsmth enabled auto-merge (squash) June 4, 2025 14:43

njhill disabled auto-merge June 4, 2025 16:52

njhill reviewed Jun 4, 2025

View reviewed changes

NickLucche mentioned this pull request Jun 4, 2025

[P/D] Heterogeneous TP #18079

Closed

address review

3d83f79

Signed-off-by: nicklucche <nlucches@redhat.com>

njhill approved these changes Jun 4, 2025

View reviewed changes

njhill enabled auto-merge (squash) June 4, 2025 21:11

njhill merged commit b2fac67 into vllm-project:main Jun 4, 2025
74 checks passed

chaunceyjiang mentioned this pull request Jun 9, 2025

[P/D][Bugfix]: Fix the metadata corruption issue in Nixl when TP > 1. #19341

Closed

This was referenced Jun 10, 2025

[PD] Skip tp_size exchange with rank0 #19413

Merged

[P/D][NixlConnector] Support tp_size > num_kv_heads deployments #19691

Merged

njhill mentioned this pull request Jun 19, 2025

[V1][P/D] Release nixl xfer handles #18204

Closed

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[P/D] Heterogeneous TP (vllm-project#18833)

af4c711

Signed-off-by: nicklucche <nlucches@redhat.com> Signed-off-by: minpeter <kali2005611@gmail.com>

leoli1208 pushed a commit to leoli1208/vllm that referenced this pull request Jul 22, 2025

[P/D] Heterogeneous TP (vllm-project#18833)

84c3c18

Signed-off-by: nicklucche <nlucches@redhat.com>

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[P/D] Heterogeneous TP (vllm-project#18833)

d7dd500

Signed-off-by: nicklucche <nlucches@redhat.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025

[P/D] Heterogeneous TP (vllm-project#18833)

777772e

Signed-off-by: nicklucche <nlucches@redhat.com>

		# Map of engine_id -> kv_caches_base_addr. For TP case, each local
		# rank will still only pull from a single remote TP worker.

Uh oh!

[P/D] Heterogeneous TP #18833

[P/D] Heterogeneous TP #18833

Uh oh!

Conversation

NickLucche commented May 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented May 29, 2025

Uh oh!

NickLucche commented May 30, 2025

Uh oh!

tlrmchlsmth commented Jun 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jun 2, 2025

Uh oh!

NickLucche commented Jun 2, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Jun 3, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhtin commented Jul 3, 2025

Uh oh!

NickLucche commented Jul 3, 2025

Uh oh!

lhtin commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

NickLucche commented May 28, 2025 •

edited by github-actions bot

Loading

njhill Jun 4, 2025 •

edited

Loading

lhtin commented Jul 4, 2025 •

edited

Loading