-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[P/D] Heterogeneous TP #18833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[P/D] Heterogeneous TP #18833
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
7766cc4
to
39118d2
Compare
Also tested on MLA with deepseek-vl2-small |
@NickLucche it looks like that's not an MLA model fortunately. We look for Line 1104 in 432ec99
And there is no Could you try on deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct? |
@functools.lru_cache | ||
def get_kv_connector_cache_layout(): | ||
vllm_config = get_current_vllm_config() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the @functools.lru_cache
will break things if someone creates two LLMEngine
s? (maybe only when using the UniProcExecutor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure, this should be a noop for all cases but PD with Nixl, and in that case every instance must have the same kv shape and layout to transfer in any case.
@njhill do you see how I could break things here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be cached, it should only be called during initialization anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually forgot to mention, I was anticipating a potential runtime use as in v0 https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/flashinfer.py#L1033.
I can still remove it as this is not the case in v1 right now, but we should consider it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the @functools.lru_cache
. I'm strongly suspicious of some edge cases where this could break and there's no benefit to caching here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this PR only works with attn backends that can use the HND layout. (I.e. only FlashInfer
and FlashAttn
.
This is OK for this PR but please make sure we're raising an exception if the wrong attn backend is used.
This pull request has merge conflicts that must be resolved before it can be |
Thanks a lot for reviewing!
mm I think it is, and provided we start it wit hf_overrides we seem to detect it just fine.
Anyways for the sake of completeness I also tested with DeepSeek-Coder-V2-Lite-Instruct, getting |
378ec58
to
72a9da3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work thanks @NickLucche
@functools.lru_cache | ||
def get_kv_connector_cache_layout(): | ||
vllm_config = get_current_vllm_config() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be cached, it should only be called during initialization anyhow.
This pull request has merge conflicts that must be resolved before it can be |
cc8afb3
to
51a6309
Compare
address race condition optimize req state checking loop release_xfer_handle on status DONE send notif to agent_name fix req abort Signed-off-by: nicklucche <nlucches@redhat.com>
docs Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: nicklucche <nlucches@redhat.com>
51a6309
to
9098232
Compare
Signed-off-by: nicklucche <nlucches@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work thanks @NickLucche.
Most of my comments are minor - I don't think any of my comments necessarily need to hold up getting this merged.
# Map of engine_id -> kv_caches_base_addr. For TP case, each local | ||
# rank will still only pull from a single remote TP worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickLucche probably a stupid question but we only support D_TP >= P_TP
specifically D_TP = N*D_TP
right? We can't have larger P size than D size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a simplification I carried over from dynamo work.
Basically it's just makes sense given the framing of the problem: D is memory bound so greater TP size will yield better performance.
In theory one could support both, but the code gets messier because you have to discern between a single D reading from N prefill workers or N Ds reading from a single P (as in this case). Sync code also gets less clean.
remote_block_size = nixl_agent_meta.block_len / ( | ||
self.slot_size_bytes) | ||
assert self.block_len == nixl_agent_meta.block_len | ||
else: | ||
remote_block_size = nixl_agent_meta.block_len / ( | ||
self.slot_size_bytes * tp_ratio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be //
rather than /
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asserting equality this should be an exact division. It's something like A=2ABC/2BC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure.. I was suggesting more because this is integer division rather than float division... here remote_block_size
will actually be a float
>>> type(8 / 2)
<class 'float'>
Signed-off-by: nicklucche <nlucches@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @NickLucche!
Signed-off-by: nicklucche <nlucches@redhat.com> Signed-off-by: minpeter <kali2005611@gmail.com>
@NickLucche Do you known whether Homogeneous and Heterogeneous TP are supported in multinode TP scenarios? Based on the code I've reviewed, there's currently only a single remote_host and remote_port configuration, which suggests that multinode TP in NIXL PD is not supported. |
Remote_host/port pair are forwarded from the proxy/sidecar server which is aware of the deployment layout. So yeah this is intended to multi node use |
@NickLucche Could you provide an example for this part? From the code snippet I see below, it appears that the Decode node only receives a single remote_host and remote_port from vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Lines 90 to 98 in 25950dc
|
Signed-off-by: nicklucche <nlucches@redhat.com>
Signed-off-by: nicklucche <nlucches@redhat.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Signed-off-by: nicklucche <nlucches@redhat.com>
Yet another take at #18079. It builds on the same commits so the rationale of the PR is the same.
The issue with the previous approach is that it appears using a higher number of descriptors per read -
block_size
as many, each region smaller byblock_size
times, so total number of bytes moved is unchanged- causes significant slowdowns. Mind that this is not happening for homogenous TP, where memory regions are seemingly merged by nixl prior to transfer.Changing NIXL+UCX versions has a noticeable effect on performance, so we could in principle tackle the above directly at the transport layer.
Instead, here we take a different approach to factor out the transport layer altogether, and instantiate the kv cache with a memory layout
[2, num_blocks, kv_heads, block_size, head_dim]
. We then permute back to the original NHD to provide a view that guarantees correctness in the rest of the codebase.This enables the splitting to be carried out on dim2, leading to much better performance as we maintain the same number of descriptors as well as bytes per-read (minus a factor of tp_ratio). The code is also somewhat easier to read with one less nested dim to account for.
This PR requires this #18775 to be merged first, as we need the proper scaffolding code for enforcing a different KV cache layout.
Here's some numbers:

This has been tested on NIXL 0.2.1 (
4f37f07
) and UCX 1.18.0.In the MLA case, most of the splitting complexity above is not needed as kv caches are replicated and they can just be copied over in their entirety just like homogenous TP.
Codewise the changes are minimal as we're using the same logic for discovery as well as rank assignment so we can conveniently support both.