Skip to content

Conversation

zhyncs
Copy link
Member

@zhyncs zhyncs commented Feb 17, 2025

Motivation

This version should only use --enable-flashinfer-mla.

For other LLM engines, if you refer to this PR, please include "Adapted from https://github.com/sgl-project/sglang/pull/3643/files", thank you :-)

ref #3550

Modifications

Checklist

@zhyncs
Copy link
Member Author

zhyncs commented Feb 17, 2025

@zhyncs zhyncs self-assigned this Feb 17, 2025
@zhyncs zhyncs merged commit 714f3e6 into main Feb 17, 2025
21 of 22 checks passed
@zhyncs zhyncs deleted the zhyncs/prefix branch February 17, 2025 18:06
@zhyncs zhyncs mentioned this pull request Feb 17, 2025
6 tasks
@@ -1004,6 +1055,26 @@ def call_begin_forward(
custom_mask=custom_mask,
non_blocking=True,
)
elif (
global_config.enable_flashinfer_mla
and not global_server_args_dict["disable_radix_cache"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhyncs A quick question, I feel if I want to make MTP work with flashinfer backend. During the target verify stage, it is this code block has to be run? because perfill increment computation has to use the absorb trick with flashinfer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full paged version is a temporary solution. I will soon support ragged prefill + paged prefill + paged decoding. These days have been quite busy, so I haven't updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants