Rebase ab4a83b #11

ivanium · 2024-09-07T02:17:07Z

Merge up-to-date new commits from upsteam main branch.

Co-authored-by: ispobock <ISPObaoke@163.com>

* test: test cases of combining multiple attention kernel calls to implement a sequence parallel kernel. Verified with 2 sp workers * fix: simplify flashinfer kernel initialization (begin_forward() and end_forward()) * test: add logic for sp worker 1 which is basically the same but with different orders of kernel calls * chore: format tweak * feat: a general seq parallel attention kernel that achieves workload balance * fix: minor tweak loop iteration within ring attention * feat [radix_attention]: seq_parallel kernel with sync communication. TODO: turn communication into async fashion and overlap it with computation * test: update test cases for seq parallel attn kernel. Need to disable kv cache management before testing because we haven't implemented kv cache management for seq parallel yet * chore [radix_attention]: format tweak * feat: async communication within ring attention * fix [parallel_utils]: add missed files * fix [infer_batch]: set default values for newly added sp-related metadata * fix [bench_latency]: minor fixes to input args * feat [parallel_utils]: get actual tp rank and size when both TP and SP are enabled * feat [linear]: add QKVParallelLinear * feat [llama2]: update llama model to use our QKVParallelLinear * feat [model_runner]: initialize model parallel with sequence parallel * fix [infer_batch]: 1. a minor issue when calling get_prefill_indices; 2. flashinfer intialization args * fix [bench_latency]: load model with sp_rank * feat [radix_attention]: automatically dispatch to seq-parallel attn kernel when sp_size > 1 * debug: stash current debug changes * fix [radix_attention]: reshape q tensor before running the kernel * bug fix for sp layout types * fix: adjust tensor layout. TODO: fix many dirty hacks and hardcoded values * fix [wip]: disable p2p communication within ring attention for now. TODO: fix the bug that causes communication hang. * chore [bench_latency]: disable decode for now since we haven't supported it * upstream with correct prefill sp layout * fix early exit on decode SP * chore: tweak format * update layout * bug fix * fix [linear, radix_attention]: fix q head indexes per SP worker to align with GQA setting. * fix [infer_batch]: set up flashinfer kernels for the batch size > 1 case * chore: tweak format * fix [radix_attention]: revert commented-out kv cache store operations in normal attention * fix: adjust k, v tensor shape to align with both TP and SP setting * chore [llama2]: minor adjustment * fix: update bench_latency to evenly distribute each sequence across all SP workers to avoid the layout issue * test: update test cases to align with current kernel in args * fix [model_runner]: initialize TokenToKVPool with correct num_heads and enable KV cache store in SP attention * chore [radix_attention]: clean up comments * fix [model_runner]: correct num_heads in memory profiling as well to avoid OOM * fix [infer_batch]: adopt SP KV cache allocation * feat [linear]: correctly partition q proj along the num_heads dimension with GQA * chore [llama2]: clean up stable variables * feat [infer_batch]: adjust positions to SP layout when preparing input_metadata * feat [infer_batch]: use dedicate paged attn kernel for cross-SP-shard attn * feat [parallel_state]: creat sequence parallel comm groups * test [sp_comm_group]: simple test case with sp_size = 2 * doc [parallel_state]: doc string for our SP group organization * fix [infer_batch]: add padding zeros to positions tensor and out_cache_loc to fix positional encoding and KV cache store * feat [radix_attn, infer_batch]: create masks for padded sequences and now attn works for unevenly-distributed sequenses too * chore [bench_latency]: revert original prompts * fix [parallel_state]: rename "actual" to "kv" * refactor [radix_attention]: unified two cases with differnt comm-comp tradeoffs * chore: rename "actual_tp_[size|rank]" to "kv_tp_[size|rank]" * fix [infer_batch]: ensure prefix_lens is not None in init_flashinfer_args * fix [infer_batch]: only pad positions and out_cache_loc for prefill * chore [linear]: clean up and revise comments * chore [parallel_state]: revise comments * chore [linear]: revise comments and class names * chore [radix_attention]: add defensive checks --------- Co-authored-by: ZYHowell <yhzhuang@cmu.edu>

hnyls2002 and others added 16 commits September 2, 2024 23:18

Fix bugs in sampler with CUDA graph / torch.compile (sgl-project#1306)

a5a134f

[Fix] Reduce memory usage for loading llava model & Remove EntryClass…

f64eae3

…Remapping (sgl-project#1308)

Support Phi3 mini and medium (sgl-project#1299)

474317f

Update README.md for llava-onevision instructions (sgl-project#1313)

c500f96

Fix llama2 weight loader (sgl-project#1317)

12cb115

[Fix] Fix select by ensuring each request has at least one token (sgl…

1e495e0

…-project#1318)

misc: speedup load safetensors (sgl-project#1319)

dc67d97

Co-authored-by: ispobock <ISPObaoke@163.com>

chore: bump v0.3.0 (sgl-project#1320)

a63c827

Fix the flaky test test_moe_eval_accuracy_large.py (sgl-project#1326)

843e63d

[Doc] update news (sgl-project#1327)

5ab9418

Remove useless fields in global_config.py (sgl-project#1328)

eda7c09

docs: update README (sgl-project#1336)

3494b32

docs: highlight ttft itl and throughput (sgl-project#1337)

79794af

docs: add conclusion (sgl-project#1340)

62f15ee

Optimize schedule (sgl-project#1339)

ab4a83b

ivanium requested a review from ZYHowell September 7, 2024 02:17

ZYHowell merged commit 0175ca2 into main Sep 9, 2024
2 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebase ab4a83b #11

Rebase ab4a83b #11

Uh oh!

ivanium commented Sep 7, 2024

Uh oh!

Uh oh!

Uh oh!

Rebase ab4a83b #11

Rebase ab4a83b #11

Uh oh!

Conversation

ivanium commented Sep 7, 2024

Uh oh!

Uh oh!

Uh oh!