[Bug] Using hiradix_cache with dp attention cause hang

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

I tried hicache with this using dp-attention, and the latest `0.4.7` version of sglang

However, sometimes server stuck before fask request brfore server fire up, sometimes stuck when benchmarking。

with print something in writing check function of `hiradix_cache.py`，I found that the server stuck before `all_reduce` function. I think, with dp attention, we shouldn't  pass the `tp_group` process group parameter to `HiCacheController`, but the `attn_tp_group` parameter. Because if there is a dp rank have no request, other instances will wait at `all_reduce` function. 

![Image](https://github.com/user-attachments/assets/c834428e-7922-4a27-9fe9-26a639ddcc9d)
can not get second print message:
![Image](https://github.com/user-attachments/assets/196f2496-cb65-463a-8828-b0a1a4a7a56d)
~~I'm working on a PR of it.~~


### Reproduction

```SGLANG_USE_MODELSCOPE=true \
python -m sglang.launch_server \
    --model-path deepseek-ai/DeepSeek-V2-Lite \
    --port 30001 --enable-hierarchical-cache --page-size 32 \
    --dp-size 4 --enable-dp-attention --tp-size 4 \
    --hicache-ratio 2 \
    --trust-remote-code

python ../benchmark/hicache/bench_multiturn.py \
    --model-path deepseek-ai/DeepSeek-V2-Lite --num-clients 24 --port 30001
```
maybe should change `i[0]` in bench_multiturn.py to `i.prompt`, as i meet another bug here.
https://github.com/sgl-project/sglang/blob/8ab7d93c2e0da6bb0f3b08a43b57f2d604b57fe0/benchmark/hicache/bench_multiturn.py#L242

### Environment

Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A10
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.90.07
PyTorch: 2.7.1+cu126
sglang: 0.4.7
sgl_kernel: 0.1.7
flashinfer_python: 0.2.6.post1
triton: 3.3.1
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.6
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.3
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.82.1
tiktoken: 0.9.0
anthropic: 0.52.1
litellm: 1.72.0
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    PIX      X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU2    NODE    NODE     X      PIX     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU3    NODE    NODE    PIX      X      SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     NODE    NODE    32-63,96-127    1               N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      NODE    NODE    32-63,96-127    1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     32-63,96-127    1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      32-63,96-127    1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 524288

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Using hiradix_cache with dp attention cause hang #7158

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Using hiradix_cache with dp attention cause hang #7158

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions