Skip to content

When testing on H20, an error occurs when using more than two nodes, but any two nodes can test normally without issues #255

@ZhenshengWu

Description

@ZhenshengWu

error logs

[testing] Running with BF16, without top-k (async=False, previous=False) ...ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:941084 [6] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:941087 [3] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938500:941090 [1] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:941087 [3] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938503:941086 [4] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:941084 [6] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:941083 [7] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938501:941088 [2] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:941086 [4] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:941091 [5] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:941083 [7] NCCL INFO misc/socket.cc:881 -> 3
[rank0]:[W625 01:10:05.149569580 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938499:941095 [0] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938500:941090 [1] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO comm 0x97e3d00 rank 3 nranks 24 cudaDev 3 busId c6000 - Abort COMPLETE
ali-f8xhrryi-0001:938502:942313 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942313 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942313 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO comm 0x859f5c0 rank 6 nranks 24 cudaDev 6 busId 1a3000 - Abort COMPLETE
ali-f8xhrryi-0001:938505:942315 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942315 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942315 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO comm 0x8fbdd10 rank 1 nranks 24 cudaDev 1 busId 7e000 - Abort COMPLETE
ali-f8xhrryi-0001:938500:942446 [1] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938500:942446 [1] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938500:942446 [1] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO comm 0x9198390 rank 7 nranks 24 cudaDev 7 busId 1c7000 - Abort COMPLETE
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO comm 0x8f00270 rank 4 nranks 24 cudaDev 4 busId 109000 - Abort COMPLETE
ali-f8xhrryi-0001:938506:942449 [7] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938506:942449 [7] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938506:942449 [7] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938503:942451 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942451 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942451 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO comm 0x9583580 rank 2 nranks 24 cudaDev 2 busId a2000 - Abort COMPLETE
ali-f8xhrryi-0001:938501:942465 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942465 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938501:942465 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO comm 0x7fe9250 rank 5 nranks 24 cudaDev 5 busId 17f000 - Abort COMPLETE
ali-f8xhrryi-0001:938504:942530 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942530 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942530 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO comm 0x7c92cc0 rank 0 nranks 24 cudaDev 0 busId 8000 - Abort COMPLETE
ali-f8xhrryi-0001:938505:942315 [6] NCCL INFO comm 0x8316580 rank 6 nranks 24 cudaDev 6 busId 1a3000 - Abort COMPLETE
ali-f8xhrryi-0001:938506:942449 [7] NCCL INFO comm 0x8f0f540 rank 7 nranks 24 cudaDev 7 busId 1c7000 - Abort COMPLETE
ali-f8xhrryi-0001:938502:942313 [3] NCCL INFO comm 0x955af30 rank 3 nranks 24 cudaDev 3 busId c6000 - Abort COMPLETE
ali-f8xhrryi-0001:938500:942446 [1] NCCL INFO comm 0x8d34ec0 rank 1 nranks 24 cudaDev 1 busId 7e000 - Abort COMPLETE
ali-f8xhrryi-0001:938503:942451 [4] NCCL INFO comm 0x8c77230 rank 4 nranks 24 cudaDev 4 busId 109000 - Abort COMPLETE
W0625 01:10:06.967000 938221 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 938499 via signal SIGTERM
W0625 01:10:06.968000 938221 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 938501 via signal SIGTERM
W0625 01:10:06.968000 938221 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 938504 via signal SIGTERM
Traceback (most recent call last):
  File "/data1/workdata/DeepEP/tests/test_internode.py", line 252, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
  File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
  File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 215, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 6 terminated with the following error:
Traceback (most recent call last):
  File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/data1/workdata/DeepEP/tests/test_internode.py", line 236, in test_loop
    test_main(i, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
  File "/data1/workdata/DeepEP/tests/test_internode.py", line 110, in test_main
    recv_x, recv_topk_idx, recv_topk_weights, recv_num_tokens_per_expert_list, handle, event = buffer.dispatch(**dispatch_args)
  File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/deep_ep-1.1.0-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 311, in dispatch
    return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
  File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/deep_ep-1.1.0-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 421, in internode_dispatch
    recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
RuntimeError: Failed: Assertion error /data1/workdata/DeepEP/csrc/kernels/internode.cu:328 'false and "Unsupported RDMA ranks" 

IB : ROCE 4 * (200G * 2 )

env

 385  export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3
  386  export NCCL_IB_GID_INDEX=3
  387  export NCCL_IB_QPS_PER_CONNECTION=8
  388  export NCCL_IB_TIMEOUT=23
  389  export NCCL_IB_RETRY_CNT=7
  390  export UCX_TLS=tcp
  391  export UCX_NET_DEVICES=eth0
  392  export NCCL_SET_THREAD_NAME=1
  393  export MASTER_ADDR=10.0.0.114
  394  export WORLD_SIZE=3
  395  export RANK=2
  397  export NCCL_SOCKET_IFNAME=eth0

The NCCL test across the entire cluster works as expected.
In addition, the low_latency test with three nodes also runs normally.

How can i solve this issue?
code of DeepEP:the last version
nvshmem_src: 3.2.5-1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions