-
Notifications
You must be signed in to change notification settings - Fork 916
Open
Description
error logs
[testing] Running with BF16, without top-k (async=False, previous=False) ...ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:941084 [6] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:941087 [3] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938500:941090 [1] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:941087 [3] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938503:941086 [4] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:941084 [6] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:941083 [7] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938501:941088 [2] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:941086 [4] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:941091 [5] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:941083 [7] NCCL INFO misc/socket.cc:881 -> 3
[rank0]:[W625 01:10:05.149569580 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938499:941095 [0] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938500:941090 [1] NCCL INFO misc/socket.cc:881 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938502:942218 [3] NCCL INFO comm 0x97e3d00 rank 3 nranks 24 cudaDev 3 busId c6000 - Abort COMPLETE
ali-f8xhrryi-0001:938502:942313 [3] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938502:942313 [3] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938502:942313 [3] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938505:942217 [6] NCCL INFO comm 0x859f5c0 rank 6 nranks 24 cudaDev 6 busId 1a3000 - Abort COMPLETE
ali-f8xhrryi-0001:938505:942315 [6] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938505:942315 [6] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938505:942315 [6] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938500:942233 [1] NCCL INFO comm 0x8fbdd10 rank 1 nranks 24 cudaDev 1 busId 7e000 - Abort COMPLETE
ali-f8xhrryi-0001:938500:942446 [1] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938500:942446 [1] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938500:942446 [1] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938506:942237 [7] NCCL INFO comm 0x9198390 rank 7 nranks 24 cudaDev 7 busId 1c7000 - Abort COMPLETE
ali-f8xhrryi-0001:938503:942235 [4] NCCL INFO comm 0x8f00270 rank 4 nranks 24 cudaDev 4 busId 109000 - Abort COMPLETE
ali-f8xhrryi-0001:938506:942449 [7] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938506:942449 [7] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938506:942449 [7] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938503:942451 [4] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938503:942451 [4] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938503:942451 [4] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938501:942245 [2] NCCL INFO comm 0x9583580 rank 2 nranks 24 cudaDev 2 busId a2000 - Abort COMPLETE
ali-f8xhrryi-0001:938501:942465 [2] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938501:942465 [2] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938501:942465 [2] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938504:942252 [5] NCCL INFO comm 0x7fe9250 rank 5 nranks 24 cudaDev 5 busId 17f000 - Abort COMPLETE
ali-f8xhrryi-0001:938504:942530 [5] NCCL INFO misc/socket.cc:64 -> 3
ali-f8xhrryi-0001:938504:942530 [5] NCCL INFO misc/socket.cc:80 -> 3
ali-f8xhrryi-0001:938504:942530 [5] NCCL INFO misc/socket.cc:829 -> 3
ali-f8xhrryi-0001:938499:942265 [0] NCCL INFO comm 0x7c92cc0 rank 0 nranks 24 cudaDev 0 busId 8000 - Abort COMPLETE
ali-f8xhrryi-0001:938505:942315 [6] NCCL INFO comm 0x8316580 rank 6 nranks 24 cudaDev 6 busId 1a3000 - Abort COMPLETE
ali-f8xhrryi-0001:938506:942449 [7] NCCL INFO comm 0x8f0f540 rank 7 nranks 24 cudaDev 7 busId 1c7000 - Abort COMPLETE
ali-f8xhrryi-0001:938502:942313 [3] NCCL INFO comm 0x955af30 rank 3 nranks 24 cudaDev 3 busId c6000 - Abort COMPLETE
ali-f8xhrryi-0001:938500:942446 [1] NCCL INFO comm 0x8d34ec0 rank 1 nranks 24 cudaDev 1 busId 7e000 - Abort COMPLETE
ali-f8xhrryi-0001:938503:942451 [4] NCCL INFO comm 0x8c77230 rank 4 nranks 24 cudaDev 4 busId 109000 - Abort COMPLETE
W0625 01:10:06.967000 938221 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 938499 via signal SIGTERM
W0625 01:10:06.968000 938221 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 938501 via signal SIGTERM
W0625 01:10:06.968000 938221 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 938504 via signal SIGTERM
Traceback (most recent call last):
File "/data1/workdata/DeepEP/tests/test_internode.py", line 252, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes, ), nprocs=num_processes)
File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 215, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 6 terminated with the following error:
Traceback (most recent call last):
File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/data1/workdata/DeepEP/tests/test_internode.py", line 236, in test_loop
test_main(i, local_rank, num_local_ranks, num_ranks, num_nodes, rank, buffer, group)
File "/data1/workdata/DeepEP/tests/test_internode.py", line 110, in test_main
recv_x, recv_topk_idx, recv_topk_weights, recv_num_tokens_per_expert_list, handle, event = buffer.dispatch(**dispatch_args)
File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/deep_ep-1.1.0-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 311, in dispatch
return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
File "/data1/anaconda/envs/test-deepep/lib/python3.10/site-packages/deep_ep-1.1.0-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 421, in internode_dispatch
recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
RuntimeError: Failed: Assertion error /data1/workdata/DeepEP/csrc/kernels/internode.cu:328 'false and "Unsupported RDMA ranks"
IB : ROCE 4 * (200G * 2 )
env
385 export NCCL_IB_HCA=mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3
386 export NCCL_IB_GID_INDEX=3
387 export NCCL_IB_QPS_PER_CONNECTION=8
388 export NCCL_IB_TIMEOUT=23
389 export NCCL_IB_RETRY_CNT=7
390 export UCX_TLS=tcp
391 export UCX_NET_DEVICES=eth0
392 export NCCL_SET_THREAD_NAME=1
393 export MASTER_ADDR=10.0.0.114
394 export WORLD_SIZE=3
395 export RANK=2
397 export NCCL_SOCKET_IFNAME=eth0
The NCCL test across the entire cluster works as expected.
In addition, the low_latency test with three nodes also runs normally.
How can i solve this issue?
code of DeepEP:the last version
nvshmem_src: 3.2.5-1
Metadata
Metadata
Assignees
Labels
No labels