[Bug Fix] Fix address/port already in use error for pplx test #20094

yewentao256 · 2025-06-25T20:22:15Z

Purpose

Fix address/port already in use error

>       raise ProcessRaisedException(msg, error_index, failed_process.pid)
E       torch.multiprocessing.spawn.ProcessRaisedException: 
E       
E       -- Process 0 terminated with the following error:
E       Traceback (most recent call last):
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
E           fn(i, *args)
E         File "/home/wentao/vllm-source/tests/kernels/moe/utils.py", line 51, in _worker_parallel_launch
E           torch.distributed.init_process_group(
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
E           return func(*args, **kwargs)
E                  ^^^^^^^^^^^^^^^^^^^^^
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
E           func_return = func(*args, **kwargs)
E                         ^^^^^^^^^^^^^^^^^^^^^
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 1710, in init_process_group
E           store, rank, world_size = next(rendezvous_iterator)
E                                     ^^^^^^^^^^^^^^^^^^^^^^^^^
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/rendezvous.py", line 230, in _tcp_rendezvous_handler
E           store = _create_c10d_store(
E                   ^^^^^^^^^^^^^^^^^^^
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/rendezvous.py", line 198, in _create_c10d_store
E           return TCPStore(
E                  ^^^^^^^^^
E       torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

../../../../.wentao_env/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:215: ProcessRaisedException
------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------
INFO 06-25 15:21:43 [__init__.py:244] Automatically detected platform cuda.
INFO 06-25 15:21:43 [__init__.py:244] Automatically detected platform cuda.
------------------------------------------------------------------ Captured stderr call ------------------------------------------------------------------
W0625 15:21:48.812000 100212 torch/multiprocessing/spawn.py:169] Terminating process 114296 via signal SIGTERM
================================================================ short test summary info =================================================================
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-8-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-8-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-8-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-8-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-64-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-64-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-64-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-64-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-8-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-8-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-8-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-8-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-64-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-64-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-64-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-64-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-8-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-8-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-8-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-8-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-64-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-64-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/usr/lib/python3.12/selectors.py:415: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
======================================================= 22 failed, 216 passed in 264.29s (0:04:24) =======================================================

Test

Signed-off-by: yewentao256 <zhyanwentao@126.com>

github-actions · 2025-06-25T20:22:24Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @yewentao256, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical address already in use error that was causing intermittent failures in distributed training tests. By implementing dynamic port allocation for torch.distributed process group initialization, it ensures that tests can reliably find and use an available port, thereby enhancing the stability and consistency of the test suite.

Highlights

Bug Fix - Port Contention: This pull request addresses the EADDRINUSE error (address already in use) that occurred during torch.distributed process group initialization, specifically when the default port 29500 was unavailable, leading to test failures.
Dynamic Port Allocation: A new utility function, find_free_port, has been introduced in tests/kernels/moe/deepep_utils.py. This function dynamically identifies and returns an available network port by binding a temporary socket to port 0.
Improved Test Reliability: The parallel_launch function, used for setting up distributed tests, has been updated to utilize the find_free_port utility. This ensures that torch.distributed.init_process_group always attempts to use a free port, significantly improving the reliability and robustness of distributed tests by preventing port conflicts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a fix for the address/port already in use error by dynamically finding a free port for the TCP store. It adds a find_free_port function and integrates it into the parallel_launch function. The changes enhance the robustness of the tests by avoiding port conflicts.

tests/kernels/moe/deepep_utils.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

aarnphm · 2025-06-25T20:40:48Z

tests/kernels/moe/deepep_utils.py

@@ -79,6 +82,13 @@ def _worker_parallel_launch(
        torch.distributed.destroy_process_group()


+def find_free_port():


We can probably move this to utils?

Sounds good, moved to vllm/model_executor/layers/fused_moe/utils.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com>

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Will Eaton <weaton@redhat.com>

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com>

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com>

fix address already in use issue

82dbc0f

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from tlrmchlsmth and WoosukKwon as code owners June 25, 2025 20:22

gemini-code-assist bot reviewed Jun 25, 2025

View reviewed changes

tests/kernels/moe/deepep_utils.py Outdated Show resolved Hide resolved

tests/kernels/moe/deepep_utils.py Outdated Show resolved Hide resolved

update through gemini

5c10f8c

Signed-off-by: yewentao256 <zhyanwentao@126.com>

aarnphm reviewed Jun 25, 2025

View reviewed changes

move to utils

35d91eb

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin approved these changes Jun 25, 2025

View reviewed changes

mgoin changed the title ~~[Bug Fix] Fix address/port already in use error~~ [Bug Fix] Fix address/port already in use error for deep_ep test Jun 25, 2025

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed ci-failure Issue about an unexpected test failure in CI labels Jun 25, 2025

github-project-automation bot added this to CI Failures Jun 25, 2025

mgoin added ci/build and removed ci-failure Issue about an unexpected test failure in CI labels Jun 25, 2025

mgoin removed this from CI Failures Jun 25, 2025

DarkLight1337 merged commit c894c5d into vllm-project:main Jun 26, 2025
80 checks passed

yewentao256 deleted the wye-fix-address-already-in-use-issue branch June 26, 2025 15:00

gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025

[Bug Fix] Fix address/port already in use error for deep_ep test (vll…

fccb7a7

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

[Bug Fix] Fix address/port already in use error for deep_ep test (vll…

f712e08

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com>

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[Bug Fix] Fix address/port already in use error for deep_ep test (vll…

14a1b03

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Will Eaton <weaton@redhat.com>

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[Bug Fix] Fix address/port already in use error for deep_ep test (vll…

d4f4f86

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com>

wwl2755-google pushed a commit to wwl2755-google/vllm that referenced this pull request Jul 1, 2025

[Bug Fix] Fix address/port already in use error for deep_ep test (vll…

0d149b2

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 changed the title ~~[Bug Fix] Fix address/port already in use error for deep_ep test~~ [Bug Fix] Fix address/port already in use error for pplx test Jul 2, 2025

googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025

[Bug Fix] Fix address/port already in use error for deep_ep test (vll…

3df245a

…m-project#20094) Signed-off-by: yewentao256 <zhyanwentao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug Fix] Fix address/port already in use error for pplx test #20094

[Bug Fix] Fix address/port already in use error for pplx test #20094

Uh oh!

yewentao256 commented Jun 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

aarnphm Jun 25, 2025

Uh oh!

yewentao256 Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -79,6 +82,13 @@ def _worker_parallel_launch(
		torch.distributed.destroy_process_group()


		def find_free_port():

Uh oh!

[Bug Fix] Fix address/port already in use error for pplx test #20094

[Bug Fix] Fix address/port already in use error for pplx test #20094

Uh oh!

Conversation

yewentao256 commented Jun 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

aarnphm Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

yewentao256 Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yewentao256 commented Jun 25, 2025 •

edited by github-actions bot

Loading