Skip to content

Conversation

yewentao256
Copy link
Collaborator

@yewentao256 yewentao256 commented Jun 25, 2025

Purpose

Fix address/port already in use error

>       raise ProcessRaisedException(msg, error_index, failed_process.pid)
E       torch.multiprocessing.spawn.ProcessRaisedException: 
E       
E       -- Process 0 terminated with the following error:
E       Traceback (most recent call last):
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
E           fn(i, *args)
E         File "/home/wentao/vllm-source/tests/kernels/moe/utils.py", line 51, in _worker_parallel_launch
E           torch.distributed.init_process_group(
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
E           return func(*args, **kwargs)
E                  ^^^^^^^^^^^^^^^^^^^^^
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
E           func_return = func(*args, **kwargs)
E                         ^^^^^^^^^^^^^^^^^^^^^
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 1710, in init_process_group
E           store, rank, world_size = next(rendezvous_iterator)
E                                     ^^^^^^^^^^^^^^^^^^^^^^^^^
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/rendezvous.py", line 230, in _tcp_rendezvous_handler
E           store = _create_c10d_store(
E                   ^^^^^^^^^^^^^^^^^^^
E         File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/distributed/rendezvous.py", line 198, in _create_c10d_store
E           return TCPStore(
E                  ^^^^^^^^^
E       torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

../../../../.wentao_env/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:215: ProcessRaisedException
------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------
INFO 06-25 15:21:43 [__init__.py:244] Automatically detected platform cuda.
INFO 06-25 15:21:43 [__init__.py:244] Automatically detected platform cuda.
------------------------------------------------------------------ Captured stderr call ------------------------------------------------------------------
W0625 15:21:48.812000 100212 torch/multiprocessing/spawn.py:169] Terminating process 114296 via signal SIGTERM
================================================================ short test summary info =================================================================
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-8-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-8-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-8-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-8-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-64-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-64-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-64-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-1-64-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-8-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-8-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-8-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-8-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-64-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-64-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-64-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-2-64-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-8-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-8-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-8-mnk2] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-8-mnk3] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-64-mnk0] - torch.multiprocessing.spawn.ProcessRaisedException: 
FAILED test_pplx_moe.py::test_pplx_prepare_finalize[False-world_dp_size0-dtype0-6-64-mnk1] - torch.multiprocessing.spawn.ProcessRaisedException: 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/usr/lib/python3.12/selectors.py:415: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
======================================================= 22 failed, 216 passed in 264.29s (0:04:24) =======================================================

Test

image

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yewentao256, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical address already in use error that was causing intermittent failures in distributed training tests. By implementing dynamic port allocation for torch.distributed process group initialization, it ensures that tests can reliably find and use an available port, thereby enhancing the stability and consistency of the test suite.

Highlights

  • Bug Fix - Port Contention: This pull request addresses the EADDRINUSE error (address already in use) that occurred during torch.distributed process group initialization, specifically when the default port 29500 was unavailable, leading to test failures.
  • Dynamic Port Allocation: A new utility function, find_free_port, has been introduced in tests/kernels/moe/deepep_utils.py. This function dynamically identifies and returns an available network port by binding a temporary socket to port 0.
  • Improved Test Reliability: The parallel_launch function, used for setting up distributed tests, has been updated to utilize the find_free_port utility. This ensures that torch.distributed.init_process_group always attempts to use a free port, significantly improving the reliability and robustness of distributed tests by preventing port conflicts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a fix for the address/port already in use error by dynamically finding a free port for the TCP store. It adds a find_free_port function and integrates it into the parallel_launch function. The changes enhance the robustness of the tests by avoiding port conflicts.

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@@ -79,6 +82,13 @@ def _worker_parallel_launch(
torch.distributed.destroy_process_group()


def find_free_port():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably move this to utils?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, moved to vllm/model_executor/layers/fused_moe/utils.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@mgoin mgoin changed the title [Bug Fix] Fix address/port already in use error [Bug Fix] Fix address/port already in use error for deep_ep test Jun 25, 2025
@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed ci-failure Issue about an unexpected test failure in CI labels Jun 25, 2025
@mgoin mgoin added ci/build and removed ci-failure Issue about an unexpected test failure in CI labels Jun 25, 2025
@mgoin mgoin removed this from CI Failures Jun 25, 2025
@DarkLight1337 DarkLight1337 merged commit c894c5d into vllm-project:main Jun 26, 2025
80 checks passed
@yewentao256 yewentao256 deleted the wye-fix-address-already-in-use-issue branch June 26, 2025 15:00
gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
…m-project#20094)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
wwl2755-google pushed a commit to wwl2755-google/vllm that referenced this pull request Jul 1, 2025
@yewentao256 yewentao256 changed the title [Bug Fix] Fix address/port already in use error for deep_ep test [Bug Fix] Fix address/port already in use error for pplx test Jul 2, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
…m-project#20094)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants