Skip to content

add support for port 0 auto-selection in multi-GPU environments#3501

Merged
SunMarc merged 2 commits into
huggingface:mainfrom
hellobiondi:fix-auto-port-selection
May 12, 2025
Merged

add support for port 0 auto-selection in multi-GPU environments#3501
SunMarc merged 2 commits into
huggingface:mainfrom
hellobiondi:fix-auto-port-selection

Conversation

@hellobiondi
Copy link
Copy Markdown
Contributor

@hellobiondi hellobiondi commented Apr 11, 2025

What does this PR do?

This PR implements support for port 0 auto-selection in multi-GPU environments (prepare_multi_gpu_env()). The documentation already mentions that setting port to 0 will automatically select the next available port, but this functionality wasn't actually implemented in the code.

When main_process_port is set to 0, the code now:

  • Automatically finds an available port through socket binding
  • Updates relevant arguments (master_port, rdzv_endpoint) with the selected port
  • Provides a more seamless experience for users working in environments where specific ports might be occupied

Before submitting

Who can review?

@SunMarc @zach-huggingface - This relates to the Command Line Interface and distributed training functionality.

Comment thread src/accelerate/utils/launch.py Outdated
Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this ! Is this something we can also do for deepspeed ? Left a minor comment and can you resolve @BenjaminBossan comment ?

Comment on lines 234 to 236
if need_port_check and is_port_in_use(main_process_port):
raise ConnectionError(
f"Tried to launch distributed communication on port `{main_process_port}`, but another process is utilizing it. "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also update the doc no ? It should work on multiple machines now

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc
Copy link
Copy Markdown
Member

SunMarc commented Apr 23, 2025

friendly ping @hellobiondi

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! One last thing that would be nice is to confirm that it actually works when training a model, do you think you can quickly test that ?

@SunMarc SunMarc merged commit 9b2d6ea into huggingface:main May 12, 2025
23 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants