add support for port 0 auto-selection in multi-GPU environments#3501
Merged
Conversation
SunMarc
reviewed
Apr 17, 2025
Member
SunMarc
left a comment
There was a problem hiding this comment.
Thanks for adding this ! Is this something we can also do for deepspeed ? Left a minor comment and can you resolve @BenjaminBossan comment ?
Comment on lines
234
to
236
| if need_port_check and is_port_in_use(main_process_port): | ||
| raise ConnectionError( | ||
| f"Tried to launch distributed communication on port `{main_process_port}`, but another process is utilizing it. " |
Member
There was a problem hiding this comment.
We can also update the doc no ? It should work on multiple machines now
4 tasks
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Member
|
friendly ping @hellobiondi |
SunMarc
approved these changes
Apr 29, 2025
Member
SunMarc
left a comment
There was a problem hiding this comment.
Thanks ! One last thing that would be nice is to confirm that it actually works when training a model, do you think you can quickly test that ?
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR implements support for port 0 auto-selection in multi-GPU environments (
prepare_multi_gpu_env()). The documentation already mentions that setting port to 0 will automatically select the next available port, but this functionality wasn't actually implemented in the code.When
main_process_portis set to0, the code now:master_port,rdzv_endpoint) with the selected portBefore submitting
tests/test_launch.pyWho can review?
@SunMarc @zach-huggingface - This relates to the Command Line Interface and distributed training functionality.