Skip to content

remove use_xpu to fix ut issues, we don't need this since XPU is OOB …#3460

Merged
SunMarc merged 4 commits into
huggingface:mainfrom
yao-matrix:issue53
Apr 1, 2025
Merged

remove use_xpu to fix ut issues, we don't need this since XPU is OOB …#3460
SunMarc merged 4 commits into
huggingface:mainfrom
yao-matrix:issue53

Conversation

@yao-matrix
Copy link
Copy Markdown
Contributor

@yao-matrix yao-matrix commented Mar 27, 2025

Sympton
When run transformers e2e training ut pytest -rA tests/trainer/test_trainer.py::TrainerIntegrationTest::test_end_to_end_example on XPU, we will get below error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and xpu:0! (when checking argument for argument index in method wrapper_XPU__index_select)

Root Cause
When run this test case while there is no default accelerate config in “/root/.cache/huggingface/accelerate/default_config.yaml”(generated by accelerate config), the use_xpu flag in launch.py (default False) blocked the using of XPU, so accelerate is launched both in single process and in cpu, but when data comes, it automatically detected xpu and put data into xpu, so model is in cpu and data is in xpu, lead to this error.

Fix

XPU is now native citizen of PyTorch, we need align the same behavior w/ CUDA, so remove these special handling.

…supported now

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
@yao-matrix yao-matrix marked this pull request as draft March 27, 2025 08:03
@yao-matrix
Copy link
Copy Markdown
Contributor Author

@sywangyi @faaany , this is a pretty fundamental change, pls review first

@yao-matrix yao-matrix marked this pull request as ready for review March 28, 2025 02:35
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
@yao-matrix
Copy link
Copy Markdown
Contributor Author

@faaany , since it changes the default behavior, could you help run accelerate UTs for this? Thx

@faaany
Copy link
Copy Markdown
Contributor

faaany commented Mar 28, 2025

@faaany , since it changes the default behavior, could you help run accelerate UTs for this? Thx

sure

@faaany
Copy link
Copy Markdown
Contributor

faaany commented Mar 28, 2025

@yao-matrix no test error is found related to this PR.

@yao-matrix
Copy link
Copy Markdown
Contributor Author

@SunMarc @muellerzr , pls help review, thx.

Copy link
Copy Markdown
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall on the surface: good. But let's be careful to deprecate please!

Comment thread docs/source/package_reference/cli.md
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
@yao-matrix
Copy link
Copy Markdown
Contributor Author

@muellerzr @SunMarc, updated per your comments. Pls help review whether any other changes needed.

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@SunMarc SunMarc merged commit 67a768b into huggingface:main Apr 1, 2025
25 checks passed
@yao-matrix yao-matrix deleted the issue53 branch April 2, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants