set backend correctly for CUDA+FSDP2+cpu-offload by SunMarc · Pull Request #3574 · huggingface/accelerate

SunMarc · 2025-05-15T09:29:29Z

What does this PR do?

Supersedes #3544

HuggingFaceDocBuilderDev · 2025-05-15T09:33:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

universuen · 2025-08-04T06:45:27Z

@SunMarc Hi, it is really a nice patch! However, I found a corner case when setting fsdp using kwargs like this:

Accelerator(
    gradient_accumulation_steps=1,
    mixed_precision='bf16',
    fsdp_plugin=FullyShardedDataParallelPlugin(
        fsdp_version=2,
        cpu_offload=True,
    ),
)

Currently, I have to set the backend explicitly to avoid the error, but didn't have time to find a final solution to this.

Accelerator(
    gradient_accumulation_steps=1,
    mixed_precision='bf16',
    fsdp_plugin=FullyShardedDataParallelPlugin(
        fsdp_version=2,
        cpu_offload=True,
    ),
    kwargs_handlers=[
        InitProcessGroupKwargs(
            backend="cuda:nccl,cpu:gloo"
        ),
    ]
)

SunMarc · 2025-08-05T11:02:10Z

Indeed that's an edge case that we might need to fix if we want to allow users to depend only on the plugin in the future. cc @S1ro1

SunMarc · 2025-08-05T11:03:01Z

I guess the easiest way for now is to update kwargs that is passed in partial state depending on fsdp_plugin

S1ro1 · 2025-08-05T11:40:45Z

I guess the easiest way for now is to update kwargs that is passed in partial state depending on fsdp_plugin

I think we should just set gloo by default together with nccl if fsdp2 is happening, i.e. async checkpointing I work on also requires gloo so I feel like defaulting to both is sensible, even costing a little overhead in launch

SunMarc · 2025-08-05T12:24:44Z

Okay, then we can do that in the async checkpoint pr

winglian and others added 2 commits May 1, 2025 16:24

set backend correctly for CUDA+FSDP2+cpu-offload

0af83a3

offload

7a7d6e7

format

82377b0

SunMarc mentioned this pull request May 15, 2025

set backend correctly for CUDA+FSDP2+cpu-offload #3544

Closed

5 tasks

SunMarc merged commit cd37bbb into main May 15, 2025
28 of 29 checks passed

SunMarc deleted the offload-fsdp branch May 15, 2025 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set backend correctly for CUDA+FSDP2+cpu-offload#3574

set backend correctly for CUDA+FSDP2+cpu-offload#3574
SunMarc merged 3 commits into
mainfrom
offload-fsdp

SunMarc commented May 15, 2025

Uh oh!

HuggingFaceDocBuilderDev commented May 15, 2025

Uh oh!

Uh oh!

universuen commented Aug 4, 2025

Uh oh!

SunMarc commented Aug 5, 2025

Uh oh!

SunMarc commented Aug 5, 2025

Uh oh!

S1ro1 commented Aug 5, 2025

Uh oh!

SunMarc commented Aug 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

SunMarc commented May 15, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented May 15, 2025

Uh oh!

Uh oh!

universuen commented Aug 4, 2025

Uh oh!

SunMarc commented Aug 5, 2025

Uh oh!

SunMarc commented Aug 5, 2025

Uh oh!

S1ro1 commented Aug 5, 2025

Uh oh!

SunMarc commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SunMarc commented Aug 5, 2025 •

edited

Loading