Skip to content

set backend correctly for CUDA+FSDP2+cpu-offload#3574

Merged
SunMarc merged 3 commits into
mainfrom
offload-fsdp
May 15, 2025
Merged

set backend correctly for CUDA+FSDP2+cpu-offload#3574
SunMarc merged 3 commits into
mainfrom
offload-fsdp

Conversation

@SunMarc
Copy link
Copy Markdown
Member

@SunMarc SunMarc commented May 15, 2025

What does this PR do?

Supersedes #3544

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc SunMarc merged commit cd37bbb into main May 15, 2025
28 of 29 checks passed
@SunMarc SunMarc deleted the offload-fsdp branch May 15, 2025 09:38
@universuen
Copy link
Copy Markdown
Contributor

@SunMarc Hi, it is really a nice patch! However, I found a corner case when setting fsdp using kwargs like this:

Accelerator(
    gradient_accumulation_steps=1,
    mixed_precision='bf16',
    fsdp_plugin=FullyShardedDataParallelPlugin(
        fsdp_version=2,
        cpu_offload=True,
    ),
)

Currently, I have to set the backend explicitly to avoid the error, but didn't have time to find a final solution to this.

Accelerator(
    gradient_accumulation_steps=1,
    mixed_precision='bf16',
    fsdp_plugin=FullyShardedDataParallelPlugin(
        fsdp_version=2,
        cpu_offload=True,
    ),
    kwargs_handlers=[
        InitProcessGroupKwargs(
            backend="cuda:nccl,cpu:gloo"
        ),
    ]
)

@SunMarc
Copy link
Copy Markdown
Member Author

SunMarc commented Aug 5, 2025

Indeed that's an edge case that we might need to fix if we want to allow users to depend only on the plugin in the future. cc @S1ro1

@SunMarc
Copy link
Copy Markdown
Member Author

SunMarc commented Aug 5, 2025

I guess the easiest way for now is to update kwargs that is passed in partial state depending on fsdp_plugin

@S1ro1
Copy link
Copy Markdown
Contributor

S1ro1 commented Aug 5, 2025

I guess the easiest way for now is to update kwargs that is passed in partial state depending on fsdp_plugin

I think we should just set gloo by default together with nccl if fsdp2 is happening, i.e. async checkpointing I work on also requires gloo so I feel like defaulting to both is sensible, even costing a little overhead in launch

@SunMarc
Copy link
Copy Markdown
Member Author

SunMarc commented Aug 5, 2025

Okay, then we can do that in the async checkpoint pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants