Fix stateful dataloader DDP by SunMarc · Pull Request #3952 · huggingface/accelerate

SunMarc · 2026-03-03T14:50:34Z

What does this PR do?

Fixes #3938. The tests we had didn't catch the issue that the user was having, so i changed it. Here's the summary of the changes:

Fix a bug where DataLoaderShard.adjust_state_dict_for_prefetch() incorrectly subtracted num_processes - 1 from state dict counters (_sampler_iter_yielded, _num_yielded) in DDP, causing the resumed dataloader to replay already-consumed batches
The num_processes - 1 correction in the base class DataLoaderAdapter is only valid for DataLoaderDispatcher (which fetches num_processes batches per step from process 0), not for DataLoaderShard (where each process has
its own sharded iterator with a single 1-batch look-ahead)
Strengthen existing test_stateful_dataloader and test_stateful_dataloader_save_state tests: save state earlier (after 3 batches instead of second-to-last), add length assertion, and test both iterable and map-style datasets

Root cause

In DataLoaderShard.iter, _update_state_dict() is called before the inner next(), so the captured state already equals the number of batches yielded to the user — no DDP adjustment is needed. The base class adjustment of num_processes - 1 caused the resume point to be 1 batch too early, producing duplicate data.
The bug only affects map-style datasets with use_stateful_dataloader=True in multi-process DDP. Iterable datasets were unaffected because their _sampler_iter_state provides correct resume info independently of _num_yielded. The previous tests didn't catch this because they (a) only used iterable datasets, (b) saved state at the second-to-last batch leaving only 1 batch remaining, and (c) used zip without a length check.

fix-stateful-dataloader

f498771

SunMarc changed the title ~~fix-stateful-dataloader~~ Fix stateful dataloader DDP Mar 3, 2026

SunMarc mentioned this pull request Mar 3, 2026

[Bug] DataLoaderShard with StatefulDataLoader produces wrong state dict in DDP #3938

Closed

4 tasks

SunMarc merged commit 5cf9cf8 into main Mar 4, 2026
28 of 29 checks passed

SunMarc deleted the fix-stateful branch March 4, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stateful dataloader DDP #3952

Fix stateful dataloader DDP #3952
SunMarc merged 1 commit into
mainfrom
fix-stateful

SunMarc commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SunMarc commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SunMarc commented Mar 3, 2026 •

edited

Loading