Skip to content

Fix stateful dataloader DDP #3952

Merged
SunMarc merged 1 commit into
mainfrom
fix-stateful
Mar 4, 2026
Merged

Fix stateful dataloader DDP #3952
SunMarc merged 1 commit into
mainfrom
fix-stateful

Conversation

@SunMarc
Copy link
Copy Markdown
Member

@SunMarc SunMarc commented Mar 3, 2026

What does this PR do?

Fixes #3938. The tests we had didn't catch the issue that the user was having, so i changed it. Here's the summary of the changes:

  • Fix a bug where DataLoaderShard.adjust_state_dict_for_prefetch() incorrectly subtracted num_processes - 1 from state dict counters (_sampler_iter_yielded, _num_yielded) in DDP, causing the resumed dataloader to replay already-consumed batches
  • The num_processes - 1 correction in the base class DataLoaderAdapter is only valid for DataLoaderDispatcher (which fetches num_processes batches per step from process 0), not for DataLoaderShard (where each process has
    its own sharded iterator with a single 1-batch look-ahead)
  • Strengthen existing test_stateful_dataloader and test_stateful_dataloader_save_state tests: save state earlier (after 3 batches instead of second-to-last), add length assertion, and test both iterable and map-style datasets

Root cause

  • In DataLoaderShard.iter, _update_state_dict() is called before the inner next(), so the captured state already equals the number of batches yielded to the user — no DDP adjustment is needed. The base class adjustment of num_processes - 1 caused the resume point to be 1 batch too early, producing duplicate data.

  • The bug only affects map-style datasets with use_stateful_dataloader=True in multi-process DDP. Iterable datasets were unaffected because their _sampler_iter_state provides correct resume info independently of _num_yielded. The previous tests didn't catch this because they (a) only used iterable datasets, (b) saved state at the second-to-last batch leaving only 1 batch remaining, and (c) used zip without a length check.

@SunMarc SunMarc changed the title fix-stateful-dataloader Fix stateful dataloader DDP Mar 3, 2026
@SunMarc SunMarc merged commit 5cf9cf8 into main Mar 4, 2026
28 of 29 checks passed
@SunMarc SunMarc deleted the fix-stateful branch March 4, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] DataLoaderShard with StatefulDataLoader produces wrong state dict in DDP

1 participant