fix(datasets): make deterministic-sampler resume robust to world-size changes

compute_sampler_state mapped a checkpointed step back to (epoch, start_index) using the *current* num_processes, but the number of sampler positions a step consumes scales with the world size that produced it. Resuming on a different GPU count therefore landed on the wrong epoch/offset, silently re-seeing or skipping data. Record num_processes in training_step.json at checkpoint time and feed the checkpoint's value into compute_sampler_state on resume, so the data order resumes at the right position regardless of the new world size. Warn when the world size changed (the global offset is correct, but per-rank sample-exactness needs the same topology). Old checkpoints without the field fall back to the current world size. Also document compute_sampler_state's assumptions explicitly: num_processes / batch_size must match the checkpointing run, and accelerate's even_batches=True padding is mirrored by the ceil(... / num_processes) term. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-19 01:07:18 +00:00 · 2026-06-11 14:37:19 +00:00
parent 7a62235bac
commit 1aa937aad2
4 changed files with 52 additions and 5 deletions
@@ -20,6 +20,7 @@ from unittest.mock import Mock, patch
 from lerobot.common.train_utils import (
    get_step_checkpoint_dir,
    get_step_identifier,
+    load_training_num_processes,
    load_training_state,
    load_training_step,
    save_checkpoint,
@@ -63,6 +64,17 @@ def test_load_training_step(tmp_path):
    assert loaded_step == step


+def test_save_training_state_records_num_processes(tmp_path, optimizer, scheduler):
+    save_training_state(tmp_path, 10, optimizer, scheduler, num_processes=4)
+    assert load_training_num_processes(tmp_path) == 4
+
+
+def test_load_training_num_processes_absent_returns_none(tmp_path, optimizer, scheduler):
+    # Checkpoints written before the world size was recorded must still load (back-compat).
+    save_training_state(tmp_path, 10, optimizer, scheduler)
+    assert load_training_num_processes(tmp_path) is None
+
+
 def test_update_last_checkpoint(tmp_path):
    checkpoint = tmp_path / "0005"
    checkpoint.mkdir()