fix(datasets): guard Feistel cycle-walking loop against non-convergence

Replace the unbounded while True in EpisodeAwareSampler._permute with a bounded for loop capped at _MAX_CYCLE_WALK_STEPS (100) and raise RuntimeError if the cycle-walk fails to land in [0, num_frames). The loop is expected to converge in <4 steps on the chosen power-of-two domain, so the bound is a safety net that should never trip in practice but prevents a pathological infinite loop. https://claude.ai/code/session_01HQ15tFrBsHYScjGWosEv22
test(sampler): drain resumed trillion-frame sampler via iter() to avoid list() prealloc
2026-06-18 08:47:05 +00:00 · 2026-06-11 13:20:31 +00:00 · 2026-06-11 10:39:13 +00:00 · 2026-06-11 11:54:22 +02:00 · 2026-06-11 11:45:36 +02:00 · 2026-06-11 11:37:44 +02:00
10 changed files with 436 additions and 254 deletions
@@ -124,7 +124,7 @@ hardware = [
    "lerobot[deepdiff-dep]",
 ]
 viz = [
-    "rerun-sdk>=0.24.0,<0.34.0",
+    "rerun-sdk>=0.24.0,<0.27.0",
 ]
 # ── User-facing composite extras (map to CLI scripts) ─────
 # lerobot-record, lerobot-replay, lerobot-calibrate, lerobot-teleoperate, etc.
@@ -99,6 +99,10 @@ class TrainPipelineConfig(HubMixin):
    batch_size: int = 8
    prefetch_factor: int = 4
    persistent_workers: bool = True
+    # Deterministic data order (pure function of seed and epoch): immune to cross-rank RNG
+    # desync and enables sample-exact resume. Set to false for the legacy RNG-based shuffle.
+    # Ignored when dataset.streaming is enabled.
+    deterministic_sampler: bool = True
    steps: int = 100_000
    eval_freq: int = 20_000
    log_freq: int = 200
@@ -50,7 +50,7 @@ from .lerobot_dataset import LeRobotDataset
 from .multi_dataset import MultiLeRobotDataset
 from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
 from .pyav_utils import check_video_encoder_parameters_pyav, detect_available_encoders_pyav
-from .sampler import EpisodeAwareSampler
+from .sampler import EpisodeAwareSampler, compute_sampler_state
 from .streaming_dataset import StreamingLeRobotDataset
 from .utils import DEFAULT_EPISODES_PATH, create_lerobot_dataset_card
 from .video_utils import VideoEncodingManager
@@ -82,6 +82,7 @@ __all__ = [
    "aggregate_stats",
    "convert_image_to_video_dataset",
    "create_initial_features",
+    "compute_sampler_state",
    "create_lerobot_dataset_card",
    "column_for_style",
    "delete_episodes",
@@ -14,14 +14,49 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
+import math
 from collections.abc import Iterator

+import numpy as np
 import torch

 logger = logging.getLogger(__name__)

+_MASK_64 = (1 << 64) - 1
+_FEISTEL_ROUNDS = 4
+# Cycle-walking converges in <4 expected steps on the chosen domain; this bound is a generous
+# safety net that should never be hit in practice.
+_MAX_CYCLE_WALK_STEPS = 100
+
+
+def _mix64(x: int) -> int:
+    """SplitMix64 finalizer (64-bit integer hash)."""
+    x = (x + 0x9E3779B97F4A7C15) & _MASK_64
+    x ^= x >> 30
+    x = (x * 0xBF58476D1CE4E5B9) & _MASK_64
+    x ^= x >> 27
+    x = (x * 0x94D049BB133111EB) & _MASK_64
+    x ^= x >> 31
+    return x
+

 class EpisodeAwareSampler:
+    """Sampler over episode frames with O(num_episodes) memory.
+
+    Only episode boundaries are stored; logical positions map to frame indices on the fly, so
+    memory does not grow with the number of frames.
+
+    By default (`deterministic=True`) shuffling uses a seeded Feistel permutation over
+    `[0, num_frames)`: the data order is a pure function of `(seed, epoch)`, needs no RNG
+    synchronization across distributed ranks, and any position can be sought in O(1), enabling
+    sample-exact resume via `state_dict` / `load_state_dict`. Each completed `__iter__`
+    advances the epoch. The shuffle is pseudo-random rather than truly uniform — the standard
+    large-scale trade-off. During a resumed epoch, `__len__` still reports the full length.
+
+    With `deterministic=False`, shuffling falls back to `torch.randperm` driven by `generator`
+    (accelerate synchronizes the generator across ranks when preparing the dataloader).
+    """
+
    def __init__(
        self,
        dataset_from_indices: list[int],
@@ -30,57 +65,161 @@ class EpisodeAwareSampler:
        drop_n_first_frames: int = 0,
        drop_n_last_frames: int = 0,
        shuffle: bool = False,
+        generator: torch.Generator | None = None,
+        deterministic: bool = True,
+        seed: int = 0,
    ):
-        """Sampler that optionally incorporates episode boundary information.
-
+        """
        Args:
-            dataset_from_indices: List of indices containing the start of each episode in the dataset.
-            dataset_to_indices: List of indices containing the end of each episode in the dataset.
-            episode_indices_to_use: List of episode indices to use. If None, all episodes are used.
-                                    Assumes that episodes are indexed from 0 to N-1.
-            drop_n_first_frames: Number of frames to drop from the start of each episode.
-            drop_n_last_frames: Number of frames to drop from the end of each episode.
+            dataset_from_indices: Start index of each episode in the dataset.
+            dataset_to_indices: End index of each episode in the dataset.
+            episode_indices_to_use: Episode indices to use; None means all.
+            drop_n_first_frames: Frames to drop from the start of each episode.
+            drop_n_last_frames: Frames to drop from the end of each episode.
            shuffle: Whether to shuffle the indices.
+            generator: Generator for non-deterministic shuffling (global torch RNG when None).
+            deterministic: Use the seeded Feistel permutation instead of `torch.randperm`.
+            seed: Seed the deterministic permutation is derived from (together with the epoch).
        """
        if drop_n_first_frames < 0:
            raise ValueError(f"drop_n_first_frames must be >= 0, got {drop_n_first_frames}")
        if drop_n_last_frames < 0:
            raise ValueError(f"drop_n_last_frames must be >= 0, got {drop_n_last_frames}")
+        if deterministic and generator is not None:
+            raise ValueError("generator is unused in deterministic mode; pass seed instead.")

-        indices = []
-        for episode_idx, (start_index, end_index) in enumerate(
-            zip(dataset_from_indices, dataset_to_indices, strict=True)
-        ):
-            if episode_indices_to_use is None or episode_idx in episode_indices_to_use:
-                ep_length = end_index - start_index
-                if drop_n_first_frames + drop_n_last_frames >= ep_length:
-                    logger.warning(
-                        "Episode %d has %d frames but drop_n_first_frames=%d and "
-                        "drop_n_last_frames=%d removes all frames. Skipping.",
-                        episode_idx,
-                        ep_length,
-                        drop_n_first_frames,
-                        drop_n_last_frames,
-                    )
-                    continue
-                indices.extend(range(start_index + drop_n_first_frames, end_index - drop_n_last_frames))
+        from_indices = np.asarray(dataset_from_indices, dtype=np.int64)
+        to_indices = np.asarray(dataset_to_indices, dtype=np.int64)
+        if from_indices.shape != to_indices.shape:
+            raise ValueError(
+                f"dataset_from_indices and dataset_to_indices must have the same length, "
+                f"got {len(from_indices)} and {len(to_indices)}"
+            )

-        if not indices:
+        used = np.ones(len(from_indices), dtype=bool)
+        if episode_indices_to_use is not None:
+            used = np.zeros(len(from_indices), dtype=bool)
+            used[np.asarray(episode_indices_to_use, dtype=np.int64)] = True
+
+        starts = from_indices + drop_n_first_frames
+        lengths = to_indices - drop_n_last_frames - starts
+        for episode_idx in np.flatnonzero(used & (lengths <= 0)):
+            logger.warning(
+                "Episode %d has %d frames but drop_n_first_frames=%d and "
+                "drop_n_last_frames=%d removes all frames. Skipping.",
+                episode_idx,
+                to_indices[episode_idx] - from_indices[episode_idx],
+                drop_n_first_frames,
+                drop_n_last_frames,
+            )
+        used &= lengths > 0
+        if not used.any():
            raise ValueError(
                "No valid frames remain after applying drop_n_first_frames and drop_n_last_frames. "
                "All episodes were either filtered out or had too few frames."
            )

-        self.indices = indices
+        self._starts = starts[used]
+        self._cum_lengths = np.cumsum(lengths[used])
+        self._num_frames = int(self._cum_lengths[-1])
        self.shuffle = shuffle
+        self.generator = generator
+        self.deterministic = deterministic
+        self.seed = seed
+        self._epoch = 0
+        self._start_index = 0
+
+        # Smallest even-bit-width power-of-two domain >= num_frames: equal Feistel halves,
+        # cycle-walking converges in <4 expected steps.
+        bits = max((self._num_frames - 1).bit_length(), 2)
+        self._half_bits = (bits + 1) // 2
+        self._half_mask = (1 << self._half_bits) - 1
+
+    @property
+    def indices(self) -> list[int]:
+        """Materialized frame indices in unshuffled order; O(num_frames), introspection only."""
+        return [self._frame_index(k) for k in range(self._num_frames)]
+
+    def set_epoch(self, epoch: int) -> None:
+        self._require_deterministic("set_epoch")
+        self._epoch = epoch
+
+    def state_dict(self) -> dict:
+        self._require_deterministic("state_dict")
+        return {"epoch": self._epoch, "start_index": self._start_index}
+
+    def load_state_dict(self, state: dict) -> None:
+        self._require_deterministic("load_state_dict")
+        self._epoch = state["epoch"]
+        self._start_index = state["start_index"]
+
+    def _require_deterministic(self, method: str) -> None:
+        if not self.deterministic:
+            raise RuntimeError(f"{method} requires deterministic=True: an RNG order cannot be sought.")
+
+    def _round_keys(self, epoch: int) -> list[int]:
+        state = _mix64(_mix64(self.seed) ^ _mix64(epoch))
+        keys = []
+        for _ in range(_FEISTEL_ROUNDS):
+            state = _mix64(state)
+            keys.append(state)
+        return keys
+
+    def _permute(self, index: int, keys: list[int]) -> int:
+        # Feistel network with cycle-walking: a bijection on [0, num_frames).
+        half_bits, half_mask = self._half_bits, self._half_mask
+        for _ in range(_MAX_CYCLE_WALK_STEPS):
+            left, right = index >> half_bits, index & half_mask
+            for key in keys:
+                left, right = right, left ^ (_mix64(right ^ key) & half_mask)
+            index = (left << half_bits) | right
+            if index < self._num_frames:
+                return index
+        raise RuntimeError(
+            f"Feistel cycle-walking did not converge within {_MAX_CYCLE_WALK_STEPS} steps; "
+            "this should never happen for a valid domain."
+        )
+
+    def _frame_index(self, position: int) -> int:
+        episode = int(np.searchsorted(self._cum_lengths, position, side="right"))
+        position_in_episode = position - (int(self._cum_lengths[episode - 1]) if episode > 0 else 0)
+        return int(self._starts[episode]) + position_in_episode

    def __iter__(self) -> Iterator[int]:
+        if not self.deterministic:
+            return self._iter_default()
+        # Advance epoch state eagerly, not on first consumption of the generator.
+        epoch, start = self._epoch, self._start_index
+        self._epoch += 1
+        self._start_index = 0
+        return self._iter_deterministic_epoch(epoch, start)
+
+    def _iter_default(self) -> Iterator[int]:
        if self.shuffle:
-            for i in torch.randperm(len(self.indices)):
-                yield self.indices[i]
+            for i in torch.randperm(self._num_frames, generator=self.generator):
+                yield self._frame_index(int(i))
        else:
-            for i in self.indices:
-                yield i
+            for k in range(self._num_frames):
+                yield self._frame_index(k)
+
+    def _iter_deterministic_epoch(self, epoch: int, start: int) -> Iterator[int]:
+        keys = self._round_keys(epoch) if self.shuffle else None
+        for k in range(start, self._num_frames):
+            yield self._frame_index(self._permute(k, keys) if self.shuffle else k)

    def __len__(self) -> int:
-        return len(self.indices)
+        return self._num_frames
+
+
+def compute_sampler_state(step: int, num_frames: int, batch_size: int, num_processes: int) -> dict:
+    """Map an optimization step to an `EpisodeAwareSampler` state for sample-exact resume.
+
+    Under accelerate's batch sharding, one step consumes `batch_size * num_processes` sampler
+    positions and each rank sees `ceil(ceil(num_frames / batch_size) / num_processes)` batches
+    per epoch (`even_batches` padding included). The start index provably stays below
+    `num_frames`; the `min` is defensive.
+    """
+    batches_per_epoch = math.ceil(math.ceil(num_frames / batch_size) / num_processes)
+    epoch, batches_into_epoch = divmod(step, batches_per_epoch)
+    start_index = min(batches_into_epoch * batch_size * num_processes, num_frames)
+    return {"epoch": epoch, "start_index": start_index}
@@ -77,21 +77,6 @@ from lerobot.utils.constants import ACTION, DONE, OBS_STATE, REWARD
 from lerobot.utils.utils import init_logging


-def get_feature_names(dataset: LeRobotDataset, key: str) -> list[str]:
-    """Return per-dimension names for a feature from the dataset metadata.
-
-    Only flat-list ``names`` metadata is used. Dict-style ``names`` and missing names fall back to ``{key}_{i}`` indices.
-    """
-    feature = dataset.features[key]
-    dim = feature["shape"][-1]
-
-    names = feature.get("names")
-    if isinstance(names, list) and len(names) == dim:
-        return [str(name) for name in names]
-
-    return [f"{key}_{d}" for d in range(dim)]
-
-
 def to_hwc_uint8_numpy(chw_float32_torch: torch.Tensor) -> np.ndarray:
    assert chw_float32_torch.dtype == torch.float32
    assert chw_float32_torch.ndim == 3
@@ -101,31 +86,6 @@ def to_hwc_uint8_numpy(chw_float32_torch: torch.Tensor) -> np.ndarray:
    return hwc_uint8_numpy


-def build_blueprint_from_dataset(dataset: LeRobotDataset):
-    """Build a Rerun blueprint laying out camera images and time series for the given dataset.
-
-    Camera images and scalar signals (action, state, reward, done, success) are arranged in a grid.
-    The per-dimension series names for ``action`` and ``state`` are applied directly
-    via blueprint overrides.
-    """
-    import rerun as rr
-    import rerun.blueprint as rrb
-
-    views = [rrb.Spatial2DView(origin=key, name=key) for key in dataset.meta.camera_keys]
-
-    # Style multi-dimensional signals (action, state) with per-dimension names.
-    for origin, key in ((ACTION, ACTION), ("state", OBS_STATE)):
-        if key in dataset.features:
-            names = get_feature_names(dataset, key)
-            styling = rr.SeriesLines(names=names)
-            views.append(rrb.TimeSeriesView(origin=origin, name=origin, overrides={origin: styling}))
-    for key in (DONE, REWARD, "next.success"):
-        if key in dataset.features:
-            views.append(rrb.TimeSeriesView(origin=key, name=key))
-
-    return rrb.Blueprint(rrb.Grid(*views))
-
-
 def visualize_dataset(
    dataset: LeRobotDataset,
    episode_index: int,
@@ -164,8 +124,7 @@ def visualize_dataset(
    import rerun as rr

    spawn_local_viewer = mode == "local" and not save
-    blueprint = build_blueprint_from_dataset(dataset)
-    rr.init(f"{repo_id}/episode_{episode_index}", spawn=spawn_local_viewer, default_blueprint=blueprint)
+    rr.init(f"{repo_id}/episode_{episode_index}", spawn=spawn_local_viewer)

    # Manually call python garbage collector after `rr.init` to avoid hanging in a blocking flush
    # when iterating on a dataloader with `num_workers` > 0
@@ -183,21 +142,26 @@ def visualize_dataset(
    for batch in tqdm.tqdm(dataloader, total=len(dataloader)):
        if first_index is None:
            first_index = batch["index"][0].item()
-
+        # iterate over the batch
        for i in range(len(batch["index"])):
            rr.set_time("frame_index", sequence=batch["index"][i].item() - first_index)
            rr.set_time("timestamp", timestamp=batch["timestamp"][i].item())

+            # display each camera image
            for key in dataset.meta.camera_keys:
                img = to_hwc_uint8_numpy(batch[key][i])
                img_entity = rr.Image(img).compress() if display_compressed_images else rr.Image(img)
                rr.log(key, entity=img_entity)

+            # display each dimension of action space (e.g. actuators command)
            if ACTION in batch:
-                rr.log(ACTION, rr.Scalars(batch[ACTION][i].numpy()))
+                for dim_idx, val in enumerate(batch[ACTION][i]):
+                    rr.log(f"{ACTION}/{dim_idx}", rr.Scalars(val.item()))

+            # display each dimension of observed state space (e.g. agent position in joint space)
            if OBS_STATE in batch:
-                rr.log("state", rr.Scalars(batch[OBS_STATE][i].numpy()))
+                for dim_idx, val in enumerate(batch[OBS_STATE][i]):
+                    rr.log(f"state/{dim_idx}", rr.Scalars(val.item()))

            if DONE in batch:
                rr.log(DONE, rr.Scalars(batch[DONE][i].item()))
@@ -209,6 +173,8 @@ def visualize_dataset(
                rr.log("next.success", rr.Scalars(batch["next.success"][i].item()))

    if mode == "local" and save:
+        # save .rrd locally
+        output_dir = Path(output_dir)
        output_dir.mkdir(parents=True, exist_ok=True)
        repo_id_str = repo_id.replace("/", "_")
        rrd_path = output_dir / f"{repo_id_str}_episode_{episode_index}.rrd"
@@ -216,7 +182,7 @@ def visualize_dataset(
        return rrd_path

    elif mode == "distant":
-        # Keep the process alive while it serves the gRPC/web connection.
+        # stop the process from exiting since it is serving the websocket connection
        try:
            while True:
                time.sleep(1)
@@ -331,14 +297,12 @@ def main():
        )
        logging.warning("Setting grpc_port to ws_port value.")
        kwargs["grpc_port"] = kwargs.pop("ws_port")
-    else:
-        kwargs.pop("ws_port")  # Always remove ws_port from kwargs

    init_logging()
    logging.info("Loading dataset")
    dataset = LeRobotDataset(repo_id, episodes=[args.episode_index], root=root, tolerance_s=tolerance_s)

-    visualize_dataset(dataset, **kwargs)
+    visualize_dataset(dataset, **vars(args))


 if __name__ == "__main__":
@@ -43,7 +43,7 @@ from lerobot.common.train_utils import (
 from lerobot.common.wandb_utils import WandBLogger
 from lerobot.configs import parser
 from lerobot.configs.train import TrainPipelineConfig
-from lerobot.datasets import EpisodeAwareSampler, make_dataset
+from lerobot.datasets import EpisodeAwareSampler, compute_sampler_state, make_dataset
 from lerobot.envs import close_envs, make_env, make_env_pre_post_processors
 from lerobot.optim.factory import make_optimizer_and_scheduler
 from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
@@ -232,15 +232,18 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        torch.backends.cudnn.benchmark = True
    torch.backends.cuda.matmul.allow_tf32 = True

-    # Dataset loading synchronization: main process downloads first to avoid race conditions
-    if is_main_process:
-        logging.info("Creating dataset")
+    # Dataset loading synchronization: each node's local main process downloads first to avoid
+    # race conditions (the global main process only exists on node 0, so gating on it would let
+    # all ranks of the other nodes download and build the Arrow cache concurrently).
+    if accelerator.is_local_main_process:
+        if is_main_process:
+            logging.info("Creating dataset")
        dataset = make_dataset(cfg)

    accelerator.wait_for_everyone()

-    # Now all other processes can safely load the dataset
-    if not is_main_process:
+    # Now all other processes can safely load the dataset from the local cache
+    if not accelerator.is_local_main_process:
        dataset = make_dataset(cfg)

    # Create environment used for evaluating checkpoints during training on simulation data.
@@ -384,14 +387,41 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        logging.info(f"{num_total_params=} ({format_big_number(num_total_params)})")

    # create dataloader for offline training
-    if hasattr(active_cfg, "drop_n_last_frames"):
+    if cfg.deterministic_sampler and not cfg.dataset.streaming:
+        # Deterministic data order: no cross-rank RNG sync needed, sample-exact resume.
        shuffle = False
+        sampler = EpisodeAwareSampler(
+            dataset.meta.episodes["dataset_from_index"],
+            dataset.meta.episodes["dataset_to_index"],
+            episode_indices_to_use=dataset.episodes,
+            drop_n_last_frames=getattr(active_cfg, "drop_n_last_frames", 0),
+            shuffle=True,
+            seed=cfg.seed if cfg.seed is not None else 0,
+        )
+        if cfg.resume and step > 0:
+            sampler_state = compute_sampler_state(
+                step, len(sampler), cfg.batch_size, accelerator.num_processes
+            )
+            sampler.load_state_dict(sampler_state)
+            if is_main_process:
+                logging.info(
+                    f"Resuming data order at epoch {sampler_state['epoch']}, "
+                    f"sample {sampler_state['start_index']}"
+                )
+    elif hasattr(active_cfg, "drop_n_last_frames"):
+        shuffle = False
+        # Legacy RNG shuffle: a dedicated generator lets accelerate synchronize it across ranks.
+        sampler_generator = torch.Generator()
+        if cfg.seed is not None:
+            sampler_generator.manual_seed(cfg.seed)
        sampler = EpisodeAwareSampler(
            dataset.meta.episodes["dataset_from_index"],
            dataset.meta.episodes["dataset_to_index"],
            episode_indices_to_use=dataset.episodes,
            drop_n_last_frames=active_cfg.drop_n_last_frames,
            shuffle=True,
+            deterministic=False,
+            generator=sampler_generator,
        )
    else:
        shuffle = True
@@ -38,8 +38,6 @@ def init_rerun(
    require_package("rerun-sdk", extra="viz", import_name="rerun")
    import rerun as rr

-    log_rerun_data.blueprint = None  # Reset blueprint cache for new session
-
    batch_size = os.getenv("RERUN_FLUSH_NUM_BYTES", "8000")
    os.environ["RERUN_FLUSH_NUM_BYTES"] = batch_size
    rr.init(session_name)
@@ -65,38 +63,6 @@ def _is_scalar(x):
    )


-def _build_blueprint(observation_paths: set[str], action_paths: set[str], image_paths: set[str]):
-    """Build a Rerun blueprint laying out camera images, observation and action scalars in separate views.
-
-    Camera images, observation and action scalars are arranged in a grid.
-    """
-
-    # Safe + zero-overhead: `log_rerun_data` already ran the `require_package` guard and imported rerun.
-    import rerun.blueprint as rrb
-
-    views = [rrb.Spatial2DView(origin=path, name=path) for path in sorted(image_paths)]
-
-    if observation_paths:
-        views.append(rrb.TimeSeriesView(name="observation", contents=sorted(observation_paths)))
-    if action_paths:
-        views.append(rrb.TimeSeriesView(name="action", contents=sorted(action_paths)))
-
-    return rrb.Blueprint(rrb.Grid(*views))
-
-
-def _ensure_blueprint(observation_paths: set[str], action_paths: set[str], image_paths: set[str]) -> None:
-    """Build and send the blueprint once, from the first observation and action data."""
-    if getattr(log_rerun_data, "blueprint", None) is not None:
-        return
-
-    # Safe + zero-overhead: `log_rerun_data` already ran the `require_package` guard and imported rerun.
-    import rerun as rr
-
-    blueprint = _build_blueprint(observation_paths, action_paths, image_paths)
-    log_rerun_data.blueprint = blueprint
-    rr.send_blueprint(blueprint)
-
-
 def log_rerun_data(
    observation: RobotObservation | None = None,
    action: RobotAction | None = None,
@@ -110,15 +76,11 @@ def log_rerun_data(
    - Scalars values (floats, ints) are logged as `rr.Scalars`.
    - 3D NumPy arrays that resemble images (e.g., with 1, 3, or 4 channels first) are transposed
      from CHW to HWC format, (optionally) compressed to JPEG and logged as `rr.Image` or `rr.EncodedImage`.
-    - 1D NumPy arrays are logged as a single `rr.Scalars` batch under one entity path, so that every
-      dimension shares the same view instead of being split across one view per element.
-    - Multi-dimensional **action** arrays are flattened and logged as a single `rr.Scalars` batch.
+    - 1D NumPy arrays are logged as a series of individual scalars, with each element indexed.
+    - Other multi-dimensional arrays are flattened and logged as individual scalars.

    Keys are automatically namespaced with "observation." or "action." if not already present.

-    On the first call, a blueprint is built and sent so observation and action scalars get separate
-    time-series views and each image gets its own spatial view.
-
    Args:
        observation: An optional dictionary containing observation data to log.
        action: An optional dictionary containing action data to log.
@@ -128,10 +90,6 @@ def log_rerun_data(
    require_package("rerun-sdk", extra="viz", import_name="rerun")
    import rerun as rr

-    observation_paths: set[str] = set()
-    action_paths: set[str] = set()
-    image_paths: set[str] = set()
-
    if observation:
        for k, v in observation.items():
            if v is None:
@@ -140,19 +98,17 @@ def log_rerun_data(

            if _is_scalar(v):
                rr.log(key, rr.Scalars(float(v)))
-                observation_paths.add(key)
            elif isinstance(v, np.ndarray):
                arr = v
                # Convert CHW -> HWC when needed
                if arr.ndim == 3 and arr.shape[0] in (1, 3, 4) and arr.shape[-1] not in (1, 3, 4):
                    arr = np.transpose(arr, (1, 2, 0))
                if arr.ndim == 1:
-                    rr.log(key, rr.Scalars(arr.astype(float)))
-                    observation_paths.add(key)
+                    for i, vi in enumerate(arr):
+                        rr.log(f"{key}_{i}", rr.Scalars(float(vi)))
                else:
                    img_entity = rr.Image(arr).compress() if compress_images else rr.Image(arr)
                    rr.log(key, entity=img_entity, static=True)
-                    image_paths.add(key)

    if action:
        for k, v in action.items():
@@ -162,9 +118,12 @@ def log_rerun_data(

            if _is_scalar(v):
                rr.log(key, rr.Scalars(float(v)))
-                action_paths.add(key)
            elif isinstance(v, np.ndarray):
-                rr.log(key, rr.Scalars(v.reshape(-1).astype(float)))
-                action_paths.add(key)
-
-    _ensure_blueprint(observation_paths, action_paths, image_paths)
+                if v.ndim == 1:
+                    for i, vi in enumerate(v):
+                        rr.log(f"{key}_{i}", rr.Scalars(float(vi)))
+                else:
+                    # Fall back to flattening higher-dimensional arrays
+                    flat = v.flatten()
+                    for i, vi in enumerate(flat):
+                        rr.log(f"{key}_{i}", rr.Scalars(float(vi)))
@@ -114,6 +114,36 @@ def test_shuffle():
    assert set(sampler) == {0, 1, 2, 3, 4, 5}


+def test_shuffle_with_generator_is_deterministic():
+    # Two samplers shuffling with same-seed generators must yield identical permutations.
+    # This is what keeps batch shards disjoint across ranks in distributed training, where
+    # accelerate synchronizes the sampler's generator state instead of the global torch RNG.
+    sampler_a = EpisodeAwareSampler(
+        [0], [6], shuffle=True, deterministic=False, generator=torch.Generator().manual_seed(42)
+    )
+    sampler_b = EpisodeAwareSampler(
+        [0], [6], shuffle=True, deterministic=False, generator=torch.Generator().manual_seed(42)
+    )
+    assert list(sampler_a) == list(sampler_b)
+
+    # Desyncing the global RNG must not affect the permutation.
+    sampler_c = EpisodeAwareSampler(
+        [0], [6], shuffle=True, deterministic=False, generator=torch.Generator().manual_seed(42)
+    )
+    order_before = list(sampler_c)
+    sampler_c.generator.manual_seed(42)
+    torch.randperm(1000)  # consume global RNG, as rank-asymmetric code (e.g. eval) would
+    assert list(sampler_c) == order_before
+
+
+def test_generator_attribute_defaults_to_none():
+    # accelerate detects synchronizable samplers via `hasattr(sampler, "generator")`,
+    # so the attribute must exist even when no generator is passed.
+    sampler = EpisodeAwareSampler([0], [6], shuffle=True, deterministic=False)
+    assert sampler.generator is None
+    assert set(sampler) == {0, 1, 2, 3, 4, 5}
+
+
 def test_negative_drop_first_frames_raises():
    with pytest.raises(ValueError, match="drop_n_first_frames must be >= 0"):
        EpisodeAwareSampler([0], [10], drop_n_first_frames=-1)
@@ -137,3 +167,127 @@ def test_partial_episode_drop_warns(caplog):
    # Episode 0 is skipped (1 frame, drop 1), Episode 1 keeps frames 2-5
    assert sampler.indices == [2, 3, 4, 5]
    assert "Episode 0" in caplog.text
+
+
+# --- deterministic mode (seeded Feistel permutation) ---
+
+from functools import partial  # noqa: E402
+
+from lerobot.datasets.sampler import compute_sampler_state  # noqa: E402
+
+deterministic_sampler = partial(EpisodeAwareSampler, deterministic=True)
+
+
+EPISODE_BOUNDS = ([0, 2, 3], [2, 3, 6])  # episodes of 2, 1 and 3 frames
+
+
+def test_deterministic_mode_unshuffled_matches_default_mode():
+    for kwargs in (
+        {},
+        {"drop_n_first_frames": 1},
+        {"drop_n_last_frames": 1},
+        {"episode_indices_to_use": [0, 2]},
+    ):
+        reference = EpisodeAwareSampler(*EPISODE_BOUNDS, shuffle=False, **kwargs)
+        sampler = deterministic_sampler(*EPISODE_BOUNDS, shuffle=False, **kwargs)
+        assert list(sampler) == list(reference), kwargs
+        assert len(sampler) == len(reference), kwargs
+
+
+def test_deterministic_mode_rejects_generator():
+    with pytest.raises(ValueError, match="generator is unused in deterministic mode"):
+        deterministic_sampler(*EPISODE_BOUNDS, shuffle=True, generator=torch.Generator())
+
+
+def test_state_methods_require_deterministic_mode():
+    sampler = EpisodeAwareSampler(*EPISODE_BOUNDS, shuffle=True, deterministic=False)
+    with pytest.raises(RuntimeError, match="deterministic=True"):
+        sampler.set_epoch(1)
+    with pytest.raises(RuntimeError, match="deterministic=True"):
+        sampler.state_dict()
+
+
+@pytest.mark.parametrize("num_frames", [1, 2, 3, 37, 64, 100])
+def test_deterministic_sampler_shuffle_is_permutation(num_frames):
+    for seed in (0, 1, 1234):
+        sampler = deterministic_sampler([0], [num_frames], shuffle=True, seed=seed)
+        assert sorted(sampler) == list(range(num_frames))
+
+
+def test_deterministic_sampler_epochs_reproduce_and_differ():
+    sampler_a = deterministic_sampler([0], [100], shuffle=True, seed=42)
+    sampler_b = deterministic_sampler([0], [100], shuffle=True, seed=42)
+    epoch_0 = list(sampler_a)
+    assert list(sampler_b) == epoch_0  # same (seed, epoch) -> same order on any process
+    epoch_1 = list(sampler_a)  # __iter__ auto-advances the epoch
+    assert epoch_1 != epoch_0
+    assert sorted(epoch_1) == sorted(epoch_0)
+    sampler_a.set_epoch(0)
+    assert list(sampler_a) == epoch_0
+    assert list(deterministic_sampler([0], [100], shuffle=True, seed=7)) != epoch_0
+
+
+def test_deterministic_sampler_resume_mid_epoch():
+    reference = deterministic_sampler(*EPISODE_BOUNDS, shuffle=True, seed=42)
+    epoch_0 = list(reference)
+    epoch_1 = list(reference)
+    for start in (0, 1, 4, len(epoch_0)):
+        resumed = deterministic_sampler(*EPISODE_BOUNDS, shuffle=True, seed=42)
+        resumed.load_state_dict({"epoch": 0, "start_index": start})
+        assert list(resumed) == epoch_0[start:]
+        # the resumed sampler continues into the same epoch 1 as the uninterrupted one
+        assert list(resumed) == epoch_1
+
+
+def test_deterministic_sampler_constant_memory():
+    # A trillion-frame dataset must instantiate instantly and seek anywhere in O(1):
+    # only per-episode boundaries are stored, never per-frame indices.
+    num_frames = 10**12
+    sampler = deterministic_sampler([0], [num_frames], shuffle=True, seed=0)
+    assert len(sampler) == num_frames
+    sampler.load_state_dict({"epoch": 3, "start_index": num_frames - 3})
+    # Collect via the iterator: list(sampler) would call PyObject_LengthHint -> sampler.__len__
+    # (the full epoch length, here 10**12) and pre-allocate that many slots before iterating. The
+    # iterator itself exposes no length hint, so this stays O(1) like the resumed epoch it drains.
+    tail = list(iter(sampler))
+    assert len(tail) == 3
+    assert all(0 <= idx < num_frames for idx in tail)
+
+
+def test_deterministic_sampler_validation_matches_episode_aware():
+    with pytest.raises(ValueError, match="drop_n_first_frames must be >= 0"):
+        deterministic_sampler([0], [10], drop_n_first_frames=-1)
+    with pytest.raises(ValueError, match="drop_n_last_frames must be >= 0"):
+        deterministic_sampler([0], [10], drop_n_last_frames=-1)
+    with pytest.raises(ValueError, match="No valid frames remain"):
+        deterministic_sampler([0, 1, 2], [1, 2, 3], drop_n_first_frames=1)
+
+
+def test_deterministic_sampler_partial_episode_drop_warns(caplog):
+    with caplog.at_level(logging.WARNING, logger="lerobot.datasets.sampler"):
+        sampler = deterministic_sampler([0, 1], [1, 6], drop_n_first_frames=1, shuffle=False)
+    assert list(sampler) == [2, 3, 4, 5]
+    assert "Episode 0" in caplog.text
+
+
+def test_compute_sampler_state():
+    # 100 frames, batch 10, 2 ranks -> 10 underlying batches, 5 per rank per epoch.
+    assert compute_sampler_state(step=0, num_frames=100, batch_size=10, num_processes=2) == {
+        "epoch": 0,
+        "start_index": 0,
+    }
+    # step 7 -> epoch 1, 2 per-rank batches in = 2 * 10 * 2 = 40 samples in
+    assert compute_sampler_state(step=7, num_frames=100, batch_size=10, num_processes=2) == {
+        "epoch": 1,
+        "start_index": 40,
+    }
+    # uneven epoch: 95 frames -> 10 underlying batches (last short), still 5 per rank
+    assert compute_sampler_state(step=12, num_frames=95, batch_size=10, num_processes=2) == {
+        "epoch": 2,
+        "start_index": 40,
+    }
+    # uneven sharding: 105 frames -> 11 underlying batches, 6 per rank (even_batches pads)
+    assert compute_sampler_state(step=11, num_frames=105, batch_size=10, num_processes=2) == {
+        "epoch": 1,
+        "start_index": 100,
+    }
@@ -30,46 +30,25 @@ from lerobot.utils.constants import OBS_STATE
@pytest.fixture
 def mock_rerun(monkeypatch):
    """
-    Provide a mock `rerun` module (and `rerun.blueprint` submodule) so tests don't
-    depend on the real library. Also reload the module-under-test so it binds to
-    this mock `rr`.
+    Provide a mock `rerun` module so tests don't depend on the real library.
+    Also reload the module-under-test so it binds to this mock `rr`.
    """
    calls = []
-    blueprints = []

    class DummyScalar:
        def __init__(self, value):
-            # Scalars may be built from a single float or from a 1D array batch.
-            self.value = value
+            self.value = float(value)

    class DummyImage:
        def __init__(self, arr):
            self.arr = arr

-        def compress(self, *a, **k):
-            return self
-
    def dummy_log(key, obj=None, **kwargs):
        # Accept either positional `obj` or keyword `entity` and record remaining kwargs.
        if obj is None and "entity" in kwargs:
            obj = kwargs.pop("entity")
        calls.append((key, obj, kwargs))

-    def dummy_send_blueprint(blueprint, *a, **k):
-        blueprints.append(blueprint)
-
-    # Mock the `rerun.blueprint` submodule used to build the layout.
-    dummy_rrb = SimpleNamespace(
-        Spatial2DView=lambda origin=None, name=None: SimpleNamespace(
-            kind="Spatial2DView", origin=origin, name=name
-        ),
-        TimeSeriesView=lambda name=None, contents=None: SimpleNamespace(
-            kind="TimeSeriesView", name=name, contents=contents
-        ),
-        Grid=lambda *views: SimpleNamespace(kind="Grid", views=list(views)),
-        Blueprint=lambda root: SimpleNamespace(kind="Blueprint", root=root),
-    )
-
    dummy_rr = SimpleNamespace(
        __name__="rerun",
        __package__="rerun",
@@ -77,23 +56,20 @@ def mock_rerun(monkeypatch):
        Scalars=DummyScalar,
        Image=DummyImage,
        log=dummy_log,
-        send_blueprint=dummy_send_blueprint,
        init=lambda *a, **k: None,
        spawn=lambda *a, **k: None,
-        blueprint=dummy_rrb,
    )

-    # Inject fake modules into sys.modules (both `rerun` and `rerun.blueprint`).
+    # Inject fake module into sys.modules
    monkeypatch.setitem(sys.modules, "rerun", dummy_rr)
-    monkeypatch.setitem(sys.modules, "rerun.blueprint", dummy_rrb)

    # Now import and reload the module under test, to bind to our rerun mock
    import lerobot.utils.visualization_utils as vu

    importlib.reload(vu)

-    # Expose the reloaded module, the call recorder and the captured blueprints
-    yield vu, calls, blueprints
+    # Expose both the reloaded module and the call recorder
+    yield vu, calls


 def _keys(calls):
@@ -116,13 +92,8 @@ def _kwargs_for(calls, key):
    raise KeyError(f"Key {key} not found in calls: {calls}")


-def _views_by_kind(blueprint, kind):
-    """Return the views of a given kind from the (single) blueprint's grid."""
-    return [v for v in blueprint.root.views if v.kind == kind]
-
-
 def test_log_rerun_data_envtransition_scalars_and_image(mock_rerun):
-    vu, calls, blueprints = mock_rerun
+    vu, calls = mock_rerun

    # Build EnvTransition dict
    obs = {
@@ -132,7 +103,7 @@ def test_log_rerun_data_envtransition_scalars_and_image(mock_rerun):
    }
    act = {
        "action.throttle": 0.7,
-        # 1D array should be logged as a single Scalars batch under one entity path
+        # 1D array should log individual Scalars with suffix _i
        "action.vector": np.array([1.0, 2.0], dtype=np.float32),
    }
    transition = {
@@ -149,28 +120,31 @@ def test_log_rerun_data_envtransition_scalars_and_image(mock_rerun):
    # - observation.state.temperature -> Scalars
    # - observation.camera -> Image (HWC) with static=True
    # - action.throttle -> Scalars
-    # - action.vector -> single Scalars batch (no per-element suffix)
+    # - action.vector_0, action.vector_1 -> Scalars
    expected_keys = {
        f"{OBS_STATE}.temperature",
        "observation.camera",
        "action.throttle",
-        "action.vector",
+        "action.vector_0",
+        "action.vector_1",
    }
    assert set(_keys(calls)) == expected_keys

    # Check scalar types and values
    temp_obj = _obj_for(calls, f"{OBS_STATE}.temperature")
    assert type(temp_obj).__name__ == "DummyScalar"
-    assert float(temp_obj.value) == pytest.approx(25.0)
+    assert temp_obj.value == pytest.approx(25.0)

    throttle_obj = _obj_for(calls, "action.throttle")
    assert type(throttle_obj).__name__ == "DummyScalar"
-    assert float(throttle_obj.value) == pytest.approx(0.7)
+    assert throttle_obj.value == pytest.approx(0.7)

-    # 1D vector logged as a single batched Scalars under one entity path
-    vec = _obj_for(calls, "action.vector")
-    assert type(vec).__name__ == "DummyScalar"
-    np.testing.assert_allclose(np.asarray(vec.value), [1.0, 2.0])
+    v0 = _obj_for(calls, "action.vector_0")
+    v1 = _obj_for(calls, "action.vector_1")
+    assert type(v0).__name__ == "DummyScalar"
+    assert type(v1).__name__ == "DummyScalar"
+    assert v0.value == pytest.approx(1.0)
+    assert v1.value == pytest.approx(2.0)

    # Check image handling: CHW -> HWC
    img_obj = _obj_for(calls, "observation.camera")
@@ -178,24 +152,9 @@ def test_log_rerun_data_envtransition_scalars_and_image(mock_rerun):
    assert img_obj.arr.shape == (10, 20, 3)  # transposed
    assert _kwargs_for(calls, "observation.camera").get("static", False) is True  # static=True for images

-    # A blueprint should have been built and sent exactly once, and cached on the function.
-    assert len(blueprints) == 1
-    assert vu.log_rerun_data.blueprint is blueprints[0]
-
-    bp = blueprints[0]
-    # One spatial view per image path
-    spatial_views = _views_by_kind(bp, "Spatial2DView")
-    assert {v.origin for v in spatial_views} == {"observation.camera"}
-
-    # One time-series view each for observation and action scalars
-    ts_views = {v.name: v for v in _views_by_kind(bp, "TimeSeriesView")}
-    assert set(ts_views) == {"observation", "action"}
-    assert ts_views["observation"].contents == [f"{OBS_STATE}.temperature"]
-    assert ts_views["action"].contents == ["action.throttle", "action.vector"]
-

 def test_log_rerun_data_plain_list_ordering_and_prefixes(mock_rerun):
-    vu, calls, blueprints = mock_rerun
+    vu, calls = mock_rerun

    # First dict without prefixes treated as observation
    # Second dict without prefixes treated as action
@@ -214,12 +173,14 @@ def test_log_rerun_data_plain_list_ordering_and_prefixes(mock_rerun):
    # First dict was treated as observation, second as action
    vu.log_rerun_data(observation=obs_plain, action=act_plain)

-    # Expected keys with auto-prefixes. The 1D vector is a single batched Scalars.
+    # Expected keys with auto-prefixes
    expected = {
        "observation.temp",
        "observation.img",
        "action.throttle",
-        "action.vec",
+        "action.vec_0",
+        "action.vec_1",
+        "action.vec_2",
    }
    logged = set(_keys(calls))
    assert logged == expected
@@ -227,11 +188,11 @@ def test_log_rerun_data_plain_list_ordering_and_prefixes(mock_rerun):
    # Scalars
    t = _obj_for(calls, "observation.temp")
    assert type(t).__name__ == "DummyScalar"
-    assert float(t.value) == pytest.approx(1.5)
+    assert t.value == pytest.approx(1.5)

    throttle = _obj_for(calls, "action.throttle")
    assert type(throttle).__name__ == "DummyScalar"
-    assert float(throttle.value) == pytest.approx(0.3)
+    assert throttle.value == pytest.approx(0.3)

    # Image stays HWC
    img = _obj_for(calls, "observation.img")
@@ -239,23 +200,15 @@ def test_log_rerun_data_plain_list_ordering_and_prefixes(mock_rerun):
    assert img.arr.shape == (5, 6, 3)
    assert _kwargs_for(calls, "observation.img").get("static", False) is True

-    # Vector logged as a single batched Scalars under one entity path
-    vec = _obj_for(calls, "action.vec")
-    assert type(vec).__name__ == "DummyScalar"
-    np.testing.assert_allclose(np.asarray(vec.value), [9, 8, 7])
-
-    # Blueprint sent once with the expected view layout
-    assert len(blueprints) == 1
-    bp = blueprints[0]
-    spatial_views = _views_by_kind(bp, "Spatial2DView")
-    assert {v.origin for v in spatial_views} == {"observation.img"}
-    ts_views = {v.name: v for v in _views_by_kind(bp, "TimeSeriesView")}
-    assert ts_views["observation"].contents == ["observation.temp"]
-    assert ts_views["action"].contents == ["action.throttle", "action.vec"]
+    # Vectors
+    for i, val in enumerate([9, 8, 7]):
+        o = _obj_for(calls, f"action.vec_{i}")
+        assert type(o).__name__ == "DummyScalar"
+        assert o.value == pytest.approx(val)


 def test_log_rerun_data_kwargs_only(mock_rerun):
-    vu, calls, blueprints = mock_rerun
+    vu, calls = mock_rerun

    vu.log_rerun_data(
        observation={"observation.temp": 10.0, "observation.gray": np.zeros((8, 8, 1), dtype=np.uint8)},
@@ -269,7 +222,7 @@ def test_log_rerun_data_kwargs_only(mock_rerun):

    temp = _obj_for(calls, "observation.temp")
    assert type(temp).__name__ == "DummyScalar"
-    assert float(temp.value) == pytest.approx(10.0)
+    assert temp.value == pytest.approx(10.0)

    img = _obj_for(calls, "observation.gray")
    assert type(img).__name__ == "DummyImage"
@@ -278,26 +231,4 @@ def test_log_rerun_data_kwargs_only(mock_rerun):

    a = _obj_for(calls, "action.a")
    assert type(a).__name__ == "DummyScalar"
-    assert float(a.value) == pytest.approx(1.0)
-
-    # Blueprint sent once, with a spatial view for the image and time-series views for scalars
-    assert len(blueprints) == 1
-    bp = blueprints[0]
-    assert {v.origin for v in _views_by_kind(bp, "Spatial2DView")} == {"observation.gray"}
-    ts_views = {v.name: v for v in _views_by_kind(bp, "TimeSeriesView")}
-    assert ts_views["observation"].contents == ["observation.temp"]
-    assert ts_views["action"].contents == ["action.a"]
-
-
-def test_log_rerun_data_blueprint_sent_only_once(mock_rerun):
-    """The blueprint is built from the first call and not resent on subsequent calls."""
-    vu, calls, blueprints = mock_rerun
-
-    vu.log_rerun_data(observation={"temp": 1.0}, action={"a": 2.0})
-    assert len(blueprints) == 1
-    first_blueprint = vu.log_rerun_data.blueprint
-
-    vu.log_rerun_data(observation={"temp": 3.0}, action={"a": 4.0})
-    # Still only one blueprint, and the cached one is unchanged.
-    assert len(blueprints) == 1
-    assert vu.log_rerun_data.blueprint is first_blueprint
+    assert a.value == pytest.approx(1.0)
@@ -1,5 +1,5 @@
 version = 1
-revision = 2
+revision = 3
 requires-python = ">=3.12"
 resolution-markers = [
    "(python_full_version >= '3.15' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'x86_64' and sys_platform == 'linux')",
@@ -3257,7 +3257,7 @@ requires-dist = [
    { name = "qwen-vl-utils", marker = "extra == 'qwen-vl-utils-dep'", specifier = ">=0.0.11,<0.1.0" },
    { name = "reachy2-sdk", marker = "extra == 'reachy2'", specifier = ">=1.0.15,<1.1.0" },
    { name = "requests", specifier = ">=2.32.0,<3.0.0" },
-    { name = "rerun-sdk", marker = "extra == 'viz'", specifier = ">=0.24.0,<0.34.0" },
+    { name = "rerun-sdk", marker = "extra == 'viz'", specifier = ">=0.24.0,<0.27.0" },
    { name = "ruff", marker = "extra == 'dev'", specifier = ">=0.14.1" },
    { name = "safetensors", specifier = ">=0.4.3,<1.0.0" },
    { name = "scikit-image", marker = "extra == 'video-benchmark'", specifier = ">=0.23.2,<0.26.0" },
@@ -5636,21 +5636,21 @@ wheels = [

 [[package]]
 name = "rerun-sdk"
-version = "0.33.0"
+version = "0.26.2"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "attrs" },
    { name = "numpy" },
    { name = "pillow" },
-    { name = "psutil" },
    { name = "pyarrow" },
    { name = "typing-extensions" },
 ]
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/31/17/5a521e86ac0064bd0f452e3e98e2422433511b54110423c0217d2cc1234f/rerun_sdk-0.33.0-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:97f123e3ef6aa69b60194bc566e5435c7d4040757ed4f58297ea46c8ef320c5c", size = 125707606, upload-time = "2026-05-29T09:42:53.584Z" },
-    { url = "https://files.pythonhosted.org/packages/34/2f/2ca2599aca03b69fbcac7c8391ef50376968edd7c58b96de53a4b7f20624/rerun_sdk-0.33.0-cp310-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:8f734cf59419dcfbc46915bea6cec030224f16e96c3a597f0ccf7cb7b058dd43", size = 135271020, upload-time = "2026-05-29T09:43:00.106Z" },
-    { url = "https://files.pythonhosted.org/packages/2e/ba/d70997b43e6db4f58c4326c29c6a6a384ddc6c2fe125f231c885ad9b3b1f/rerun_sdk-0.33.0-cp310-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:53d95609f8b330026bcd041bf6d11b46ee1c18b6fbde155135f291fe86328eeb", size = 139552018, upload-time = "2026-05-29T09:43:06.275Z" },
-    { url = "https://files.pythonhosted.org/packages/14/a5/0cac294d16aff6c9a2f183f838428a0380b4d2fd9e053bb37b3041999ad5/rerun_sdk-0.33.0-cp310-abi3-win_amd64.whl", hash = "sha256:b152992a72ec240062c8c285bd30ab681b464a25efbe1464c66fdac82320de1f", size = 120418186, upload-time = "2026-05-29T09:43:13.733Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/4a/767c20e1529d74d9be5b5e55c6c26b63a6918ef3c1709fc422d08a460114/rerun_sdk-0.26.2-cp39-abi3-macosx_10_12_x86_64.whl", hash = "sha256:3d4151c9a3484e112b53d1df90c8fa07397dc7b8bfbb420f09e011eff20f1ef2", size = 93349439, upload-time = "2025-10-27T11:34:10.745Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/3d/d8dd0af9c287a85d51ec99d69406cc4b94a9feb1d6f192d3bbcaac9f0b81/rerun_sdk-0.26.2-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:03977d2aba4966d9a70b682eca196123fda11408fecd733441ede9916c6341e2", size = 86323042, upload-time = "2025-10-27T11:34:17.995Z" },
+    { url = "https://files.pythonhosted.org/packages/13/29/53d8d98799ab32418fd4ba6834d6a5749c31f56160d3c87f52a7219887e9/rerun_sdk-0.26.2-cp39-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:b6128c3c4f014cae5be18e4d37657c5932d1bcdb2ce5e9d4b488a6eed47f7437", size = 92677274, upload-time = "2025-10-27T11:34:22.601Z" },
+    { url = "https://files.pythonhosted.org/packages/f5/86/0b9c8f56398b4fc85f8e99279907c258413a297e5603f8f2537fe5806e51/rerun_sdk-0.26.2-cp39-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:a6f97b60aaa7d4e8c6124a3f6b97ce9dbd09520050955f0e0bdacb72b0eb106a", size = 98768129, upload-time = "2025-10-27T11:34:27.36Z" },
+    { url = "https://files.pythonhosted.org/packages/be/e7/99fc91c0f99f69d7d43e1db0a6f6cb8273ffc02111539bfc1fee43749bad/rerun_sdk-0.26.2-cp39-abi3-win_amd64.whl", hash = "sha256:a493ad6c8357022cba2ca6f8954a81d0faf984b0b22154eb1d976bfc7649df63", size = 84267089, upload-time = "2025-10-27T11:34:32.023Z" },
 ]

 [[package]]
Author	SHA1	Message	Date
Claude	7a62235bac	fix(datasets): guard Feistel cycle-walking loop against non-convergence Replace the unbounded while True in EpisodeAwareSampler._permute with a bounded for loop capped at _MAX_CYCLE_WALK_STEPS (100) and raise RuntimeError if the cycle-walk fails to land in [0, num_frames). The loop is expected to converge in <4 steps on the chosen power-of-two domain, so the bound is a safety net that should never trip in practice but prevents a pathological infinite loop. https://claude.ai/code/session_01HQ15tFrBsHYScjGWosEv22	2026-06-11 13:20:31 +00:00
Pepijn	81f0ca9ce4	test(sampler): drain resumed trillion-frame sampler via iter() to avoid list() prealloc list(sampler) calls PyObject_LengthHint -> __len__ (the full 10**12 epoch length) and preallocates that many slots before iterating, OOMing even though the resumed epoch only yields 3 frames. Collect through the iterator (no length hint) so the test exercises the real O(1) seek/drain instead of CPython's list growth heuristic.	2026-06-11 10:39:13 +00:00
Pepijn	29ca0f53d9	feat(datasets): default EpisodeAwareSampler to deterministic mode and trim comments deterministic=True is now the class default as well as the training default; the legacy RNG path requires an explicit deterministic=False (the train script's non-deterministic branch passes it). Docstrings and inline comments slimmed down across the changed files. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 11:54:22 +02:00
Pepijn	b2d5d4ccfc	feat(train): enable deterministic_sampler by default Deterministic data order (sample-exact resume, no cross-rank RNG sync, O(1) sampler memory) is now the default for map-style training; set deterministic_sampler=false to restore the legacy RNG-based shuffle. Streaming datasets ignore the flag (the sampler path only applies to map-style datasets), replacing the previous hard validation error so streaming configs keep working with the new default. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 11:45:36 +02:00
Pepijn	32b0d7d1ef	refactor(datasets): fold deterministic mode into EpisodeAwareSampler Instead of a parallel DeterministicEpisodeAwareSampler class, extend the existing EpisodeAwareSampler with a deterministic=True mode (seeded Feistel permutation, epoch auto-advance, state_dict/load_state_dict). The default mode is behavior-identical: same torch.randperm consumption and the same generator contract accelerate synchronizes; the O(N) Python index list is replaced by O(num_episodes) boundary arrays in both modes, with `indices` kept as a back-compat property. Passing a generator together with deterministic=True is rejected, and the state/seek methods raise outside deterministic mode. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 11:37:44 +02:00
Pepijn	7416b714c0	Merge remote-tracking branch 'origin/main' into feat/deterministic-sampler	2026-06-11 11:33:44 +02:00
Pepijn	41166b39fb	fix(train): synchronize EpisodeAwareSampler shuffling across ranks and gate dataset download per node (#3768 ) * fix(datasets): expose a generator on EpisodeAwareSampler for distributed shuffle sync In distributed training, accelerate can only synchronize the shuffle permutation across ranks when the sampler exposes a generator attribute. EpisodeAwareSampler shuffled via the global torch RNG, so disjoint batch shards relied on every rank's global CPU RNG staying in lockstep forever; any rank-asymmetric RNG consumption (e.g. eval rollouts on the main process only) silently desynced the permutations and ranks trained on overlapping/missing samples. * fix(train): seed sampler generator and gate dataset download per node - Pass a generator seeded with cfg.seed to EpisodeAwareSampler so accelerator.prepare registers it as the synchronized RNG and the shuffle order is reproducible. - Gate the initial make_dataset call on is_local_main_process instead of is_main_process: the global main process only exists on node 0, so on every other node all local ranks were downloading the dataset and building the Arrow cache concurrently.	2026-06-11 11:07:42 +02:00
Pepijn	6fa495c6b0	feat(datasets): add DeterministicEpisodeAwareSampler with O(1) memory and sample-exact resume Add a sampler that never materializes frame indices: it stores only per-episode boundaries (numpy, a few bytes per episode) and maps logical positions to frame indices on the fly with searchsorted. Shuffling uses a seeded Feistel permutation over [0, num_frames) (cycle-walking to the exact domain), so the data order is a pure function of (seed, epoch): - no RNG state to synchronize across distributed ranks, - constant memory and zero epoch-boundary cost at any dataset size, - O(1) seek to any position, enabling sample-exact resume. Opt in with --deterministic_sampler=true. On resume, lerobot-train maps the checkpointed step back to (epoch, start_index) via compute_sampler_state and continues at the exact sample where the run left off (up to accelerate's even_batches padding at epoch boundaries). The shuffle is pseudo-random rather than a true uniform permutation, the standard trade-off in large-scale training loaders. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 10:33:52 +02:00
Pepijn	72e093dbff	fix(train): seed sampler generator and gate dataset download per node - Pass a generator seeded with cfg.seed to EpisodeAwareSampler so accelerator.prepare registers it as the synchronized RNG and the shuffle order is reproducible. - Gate the initial make_dataset call on is_local_main_process instead of is_main_process: the global main process only exists on node 0, so on every other node all local ranks were downloading the dataset and building the Arrow cache concurrently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 10:01:43 +02:00
Pepijn	3d262a6c9e	fix(datasets): expose a generator on EpisodeAwareSampler for distributed shuffle sync In distributed training, accelerate can only synchronize the shuffle permutation across ranks when the sampler exposes a generator attribute. EpisodeAwareSampler shuffled via the global torch RNG, so disjoint batch shards relied on every rank's global CPU RNG staying in lockstep forever; any rank-asymmetric RNG consumption (e.g. eval rollouts on the main process only) silently desynced the permutations and ranks trained on overlapping/missing samples. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 10:01:42 +02:00