Merge remote-tracking branch 'origin/main' into feat/smolvla-on-steerable

# Conflicts: # docs/source/annotation_pipeline.mdx # examples/annotations/run_hf_job.py # pyproject.toml # src/lerobot/annotations/steerable_pipeline/config.py # src/lerobot/annotations/steerable_pipeline/frames.py # src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py # src/lerobot/annotations/steerable_pipeline/vlm_client.py # src/lerobot/annotations/steerable_pipeline/writer.py # src/lerobot/datasets/__init__.py # src/lerobot/datasets/sampler.py # src/lerobot/scripts/lerobot_annotate.py # src/lerobot/scripts/lerobot_train.py # tests/annotations/test_frames.py # tests/annotations/test_modules.py # tests/annotations/test_writer.py # tests/datasets/test_sampler.py # tests/scripts/test_lerobot_annotate.py # uv.lock
2026-06-27 13:17:10 +00:00 · 2026-06-23 11:07:53 +02:00
parent 3427499212 73782447f2
commit 4dbe83d3bc
91 changed files with 4267 additions and 2012 deletions
@@ -71,11 +71,21 @@ it uses a two-step **describe → segment** flow:
 2. **Segment** — that description is fed back in, and the VLM splits the
   episode into consecutive atomic subtasks.

+Both passes see the episode as **timestamped contact sheets** — frames
+sampled at `frames_per_second` (0.5s by default) and packed into JPEG
+grids with each frame's time burned into its corner, so the VLM cites
+exact boundary times directly. This is far cheaper in vision tokens than
+one image per frame, so the sampling can stay dense; episodes longer than
+`max_frames_per_prompt` are split into windows at the same density and
+merged. Both prompts also carry a causal **event-boundary** definition (a
+new event starts when an object becomes held / is released / reaches a new
+location / a lid changes state / contents move) to sharpen where cuts land.
+
 The resulting spans are then stitched into a gap-free, full-episode
 cover, so **every frame has exactly one active subtask**. See
 [`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
-for the production settings (single camera, embedded frames, windowed
-subtask generation).
+for the production settings (single camera, timestamped contact sheets,
+auto-windowed subtask generation).

 ### Tools

@@ -162,15 +172,15 @@ Every module is on by default and can be toggled independently (set to

 | Flag                            | Default    | What it does                                                                                                              |
 | ------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------- |
-| `--plan.frames_per_second`      | `1.0`      | How densely the episode video is sampled.                                                                                 |
-| `--plan.max_video_frames`       | `32`       | Hard cap on frames per call (context-budget guard — don't exceed ~32 for a 32k context).                                  |
-| `--plan.subtask_window_seconds` | `0`        | Split long episodes into fixed windows for constant frame density (`0` = whole episode).                                  |
+| `--plan.frames_per_second`      | `2.0`      | Frame sampling rate for the contact sheets (`2.0` = one frame every 0.5s).                                                |
+| `--plan.max_frames_per_prompt`  | `60`       | Frame budget per VLM call. Episodes whose sampling exceeds this are auto-windowed at the same density, then stitched.     |
+| `--plan.contact_sheet_columns`  | `5`        | Columns per contact-sheet grid (`contact_sheet_frames_per_sheet` tiles, time row-major).                                  |
 | `--plan.plan_max_steps`         | `8`        | Upper bound on subtasks per episode.                                                                                      |
 | `--plan.subtask_describe_first` | `true`     | Run the describe→segment grounding pass (best subtask quality; +1 call/episode).                                          |
 | `--plan.emit_plan`              | `true`     | Emit the numbered `plan` rows (`false` = subtasks + memory only).                                                         |
+| `--plan.emit_memory`            | `true`     | Emit the `memory` rows (`false` = subtasks + plan only); symmetric to `emit_plan`.                                        |
 | `--plan.n_task_rephrasings`     | `10`       | How many `task_aug` rephrasings to emit (`0` disables).                                                                   |
 | `--plan.derive_task_from_video` | `if_short` | Use the dataset task as-is (`off`), only when it's missing/short (`if_short`), or always re-derive from video (`always`). |
-| `--plan.use_video_url`          | `false`    | Send a server-side video clip instead of embedded frames.                                                                 |

 ### Interjections + VQA

@@ -57,11 +57,11 @@ The `lerobot-rollout --strategy.type=dagger` mode requires **teleoperators with

 **Compatible teleoperators:**

- `openarm_mini` - OpenArm Mini
+- `bi_openarm_mini` - Bimanual OpenArm Mini
 - `so_leader` - SO100 / SO101 leader arm

 > [!IMPORTANT]
-> The provided commands default to `bi_openarm_follower` + `openarm_mini`.
+> The provided commands default to `bi_openarm_follower` + `bi_openarm_mini`.
 > `so_follower` + `so_leader` configs are also registered and can be used via CLI flags.

 ---
@@ -104,9 +104,9 @@ lerobot-rollout --strategy.type=dagger \
    --robot.right_arm_config.port=can0 \
    --robot.right_arm_config.side=right \
    --robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}}' \
-    --teleop.type=openarm_mini \
-    --teleop.port_left=/dev/ttyACM0 \
-    --teleop.port_right=/dev/ttyACM1 \
+    --teleop.type=bi_openarm_mini \
+    --teleop.left_arm_config.port=/dev/ttyACM0 \
+    --teleop.right_arm_config.port=/dev/ttyACM1 \
    --policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
    --dataset.repo_id=your-username/rollout_hil_dataset \
    --dataset.single_task="Fold the T-shirt properly" \
@@ -131,9 +131,9 @@ lerobot-rollout --strategy.type=dagger \
    --robot.right_arm_config.port=can0 \
    --robot.right_arm_config.side=right \
    --robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}}' \
-    --teleop.type=openarm_mini \
-    --teleop.port_left=/dev/ttyACM0 \
-    --teleop.port_right=/dev/ttyACM1 \
+    --teleop.type=bi_openarm_mini \
+    --teleop.left_arm_config.port=/dev/ttyACM0 \
+    --teleop.right_arm_config.port=/dev/ttyACM1 \
    --policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
    --dataset.repo_id=your-username/rollout_hil_rtc_dataset \
    --dataset.single_task="Fold the T-shirt properly" \
@@ -117,7 +117,7 @@ lerobot-rollout \
    --strategy.num_episodes=20 \
    --policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
    --robot.type=bi_openarm_follower \
-    --teleop.type=openarm_mini \
+    --teleop.type=bi_openarm_mini \
    --dataset.repo_id=${HF_USER}/rollout_hil_data \
    --dataset.single_task="Fold the T-shirt"
 ```
@@ -113,6 +113,61 @@ accelerate launch --num_processes=2 $(which lerobot-train) \
  --policy=act
 ```

+## Training Large Models with FSDP
+
+DDP replicates the full model on every GPU, so a model that doesn't fit on one GPU won't fit under
+DDP either. For large models, use **FSDP** (Fully Sharded Data Parallel), which shards parameters,
+gradients, and optimizer state across GPUs. See the [accelerate FSDP guide](https://huggingface.co/docs/accelerate/usage_guides/fsdp) for background.
+
+An example on how to launch LeRobot training with FSDP across 4 GPUs (1 machine):
+
+```bash
+accelerate launch --config_file fsdp.yaml --num_processes=4 $(which lerobot-train) \
+  --dataset.repo_id=${HF_USER}/my_dataset \
+  --policy.type=<your_policy> \
+  --output_dir=outputs/train/my_policy_fsdp
+```
+
+A minimal `fsdp.yaml` (FSDP1; shards params/grads/optimizer — ZeRO-3-equivalent):
+
+```yaml
+compute_environment: LOCAL_MACHINE
+distributed_type: FSDP
+mixed_precision: bf16
+num_machines: 1
+num_processes: 4
+fsdp_config:
+  fsdp_version: 1
+  fsdp_sharding_strategy: FULL_SHARD # params + grads + optimizer (ZeRO-3)
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: <YourTransformerBlock> # repeated block class to shard
+  fsdp_use_orig_params: true # required: optimizer is built pre-prepare
+  fsdp_state_dict_type: FULL_STATE_DICT
+```
+
+Set `fsdp_transformer_layer_cls_to_wrap` to your model's repeated transformer-block class so each
+block is sharded as its own unit. `fsdp_use_orig_params: true` is required because LeRobot builds the
+optimizer before `accelerator.prepare()`.
+
+### FSDP checkpoints
+
+LeRobot gathers the full state dict across all ranks and the main process writes it as a single
+`model.safetensors`, loadable as usual with `Policy.from_pretrained(...)`. Two things to look out for:
+
+- **Checkpoints store fp32 weights.** Under mixed precision (`bf16`/`fp16`) FSDP keeps an fp32 master
+  copy, and the checkpoint saves it (~2× the bf16 size on disk) so training can resume consistently
+  with the fp32 optimizer state; `from_pretrained` casts back to the policy dtype on load. FSDP-specific
+  caveat: an fp32 checkpoint is materialized in full precision on the target device _before_ casting,
+  so loading it for inference on a tight GPU can OOM even when the bf16 model would fit — load on CPU
+  first, or cast `model.safetensors` to the deployment dtype offline.
+- The sharded optimizer state is gathered into a full (world-size-independent) state dict and saved
+  alongside the model in the same `optimizer_state.safetensors` / `optimizer_param_groups.json`
+  format as single-GPU training, so **resume-from-checkpoint is supported** with `--resume=true`.
+  Resume reshards both the model and the optimizer state to the _current_ FSDP topology, so you can
+  resume an FSDP checkpoint on a different number of GPUs. Note that the data sampler is only
+  sample-exact when the world size and batch size match the original run (a warning is logged
+  otherwise); the optimizer/model state itself is unaffected.
+
 ## Notes

 - The `--policy.use_amp` flag in `lerobot-train` is only used when **not** running with accelerate. When using accelerate, mixed precision is controlled by accelerate's configuration.