mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-26 20:57:28 +00:00
Merge remote-tracking branch 'origin/main' into feat/smolvla-on-steerable
# Conflicts: # docs/source/annotation_pipeline.mdx # examples/annotations/run_hf_job.py # pyproject.toml # src/lerobot/annotations/steerable_pipeline/config.py # src/lerobot/annotations/steerable_pipeline/frames.py # src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py # src/lerobot/annotations/steerable_pipeline/vlm_client.py # src/lerobot/annotations/steerable_pipeline/writer.py # src/lerobot/datasets/__init__.py # src/lerobot/datasets/sampler.py # src/lerobot/scripts/lerobot_annotate.py # src/lerobot/scripts/lerobot_train.py # tests/annotations/test_frames.py # tests/annotations/test_modules.py # tests/annotations/test_writer.py # tests/datasets/test_sampler.py # tests/scripts/test_lerobot_annotate.py # uv.lock
This commit is contained in:
@@ -71,11 +71,21 @@ it uses a two-step **describe → segment** flow:
|
||||
2. **Segment** — that description is fed back in, and the VLM splits the
|
||||
episode into consecutive atomic subtasks.
|
||||
|
||||
Both passes see the episode as **timestamped contact sheets** — frames
|
||||
sampled at `frames_per_second` (0.5s by default) and packed into JPEG
|
||||
grids with each frame's time burned into its corner, so the VLM cites
|
||||
exact boundary times directly. This is far cheaper in vision tokens than
|
||||
one image per frame, so the sampling can stay dense; episodes longer than
|
||||
`max_frames_per_prompt` are split into windows at the same density and
|
||||
merged. Both prompts also carry a causal **event-boundary** definition (a
|
||||
new event starts when an object becomes held / is released / reaches a new
|
||||
location / a lid changes state / contents move) to sharpen where cuts land.
|
||||
|
||||
The resulting spans are then stitched into a gap-free, full-episode
|
||||
cover, so **every frame has exactly one active subtask**. See
|
||||
[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
|
||||
for the production settings (single camera, embedded frames, windowed
|
||||
subtask generation).
|
||||
for the production settings (single camera, timestamped contact sheets,
|
||||
auto-windowed subtask generation).
|
||||
|
||||
### Tools
|
||||
|
||||
@@ -162,15 +172,15 @@ Every module is on by default and can be toggled independently (set to
|
||||
|
||||
| Flag | Default | What it does |
|
||||
| ------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `--plan.frames_per_second` | `1.0` | How densely the episode video is sampled. |
|
||||
| `--plan.max_video_frames` | `32` | Hard cap on frames per call (context-budget guard — don't exceed ~32 for a 32k context). |
|
||||
| `--plan.subtask_window_seconds` | `0` | Split long episodes into fixed windows for constant frame density (`0` = whole episode). |
|
||||
| `--plan.frames_per_second` | `2.0` | Frame sampling rate for the contact sheets (`2.0` = one frame every 0.5s). |
|
||||
| `--plan.max_frames_per_prompt` | `60` | Frame budget per VLM call. Episodes whose sampling exceeds this are auto-windowed at the same density, then stitched. |
|
||||
| `--plan.contact_sheet_columns` | `5` | Columns per contact-sheet grid (`contact_sheet_frames_per_sheet` tiles, time row-major). |
|
||||
| `--plan.plan_max_steps` | `8` | Upper bound on subtasks per episode. |
|
||||
| `--plan.subtask_describe_first` | `true` | Run the describe→segment grounding pass (best subtask quality; +1 call/episode). |
|
||||
| `--plan.emit_plan` | `true` | Emit the numbered `plan` rows (`false` = subtasks + memory only). |
|
||||
| `--plan.emit_memory` | `true` | Emit the `memory` rows (`false` = subtasks + plan only); symmetric to `emit_plan`. |
|
||||
| `--plan.n_task_rephrasings` | `10` | How many `task_aug` rephrasings to emit (`0` disables). |
|
||||
| `--plan.derive_task_from_video` | `if_short` | Use the dataset task as-is (`off`), only when it's missing/short (`if_short`), or always re-derive from video (`always`). |
|
||||
| `--plan.use_video_url` | `false` | Send a server-side video clip instead of embedded frames. |
|
||||
|
||||
### Interjections + VQA
|
||||
|
||||
|
||||
@@ -57,11 +57,11 @@ The `lerobot-rollout --strategy.type=dagger` mode requires **teleoperators with
|
||||
|
||||
**Compatible teleoperators:**
|
||||
|
||||
- `openarm_mini` - OpenArm Mini
|
||||
- `bi_openarm_mini` - Bimanual OpenArm Mini
|
||||
- `so_leader` - SO100 / SO101 leader arm
|
||||
|
||||
> [!IMPORTANT]
|
||||
> The provided commands default to `bi_openarm_follower` + `openarm_mini`.
|
||||
> The provided commands default to `bi_openarm_follower` + `bi_openarm_mini`.
|
||||
> `so_follower` + `so_leader` configs are also registered and can be used via CLI flags.
|
||||
|
||||
---
|
||||
@@ -104,9 +104,9 @@ lerobot-rollout --strategy.type=dagger \
|
||||
--robot.right_arm_config.port=can0 \
|
||||
--robot.right_arm_config.side=right \
|
||||
--robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}}' \
|
||||
--teleop.type=openarm_mini \
|
||||
--teleop.port_left=/dev/ttyACM0 \
|
||||
--teleop.port_right=/dev/ttyACM1 \
|
||||
--teleop.type=bi_openarm_mini \
|
||||
--teleop.left_arm_config.port=/dev/ttyACM0 \
|
||||
--teleop.right_arm_config.port=/dev/ttyACM1 \
|
||||
--policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
|
||||
--dataset.repo_id=your-username/rollout_hil_dataset \
|
||||
--dataset.single_task="Fold the T-shirt properly" \
|
||||
@@ -131,9 +131,9 @@ lerobot-rollout --strategy.type=dagger \
|
||||
--robot.right_arm_config.port=can0 \
|
||||
--robot.right_arm_config.side=right \
|
||||
--robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}}' \
|
||||
--teleop.type=openarm_mini \
|
||||
--teleop.port_left=/dev/ttyACM0 \
|
||||
--teleop.port_right=/dev/ttyACM1 \
|
||||
--teleop.type=bi_openarm_mini \
|
||||
--teleop.left_arm_config.port=/dev/ttyACM0 \
|
||||
--teleop.right_arm_config.port=/dev/ttyACM1 \
|
||||
--policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
|
||||
--dataset.repo_id=your-username/rollout_hil_rtc_dataset \
|
||||
--dataset.single_task="Fold the T-shirt properly" \
|
||||
|
||||
@@ -117,7 +117,7 @@ lerobot-rollout \
|
||||
--strategy.num_episodes=20 \
|
||||
--policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
|
||||
--robot.type=bi_openarm_follower \
|
||||
--teleop.type=openarm_mini \
|
||||
--teleop.type=bi_openarm_mini \
|
||||
--dataset.repo_id=${HF_USER}/rollout_hil_data \
|
||||
--dataset.single_task="Fold the T-shirt"
|
||||
```
|
||||
|
||||
@@ -113,6 +113,61 @@ accelerate launch --num_processes=2 $(which lerobot-train) \
|
||||
--policy=act
|
||||
```
|
||||
|
||||
## Training Large Models with FSDP
|
||||
|
||||
DDP replicates the full model on every GPU, so a model that doesn't fit on one GPU won't fit under
|
||||
DDP either. For large models, use **FSDP** (Fully Sharded Data Parallel), which shards parameters,
|
||||
gradients, and optimizer state across GPUs. See the [accelerate FSDP guide](https://huggingface.co/docs/accelerate/usage_guides/fsdp) for background.
|
||||
|
||||
An example on how to launch LeRobot training with FSDP across 4 GPUs (1 machine):
|
||||
|
||||
```bash
|
||||
accelerate launch --config_file fsdp.yaml --num_processes=4 $(which lerobot-train) \
|
||||
--dataset.repo_id=${HF_USER}/my_dataset \
|
||||
--policy.type=<your_policy> \
|
||||
--output_dir=outputs/train/my_policy_fsdp
|
||||
```
|
||||
|
||||
A minimal `fsdp.yaml` (FSDP1; shards params/grads/optimizer — ZeRO-3-equivalent):
|
||||
|
||||
```yaml
|
||||
compute_environment: LOCAL_MACHINE
|
||||
distributed_type: FSDP
|
||||
mixed_precision: bf16
|
||||
num_machines: 1
|
||||
num_processes: 4
|
||||
fsdp_config:
|
||||
fsdp_version: 1
|
||||
fsdp_sharding_strategy: FULL_SHARD # params + grads + optimizer (ZeRO-3)
|
||||
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
||||
fsdp_transformer_layer_cls_to_wrap: <YourTransformerBlock> # repeated block class to shard
|
||||
fsdp_use_orig_params: true # required: optimizer is built pre-prepare
|
||||
fsdp_state_dict_type: FULL_STATE_DICT
|
||||
```
|
||||
|
||||
Set `fsdp_transformer_layer_cls_to_wrap` to your model's repeated transformer-block class so each
|
||||
block is sharded as its own unit. `fsdp_use_orig_params: true` is required because LeRobot builds the
|
||||
optimizer before `accelerator.prepare()`.
|
||||
|
||||
### FSDP checkpoints
|
||||
|
||||
LeRobot gathers the full state dict across all ranks and the main process writes it as a single
|
||||
`model.safetensors`, loadable as usual with `Policy.from_pretrained(...)`. Two things to look out for:
|
||||
|
||||
- **Checkpoints store fp32 weights.** Under mixed precision (`bf16`/`fp16`) FSDP keeps an fp32 master
|
||||
copy, and the checkpoint saves it (~2× the bf16 size on disk) so training can resume consistently
|
||||
with the fp32 optimizer state; `from_pretrained` casts back to the policy dtype on load. FSDP-specific
|
||||
caveat: an fp32 checkpoint is materialized in full precision on the target device _before_ casting,
|
||||
so loading it for inference on a tight GPU can OOM even when the bf16 model would fit — load on CPU
|
||||
first, or cast `model.safetensors` to the deployment dtype offline.
|
||||
- The sharded optimizer state is gathered into a full (world-size-independent) state dict and saved
|
||||
alongside the model in the same `optimizer_state.safetensors` / `optimizer_param_groups.json`
|
||||
format as single-GPU training, so **resume-from-checkpoint is supported** with `--resume=true`.
|
||||
Resume reshards both the model and the optimizer state to the _current_ FSDP topology, so you can
|
||||
resume an FSDP checkpoint on a different number of GPUs. Note that the data sampler is only
|
||||
sample-exact when the world size and batch size match the original run (a warning is logged
|
||||
otherwise); the optimizer/model state itself is unaffected.
|
||||
|
||||
## Notes
|
||||
|
||||
- The `--policy.use_amp` flag in `lerobot-train` is only used when **not** running with accelerate. When using accelerate, mixed precision is controlled by accelerate's configuration.
|
||||
|
||||
Reference in New Issue
Block a user