feat(dagger): Add HIL/Dagger/HG-Dagger/RaC style data collection (#2833)

* feat: HIL data collection, RTC interpolator, and action queue improvements - Add Human-in-the-Loop (HIL) data collection examples (sync + RTC) - Add HIL data collection documentation - Add ActionInterpolator for smoother policy control at higher rates - Integrate interpolator into lerobot-record and eval_with_real_robot - Add action queue clear() and get_processed_left_over() methods - Add rtc/__init__.py for cleaner imports * docs: expand Related Work section with paper summaries * fix: only record dataset frames at original fps, not at interpolated rate The interpolator speeds up robot control (e.g. 2x) but dataset frames should still be recorded at the original fps. Interpolated-only iterations now only send actions to the robot without writing to the dataset. * refactor: merge HIL sync and RTC scripts into single file with --rtc.enabled toggle Combines hil_data_collection.py and hil_data_collection_rtc.py into one script. RTC is toggled via --rtc.enabled=true (defaults to off for sync inference). Deletes the separate hil_data_collection_rtc.py and updates docs to reflect the single-script usage. * test: add ActionInterpolator test suite (29 tests) Covers constructor validation, passthrough (multiplier=1), 2x and 3x interpolation with exact value checks, reset/episode boundaries, control interval calculation, multi-dim actions, and simulated control loop integration. * test: add ActionQueue + ActionInterpolator integration tests Verifies the interpolator doesn't interfere with RTC's leftover chunk tracking: queue consumption rate matches base fps regardless of multiplier, get_left_over/get_processed_left_over only change on queue.get(), merge preserves smooth interpolation across chunks, and interpolator reset is independent of queue state. * feat: register SO follower/leader configs in HIL script Adds SOFollowerRobotConfig and SOLeaderTeleopConfig imports so SO100/SO101 robots can be used via --robot.type=so_follower and --teleop.type=so_leader. Updates docs accordingly. Made-with: Cursor * docs: remove em dashes from HIL documentation Made-with: Cursor * refactor: rename examples/rac to examples/hil Updates directory name and all references in docs and script docstrings. Made-with: Cursor * fix: encorperate pr feedback comments * refactor(tests): enhance ActionInterpolator test structure and add detailed docstrings * feedback pr and test fix * fix(test): pass correct real_delay in interpolator delay test The test was passing real_delay=0 and relying on _check_delays to silently override it with the index-based diff. Now passes real_delay=3 to match the 3 actions consumed during the simulated inference period. * fix pr feedback * ordering * update hil script * fix * default name * fix(bi_openarm): use kw_only=True to fix dataclass field ordering BiOpenArmFollowerConfig overrides `id` with a default, making it positional in the child — non-default `left_arm_config` then follows a default field, which Python dataclasses forbid. Adding kw_only=True (matching the parent RobotConfig) removes positional constraints. Made-with: Cursor * style: format long line in hil_data_collection.py Made-with: Cursor * pr feedback --------- Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-07-08 02:22:02 +00:00 · 2026-04-02 19:53:59 +02:00
parent 66fef25ded
commit 818892a38b
13 changed files with 2605 additions and 61 deletions
@@ -17,6 +17,8 @@
    title: Train RL in Simulation
  - local: multi_gpu_training
    title: Multi GPU training
+  - local: hil_data_collection
+    title: Human In the Loop Data Collection
  - local: peft_training
    title: Training with PEFT (e.g., LoRA)
  - local: rename_map
@@ -0,0 +1,269 @@
+# Human-In-the-Loop Data Collection
+
+Human-In-the-Loop (HIL) data collection lets you improve a trained policy by deploying it on a real robot while a human operator monitors and intervenes when needed. The intervention data (recovery movements and corrections) is recorded alongside autonomous segments, producing a richer training dataset that teaches the policy how to handle failures.
+
+---
+
+## Why Human-In-the-Loop?
+
+Standard behavioral cloning trains policies on successful demonstrations only. During deployment, small errors can compound and push the robot into states never seen during training (distribution shift). HIL data collection addresses this by:
+
+- Running the trained policy on the real robot
+- Having a human intervene when the robot is about to fail
+- Recording the human's recovery and correction as training data
+- Fine-tuning the policy on the combined dataset
+
+This produces a policy that not only knows how to perform the task, but also how to recover when things go wrong.
+
+---
+
+## How It Works
+
+During a HIL session, the human operator follows this loop within each episode:
+
+1. **Watch** the policy run autonomously
+2. **Pause** when failure is imminent, the robot holds its position
+3. **Take control** and teleoperate the robot back to a good state (recovery), then correct the behavior
+4. **Return control to the policy**, the policy resumes autonomous execution
+5. Repeat steps 2–4 as many times as needed during the episode
+6. **End the episode** when the task is complete, save and move on to the next rollout
+
+Both autonomous and human-controlled segments are recorded. The policy and human can alternate control multiple times within a single episode, and the episode continues from the current state after each handoff (no reset required just because intervention happened). This captures autonomous execution, recovery, and correction in one continuous trajectory. After collection, the combined dataset (original demonstrations + HIL data) is used to fine-tune the policy.
+
+This process can be repeated iteratively: deploy, collect, fine-tune, repeat. Each round targets the current policy's failure modes.
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│  Policy v0 (trained on demos)                                           │
+│       ↓                                                                 │
+│  HIL Collection (target current failure modes) → Fine-tune → Policy v1  │
+│       ↓                                                                 │
+│  HIL Collection (target new failure modes) → Fine-tune → Policy v2      │
+│       ↓                                                                 │
+│  ... (repeat until satisfactory performance)                            │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Hardware Requirements
+
+### Teleoperator Requirements
+
+The `examples/hil` HIL scripts require **teleoperators with active motors** that can:
+
+- Enable/disable torque programmatically
+- Move to target positions (to mirror the robot state when pausing)
+
+**Compatible teleoperators in the current `examples/hil` scripts:**
+
+- `openarm_mini` - OpenArm Mini
+- `so_leader` - SO100 / SO101 leader arm
+
+> [!IMPORTANT]
+> The provided `examples/hil` commands default to `bi_openarm_follower` + `openarm_mini`.
+> `so_follower` + `so_leader` configs are also registered and can be used via CLI flags.
+
+---
+
+## Script
+
+A single script handles both synchronous and RTC-based inference. Toggle RTC with `--rtc.enabled=true`:
+
+| Mode                     | Flag                 | Models                |
+| ------------------------ | -------------------- | --------------------- |
+| Standard (default)       | _(no flag needed)_   | ACT, Diffusion Policy |
+| Real-Time Chunking (RTC) | `--rtc.enabled=true` | Pi0, Pi0.5, SmolVLA   |
+
+---
+
+## Step-by-Step Guide
+
+### Step 1: Pre-train a Base Policy
+
+First, train a policy on your demonstration dataset:
+
+```bash
+python src/lerobot/scripts/lerobot_train.py \
+    --dataset.repo_id=your-username/demo-dataset \
+    --policy.type=pi0 \
+    --output_dir=outputs/pretrain \
+    --batch_size=32 \
+    --steps=50000
+```
+
+### Step 2: Collect HIL Data
+
+**Standard inference (ACT, Diffusion Policy):**
+
+```bash
+python examples/hil/hil_data_collection.py \
+    --robot.type=bi_openarm_follower \
+    --robot.left_arm_config.port=can1 \
+    --robot.left_arm_config.side=left \
+    --robot.right_arm_config.port=can0 \
+    --robot.right_arm_config.side=right \
+    --robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}}' \
+    --teleop.type=openarm_mini \
+    --teleop.port_left=/dev/ttyACM0 \
+    --teleop.port_right=/dev/ttyACM1 \
+    --policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
+    --dataset.repo_id=your-username/hil-dataset \
+    --dataset.single_task="Fold the T-shirt properly" \
+    --dataset.fps=30 \
+    --dataset.episode_time_s=1000 \
+    --dataset.num_episodes=50 \
+    --interpolation_multiplier=2
+```
+
+**With RTC for large models (Pi0, Pi0.5, SmolVLA):**
+
+For models with high inference latency, enable RTC for smooth execution:
+
+```bash
+python examples/hil/hil_data_collection.py \
+    --rtc.enabled=true \
+    --rtc.execution_horizon=20 \
+    --rtc.max_guidance_weight=5.0 \
+    --rtc.prefix_attention_schedule=LINEAR \
+    --robot.type=bi_openarm_follower \
+    --robot.left_arm_config.port=can1 \
+    --robot.left_arm_config.side=left \
+    --robot.right_arm_config.port=can0 \
+    --robot.right_arm_config.side=right \
+    --robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}}' \
+    --teleop.type=openarm_mini \
+    --teleop.port_left=/dev/ttyACM0 \
+    --teleop.port_right=/dev/ttyACM1 \
+    --policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
+    --dataset.repo_id=your-username/hil-rtc-dataset \
+    --dataset.single_task="Fold the T-shirt properly" \
+    --dataset.fps=30 \
+    --dataset.episode_time_s=1000 \
+    --dataset.num_episodes=50 \
+    --interpolation_multiplier=3
+```
+
+**Controls (Conceptual):**
+
+The interaction model is:
+
+- **Pause input**: pause autonomous policy execution
+- **Takeover input**: transfer control to the human operator and record intervention data
+- **Return-to-policy input**: hand control back to the policy and continue the same episode
+- **Episode control inputs**: save/re-record/stop/reset as needed
+
+Exact key/pedal bindings can differ across scripts and hardware integrations. Use each script's printed controls as the source of truth for the concrete mapping on your setup.
+
+**The HIL Protocol:**
+
+1. Watch the policy run autonomously (teleop is idle/free)
+2. When you see imminent failure, trigger the **pause input**
+   - Policy stops
+   - Teleoperator moves to match robot position (torque enabled)
+   - No frames recorded during pause
+3. Trigger the **takeover input** to take control
+   - Teleoperator torque disabled, free to move
+   - **Recovery**: Teleoperate the robot back to a good state
+   - **Correction**: Correct the behavior
+   - All movements are recorded
+4. Trigger the **return-to-policy input**
+   - Policy resumes autonomous execution from the current state
+   - You can intervene again at any time (repeat steps 2–4)
+5. End and save the episode when the task is complete (or episode time limit is reached)
+6. **Reset**: Teleop moves to robot position, you can move the robot to the starting position
+7. Start the next episode
+
+**Foot Pedal Setup (Linux):**
+
+If using a USB foot pedal (PCsensor FootSwitch), ensure access:
+
+```bash
+sudo setfacl -m u:$USER:rw /dev/input/by-id/usb-PCsensor_FootSwitch-event-kbd
+```
+
+### Step 3: Fine-tune the Policy
+
+Fine-tune on the **combined** dataset (`demo-dataset` + `hil-dataset` merged together):
+
+```bash
+python src/lerobot/scripts/lerobot_train.py \
+    --dataset.repo_id=your-username/hil-dataset \
+    --policy.type=pi0 \
+    --policy.pretrained_path=outputs/pretrain/checkpoints/last/pretrained_model \
+    --output_dir=outputs/hil_finetune \
+    --steps=20000
+```
+
+Then deploy the fine-tuned policy and repeat from Step 2 to target its remaining failure modes.
+
+---
+
+## Tips for Effective HIL Collection
+
+### When to Intervene
+
+Intervene when you see:
+
+- Robot about to make an irreversible mistake
+- Robot hesitating or showing uncertain behavior
+- Robot deviating from the expected trajectory
+
+### Recovery: Teleoperating Back to a Good State
+
+During recovery, teleoperate the robot back to a state where:
+
+- The robot is in a familiar, in-distribution configuration
+- The current subtask can still be completed
+- The recovery trajectory itself is informative training data
+
+### Quality of Corrections
+
+During correction:
+
+- Provide **confident, clean** trajectories
+- Complete the current subtask fully
+- Don't overcorrect or add unnecessary movements
+
+---
+
+## Related Work
+
+This HIL data collection approach builds on ideas from interactive imitation learning:
+
+- **DAgger** (Ross et al., 2011) introduced the core idea: instead of only training on expert demonstrations, query the expert for corrections on states the _learner_ visits. This breaks the compounding-error cycle of standard behavioral cloning by iteratively collecting on-policy data.
+
+- **HG-DAgger** (Kelly et al., 2019) made this practical for robotics: a human expert monitors the robot and only intervenes when needed, rather than labeling every state. The gating between autonomous and human control is exactly the pause → takeover → return-to-policy loop used in the scripts here.
+
+- **RaC** (Hu et al., 2025) scales this loop to long-horizon tasks by explicitly decomposing interventions into **recovery** (teleoperating back to a good state) and **correction** (demonstrating the right behavior from there). This decomposition is the protocol followed by the HIL scripts in `examples/hil`.
+
+- **π0.6/RECAP** (Physical Intelligence, 2025) applies the same iterative collect-and-finetune loop at scale with VLA models, showing that even large pretrained policies benefit substantially from targeted human corrections on their own failure modes. π0.6 is trained using RECAP.
+
+```bibtex
+@article{ross2011dagger,
+  title={A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning},
+  author={Ross, Stéphane and Gordon, Geoffrey and Bagnell, Drew},
+  journal={Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics},
+  year={2011}
+}
+
+@article{kelly2019hgdagger,
+  title={HG-DAgger: Interactive Imitation Learning with Human Experts},
+  author={Kelly, Michael and Sidrane, Chelsea and Driggs-Campbell, Katherine and Kochenderfer, Mykel J},
+  journal={arXiv preprint arXiv:1810.02890},
+  year={2019}
+}
+
+@article{hu2025rac,
+  title={RaC: Robot Learning for Long-Horizon Tasks by Scaling Recovery and Correction},
+  author={Hu, Zheyuan and Wu, Robyn and Enock, Naveen and Li, Jasmine and Kadakia, Riya and Erickson, Zackory and Kumar, Aviral},
+  journal={arXiv preprint arXiv:2509.07953},
+  year={2025}
+}
+
+@article{pi2025recap,
+  title={π0.6: a VLA That Learns From Experience},
+  author={Physical Intelligence},
+  year={2025}
+}
+```