mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-13 23:59:43 +00:00
ca87ccd941
* feat(scripts): lerobot-rollout * fix(rollout) require dataset in dagger + use duration too * fix(docs): dagger num_episodes * test(rollout): fix expectations * fix(rollout): features check * fix(rollout): device and task propagation + feature pos + warn fps + move rename_map config * docs(rollout): edit rename_map instructions * chore(rollout): multiple minor improvements * chore(rollout): address coments + minor improvements * fix(rollout): enable default * fix(tests): default value RTCConfig * fix(rollout): robot_observation_processor and notify_observation at policy frequency instead of interpolator rate Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> * fix(rollout): prevent relativeactions with sync inference engine Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> * fix(rollout): rtc reanchor to non normalized state Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> * fix(rollout): fixing the episode length to use hwc (#3469) also reducing default length to 5 minutes * feat(rollout): go back to initial position is now a config * fix(rollout): properly propagating video_files_size_in_mb to lerobot_dataset (#3470) * chore(rollout): note about dagger correction stage * chore(docs): update comments and docstring * fix(test): move rtc relative out of rollout module * fix(rollout): address the review comments --------- Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: Maxime Ellerbach <maxime.ellerbach@huggingface.co>
268 lines
12 KiB
Plaintext
268 lines
12 KiB
Plaintext
# Human-In-the-Loop Data Collection
|
||
|
||
Human-In-the-Loop (HIL) data collection lets you improve a trained policy by deploying it on a real robot while a human operator monitors and intervenes when needed. The intervention data (recovery movements and corrections) is recorded alongside autonomous segments, producing a richer training dataset that teaches the policy how to handle failures.
|
||
|
||
---
|
||
|
||
## Why Human-In-the-Loop?
|
||
|
||
Standard behavioral cloning trains policies on successful demonstrations only. During deployment, small errors can compound and push the robot into states never seen during training (distribution shift). HIL data collection addresses this by:
|
||
|
||
- Running the trained policy on the real robot
|
||
- Having a human intervene when the robot is about to fail
|
||
- Recording the human's recovery and correction as training data
|
||
- Fine-tuning the policy on the combined dataset
|
||
|
||
This produces a policy that not only knows how to perform the task, but also how to recover when things go wrong.
|
||
|
||
---
|
||
|
||
## How It Works
|
||
|
||
During a HIL session, the human operator follows this loop within each episode:
|
||
|
||
1. **Watch** the policy run autonomously
|
||
2. **Pause** when failure is imminent, the robot holds its position
|
||
3. **Take control** and teleoperate the robot back to a good state (recovery), then correct the behavior
|
||
4. **Return control to the policy**, the policy resumes autonomous execution
|
||
5. Repeat steps 2–4 as many times as needed during the episode
|
||
6. **End the episode** when the task is complete, save and move on to the next rollout
|
||
|
||
Both autonomous and human-controlled segments are recorded. The policy and human can alternate control multiple times within a single episode, and the episode continues from the current state after each handoff (no reset required just because intervention happened). This captures autonomous execution, recovery, and correction in one continuous trajectory. After collection, the combined dataset (original demonstrations + HIL data) is used to fine-tune the policy.
|
||
|
||
This process can be repeated iteratively: deploy, collect, fine-tune, repeat. Each round targets the current policy's failure modes.
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────┐
|
||
│ Policy v0 (trained on demos) │
|
||
│ ↓ │
|
||
│ HIL Collection (target current failure modes) → Fine-tune → Policy v1 │
|
||
│ ↓ │
|
||
│ HIL Collection (target new failure modes) → Fine-tune → Policy v2 │
|
||
│ ↓ │
|
||
│ ... (repeat until satisfactory performance) │
|
||
└─────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Hardware Requirements
|
||
|
||
### Teleoperator Requirements
|
||
|
||
The `lerobot-rollout --strategy.type=dagger` mode requires **teleoperators with active motors** that can:
|
||
|
||
- Enable/disable torque programmatically
|
||
- Move to target positions (to mirror the robot state when pausing)
|
||
|
||
**Compatible teleoperators:**
|
||
|
||
- `openarm_mini` - OpenArm Mini
|
||
- `so_leader` - SO100 / SO101 leader arm
|
||
|
||
> [!IMPORTANT]
|
||
> The provided commands default to `bi_openarm_follower` + `openarm_mini`.
|
||
> `so_follower` + `so_leader` configs are also registered and can be used via CLI flags.
|
||
|
||
---
|
||
|
||
## Script
|
||
|
||
Use `lerobot-rollout` with `--strategy.type=dagger` for HIL data collection. Select the inference backend with `--inference.type=sync|rtc`:
|
||
|
||
| Mode | Flag | Models |
|
||
| ------------------------ | ---------------------- | --------------------- |
|
||
| Standard (default) | _(no flag needed)_ | ACT, Diffusion Policy |
|
||
| Real-Time Chunking (RTC) | `--inference.type=rtc` | Pi0, Pi0.5, SmolVLA |
|
||
|
||
---
|
||
|
||
## Step-by-Step Guide
|
||
|
||
### Step 1: Pre-train a Base Policy
|
||
|
||
First, train a policy on your demonstration dataset:
|
||
|
||
```bash
|
||
python src/lerobot/scripts/lerobot_train.py \
|
||
--dataset.repo_id=your-username/demo-dataset \
|
||
--policy.type=pi0 \
|
||
--output_dir=outputs/pretrain \
|
||
--batch_size=32 \
|
||
--steps=50000
|
||
```
|
||
|
||
### Step 2: Collect HIL Data
|
||
|
||
**Standard inference (ACT, Diffusion Policy):**
|
||
|
||
```bash
|
||
lerobot-rollout --strategy.type=dagger \
|
||
--robot.type=bi_openarm_follower \
|
||
--robot.left_arm_config.port=can1 \
|
||
--robot.left_arm_config.side=left \
|
||
--robot.right_arm_config.port=can0 \
|
||
--robot.right_arm_config.side=right \
|
||
--robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}}' \
|
||
--teleop.type=openarm_mini \
|
||
--teleop.port_left=/dev/ttyACM0 \
|
||
--teleop.port_right=/dev/ttyACM1 \
|
||
--policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
|
||
--dataset.repo_id=your-username/rollout_hil_dataset \
|
||
--dataset.single_task="Fold the T-shirt properly" \
|
||
--dataset.fps=30 \
|
||
--strategy.num_episodes=50 \
|
||
--interpolation_multiplier=2
|
||
```
|
||
|
||
**With RTC for large models (Pi0, Pi0.5, SmolVLA):**
|
||
|
||
For models with high inference latency, enable RTC for smooth execution:
|
||
|
||
```bash
|
||
lerobot-rollout --strategy.type=dagger \
|
||
--inference.type=rtc \
|
||
--inference.rtc.execution_horizon=20 \
|
||
--inference.rtc.max_guidance_weight=5.0 \
|
||
--inference.rtc.prefix_attention_schedule=LINEAR \
|
||
--robot.type=bi_openarm_follower \
|
||
--robot.left_arm_config.port=can1 \
|
||
--robot.left_arm_config.side=left \
|
||
--robot.right_arm_config.port=can0 \
|
||
--robot.right_arm_config.side=right \
|
||
--robot.cameras='{left_wrist: {type: opencv, index_or_path: "/dev/video0", width: 1280, height: 720, fps: 30}, right_wrist: {type: opencv, index_or_path: "/dev/video4", width: 1280, height: 720, fps: 30}, base: {type: opencv, index_or_path: "/dev/video2", width: 640, height: 480, fps: 30}}' \
|
||
--teleop.type=openarm_mini \
|
||
--teleop.port_left=/dev/ttyACM0 \
|
||
--teleop.port_right=/dev/ttyACM1 \
|
||
--policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
|
||
--dataset.repo_id=your-username/rollout_hil_rtc_dataset \
|
||
--dataset.single_task="Fold the T-shirt properly" \
|
||
--dataset.fps=30 \
|
||
--strategy.num_episodes=50 \
|
||
--interpolation_multiplier=3
|
||
```
|
||
|
||
**Controls (Conceptual):**
|
||
|
||
The interaction model is:
|
||
|
||
- **Pause input**: pause autonomous policy execution
|
||
- **Takeover input**: transfer control to the human operator and record intervention data
|
||
- **Return-to-policy input**: hand control back to the policy and continue the same episode
|
||
- **Episode control inputs**: save/re-record/stop/reset as needed
|
||
|
||
Exact key/pedal bindings can differ across scripts and hardware integrations. Use each script's printed controls as the source of truth for the concrete mapping on your setup.
|
||
|
||
**The HIL Protocol:**
|
||
|
||
1. Watch the policy run autonomously (teleop is idle/free)
|
||
2. When you see imminent failure, trigger the **pause input**
|
||
- Policy stops
|
||
- Teleoperator moves to match robot position (torque enabled)
|
||
- No frames recorded during pause
|
||
3. Trigger the **takeover input** to take control
|
||
- Teleoperator torque disabled, free to move
|
||
- **Recovery**: Teleoperate the robot back to a good state
|
||
- **Correction**: Correct the behavior
|
||
- All movements are recorded
|
||
4. Trigger the **return-to-policy input**
|
||
- Policy resumes autonomous execution from the current state
|
||
- You can intervene again at any time (repeat steps 2–4)
|
||
5. End and save the episode when the task is complete (or episode time limit is reached)
|
||
6. **Reset**: Teleop moves to robot position, you can move the robot to the starting position
|
||
7. Start the next episode
|
||
|
||
**Foot Pedal Setup (Linux):**
|
||
|
||
If using a USB foot pedal (PCsensor FootSwitch), ensure access:
|
||
|
||
```bash
|
||
sudo setfacl -m u:$USER:rw /dev/input/by-id/usb-PCsensor_FootSwitch-event-kbd
|
||
```
|
||
|
||
### Step 3: Fine-tune the Policy
|
||
|
||
Fine-tune on the **combined** dataset (`demo-dataset` + `hil-dataset` merged together):
|
||
|
||
```bash
|
||
python src/lerobot/scripts/lerobot_train.py \
|
||
--dataset.repo_id=your-username/hil-dataset \
|
||
--policy.type=pi0 \
|
||
--policy.pretrained_path=outputs/pretrain/checkpoints/last/pretrained_model \
|
||
--output_dir=outputs/hil_finetune \
|
||
--steps=20000
|
||
```
|
||
|
||
Then deploy the fine-tuned policy and repeat from Step 2 to target its remaining failure modes.
|
||
|
||
---
|
||
|
||
## Tips for Effective HIL Collection
|
||
|
||
### When to Intervene
|
||
|
||
Intervene when you see:
|
||
|
||
- Robot about to make an irreversible mistake
|
||
- Robot hesitating or showing uncertain behavior
|
||
- Robot deviating from the expected trajectory
|
||
|
||
### Recovery: Teleoperating Back to a Good State
|
||
|
||
During recovery, teleoperate the robot back to a state where:
|
||
|
||
- The robot is in a familiar, in-distribution configuration
|
||
- The current subtask can still be completed
|
||
- The recovery trajectory itself is informative training data
|
||
|
||
### Quality of Corrections
|
||
|
||
During correction:
|
||
|
||
- Provide **confident, clean** trajectories
|
||
- Complete the current subtask fully
|
||
- Don't overcorrect or add unnecessary movements
|
||
|
||
---
|
||
|
||
## Related Work
|
||
|
||
This HIL data collection approach builds on ideas from interactive imitation learning:
|
||
|
||
- **DAgger** (Ross et al., 2011) introduced the core idea: instead of only training on expert demonstrations, query the expert for corrections on states the _learner_ visits. This breaks the compounding-error cycle of standard behavioral cloning by iteratively collecting on-policy data.
|
||
|
||
- **HG-DAgger** (Kelly et al., 2019) made this practical for robotics: a human expert monitors the robot and only intervenes when needed, rather than labeling every state. The gating between autonomous and human control is exactly the pause → takeover → return-to-policy loop used in the scripts here.
|
||
|
||
- **RaC** (Hu et al., 2025) scales this loop to long-horizon tasks by explicitly decomposing interventions into **recovery** (teleoperating back to a good state) and **correction** (demonstrating the right behavior from there). This decomposition is the protocol followed by the DAgger strategy in `lerobot-rollout`.
|
||
|
||
- **π0.6/RECAP** (Physical Intelligence, 2025) applies the same iterative collect-and-finetune loop at scale with VLA models, showing that even large pretrained policies benefit substantially from targeted human corrections on their own failure modes. π0.6 is trained using RECAP.
|
||
|
||
```bibtex
|
||
@article{ross2011dagger,
|
||
title={A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning},
|
||
author={Ross, Stéphane and Gordon, Geoffrey and Bagnell, Drew},
|
||
journal={Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics},
|
||
year={2011}
|
||
}
|
||
|
||
@article{kelly2019hgdagger,
|
||
title={HG-DAgger: Interactive Imitation Learning with Human Experts},
|
||
author={Kelly, Michael and Sidrane, Chelsea and Driggs-Campbell, Katherine and Kochenderfer, Mykel J},
|
||
journal={arXiv preprint arXiv:1810.02890},
|
||
year={2019}
|
||
}
|
||
|
||
@article{hu2025rac,
|
||
title={RaC: Robot Learning for Long-Horizon Tasks by Scaling Recovery and Correction},
|
||
author={Hu, Zheyuan and Wu, Robyn and Enock, Naveen and Li, Jasmine and Kadakia, Riya and Erickson, Zackory and Kumar, Aviral},
|
||
journal={arXiv preprint arXiv:2509.07953},
|
||
year={2025}
|
||
}
|
||
|
||
@article{pi2025recap,
|
||
title={π0.6: a VLA That Learns From Experience},
|
||
author={Physical Intelligence},
|
||
year={2025}
|
||
}
|
||
```
|