mirror of
https://github.com/huggingface/lerobot.git
synced 2026-07-01 15:17:05 +00:00
5ac3b49a5f
* feat(train): add JobConfig group, save_checkpoint_to_hub flag, Hub checkpoint helper
Introduce a JobConfig draccus group on TrainPipelineConfig (--job.target/image/
timeout/detach/tags) whose is_remote property gates remote dispatch, plus a
save_checkpoint_to_hub flag and validation. Add push_checkpoint_to_hub(), which
uploads a saved checkpoint directory to the model repo under checkpoints/<step>/
and creates the repo idempotently (private propagates from policy.private).
* feat(train): run training remotely on HF Jobs via --job.target
When --job.target names a GPU flavor, train() dispatches to lerobot.jobs.submit_to_hf
instead of training locally: it authenticates, ensures the dataset is on the Hub
(pushing a local-only one privately), serializes a pod-compatible train_config.json
(strips client-only fields, points at the model repo), submits via HfApi.run_job
with HF_TOKEN/WANDB_API_KEY secrets, then streams logs and finishes when the model
is pushed. Wires push_checkpoint_to_hub into the training loop behind
save_checkpoint_to_hub, and tags jobs/datasets/model with 'lerobot' + --job.tags.
* docs(train): document remote training on HF Jobs
* test(train): skip remote-dispatch tests without the dataset extra
The module imports lerobot.scripts.lerobot_train, which eagerly pulls in
lerobot.datasets (dataset extra). The base fast-test CI tier runs without
that extra, so collection failed there. Guard with pytest.importorskip,
matching the existing tests/scripts dataset-extra tests.
* refactor(jobs): hoist huggingface_hub imports to module level in hf.py
huggingface_hub is a core dependency, so the per-function dynamic imports
had no lazy-loading rationale. Move them to a single module-level import
and update test monkeypatch targets to lerobot.jobs.hf.* accordingly.
* refactor(jobs): build remote config dict via cfg.to_dict()
TrainPipelineConfig.to_dict() already returns the canonical draccus
encoding, so the StringIO + draccus.dump + json.loads round-trip was
redundant. Use it directly and drop the now-unused io/draccus imports.
* refactor(train): use module-level HfApi import in push_checkpoint_to_hub
huggingface_hub is a core dependency; the in-function import was
unnecessary. Move HfApi to a module-level import and point the test
monkeypatches at lerobot.common.train_utils.HfApi.
* refactor(configs): export JobConfig from the configs package
Re-export JobConfig in lerobot/configs/__init__.py so external callers
import it as `from lerobot.configs import JobConfig`, matching the other
config classes. Adapt the train script and test imports.
* refactor(jobs): check dataset presence with api.repo_exists
Replace the dataset_info try/except RepositoryNotFoundError dance with a
direct api.repo_exists(repo_id, repo_type="dataset") call, dropping the
httpx/RepositoryNotFoundError test scaffolding.
* chore(jobs): annotate ensure_dataset_available api param as HfApi
Add the missing HfApi type hint via a TYPE_CHECKING import.
* refactor(jobs): use HF_LEROBOT_HOME constant for the local cache root
Resolve the local dataset cache via lerobot.utils.constants.HF_LEROBOT_HOME
instead of re-reading the env var by hand, dropping the os/Path imports.
Tests now patch the imported constant and assert on a stable message
substring (the previous "neither" match only passed by accident, matching
the test name embedded in the pytest tmp_path).
* chore(jobs): guard LeRobotDataset import with require_package
Surface a clear "install lerobot[dataset]" error if the datasets extra
is missing, instead of a raw ImportError, before pushing a local dataset.
* docs(configs): clarify the is_remote_target/is_remote split
Add a comment explaining why JobConfig keeps both the staticmethod (tests
a raw target string from argv before a config exists) and the property
(accessor for an existing config instance).
* docs(train): note how to pin a pushed model version for inference
Document --policy.pretrained_revision alongside --policy.path so a
specific Hub-pushed checkpoint (once --save_checkpoint_to_hub has
committed several) can be selected for inference.
* test(jobs): skip dataset import guard in base-deps test
The fast test env installs base deps only, so require_package('datasets')
raised ImportError before the mocked lerobot.datasets import was reached.
Monkeypatch the guard to a no-op so the unit test exercises the upload logic.
* fix(jobs): address claude review findings on remote training
Resolve the claude[bot] review on #3856:
- Reject reward-model training under --job.target with a clear error instead
of crashing on a None policy inside build_remote_config_file.
- Support --policy.path remote runs: validate() no longer requires repo_id for
remote runs (it is auto-generated in submit_to_hf), and repo_id/push_to_hub
are now set after validate() resolves the policy.
- Narrow the bare `except Exception` in _tail_logs/_poll_until_done to
(OSError, httpx.HTTPError) so programming errors surface instead of being
silently retried or counted as job failures.
- Install the SIGINT detach handler only on the main thread.
- Generate model repo timestamps in UTC.
* docs(jobs): document the model-pushed marker contract and orphaned repos
Follow-up to the claude[bot] review on #3856 (non-blocking observations):
- Cross-reference the "Model pushed to <url>" log line between its producer
(PreTrainedPolicy.push_model_to_hub) and the remote-run consumer in
submit_to_hf, noting the contract is an early-finish optimization that
falls back to status polling if it drifts.
- Note in the HF Jobs guide that a failed remote run leaves its model repo
on the Hub (it is not auto-deleted) and how to remove it.
* feat(train): tag each pushed checkpoint with its step
Address review feedback on #3856: pushing a checkpoint to the Hub now
also creates a tag named after the checkpoint step, so a checkpoint can
be recovered with --policy.pretrained_revision=<step> instead of having
to look up its commit sha.
* fix(jobs): hoist ensure_dataset_available to a module-level import
Addresses Caroline's review comment on PR #3856: the local import of
ensure_dataset_available inside submit_to_hf was vestigial. dataset.py
does not import hf.py, so there is no circular-import risk and no extra
load cost (its heavy deps stay lazy), so make it a top-level import.
* refactor(configs): untangle config_path/resume resolution in validate()
Split the re-parse HACK block in TrainPipelineConfig.validate() into focused
helpers (_resolve_pretrained_from_cli, _resolve_resume_checkpoint) that handle
the policy path, reward-model path, and resume config_path as separate,
readable units. Behavior-preserving.
* feat(train): resume training from a Hub checkpoint
Allow --config_path to be a Hub repo id when resuming, not only a local path.
The latest checkpoint under checkpoints/<step>/ is downloaded into a fresh local
run dir and resumed from there (optimizer, scheduler, RNG and data order
restored as for a local resume). TrainPipelineConfig.from_pretrained falls back
to the latest checkpoint's train_config.json when a repo has no root config
(an interrupted run that only pushed checkpoints). The download is skipped when
dispatching remotely so the executor (local machine or HF Jobs pod) performs it.
- add find_latest_hub_checkpoint (utils/hub) and resolve_resume_checkpoint
(common/train_utils), the symmetric download counterpart to
push_checkpoint_to_hub
- unit tests for both helpers and the from_pretrained fallback
* feat(jobs): resume a run on HF Jobs from a checkpoint
When --resume is set with a remote --job.target, submit_to_hf resumes from the
checkpoint repo instead of staging a fresh config. A Hub config_path is resumed
in place (its checkpoint config already targets that repo); a local config_path
has its checkpoint uploaded to a new private repo first and the run is forced to
push back to it. The pod command carries --job.target=local so the checkpoint's
saved job.target can't make the pod re-dispatch itself, and the user's CLI
overrides are forwarded so a remote resume matches the same local command.
ensure_dataset_available is hoisted before the resume/fresh branch since it
applies to both.
* docs(train): document resuming from a Hub checkpoint, locally and on jobs
Show that --config_path accepts a Hub repo id for --resume, and that adding
--job.target resumes on HF Jobs (uploading a local checkpoint/dataset first).
* fix(jobs): default remote job timeout to 2d instead of the platform default
HF Jobs applies its own short 30-minute timeout when none is sent, which
silently kills long training runs. Pass an explicit, generous 2d cap by
default; users can still override --job.timeout to fail fast or extend it.
* fix(jobs): drop --dataset.root on resume + restore keyboard-control docs
Address the latest Claude review on #3856:
- _build_resume_job no longer forwards --dataset.root to the pod (a
host-local path it can't read); the fresh-run path already nulls it in
build_remote_config_file, so this makes resume consistent. Add a unit
test for _pod_forwarded_args covering the drop in both flag forms.
- Restore the display-independent keyboard-control docs (n/r/q letter
equivalents + X11/Wayland/headless Tip) in il_robots.mdx that this
branch was stale on relative to main (#3875).
* fix(jobs): handle str-typed job stage from huggingface_hub
inspect_job's status.stage is an enum (with .value) in some
huggingface_hub versions and a plain str in others. The poller
assumed the enum shape, raising "'str' object has no attribute
'value'" on resume for users on the str-returning version.
Read it via getattr(..., "value", ...) so both shapes work, and
parametrize the poll test over enum and str stages so the str case
is actually exercised (the old mock only ever simulated the enum).
* refactor(jobs): use relative import for ensure_dataset_available
* refactor(train): hoist submit_to_hf import to module top
The `from lerobot.jobs import submit_to_hf` was a function-local import in
train(); it pulls no heavy/optional deps and has no circular-import risk, so
move it to the top-level import block.
* refactor(train): hoist _remote_target_in_argv imports to module top
Move `import sys` and `from lerobot.configs import JobConfig` out of the
function body and into the top-level import block.
* refactor(utils): use relative import for sibling constants in hub.py
`from lerobot.utils.constants import CHECKPOINTS_DIR` was the odd one out in
utils/ — sibling modules there are imported relatively (.constants, .errors,
.utils, ...). Match that convention.
* refactor(jobs): hoist LeRobotDataset import, guard dataset extra at package init
Move the `from lerobot.datasets import LeRobotDataset` import to the top of
dataset.py and relocate the `require_package("datasets", extra="dataset")`
guard to the jobs package __init__, per review feedback.
* test(jobs): skip test_hf if datasets extra is missing
lerobot.configs.train pulls in datasets at import time, so the module
fails to collect without lerobot[dataset]. Guard with importorskip,
matching the convention in tests/training/test_multi_gpu.py.
* test(jobs): skip test_dataset if datasets extra is missing
tests/jobs/test_dataset.py imports lerobot.jobs.dataset, which triggers
the require_package("datasets") guard in lerobot/jobs/__init__.py at
import time. Without lerobot[dataset] the module fails to collect in the
base CI tier. Guard with importorskip, same as test_hf.py.
716 lines
30 KiB
Plaintext
716 lines
30 KiB
Plaintext
# Imitation Learning on Real-World Robots
|
||
|
||
This tutorial will explain how to train a neural network to control a real robot autonomously.
|
||
|
||
**You'll learn:**
|
||
|
||
1. How to record and visualize your dataset.
|
||
2. How to train a policy using your data and prepare it for evaluation.
|
||
3. How to evaluate your policy and visualize the results.
|
||
|
||
By following these steps, you'll be able to replicate tasks, such as picking up a Lego block and placing it in a bin with a high success rate, as shown in the video below.
|
||
|
||
<details>
|
||
<summary><strong>Video: pickup lego block task</strong></summary>
|
||
|
||
<div class="video-container">
|
||
<video controls width="600">
|
||
<source
|
||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot_task.mp4"
|
||
type="video/mp4"
|
||
/>
|
||
</video>
|
||
</div>
|
||
|
||
</details>
|
||
|
||
This tutorial isn’t tied to a specific robot: we walk you through the commands and API snippets you can adapt for any supported platform.
|
||
|
||
During data collection, you’ll use a “teloperation” device, such as a leader arm or keyboard to teleoperate the robot and record its motion trajectories.
|
||
|
||
Once you’ve gathered enough trajectories, you’ll train a neural network to imitate these trajectories and deploy the trained model so your robot can perform the task autonomously.
|
||
|
||
If you run into any issues at any point, jump into our [Discord community](https://discord.com/invite/s3KuuzsPFb) for support.
|
||
|
||
<Tip>
|
||
|
||
Want to quickly get the right commands for your setup? The [quickstart notebook](https://github.com/huggingface/lerobot/blob/main/examples/notebooks/quickstart.ipynb) [](https://colab.research.google.com/github/huggingface/lerobot/blob/main/examples/notebooks/quickstart.ipynb) lets you configure your robot once and generates all the commands below ready to paste.
|
||
|
||
</Tip>
|
||
|
||
## Set up and Calibrate
|
||
|
||
If you haven't yet set up and calibrated your robot and teleop device, please do so by following the robot-specific tutorial.
|
||
|
||
## Teleoperate
|
||
|
||
In this example, we’ll demonstrate how to teleoperate the SO101 robot. For each command, we also provide a corresponding API example.
|
||
|
||
Note that the `id` associated with a robot is used to store the calibration file. It's important to use the same `id` when teleoperating, recording, and evaluating when using the same setup.
|
||
|
||
<hfoptions id="teleoperate_so101">
|
||
<hfoption id="Command">
|
||
```bash
|
||
lerobot-teleoperate \
|
||
--robot.type=so101_follower \
|
||
--robot.port=/dev/tty.usbmodem58760431541 \
|
||
--robot.id=my_awesome_follower_arm \
|
||
--teleop.type=so101_leader \
|
||
--teleop.port=/dev/tty.usbmodem58760431551 \
|
||
--teleop.id=my_awesome_leader_arm
|
||
```
|
||
</hfoption>
|
||
<hfoption id="API example">
|
||
|
||
<!-- prettier-ignore-start -->
|
||
```python
|
||
from lerobot.teleoperators.so_leader import SO101Leader, SO101LeaderConfig
|
||
from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
|
||
|
||
robot_config = SO101FollowerConfig(
|
||
port="/dev/tty.usbmodem5AB90687491",
|
||
id="my_follower_arm",
|
||
)
|
||
|
||
teleop_config = SO101LeaderConfig(
|
||
port="/dev/tty.usbmodem5AB90689011",
|
||
id="my_leader_arm",
|
||
)
|
||
|
||
robot = SO101Follower(robot_config)
|
||
teleop_device = SO101Leader(teleop_config)
|
||
robot.connect()
|
||
teleop_device.connect()
|
||
|
||
while True:
|
||
action = teleop_device.get_action()
|
||
robot.send_action(action)
|
||
```
|
||
<!-- prettier-ignore-end -->
|
||
|
||
</hfoption>
|
||
</hfoptions>
|
||
|
||
The teleoperate command will automatically:
|
||
|
||
1. Identify any missing calibrations and initiate the calibration procedure.
|
||
2. Connect the robot and teleop device and start teleoperation.
|
||
|
||
## Cameras
|
||
|
||
To add cameras to your setup, follow this [Guide](./cameras#setup-cameras).
|
||
|
||
## Teleoperate with cameras
|
||
|
||
With `rerun`, you can teleoperate again while simultaneously visualizing the camera feeds and joint positions. In this example, we’re using the Koch arm.
|
||
|
||
<hfoptions id="teleoperate_koch_camera">
|
||
<hfoption id="Command">
|
||
```bash
|
||
lerobot-teleoperate \
|
||
--robot.type=so101_follower \
|
||
--robot.port=/dev/tty.usbmodem5AB90687491 \
|
||
--robot.id=my_follower_arm \
|
||
--robot.cameras="{front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
|
||
--teleop.type=so101_leader \
|
||
--teleop.port=/dev/tty.usbmodem5AB90689011 \
|
||
--teleop.id=my_leader_arm \
|
||
--display_data=true
|
||
```
|
||
</hfoption>
|
||
<hfoption id="API example">
|
||
|
||
<!-- prettier-ignore-start -->
|
||
```python
|
||
import time
|
||
from lerobot.teleoperators.so_leader import SO101Leader, SO101LeaderConfig
|
||
from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
|
||
from lerobot.cameras.opencv import OpenCVCameraConfig
|
||
from lerobot.utils.visualization_utils import init_rerun, log_rerun_data, shutdown_rerun
|
||
|
||
robot_config = SO101FollowerConfig(
|
||
port="/dev/tty.usbmodem5AB90687491",
|
||
id="my_follower_arm",
|
||
cameras={
|
||
"wrist": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=30),
|
||
"top": OpenCVCameraConfig(index_or_path=1, width=640, height=480, fps=30)
|
||
}
|
||
)
|
||
|
||
teleop_config = SO101LeaderConfig(
|
||
port="/dev/tty.usbmodem5AB90689011",
|
||
id="my_leader_arm",
|
||
)
|
||
|
||
init_rerun(session_name="teleoperation")
|
||
|
||
robot = SO101Follower(robot_config)
|
||
teleop_device = SO101Leader(teleop_config)
|
||
robot.connect()
|
||
teleop_device.connect()
|
||
|
||
TARGET_HZ = 30
|
||
TIME_PER_FRAME = 1.0 / TARGET_HZ
|
||
|
||
while True:
|
||
start_time = time.perf_counter()
|
||
|
||
observation = robot.get_observation()
|
||
action = teleop_device.get_action()
|
||
robot.send_action(action)
|
||
log_rerun_data(observation=observation, action=action)
|
||
|
||
elapsed_time = time.perf_counter() - start_time
|
||
sleep_time = TIME_PER_FRAME - elapsed_time
|
||
if sleep_time > 0:
|
||
time.sleep(sleep_time)
|
||
```
|
||
<!-- prettier-ignore-end -->
|
||
|
||
</hfoption>
|
||
</hfoptions>
|
||
|
||
## Record a dataset
|
||
|
||
Once you're familiar with teleoperation, you can record your first dataset.
|
||
|
||
We use the Hugging Face hub features for uploading your dataset. If you haven't previously used the Hub, make sure you can login via the cli using a write-access token, this token can be generated from the [Hugging Face settings](https://huggingface.co/settings/tokens).
|
||
|
||
Add your token to the CLI by running this command:
|
||
|
||
```bash
|
||
hf auth login --token ${HUGGINGFACE_TOKEN} --add-to-git-credential
|
||
```
|
||
|
||
Then store your Hugging Face repository name in a variable:
|
||
|
||
```bash
|
||
HF_USER=$(NO_COLOR=1 hf auth whoami | awk -F': *' 'NR==1 {print $2}')
|
||
echo $HF_USER
|
||
```
|
||
|
||
Now you can record a dataset. To record 5 episodes and upload your dataset to the hub, adapt the code below for your robot and execute the command or API example.
|
||
|
||
<hfoptions id="record">
|
||
<hfoption id="Command">
|
||
```bash
|
||
lerobot-record \
|
||
--robot.type=so101_follower \
|
||
--robot.port=/dev/tty.usbmodem585A0076841 \
|
||
--robot.id=my_awesome_follower_arm \
|
||
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \
|
||
--teleop.type=so101_leader \
|
||
--teleop.port=/dev/tty.usbmodem58760431551 \
|
||
--teleop.id=my_awesome_leader_arm \
|
||
--display_data=true \
|
||
--dataset.repo_id=${HF_USER}/record-test \
|
||
--dataset.num_episodes=5 \
|
||
--dataset.single_task="Grab the black cube" \
|
||
--dataset.streaming_encoding=true \
|
||
# --dataset.rgb_encoder.vcodec=auto \
|
||
--dataset.encoder_threads=2
|
||
```
|
||
</hfoption>
|
||
<hfoption id="API example">
|
||
|
||
<!-- prettier-ignore-start -->
|
||
```python
|
||
from lerobot.cameras.opencv import OpenCVCameraConfig
|
||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||
from lerobot.utils.feature_utils import hw_to_dataset_features
|
||
from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
|
||
from lerobot.teleoperators.so_leader.config_so_leader import SO101LeaderConfig
|
||
from lerobot.teleoperators.so_leader.so_leader import SO101Leader
|
||
from lerobot.common.control_utils import init_keyboard_listener
|
||
from lerobot.utils.utils import log_say
|
||
from lerobot.utils.visualization_utils import init_rerun
|
||
from lerobot.scripts.lerobot_record import record_loop
|
||
from lerobot.processor import make_default_processors
|
||
|
||
NUM_EPISODES = 5
|
||
FPS = 30
|
||
EPISODE_TIME_SEC = 60
|
||
RESET_TIME_SEC = 10
|
||
TASK_DESCRIPTION = "My task description"
|
||
|
||
def main():
|
||
# Create robot configuration
|
||
robot_config = SO101FollowerConfig(
|
||
port="/dev/tty.usbmodem5AB90687491",
|
||
id="my_follower_arm",
|
||
cameras={
|
||
"wrist": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=30),
|
||
"top": OpenCVCameraConfig(index_or_path=1, width=640, height=480, fps=30)
|
||
}
|
||
)
|
||
|
||
teleop_config = SO101LeaderConfig(
|
||
port="/dev/tty.usbmodem5AB90689011",
|
||
id="my_leader_arm",
|
||
)
|
||
|
||
# Initialize the robot and teleoperator
|
||
robot = SO101Follower(robot_config)
|
||
teleop = SO101Leader(teleop_config)
|
||
|
||
# Configure the dataset features
|
||
action_features = hw_to_dataset_features(robot.action_features, "action")
|
||
obs_features = hw_to_dataset_features(robot.observation_features, "observation")
|
||
dataset_features = {**action_features, **obs_features}
|
||
|
||
# Create the dataset
|
||
dataset = LeRobotDataset.create(
|
||
repo_id="<hf_username>/<dataset_repo_id>",
|
||
fps=FPS,
|
||
features=dataset_features,
|
||
robot_type=robot.name,
|
||
use_videos=True,
|
||
image_writer_threads=4,
|
||
)
|
||
|
||
# Initialize the keyboard listener and rerun visualization
|
||
_, events = init_keyboard_listener()
|
||
init_rerun(session_name="recording")
|
||
|
||
# Connect the robot and teleoperator
|
||
robot.connect()
|
||
teleop.connect()
|
||
|
||
# Create the required processors
|
||
teleop_action_processor, robot_action_processor, robot_observation_processor = make_default_processors()
|
||
|
||
episode_idx = 0
|
||
while episode_idx < NUM_EPISODES and not events["stop_recording"]:
|
||
log_say(f"Recording episode {episode_idx + 1} of {NUM_EPISODES}")
|
||
|
||
record_loop(
|
||
robot=robot,
|
||
events=events,
|
||
fps=FPS,
|
||
teleop_action_processor=teleop_action_processor,
|
||
robot_action_processor=robot_action_processor,
|
||
robot_observation_processor=robot_observation_processor,
|
||
teleop=teleop,
|
||
dataset=dataset,
|
||
control_time_s=EPISODE_TIME_SEC,
|
||
single_task=TASK_DESCRIPTION,
|
||
display_data=True,
|
||
)
|
||
|
||
# Reset the environment if not stopping or re-recording
|
||
if not events["stop_recording"] and (episode_idx < NUM_EPISODES - 1 or events["rerecord_episode"]):
|
||
log_say("Reset the environment")
|
||
record_loop(
|
||
robot=robot,
|
||
events=events,
|
||
fps=FPS,
|
||
teleop_action_processor=teleop_action_processor,
|
||
robot_action_processor=robot_action_processor,
|
||
robot_observation_processor=robot_observation_processor,
|
||
teleop=teleop,
|
||
control_time_s=RESET_TIME_SEC,
|
||
single_task=TASK_DESCRIPTION,
|
||
display_data=True,
|
||
)
|
||
|
||
if events["rerecord_episode"]:
|
||
log_say("Re-recording episode")
|
||
events["rerecord_episode"] = False
|
||
events["exit_early"] = False
|
||
dataset.clear_episode_buffer()
|
||
continue
|
||
|
||
dataset.save_episode()
|
||
episode_idx += 1
|
||
|
||
# finalize dataset
|
||
log_say("Finalizing dataset...")
|
||
dataset.finalize()
|
||
# Clean up
|
||
log_say("Stop recording")
|
||
robot.disconnect()
|
||
teleop.disconnect()
|
||
dataset.push_to_hub()
|
||
|
||
|
||
if __name__ == "__main__":
|
||
main()
|
||
```
|
||
<!-- prettier-ignore-end -->
|
||
|
||
</hfoption>
|
||
</hfoptions>
|
||
|
||
#### Dataset upload
|
||
|
||
Locally, your dataset is stored in this folder: `~/.cache/huggingface/lerobot/{repo-id}`. At the end of data recording, your dataset will be uploaded on your Hugging Face page (e.g. `https://huggingface.co/datasets/${HF_USER}/so101_test`) that you can obtain by running:
|
||
|
||
```bash
|
||
echo https://huggingface.co/datasets/${HF_USER}/so101_test
|
||
```
|
||
|
||
Your dataset will be automatically tagged with `LeRobot` for the community to find it easily, and you can also add custom tags (in this case `tutorial` for example).
|
||
|
||
You can look for other LeRobot datasets on the hub by searching for `LeRobot` [tags](https://huggingface.co/datasets?other=LeRobot).
|
||
|
||
You can also push your local dataset to the Hub manually, running:
|
||
|
||
```bash
|
||
hf upload ${HF_USER}/record-test ~/.cache/huggingface/lerobot/{repo-id} --repo-type dataset
|
||
```
|
||
|
||
#### Record function
|
||
|
||
The `record` function provides a suite of tools for capturing and managing data during robot operation:
|
||
|
||
##### 1. Data Storage
|
||
|
||
- Data is stored using the `LeRobotDataset` format and is stored on disk during recording.
|
||
- By default, the dataset is pushed to your Hugging Face page after recording.
|
||
- To disable uploading, use `--dataset.push_to_hub=False`.
|
||
|
||
##### 2. Checkpointing and Resuming
|
||
|
||
- Checkpoints are automatically created during recording.
|
||
- If an issue occurs or you want to record additional episodes in the same dataset, you can resume by re-running the same command with `--resume=true`. When resuming a recording, `--dataset.num_episodes` must be set to the **number of additional episodes to be recorded**, and not to the targeted total number of episodes in the dataset! Make sure that you also set `--dataset.root="local_path"`, it's a local path to save the new part of the dataset and is required to resume.
|
||
- To start recording from scratch, **manually delete** the dataset directory.
|
||
|
||
##### 3. Recording Parameters
|
||
|
||
Set the flow of data recording using command-line arguments:
|
||
|
||
- `--dataset.episode_time_s=60`
|
||
Duration of each data recording episode (default: **60 seconds**).
|
||
- `--dataset.reset_time_s=60`
|
||
Duration for resetting the environment after each episode (default: **60 seconds**).
|
||
- `--dataset.num_episodes=50`
|
||
Total number of episodes to record (default: **50**).
|
||
|
||
##### 4. Keyboard Controls During Recording
|
||
|
||
Control the data recording flow using keyboard shortcuts:
|
||
|
||
- Press **Right Arrow (`→`)** or **`n`**: Early stop the current episode or reset time and move to the next.
|
||
- Press **Left Arrow (`←`)** or **`r`**: Cancel the current episode and re-record it.
|
||
- Press **Escape (`ESC`)** or **`q`**: Immediately stop the session, encode videos, and upload the dataset.
|
||
|
||
<Tip>
|
||
|
||
These control-flow shortcuts work on **X11, Wayland, and headless/SSH** sessions. When a global keyboard backend isn't available (Wayland, a headless machine, or macOS without Accessibility permission), `lerobot-record` automatically reads the same keys from the terminal — launch it from an interactive terminal and keep it focused. You can also use the letter equivalents **`n`** (next, same as `→`), **`r`** (re-record, same as `←`) and **`q`** (quit, same as `ESC`). No `$DISPLAY` setup is required.
|
||
|
||
This applies to the recording control flow only. Keyboard **teleoperation** (driving the robot with the keyboard) still needs a global key backend, so it works only on an X11 session, a Windows desktop, or macOS with Accessibility/Input Monitoring granted — not on Wayland or headless sessions.
|
||
|
||
</Tip>
|
||
|
||
#### Tips for gathering data
|
||
|
||
Once you're comfortable with data recording, you can create a larger dataset for training. A good starting task is grasping an object at different locations and placing it in a bin. We suggest recording at least 50 episodes, with 10 episodes per location. Keep the cameras fixed and maintain consistent grasping behavior throughout the recordings. Also make sure the object you are manipulating is visible on the camera's. A good rule of thumb is you should be able to do the task yourself by only looking at the camera images.
|
||
|
||
In the following sections, you’ll train your neural network. After achieving reliable grasping performance, you can start introducing more variations during data collection, such as additional grasp locations, different grasping techniques, and altering camera positions.
|
||
|
||
Avoid adding too much variation too quickly, as it may hinder your results.
|
||
|
||
If you want to dive deeper into this important topic, you can check out the [blog post](https://huggingface.co/blog/lerobot-datasets#what-makes-a-good-dataset) we wrote on what makes a good dataset.
|
||
|
||
#### Troubleshooting:
|
||
|
||
- On Linux, the recording control-flow keys (arrow keys, Escape) work on X11, Wayland, and headless/SSH sessions as long as `lerobot-record` runs in an interactive terminal — no `$DISPLAY` setup is needed. If the keys have no effect, make sure you are in an interactive (TTY) terminal, not a piped/non-TTY session, and that it is focused; the letter equivalents `n` / `r` / `q` also work. Keyboard _teleoperation_ (as opposed to the recording control flow) still requires a global key backend — an X11 session, a Windows desktop, or macOS with Accessibility/Input Monitoring granted — and is unavailable on Wayland or headless machines. See [pynput limitations](https://pynput.readthedocs.io/en/latest/limitations.html#linux).
|
||
|
||
## Visualize a dataset
|
||
|
||
If you uploaded your dataset to the hub with `--control.push_to_hub=true`, you can [visualize your dataset online](https://huggingface.co/spaces/lerobot/visualize_dataset) by copy pasting your repo id given by:
|
||
|
||
```bash
|
||
echo ${HF_USER}/so101_test
|
||
```
|
||
|
||
## Replay an episode
|
||
|
||
A useful feature is the `replay` function, which allows you to replay any episode that you've recorded or episodes from any dataset out there. This function helps you test the repeatability of your robot's actions and assess transferability across robots of the same model.
|
||
|
||
You can replay the first episode on your robot with either the command below or with the API example:
|
||
|
||
<hfoptions id="replay">
|
||
<hfoption id="Command">
|
||
```bash
|
||
lerobot-replay \
|
||
--robot.type=so101_follower \
|
||
--robot.port=/dev/tty.usbmodem58760431541 \
|
||
--robot.id=my_awesome_follower_arm \
|
||
--dataset.repo_id=${HF_USER}/record-test \
|
||
--dataset.episode=0 # choose the episode you want to replay
|
||
```
|
||
</hfoption>
|
||
<hfoption id="API example">
|
||
|
||
<!-- prettier-ignore-start -->
|
||
```python
|
||
import time
|
||
|
||
from lerobot.datasets import LeRobotDataset
|
||
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
|
||
from lerobot.utils.robot_utils import precise_sleep
|
||
from lerobot.utils.utils import log_say
|
||
|
||
episode_idx = 0
|
||
|
||
robot_config = SO100FollowerConfig(port="/dev/tty.usbmodem5AB90687491", id="my_follower_arm")
|
||
|
||
robot = SO100Follower(robot_config)
|
||
robot.connect()
|
||
|
||
dataset = LeRobotDataset("<hf_username>/<dataset_repo_id>", episodes=[episode_idx])
|
||
actions = dataset.select_columns("action")
|
||
|
||
log_say(f"Replaying episode {episode_idx}")
|
||
for idx in range(dataset.num_frames):
|
||
t0 = time.perf_counter()
|
||
|
||
action = {
|
||
name: float(actions[idx]["action"][i]) for i, name in enumerate(dataset.features["action"]["names"])
|
||
}
|
||
robot.send_action(action)
|
||
|
||
precise_sleep(max(1.0 / dataset.fps - (time.perf_counter() - t0), 0.0))
|
||
|
||
robot.disconnect()
|
||
```
|
||
<!-- prettier-ignore-end -->
|
||
|
||
</hfoption>
|
||
</hfoptions>
|
||
|
||
Your robot should replicate movements similar to those you recorded. For example, check out [this video](https://x.com/RemiCadene/status/1793654950905680090) where we use `replay` on a Aloha robot from [Trossen Robotics](https://www.trossenrobotics.com).
|
||
|
||
## Train a policy
|
||
|
||
To train a policy to control your robot, use the [`lerobot-train`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/scripts/lerobot_train.py) script. A few arguments are required. Here is an example command:
|
||
|
||
```bash
|
||
lerobot-train \
|
||
--dataset.repo_id=${HF_USER}/so101_test \
|
||
--policy.type=act \
|
||
--output_dir=outputs/train/act_so101_test \
|
||
--job_name=act_so101_test \
|
||
--policy.device=cuda \
|
||
--wandb.enable=true \
|
||
--policy.repo_id=${HF_USER}/my_policy
|
||
```
|
||
|
||
Let's explain the command:
|
||
|
||
1. We provided the dataset as argument with `--dataset.repo_id=${HF_USER}/so101_test`.
|
||
2. We provided the policy with `policy.type=act`. This loads configurations from [`configuration_act.py`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/act/configuration_act.py). Importantly, this policy will automatically adapt to the number of motor states, motor actions and cameras of your robot (e.g. `laptop` and `phone`) which have been saved in your dataset.
|
||
3. We provided `policy.device=cuda` since we are training on a Nvidia GPU, but you could use `policy.device=mps` to train on Apple silicon.
|
||
4. We provided `wandb.enable=true` to use [Weights and Biases](https://docs.wandb.ai/quickstart) for visualizing training plots. This is optional but if you use it, make sure you are logged in by running `wandb login`.
|
||
|
||
Training should take several hours. You will find checkpoints in `outputs/train/act_so101_test/checkpoints`.
|
||
|
||
To resume training from a checkpoint, below is an example command to resume from `last` checkpoint of the `act_so101_test` policy:
|
||
|
||
```bash
|
||
lerobot-train \
|
||
--config_path=outputs/train/act_so101_test/checkpoints/last/pretrained_model/train_config.json \
|
||
--resume=true
|
||
```
|
||
|
||
`--config_path` also accepts a **Hub repo id**: if a run pushed its checkpoints to the Hub (with `--save_checkpoint_to_hub=true`), you can resume straight from the repo — its latest checkpoint is downloaded and training continues, restoring the optimizer, scheduler, step counter and data order:
|
||
|
||
```bash
|
||
lerobot-train --config_path=${HF_USER}/my_policy --resume=true
|
||
```
|
||
|
||
If you do not want to push your model to the hub after training use `--policy.push_to_hub=false`.
|
||
|
||
Additionally you can provide extra `tags` or specify a `license` for your model or make the model repo `private` by adding this: `--policy.private=true --policy.tags=\[ppo,rl\] --policy.license=mit`
|
||
|
||
#### Train using Google Colab
|
||
|
||
If your local computer doesn't have a powerful GPU you could utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act).
|
||
|
||
#### Train using Hugging Face Jobs
|
||
|
||
Hugging Face jobs let's you easily select hardware and run the training in the cloud. So if you don't have a powerful GPU or you need more VRAM or just want to train a model much faster use HF Jobs! It's pay as you go and you simply pay for each second of use, you can see the pricing and additional information [here](https://huggingface.co/docs/hub/jobs).
|
||
|
||
> **Tip:** if you just want to launch a standard training run, you can skip building the command below and use the integrated **Train on HF Jobs via `--job.target`** flow described further down — `lerobot-train` then submits the job, uploads a local-only dataset for you, and streams the logs.
|
||
|
||
To run the training manually use this command:
|
||
|
||
<hfoptions id="train_with_hf_jobs">
|
||
<hfoption id="Command">
|
||
```bash
|
||
hf jobs run \
|
||
--flavor a10g-small \
|
||
--timeout 4h \
|
||
--secrets HF_TOKEN \
|
||
huggingface/lerobot-gpu:latest \
|
||
-- \
|
||
python -m lerobot.scripts.lerobot_train \
|
||
--dataset.repo_id=username/dataset \
|
||
--policy.type=act \
|
||
--steps=5000 \
|
||
--batch_size=16 \
|
||
--policy.device=cuda \
|
||
--policy.repo_id=username/your_policy \
|
||
--log_freq=100
|
||
```
|
||
</hfoption>
|
||
<hfoption id="API example">
|
||
|
||
<!-- prettier-ignore-start -->
|
||
```python
|
||
from huggingface_hub import run_job, get_token
|
||
|
||
run_name = "act_so101_hf_jobs"
|
||
dataset_id = "username/dataset"
|
||
user_hub_id = "username"
|
||
|
||
command_args = [
|
||
"python", "-m", "lerobot.scripts.lerobot_train",
|
||
"--dataset.repo_id", dataset_id,
|
||
"--policy.type", "act",
|
||
"--steps", "5000",
|
||
"--batch_size", "16",
|
||
"--num_workers", "4",
|
||
"--policy.device", "cuda",
|
||
"--log_freq", "100",
|
||
"--save_freq", "1000",
|
||
"--save_checkpoint", "true",
|
||
"--wandb.enable", "false",
|
||
"--policy.repo_id", f"{user_hub_id}/{run_name}"
|
||
]
|
||
|
||
print(f"Submitting job '{run_name}' to Hugging Face Infrastructure...")
|
||
|
||
job_info = run_job(
|
||
image="huggingface/lerobot-gpu:latest",
|
||
command=command_args,
|
||
flavor="a10g-small",
|
||
timeout="4h",
|
||
secrets={"HF_TOKEN": get_token()}
|
||
)
|
||
|
||
print("\n🚀 Job successfully launched!")
|
||
print(f"🔹 Job ID: {job_info.id}")
|
||
print(f"🔗 Live UI Dashboard & Logs: {job_info.url}")
|
||
```
|
||
<!-- prettier-ignore-end -->
|
||
|
||
</hfoption>
|
||
</hfoptions>
|
||
|
||
You can modify the `--flavor` to use different hardware, for example: `t4-small`, `a100-large`, `h200`. Use `hf jobs hardware` to see the full list with pricing.
|
||
Depending on the model you want to train and the hardware you selected you can also modify the `--batch_size` and `--number_of_workers`.
|
||
For longer training sessions increase the timeout.
|
||
|
||
Once the training is started you can go to [Jobs](https://huggingface.co/settings/jobs) and see if your jobs is running as well as all the outputs. Sometimes it takes a few minutes to schedule your job so be patient.
|
||
|
||
After training the model will be pushed to hub and you can use it as any other model with LeRobot.
|
||
|
||
#### Train on HF Jobs via `--job.target` (integrated CLI)
|
||
|
||
`lerobot-train` runs locally by default. To run on a HuggingFace GPU without constructing the Docker command yourself, pass `--job.target` with a hardware flavor name:
|
||
|
||
```bash
|
||
lerobot-train \
|
||
--dataset.repo_id=${HF_USER}/so101_test \
|
||
--policy.type=act \
|
||
--policy.repo_id=${HF_USER}/my_policy \
|
||
--job.target=a10g-small
|
||
```
|
||
|
||
List available flavors and pricing with `hf jobs hardware`. The run streams its logs to your terminal; press Ctrl-C to detach (the job keeps running in the cloud). Re-attach or cancel with:
|
||
|
||
```bash
|
||
hf jobs logs <job-id>
|
||
hf jobs cancel <job-id>
|
||
```
|
||
|
||
If your dataset exists only locally (not yet on the Hub), it is automatically pushed to a **private** Hub repo so the job can download it by `repo_id` (nothing is made public). The trained model is pushed to the model repo at the end of the run. To also push every intermediate checkpoint to the Hub as it is saved (so you can monitor progress mid-run), add `--save_checkpoint_to_hub=true` — this requires a runtime image that includes this feature.
|
||
|
||
Every job (and any dataset pushed by the run) is tagged `lerobot` so it's easy to find on the Hub. Add your own with `--job.tags '["my-tag"]'`.
|
||
|
||
By default the job is capped at `2d` (48h) of wall-clock. Override it with an HF Jobs duration string, e.g. `--job.timeout=4h` to fail faster or `--job.timeout=7d` for a longer run.
|
||
|
||
> **Note:** the model repo is created up front (it holds the staged training config the job runs from). If a run fails before the model is pushed, that repo is left on the Hub so you can inspect it — it is not deleted automatically, so repeated failures can leave empty repos behind. Remove one with `hf repo delete <repo-id>`.
|
||
|
||
**Prerequisites:** run `hf auth login` before submitting. For Weights & Biases integration, run `wandb login` or set `WANDB_API_KEY` on your machine — the key is forwarded to the job automatically.
|
||
|
||
**Resuming on a job.** Adding `--job.target` to a resume command runs the resume in the cloud — the same command works locally or remotely. The checkpoint repo is the source of truth, and new checkpoints continue the lineage in the same repo:
|
||
|
||
```bash
|
||
# resume a Hub run on a job (its checkpoints are already on the Hub)
|
||
lerobot-train --config_path=${HF_USER}/my_policy --resume=true --job.target=a10g-small
|
||
|
||
# resume a LOCAL run on a job — the checkpoint is uploaded to a private Hub repo first,
|
||
# then the job resumes from it (a local-only dataset is uploaded the same way)
|
||
lerobot-train \
|
||
--config_path=outputs/train/act_so101_test/checkpoints/last/pretrained_model/train_config.json \
|
||
--resume=true \
|
||
--job.target=a10g-small
|
||
```
|
||
|
||
Job settings come from the current command, so override `--job.target`, `--job.timeout`, etc. as needed; for the resumed run to itself be resumable later, keep `--save_checkpoint_to_hub=true`.
|
||
|
||
#### Upload policy checkpoints
|
||
|
||
Once training is done, upload the latest checkpoint with:
|
||
|
||
```bash
|
||
hf upload ${HF_USER}/act_so101_test \
|
||
outputs/train/act_so101_test/checkpoints/last/pretrained_model
|
||
```
|
||
|
||
You can also upload intermediate checkpoints with:
|
||
|
||
```bash
|
||
CKPT=010000
|
||
hf upload ${HF_USER}/act_so101_test${CKPT} \
|
||
outputs/train/act_so101_test/checkpoints/${CKPT}/pretrained_model
|
||
```
|
||
|
||
## Run inference and evaluate your policy
|
||
|
||
Use `lerobot-rollout` to deploy a trained policy on your robot. You can choose different strategies depending on your needs:
|
||
|
||
The examples below load the model from `--policy.path`. To pin a specific pushed version — useful once `--save_checkpoint_to_hub=true` has committed several checkpoints — add `--policy.pretrained_revision` with a commit hash, branch, or tag. Each pushed checkpoint is tagged with its step (e.g. `--policy.pretrained_revision=010000`), so you can recover a checkpoint by step without looking up its commit sha.
|
||
|
||
<hfoptions id="eval">
|
||
<hfoption id="Base mode (no recording)">
|
||
```bash
|
||
lerobot-rollout \
|
||
--strategy.type=base \
|
||
--policy.path=${HF_USER}/my_policy \
|
||
--robot.type=so100_follower \
|
||
--robot.port=/dev/ttyACM1 \
|
||
--robot.cameras="{ up: {type: opencv, index_or_path: /dev/video10, width: 640, height: 480, fps: 30}, side: {type: intelrealsense, serial_number_or_name: 233522074606, width: 640, height: 480, fps: 30}}" \
|
||
--task="Put lego brick into the transparent box" \
|
||
--duration=60
|
||
```
|
||
</hfoption>
|
||
<hfoption id="Sentry mode (with recording)">
|
||
```bash
|
||
lerobot-rollout \
|
||
--strategy.type=sentry \
|
||
--strategy.upload_every_n_episodes=5 \
|
||
--policy.path=${HF_USER}/my_policy \
|
||
--robot.type=so100_follower \
|
||
--robot.port=/dev/ttyACM1 \
|
||
--robot.cameras="{ up: {type: opencv, index_or_path: /dev/video10, width: 640, height: 480, fps: 30}, side: {type: intelrealsense, serial_number_or_name: 233522074606, width: 640, height: 480, fps: 30}}" \
|
||
--dataset.repo_id=${HF_USER}/eval_so100 \
|
||
--dataset.single_task="Put lego brick into the transparent box" \
|
||
--duration=600
|
||
```
|
||
</hfoption>
|
||
</hfoptions>
|
||
|
||
The `--strategy.type` flag selects the execution mode:
|
||
|
||
- `base`: Autonomous rollout with no data recording (useful for quick evaluation)
|
||
- `sentry`: Continuous recording with auto-upload (useful for large-scale evaluation)
|
||
- `highlight`: Ring buffer recording with keystroke save (useful for capturing interesting events)
|
||
- `dagger`: Human-in-the-loop data collection (see [HIL Data Collection](./hil_data_collection))
|
||
- `episodic`: Episode-oriented policy recording with reset phases between episodes
|
||
|
||
All strategies support `--inference.type=rtc` for smooth execution with slow VLA models (Pi0, Pi0.5, SmolVLA).
|