mirror of https://github.com/huggingface/lerobot.git synced 2026-07-01 15:17:05 +00:00

Files

T

Nicolas Rabault 5ac3b49a5f feat(train): run training remotely on HF Jobs via --job.target (#3856 )

* feat(train): add JobConfig group, save_checkpoint_to_hub flag, Hub checkpoint helper

Introduce a JobConfig draccus group on TrainPipelineConfig (--job.target/image/
timeout/detach/tags) whose is_remote property gates remote dispatch, plus a
save_checkpoint_to_hub flag and validation. Add push_checkpoint_to_hub(), which
uploads a saved checkpoint directory to the model repo under checkpoints/<step>/
and creates the repo idempotently (private propagates from policy.private).

* feat(train): run training remotely on HF Jobs via --job.target

When --job.target names a GPU flavor, train() dispatches to lerobot.jobs.submit_to_hf
instead of training locally: it authenticates, ensures the dataset is on the Hub
(pushing a local-only one privately), serializes a pod-compatible train_config.json
(strips client-only fields, points at the model repo), submits via HfApi.run_job
with HF_TOKEN/WANDB_API_KEY secrets, then streams logs and finishes when the model
is pushed. Wires push_checkpoint_to_hub into the training loop behind
save_checkpoint_to_hub, and tags jobs/datasets/model with 'lerobot' + --job.tags.

* docs(train): document remote training on HF Jobs

* test(train): skip remote-dispatch tests without the dataset extra

The module imports lerobot.scripts.lerobot_train, which eagerly pulls in
lerobot.datasets (dataset extra). The base fast-test CI tier runs without
that extra, so collection failed there. Guard with pytest.importorskip,
matching the existing tests/scripts dataset-extra tests.

* refactor(jobs): hoist huggingface_hub imports to module level in hf.py

huggingface_hub is a core dependency, so the per-function dynamic imports
had no lazy-loading rationale. Move them to a single module-level import
and update test monkeypatch targets to lerobot.jobs.hf.* accordingly.

* refactor(jobs): build remote config dict via cfg.to_dict()

TrainPipelineConfig.to_dict() already returns the canonical draccus
encoding, so the StringIO + draccus.dump + json.loads round-trip was
redundant. Use it directly and drop the now-unused io/draccus imports.

* refactor(train): use module-level HfApi import in push_checkpoint_to_hub

huggingface_hub is a core dependency; the in-function import was
unnecessary. Move HfApi to a module-level import and point the test
monkeypatches at lerobot.common.train_utils.HfApi.

* refactor(configs): export JobConfig from the configs package

Re-export JobConfig in lerobot/configs/__init__.py so external callers
import it as `from lerobot.configs import JobConfig`, matching the other
config classes. Adapt the train script and test imports.

* refactor(jobs): check dataset presence with api.repo_exists

Replace the dataset_info try/except RepositoryNotFoundError dance with a
direct api.repo_exists(repo_id, repo_type="dataset") call, dropping the
httpx/RepositoryNotFoundError test scaffolding.

* chore(jobs): annotate ensure_dataset_available api param as HfApi

Add the missing HfApi type hint via a TYPE_CHECKING import.

* refactor(jobs): use HF_LEROBOT_HOME constant for the local cache root

Resolve the local dataset cache via lerobot.utils.constants.HF_LEROBOT_HOME
instead of re-reading the env var by hand, dropping the os/Path imports.
Tests now patch the imported constant and assert on a stable message
substring (the previous "neither" match only passed by accident, matching
the test name embedded in the pytest tmp_path).

* chore(jobs): guard LeRobotDataset import with require_package

Surface a clear "install lerobot[dataset]" error if the datasets extra
is missing, instead of a raw ImportError, before pushing a local dataset.

* docs(configs): clarify the is_remote_target/is_remote split

Add a comment explaining why JobConfig keeps both the staticmethod (tests
a raw target string from argv before a config exists) and the property
(accessor for an existing config instance).

* docs(train): note how to pin a pushed model version for inference

Document --policy.pretrained_revision alongside --policy.path so a
specific Hub-pushed checkpoint (once --save_checkpoint_to_hub has
committed several) can be selected for inference.

* test(jobs): skip dataset import guard in base-deps test

The fast test env installs base deps only, so require_package('datasets')
raised ImportError before the mocked lerobot.datasets import was reached.
Monkeypatch the guard to a no-op so the unit test exercises the upload logic.

* fix(jobs): address claude review findings on remote training

Resolve the claude[bot] review on #3856:

- Reject reward-model training under --job.target with a clear error instead
  of crashing on a None policy inside build_remote_config_file.
- Support --policy.path remote runs: validate() no longer requires repo_id for
  remote runs (it is auto-generated in submit_to_hf), and repo_id/push_to_hub
  are now set after validate() resolves the policy.
- Narrow the bare `except Exception` in _tail_logs/_poll_until_done to
  (OSError, httpx.HTTPError) so programming errors surface instead of being
  silently retried or counted as job failures.
- Install the SIGINT detach handler only on the main thread.
- Generate model repo timestamps in UTC.

* docs(jobs): document the model-pushed marker contract and orphaned repos

Follow-up to the claude[bot] review on #3856 (non-blocking observations):

- Cross-reference the "Model pushed to <url>" log line between its producer
  (PreTrainedPolicy.push_model_to_hub) and the remote-run consumer in
  submit_to_hf, noting the contract is an early-finish optimization that
  falls back to status polling if it drifts.
- Note in the HF Jobs guide that a failed remote run leaves its model repo
  on the Hub (it is not auto-deleted) and how to remove it.

* feat(train): tag each pushed checkpoint with its step

Address review feedback on #3856: pushing a checkpoint to the Hub now
also creates a tag named after the checkpoint step, so a checkpoint can
be recovered with --policy.pretrained_revision=<step> instead of having
to look up its commit sha.

* fix(jobs): hoist ensure_dataset_available to a module-level import

Addresses Caroline's review comment on PR #3856: the local import of
ensure_dataset_available inside submit_to_hf was vestigial. dataset.py
does not import hf.py, so there is no circular-import risk and no extra
load cost (its heavy deps stay lazy), so make it a top-level import.

* refactor(configs): untangle config_path/resume resolution in validate()

Split the re-parse HACK block in TrainPipelineConfig.validate() into focused
helpers (_resolve_pretrained_from_cli, _resolve_resume_checkpoint) that handle
the policy path, reward-model path, and resume config_path as separate,
readable units. Behavior-preserving.

* feat(train): resume training from a Hub checkpoint

Allow --config_path to be a Hub repo id when resuming, not only a local path.
The latest checkpoint under checkpoints/<step>/ is downloaded into a fresh local
run dir and resumed from there (optimizer, scheduler, RNG and data order
restored as for a local resume). TrainPipelineConfig.from_pretrained falls back
to the latest checkpoint's train_config.json when a repo has no root config
(an interrupted run that only pushed checkpoints). The download is skipped when
dispatching remotely so the executor (local machine or HF Jobs pod) performs it.

- add find_latest_hub_checkpoint (utils/hub) and resolve_resume_checkpoint
  (common/train_utils), the symmetric download counterpart to
  push_checkpoint_to_hub
- unit tests for both helpers and the from_pretrained fallback

* feat(jobs): resume a run on HF Jobs from a checkpoint

When --resume is set with a remote --job.target, submit_to_hf resumes from the
checkpoint repo instead of staging a fresh config. A Hub config_path is resumed
in place (its checkpoint config already targets that repo); a local config_path
has its checkpoint uploaded to a new private repo first and the run is forced to
push back to it. The pod command carries --job.target=local so the checkpoint's
saved job.target can't make the pod re-dispatch itself, and the user's CLI
overrides are forwarded so a remote resume matches the same local command.
ensure_dataset_available is hoisted before the resume/fresh branch since it
applies to both.

* docs(train): document resuming from a Hub checkpoint, locally and on jobs

Show that --config_path accepts a Hub repo id for --resume, and that adding
--job.target resumes on HF Jobs (uploading a local checkpoint/dataset first).

* fix(jobs): default remote job timeout to 2d instead of the platform default

HF Jobs applies its own short 30-minute timeout when none is sent, which
silently kills long training runs. Pass an explicit, generous 2d cap by
default; users can still override --job.timeout to fail fast or extend it.

* fix(jobs): drop --dataset.root on resume + restore keyboard-control docs

Address the latest Claude review on #3856:

- _build_resume_job no longer forwards --dataset.root to the pod (a
  host-local path it can't read); the fresh-run path already nulls it in
  build_remote_config_file, so this makes resume consistent. Add a unit
  test for _pod_forwarded_args covering the drop in both flag forms.
- Restore the display-independent keyboard-control docs (n/r/q letter
  equivalents + X11/Wayland/headless Tip) in il_robots.mdx that this
  branch was stale on relative to main (#3875).

* fix(jobs): handle str-typed job stage from huggingface_hub

inspect_job's status.stage is an enum (with .value) in some
huggingface_hub versions and a plain str in others. The poller
assumed the enum shape, raising "'str' object has no attribute
'value'" on resume for users on the str-returning version.

Read it via getattr(..., "value", ...) so both shapes work, and
parametrize the poll test over enum and str stages so the str case
is actually exercised (the old mock only ever simulated the enum).

* refactor(jobs): use relative import for ensure_dataset_available

* refactor(train): hoist submit_to_hf import to module top

The `from lerobot.jobs import submit_to_hf` was a function-local import in
train(); it pulls no heavy/optional deps and has no circular-import risk, so
move it to the top-level import block.

* refactor(train): hoist _remote_target_in_argv imports to module top

Move `import sys` and `from lerobot.configs import JobConfig` out of the
function body and into the top-level import block.

* refactor(utils): use relative import for sibling constants in hub.py

`from lerobot.utils.constants import CHECKPOINTS_DIR` was the odd one out in
utils/ — sibling modules there are imported relatively (.constants, .errors,
.utils, ...). Match that convention.

* refactor(jobs): hoist LeRobotDataset import, guard dataset extra at package init

Move the `from lerobot.datasets import LeRobotDataset` import to the top of
dataset.py and relocate the `require_package("datasets", extra="dataset")`
guard to the jobs package __init__, per review feedback.

* test(jobs): skip test_hf if datasets extra is missing

lerobot.configs.train pulls in datasets at import time, so the module
fails to collect without lerobot[dataset]. Guard with importorskip,
matching the convention in tests/training/test_multi_gpu.py.

* test(jobs): skip test_dataset if datasets extra is missing

tests/jobs/test_dataset.py imports lerobot.jobs.dataset, which triggers
the require_package("datasets") guard in lerobot/jobs/__init__.py at
import time. Without lerobot[dataset] the module fails to collect in the
base CI tier. Guard with importorskip, same as test_hf.py.

2026-06-29 17:59:33 +02:00

23 KiB

Raw Blame History

AGENT_GUIDE.md — LeRobot Helper for AI Agents & Users

This file is a practical, copy-paste-friendly companion for any AI agent (Cursor, Claude, ChatGPT, Codex, etc.) helping a user work with LeRobot. It complements AGENTS.md (dev/contributor context) with user-facing guidance: how to start, what to train, how long, how to record, and how to calibrate an SO-101.

1. Start here — ask the user first (MANDATORY)

Before suggesting any command, an agent MUST ask the user at least these questions and wait for answers:

What's your goal? (e.g. "teach my SO-101 to fold a cloth", "train a policy on an existing HF dataset", "contribute a PR", "understand the codebase")
What hardware do you have?
- Robot: none / SO-100 / SO-101 / Koch / LeKiwi / Reachy / other
- Teleop: leader arm / phone / keyboard / gamepad / none
- Cameras: how many, resolution, fixed or moving?
What machine will you train on?
- GPU model + VRAM (e.g. "laptop 3060 6 GB", "RTX 4090 24 GB", "A100 80 GB", "CPU only")
- OS: macOS / Linux / Windows
Skill level & time budget? First time, some ML, experienced? Hours, days, a weekend?
Do you already have a dataset? Yes (HF repo id?) / no / want to record one
How can I help right now? (pick one concrete next step)

Only after you have answers, propose a concrete path. If something is ambiguous, ask again rather than guessing. Bias toward the simplest thing that works for the user's hardware and goal.

2. LeRobot in 60 seconds

LeRobot = datasets + policies + envs + robot control, unified by a small set of strong abstractions.

LeRobotDataset — episode-aware dataset (video or images + actions + state), loadable from the Hub or disk.
Policies (ACT, Diffusion, SmolVLA, π0, π0.5, Wall-X, X-VLA, VQ-BeT, TD-MPC, …) — all inherit PreTrainedPolicy and can be pushed/pulled from the Hub.
Processors — small composable transforms between dataset → policy → robot.
Envs (sim) and Robots (real) — same action/observation contract so code swaps cleanly.
CLI — lerobot-record, lerobot-train, lerobot-eval, lerobot-teleoperate, lerobot-calibrate, lerobot-find-port, lerobot-setup-motors, lerobot-replay.

See AGENTS.md for repo architecture.

3. Quickstart paths (pick one)

Path A — "I have an SO-101 and want my first trained policy"

Go to §4 (SO-101 end-to-end), then §5 (data tips), then §6 (pick a policy — likely ACT), then §7 (how long), then §8 (eval).

Path B — "No hardware, I want to train on an existing dataset"

Skip §4. Pick a policy in §6, pick a duration in §7, then run lerobot-train per §4.9 with a Hub --dataset.repo_id and an --env.type for eval. Finish with §8.

Path C — "I just want to understand the codebase"

Read §2 above, then AGENTS.md "Architecture", then open src/lerobot/policies/act/ and src/lerobot/datasets/lerobot_dataset.py as canonical examples.

4. SO-101 end-to-end cheat-sheet

Full details in docs/source/so101.mdx and docs/source/il_robots.mdx. Minimum commands in order. Confirm arms are assembled + powered before issuing.

4.1 Install

pip install 'lerobot[feetech]'              # SO-100/SO-101 motor stack
# pip install 'lerobot[all]'                # everything
# pip install 'lerobot[aloha,pusht]'        # specific features
# pip install 'lerobot[smolvla]'            # add SmolVLA deps
git lfs install && git lfs pull
hf auth login                               # required to push datasets/policies

Contributors can alternatively use uv sync --locked --extra feetech (see AGENTS.md).

4.2 Find USB ports — run once per arm, unplug when prompted.

lerobot-find-port

macOS: /dev/tty.usbmodem...; Linux: /dev/ttyACM0 (may need sudo chmod 666 /dev/ttyACM0).

4.3 Setup motor IDs & baudrate (one-time, per arm)

lerobot-setup-motors --robot.type=so101_follower --robot.port=<FOLLOWER_PORT>
lerobot-setup-motors --teleop.type=so101_leader  --teleop.port=<LEADER_PORT>

4.4 Calibrate — center all joints, press Enter, sweep each joint through its full range. The id is the calibration key — reuse it everywhere.

lerobot-calibrate --robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower
lerobot-calibrate --teleop.type=so101_leader  --teleop.port=<LEADER_PORT>   --teleop.id=my_leader

4.5 Teleoperate (sanity check, no recording)

lerobot-teleoperate \
  --robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
  --teleop.type=so101_leader  --teleop.port=<LEADER_PORT>  --teleop.id=my_leader \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --display_data=true

Feetech timeout / comms error on SO-100 / SO-101? Before touching software, check the red motor LEDs on the daisy chain.

All steady red, gripper → base chain → wiring OK.

One or more motors dark / chain stops mid-way → wiring issue: reseat the 3-pin cables, check the controller-board power supply, and make sure each motor is fully clicked in.

LEDs blinking → the motor is in an error state: usually overload (forcing a joint past its limit) or wrong power supply voltage. SO-100 / SO-101 ship in two variants — a 5 V / 7.4 V build and a 12 V build — they are NOT interchangeable. Using a 12 V PSU on a 5 V / 7.4 V arm (or vice-versa) will trip this error; confirm your motor variant before powering up.

Most "timeout" errors are physical, not code.

4.6 Record a dataset — keys: → next, ← redo, ESC finish & upload.

HF_USER=$(NO_COLOR=1 hf auth whoami | awk -F': *' 'NR==1 {print $2}')

lerobot-record \
  --robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
  --teleop.type=so101_leader  --teleop.port=<LEADER_PORT>  --teleop.id=my_leader \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --dataset.repo_id=${HF_USER}/my_task \
  --dataset.single_task="<describe the task in one sentence>" \
  --dataset.num_episodes=50 \
  --dataset.episode_time_s=30 \
  --dataset.reset_time_s=10 \
  --display_data=true

4.7 Visualize — always do this before training. Look for missing frames, camera blur, unreachable targets, inconsistent object positions. After upload: https://huggingface.co/spaces/lerobot/visualize_dataset → paste ${HF_USER}/my_task. Works for any LeRobot-formatted Hub dataset — use it to scout other datasets, inspect episode quality, or debug your own data before retraining.

4.8 Replay an episode (sanity check)

lerobot-replay --robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
  --dataset.repo_id=${HF_USER}/my_task --dataset.episode=0

4.9 Train (default: ACT — fastest, lowest memory). Apple silicon: --policy.device=mps. No local GPU? Add --job.target=<flavor> (e.g. a10g-small, list them with hf jobs hardware) to run on Hugging Face Jobs instead. See §6/§7 for policy and duration.

lerobot-train \
  --dataset.repo_id=${HF_USER}/my_task \
  --policy.type=act \
  --policy.device=cuda \
  --output_dir=outputs/train/act_my_task \
  --job_name=act_my_task \
  --batch_size=8 \
  --wandb.enable=true \
  --policy.repo_id=${HF_USER}/act_my_task

4.10 Evaluate on the real robot — compare success rate to a teleoperated baseline.

lerobot-record \
  --robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --dataset.repo_id=${HF_USER}/eval_my_task \
  --dataset.single_task="<same task description as training>" \
  --dataset.num_episodes=10 \
  --policy.path=${HF_USER}/act_my_task

5. Data collection tips (beginner → reliable policy)

Good data beats clever models. Adopt these defaults and deviate only with evidence.

5.1 Setup & ergonomics

Fix the rig and cameras before touching the software. If the rig vibrates or the operator gets frustrated, fix that first — more bad data won't help.
Lighting matters more than resolution. Diffuse, consistent light. Avoid moving shadows.
"Can you do the task from the camera view alone?" If no, your cameras are wrong. Fix before recording.
Enable action interpolation for rollouts when available for smoother trajectories.

5.2 Practice before you record

Do 5–10 demos without recording. Build a deliberate, repeatable strategy.
Hesitant or inconsistent demos teach the model hesitation.

5.3 Quality over speed

Deliberate, high-quality execution beats fast sloppy runs. Optimize for speed only after strategy is dialed in — never trade quality for it.

5.4 Consistency within and across episodes

Same grasp, approach vector, and timing. Coherent strategies are much easier to learn than wildly varying movements.

5.5 Start small, then extend (the golden rule)

First 50 episodes = constrained version of the task: one object, fixed position, fixed camera setup, one operator.
Train a quick ACT model. See what fails.
Then add diversity along one axis at a time: more positions → more lighting → more objects → more operators.
Don't try to collect the "perfect dataset" on day one. Iterate.

5.6 Policy choice for beginners

Laptop / first time / want results fast → ACT. Works surprisingly well, trains fast even on a laptop GPU.
Bigger GPU / language-conditioned / multi-task → SmolVLA. Unfreezing the vision encoder (see §7) is a big win here.
Defer π0 / π0.5 / Wall-X / X-VLA until you have a proven ACT baseline and a 20+ GB GPU.

5.7 Recommended defaults for your first task

Setting	Value
Episodes	50 to start, scale to 100–300 after first training
Episode length	20–45 s (shorter is fine for grasp/place)
Reset time	10 s
FPS	30
Cameras	2 cameras recommended: 1 fixed front + 1 wrist. Multi-view often outperforms single-view. A single fixed camera also works to keep things simple.
Task description	Short, specific, action-phrased sentence

5.8 Troubleshooting signal

Policy fails at one specific stage → record 10–20 more episodes targeting that stage.
Policy flaps / oscillates → likely inconsistent demos, or need more training; re-record worst episodes (use ← to redo).
Policy ignores the object → camera framing or lighting issue, not a model issue.

6. Which policy should I train?

Match the policy to the user's GPU memory and time budget. Numbers below come from an internal profiling run (one training update per policy). They are indicative only — see caveats.

6.1 Profiling snapshot (indicative)

All policies typically train for 5–10 epochs (see §7).

Human-facing version: the Compute Hardware Guide reuses the table below and adds a cloud-GPU tier guide and a Hugging Face Jobs pointer.

Policy	Batch	Update (ms)	Peak GPU mem (GB)	Best for
`act`	4	83.9	0.94	First-time users, laptops, single-task. Fast and reliable.
`diffusion`	4	168.6	4.94	Multi-modal action distributions; needs mid-range GPU.
`smolvla`	1	357.8	3.93	Language-conditioned, multi-task, small VLA. Unfreeze vision encoder for big gains (see §7).
`xvla`	1	731.6	15.52	Large VLA, multi-task.
`wall_x`	1	716.5	15.95	Large VLA with world-model objective.
`pi0`	1	940.3	15.50	Strong large VLA baseline (Physical Intelligence).
`pi05`	1	1055.8	16.35	Newer π policy; similar footprint to `pi0`.

Critical caveats:

Optimizer: measured with SGD. LeRobot's default is AdamW, which keeps extra optimizer state → peak memory will be noticeably higher with the default, especially for pi0, pi05, wall_x, xvla.
Batch size: the large policies were profiled at batch 1. In practice use a larger batch for stable training (see §7.4). Memory scales roughly linearly with batch.

6.2 Decision rules

< 8 GB VRAM (laptop, 3060, M-series Mac): → act. Maybe diffusion if you have ~6–8 GB free.
12–16 GB VRAM (4070/4080, A4000): → smolvla with defaults, or act/diffusion with larger batch. pi0/pi05/wall_x/xvla feasible only with small batch + gradient accumulation.
24+ GB VRAM (3090/4090/A5000): → any policy. Prefer smolvla (unfrozen) for multi-task; act for single-task grasp-and-place (still often the best ROI). Could experiment with pi0 or pi05 or xvla
80 GB (A100/H100): → any, with healthy batch. pi05, xvla, wall_x become comfortable.
CPU only: → don't train here. Use Google Colab (see docs/source/notebooks.mdx) or a rented GPU.

7. How long should I train?

Robotics imitation learning usually converges in a few epochs over the dataset, not hundreds of thousands of raw steps. Think epochs first, then translate to steps.

7.1 Rule of thumb

Typical total: 5–10 epochs. Start at 5, eval, then decide if more helps.
Very small datasets (< 30 episodes) may want slightly more epochs — but first, collect more data.
VLAs with a pretrained vision backbone typically need fewer epochs than training from scratch.

7.2 Steps ↔ epochs conversion

total_frames     = sum of frames over all episodes      # e.g. 50 eps × 30 fps × 30 s ≈ 45,000
steps_per_epoch  = ceil(total_frames / batch_size)
total_steps      = epochs × steps_per_epoch

Examples for --batch_size=8:

Dataset size	Frames	Steps / epoch	5 epochs	10 epochs
50 eps × 30 s @ 30 fps	45,000	~5,625	28k	56k
100 eps × 30 s @ 30 fps	90,000	~11,250	56k	113k
300 eps × 30 s @ 30 fps	270,000	~33,750	169k	338k

Pass the resulting total with --steps=<N>; eval at intermediate checkpoints (outputs/train/.../checkpoints/).

7.3 Per-policy starting points (single-task, ~50 episodes)

Policy	Batch	Steps (first run)	Notes
`act`	8–16	30k–80k	Usually converges under 50k for single-task.
`diffusion`	8–16	80k–150k	Benefits from longer training than ACT.
`smolvla`	4–8	30k–80k	Pretrained VLM → converges fast.
`pi0` / `pi05`	1–4	30k–80k	Memory-bound; use gradient accumulation for effective batch ≥ 16!

7.4 Batch size guidance

Bigger batch is preferable for stable gradients on teleop data.
If GPU memory is the bottleneck, use gradient accumulation to raise effective batch without raising peak memory.
Scale learning rate gently with batch; most LeRobot defaults work fine for a 2–4× batch change.

7.5 Scale LR schedule & checkpoints with `--steps`

LeRobot's default schedulers (e.g. SmolVLA's cosine decay) use scheduler_decay_steps=30_000, which is sized for long training runs. When you shorten training (e.g. 5k–10k steps on a small dataset), scale the scheduler down to match — otherwise the LR stays near the peak and never decays. Same for checkpoint frequency.

lerobot-train ... \
  --steps=5000 \
  --policy.scheduler_decay_steps=5000 \
  --save_freq=5000

Rule of thumb: set scheduler_decay_steps ≈ steps, and save_freq to whatever granularity you want for eval (e.g. every 1k–5k steps). Match scheduler_warmup_steps proportionally if your run is very short.

7.6 SmolVLA: unfreeze the vision encoder for real gains

SmolVLA ships with freeze_vision_encoder=True. Unfreezing usually improves performance substantially on specialized tasks, at the cost of more VRAM and slower steps. Enable with:

lerobot-train ... --policy.type=smolvla \
  --policy.freeze_vision_encoder=false \
  --policy.train_expert_only=false

7.7 Signals to stop / keep going

Train loss plateaus → stop, save a Hub checkpoint.
Train loss still dropping and you're under 10 epochs → keep going.

8. Evaluation & benchmarks

Two flavors of evaluation:

8.1 Real-robot eval (SO-101, etc.)

Reuse lerobot-record with --policy.path to run the trained policy on-robot and save the run as an eval dataset. Convention: prefix the dataset with eval_.

lerobot-record \
  --robot.type=so101_follower --robot.port=<FOLLOWER_PORT> --robot.id=my_follower \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --dataset.repo_id=${HF_USER}/eval_my_task \
  --dataset.single_task="<same task description used during training>" \
  --dataset.num_episodes=10 \
  --policy.path=${HF_USER}/act_my_task

Report success rate across episodes. Compare to a teleoperated baseline and to an earlier checkpoint to catch regressions.

8.2 Sim-benchmark eval

For policies trained on sim datasets (PushT, Aloha, LIBERO, MetaWorld, RoboCasa, …) use lerobot-eval against the matching env.type:

lerobot-eval \
  --policy.path=${HF_USER}/diffusion_pusht \
  --env.type=pusht \
  --eval.n_episodes=50 \
  --eval.batch_size=10 \
  --policy.device=cuda

Use --policy.path=outputs/train/.../checkpoints/<step>/pretrained_model for local checkpoints.
--eval.n_episodes should be ≥ 50 for a stable success-rate estimate.
Available envs live in src/lerobot/envs/. See docs/source/libero.mdx, metaworld.mdx, robocasa.mdx, vlabench.mdx for specific benchmarks.
To add a new benchmark, see docs/source/adding_benchmarks.mdx and envhub.mdx.

8.2b Dockerfiles for benchmark eval

Benchmark envs have native dependencies that are painful to install locally. The repo ships pre-baked Dockerfiles for each supported benchmark — use these to run lerobot-eval in a reproducible environment:

Benchmark	Dockerfile
LIBERO	`docker/Dockerfile.benchmark.libero`
LIBERO+	`docker/Dockerfile.benchmark.libero_plus`
MetaWorld	`docker/Dockerfile.benchmark.metaworld`
RoboCasa	`docker/Dockerfile.benchmark.robocasa`
RoboCerebra	`docker/Dockerfile.benchmark.robocerebra`
RoboMME	`docker/Dockerfile.benchmark.robomme`
RoboTwin	`docker/Dockerfile.benchmark.robotwin`
VLABench	`docker/Dockerfile.benchmark.vlabench`

Build and run (adapt to your benchmark):

docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-bench-robomme .
docker run --gpus all --rm -it \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  lerobot-bench-robomme \
  lerobot-eval --policy.path=<your_policy> --env.type=<env> --eval.n_episodes=50

See docker/README.md for base-image details.

8.3 Target success rates

Single-task grasp-and-place with 50 clean episodes: ACT should reach > 70% success on the training configuration. Less → data problem (see §5), not model problem. Expect a drop when generalizing to new positions — scale episodes or diversity to recover.

9. Further reading & resources

Getting started: installation.mdx · il_robots.mdx · What makes a good dataset
Per-policy docs: browse docs/source/*.mdx (policies, hardware, benchmarks, advanced training).
Community: Discord · Hub LeRobot tag · Dataset visualizer

Keep this file current. If you learn a rule that would prevent a class of user mistakes, add it here and in AGENTS.md.

23 KiB Raw Blame History Unescape Escape