- Inject discretized proprioceptive state (256 bins, pi05 format) into
low-level action-conditioning prompts in both training
(PI052TextTokenizerStep) and eval (_with_low_level_subtask_prompt),
matching the recipe's documented "[images, subtask, state]" intent.
Higher-level subtask/memory text streams stay state-free.
- Cache the loc-token tokenizer (_get_loc_tokenizer) instead of reloading
it from disk on every _build_text_batch/select_message call (it ran
twice per env per replan and dominated eval runtime).
- Add a KV cache to select_message decode (bit-identical output to the
recompute path) to avoid O(n^2) generation.
Net: pi052 eval ~2.9 s/it -> ~0.1 s/it (~25x).
Co-authored-by: Cursor <cursoragent@cursor.com>
EMA was on by default, so every training run on the branch (incl. VLA-JEPA
and other non-flow-matching policies) created a full fp32 shadow copy. EMA
only benefits flow-matching/diffusion policies (pi0/pi05/pi052). Make it
opt-in via --ema.enable=true; the pi05/pi052 recipes already pass that flag.
Co-authored-by: Cursor <cursoragent@cursor.com>
LeRobot's RoboCasaEnv used a divergent flat state/action layout vs the
robocasa package (robocasa.utils.env_utils.convert_action) and the openpi
robocasa pipeline. This scrambles I/O when using openpi-convention checkpoints
(e.g. the JAX->PyTorch->LeRobot converted pi05 robocasa model: CloseFridge
20% -> 60% once both orders match openpi).
- convert_action: ee_pos(3)+ee_rot(3)+gripper(1)+base_motion(4)+control_mode(1)
- observation.state: ee_pos_rel(3)+ee_rot_rel(4)+base_pos(3)+base_rot(4)+gripper(2)
Matches openpi examples/robocasa/main.py + RobocasaInputs ordering.
Co-authored-by: Cursor <cursoragent@cursor.com>
Eliminate the standalone pi052/pi05_backbone.py by distributing its contents:
- Generic dual-expert transformer machinery -> lerobot/policies/pi_gemma.py
(sdpa_attention_forward, compute_layer_complete, PaliGemmaWithExpertModel,
get_gemma_config; the openpi width/depth config is renamed GemmaConfig ->
GemmaVariantConfig to avoid clashing with transformers' GemmaConfig). These
sit next to the existing PiGemma layer code they already depend on.
- pi052-specific model + helpers -> pi052/modeling_pi052.py (PI05Pytorch,
ActionSelectKwargs, make_att_2d_masks, pad_vector, resize_with_pad_torch,
create_sinusoidal_pos_embedding, sample_beta, get_safe_dtype).
DEFAULT_IMAGE_SIZE is duplicated as a plain constant in pi_gemma to avoid a
pi_gemma -> pi05 import cycle. Additive to pi_gemma; pi0/pi05 unaffected.
Verified bit-exact on pepijn223/pi052_robocasa_full (embed/predict/forward
identical) and all 34 pi052 tests pass.
Co-authored-by: Cursor <cursoragent@cursor.com>
_fast_ce/_shifted_ce were renamed to _fast_lin_ce/_shifted_lin_ce and changed
from logits-based to Liger fused-linear-CE (hidden @ lm_head_weightᵀ). Update
the tests via thin adapters that pass an identity lm_head_weight (so the
computed logits equal the provided ones), run on CUDA (Liger is GPU-only) and
skip otherwise, and loosen the allclose tolerance to absorb GPU-vs-CPU float
noise on the tiny losses.
Co-authored-by: Cursor <cursoragent@cursor.com>
The smolvla branch had modified the shared pi0/pi05 modeling + pi05 config to
support pi052 (SDPA attention, layernorm/lm_head handling, optimizer
foreach/fused/lm_head_lr_scale, embedding scaling). Decouple pi052 instead:
- Vendor the PI0.5 backbone (PaliGemmaWithExpertModel, PI05Pytorch, helpers)
into pi052/pi05_backbone.py (verbatim copy, no PI05Policy).
- Flatten PI052Policy to subclass PreTrainedPolicy directly (no longer
PI05Policy); inline the needed PI05Policy methods.
- Restore optimizer_foreach/fused + get_optimizer_preset on PI052Config.
- Revert pi0, pi0_fast, pi05 modeling and configuration_pi05 to origin/main
(byte-identical), so the shared policies carry no smolvla modifications.
Behavior verified bit-exact on pepijn223/pi052_robocasa_full: embed_language_
tokens, predict_action_chunk, and the fused flow+text+FAST training loss are
identical before/after (max_abs_diff=0). pi052 tests pass (pre-existing
stale-name collection errors unchanged).
Co-authored-by: Cursor <cursoragent@cursor.com>
embed_language_tokens already applies Gemma's sqrt(hidden) normalizer
(GemmaTextScaledWordEmbedding, transformers >=5.4.0). pi052 multiplied FAST
action-token and autoregressive subtask-text embeddings by sqrt(emb_dim) on
top of that, double-scaling them (~2048x). Remove the manual scaling so FAST
and text tokens are single-scaled, consistent with the pi05 fix and OpenPI.
Co-authored-by: Cursor <cursoragent@cursor.com>
transformers >=5.4.0 (PR #44432) makes Gemma's embed_tokens a
GemmaTextScaledWordEmbedding that already multiplies token embeddings by
sqrt(hidden_size). The manual `* sqrt(embed_dim)` applied on top therefore
double-scaled text (~2048x instead of ~45x), breaking VLM alignment for
models trained/run on stock transformers. Remove the manual scaling and rely
on embed_tokens' internal normalizer (matches main #3603). Image features
stay raw (un-normalized), as before.
Co-authored-by: Cursor <cursoragent@cursor.com>
OpenPI (pi0 and pi0-FAST) multiplies language token embeddings by
sqrt(embed_dim) — the Gemma embedder normalizer — before the transformer.
LeRobot pi0/pi0_fast omitted it, leaving text tokens ~45x under-scaled
relative to the residual stream (same class of bug as the pi05 image
scaling). pi0: applied in embed_prefix's lang_embed_func. pi0_fast:
applied inside embed_language_tokens so prompt, FAST action tokens, and
autoregressive next-token embeds are all scaled consistently.
Co-authored-by: Cursor <cursoragent@cursor.com>
lerobot/pi05_base was trained in the OpenPI/big_vision regime where image
(soft) tokens are NOT multiplied by the Gemma embedder normalizer
(sqrt(hidden_size)) — only text tokens are. Scaling image features here
over-scaled them ~45x, breaking the pretrained vision-language alignment
and yielding ~0% closed-loop success on RoboCasa across all pi05 runs.
Co-authored-by: Cursor <cursoragent@cursor.com>
Bring the authoritative annotation pipeline from the annotation branch.
The annotation surface is forced to EXACTLY match feat/language-annotation-
pipeline (the annotation branch is the source of truth for annotation
code), which also removes smolvla's stale copies:
- deleted: steerable_pipeline/vocabulary.py, tests/annotations/test_
vocabulary.py, prompts/module_0_vocabulary.txt, module_1_action_record
.txt, module_3_vqa.txt, module_1_plan.txt, and the old module_* prompt
names (now plan_*/interjections_*/vqa.txt).
- synced: all of src/lerobot/annotations/, lerobot_annotate.py,
examples/annotations/, tests/annotations/, datasets/language.py,
tests/datasets/test_language.py, docs/annotation_pipeline.mdx.
Non-annotation conflicts resolved by union (keeping both branches' intent):
- pyproject.toml: keep smolvla's pi extra (+sentencepiece) and add the
molmoact2 extra from main.
- policies/factory.py: keep both dataset_repo_id (pi052 FAST tokenizer)
and dataset_meta (both are referenced); union the policy-type docstring.
- scripts/lerobot_train.py: keep smolvla's pi052 / use_relative_actions
processor-rebuild block.
- uv.lock: regenerated from the merged pyproject.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first-token parity check re-tokenized the decoded (stripped) inference
string, so the leading-space SentencePiece variant always mismatched the
training argmax — a false "DIVERGED" alarm. Remove the autoregressive
inference print and parity comparison (and the now-dead per-sample
select_message generation), keeping only the prompt, ground-truth target,
and teacher-forced argmax accuracy.
Co-authored-by: Cursor <cursoragent@cursor.com>
Collapse the remaining multi-line field comments / docstrings in config.py
to single lines (or two where a knob genuinely needs it), keeping the
essential rationale. Comments only — no field or behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the optional structured per-subtask action records — not a feature
we want to ship.
* language.py: remove 'action_record' from CORE_STYLES + PERSISTENT_STYLES
(and the matching assertion in tests/datasets/test_language.py).
* config.py: delete ActionRecordsConfig (verb/grasp vocabularies,
frames_per_subtask, emit_record_row) and the PlanConfig.action_records
field.
* plan_subtasks_memory.py: delete _extract_action_record and the
run_episode block that emitted style='action_record' rows; drop the
now-unused json / to_image_blocks imports.
* remove the plan_action_record.txt prompt.
* run_hf_job.py: drop the action_records comment.
Verified: 40 tests pass; pre-commit (ruff, mypy, bandit) clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tighten the remaining multi-line comment blocks in config.py (derive_task,
frames/window, describe_first, action-record/vqa/vlm fields, video_backend,
repo ids, executor) to 1-3 lines each. Also fix a stale path typo
('examples/annotation' -> the docstring now just says HF Jobs). Comments
only — no field or behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two behavior-preserving simplifications:
* plan_subtasks_memory.run_episode: the task_aug 'axes' and free-form
branches built identical deduped rows via copy-pasted seen/append
loops. Collapse to one branch that picks the variant source, then a
shared _task_aug_rows() helper does the dedup + row build (-~25 LOC).
* writer: _normalize_persistent_row / _normalize_event_row shared the
same camera-validate + struct construction. Extract _normalize_row(),
keeping the exact key order (the parquet struct schema is inferred
from insertion order, so timestamp must stay between style and camera).
docs: 'Which modules run' is now a table giving each module's on/off flag
(--plan.enabled / --interjections.enabled / --vqa.enabled) and what it
turns off.
Verified: 40 tests pass (incl. test_writer struct round-trip); pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pyproject annotations-extra comment still described the removed
vllm/transformers in-process backends ('vllm preferred ... transformers
fallback', '_make_vllm_client'); rewrite it for the openai-only reality
and trim it. Also condense the conftest lazy-import NOTE. Comments only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dead code (defined but never referenced anywhere in src/tests/examples):
* reader.py: keyframe_indices, episode_frame_timestamps, lookup_data_path,
and the now-orphaned gather_data_paths + episode_offsets_per_path
(lookup_data_path was their only caller).
* staging.py: iter_staged_episodes.
* writer.py: normalize_rows_for_writer.
* config.py VlmConfig: json_mode, batch_size, tensor_parallel_size,
gpu_memory_utilization, trust_remote_code — consumed only by the
in-process vllm/transformers backends that were removed; the openai
auto-serve path carries those vLLM flags via serve_command instead.
Kept max_model_len (still used as the serve-command default).
* config.py TaskAugAxesConfig.total property.
Docs: new 'Key options' section in annotation_pipeline.mdx — grouped
tables (dataset in/out, module toggles, --vlm.*, --plan.*, interjections
+ vqa) describing the flags users actually reach for, with defaults.
config.py: compact the verbose field comments + ActionRecordsConfig /
TaskAugAxesConfig docstrings; fix two stale 'verify' references (the
verify pass was removed — it's describe -> segment now) and the stale
'renders record back to subtask text' note (that path was removed).
vlm_client docstring no longer mentions the removed json_mode field.
Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (40 passed); pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trim the long inline comment blocks (effective_task / task_aug, action
records, plan-boundary rows, plan-update span closing, windowed +
coverage-stitch sections) and the _generate_plan / run_plan_updates
docstrings to a few lines each. No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same flags and rationale, condensed — each plan-module flag now has a
short one/two-line comment instead of a paragraph.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-down flow (read episodes → 3 modules fan out → validator → writer →
parquet) with aligned boxes, instead of the cramped bordered version.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite annotation_pipeline.mdx in plainer, easier-to-read language
(shorter sentences, active voice, a plain-text intro), add an ASCII
'How it fits together' architecture diagram, and remove the
'Reproducibility via seed and prompt hashes' section. Content/links are
preserved; only wording and structure change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The example already pins '@main'; update the doc step and the script
docstring from 'the branch under test' to 'lerobot (from main)' now that
the pipeline is merging to main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bugs
* validator: don't re-raise on unknown style. The second column_for_style
lookup (used to route persistent vs event) now sits in try/except so an
unknown style is recorded by _check_column_routing and skipped instead
of crashing the whole validation pass.
* general_vqa._target_cameras: when restrict_to_default_camera is set but
the configured camera_key isn't one the provider exposes, warn and fall
back to all cameras instead of returning a phantom key that KeyErrors
deep in frame decode.
* interjections: clamp interjection timestamps to frame_timestamps[0]
rather than a hardcoded 0.0 (datasets can start at non-zero t).
Docs / code drift
* annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase
0 / --vocabulary.* / canonical_vocabulary.json' section (none of it
exists); describe the real describe->segment + coverage-stitch flow.
Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of
this PR' (matches tools.mdx, which already marks the runtime layer as
not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note
the default is now a single h200. Add a 'Contributing new modules'
section inviting module / prompt / quality contributions.
* executor docstring: six phases, no phantom phase 0.
run_hf_job.py
* add the Apache 2.0 license header (was flagged repeatedly).
* default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1
(scale to h200x4 noted in the docstring).
* pin the install to @main instead of the feature branch (won't break
after merge).
Naming / cleanup
* rename dest_repo_id -> new_repo_id across config / script / example /
test to match the LeRobot dataset edit tools.
* rename prompt templates module_N_*.txt -> descriptive (plan_*,
interjections_*, vqa.txt) and update every load_prompt() call.
* remove dead _messages_to_prompt (used only by the removed in-process
backends).
* declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as
real init=False dataclass fields instead of getattr monkey-patches.
* scope bandit B607 to the two ffmpeg subprocess.run sites via
'# nosec B607' and drop it from the global skip list.
Tests
* fix stale canned-VLM markers ('ONE realistic interruption' ->
'compact interjection', 'Update the memory' -> 'compressed semantic
memory') and drop the dead 'concise hierarchical PLAN' plan responders
(plan generation is deterministic now) in run_e2e_smoke,
test_pipeline_recipe_render, test_modules.
* run_e2e_smoke now asserts interjection + speech rows are produced so a
stale marker can't silently pass again.
* drop remaining 'PR 1' / 'PR 2' references from test comments / names.
Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke
(interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit,
prettier) clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
action_record is in PERSISTENT_STYLES but was missing from CORE_STYLES,
so STYLE_REGISTRY (= CORE_STYLES | EXTENDED_STYLES) didn't contain it and
the PERSISTENT_STYLES | EVENT_ONLY_STYLES <= STYLE_REGISTRY invariant in
test_style_registry_routes_columns failed. Add it to CORE_STYLES so the
registry, the persistent-set, and column_for_style() stay consistent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shipped workflow is Hugging Face Jobs (examples/annotations/run_hf_
job.py): it serves the model with vLLM in the vllm/vllm-openai image and
the pipeline talks to it over the OpenAI-compatible API. The in-process
vllm / transformers local backends added surface (and the vllm
one pinned an old torch) without being part of that path, so they're
removed for now.
* vlm_client.make_vlm_client: keep only backend='openai' (+ 'stub'
rejected with the usual guidance). Requesting 'vllm'/'transformers'
now raises a clear 'not supported for now — use the HF Jobs flow'
error. Removed _make_vllm_client and _make_transformers_client.
* config: backend docstring updated (openai-only); default model_id
bumped to Qwen/Qwen3.6-27B to match run_hf_job.
* docs/annotation_pipeline.mdx: remove the '## Running locally'
section; the launcher description now says one vLLM server per GPU
over the OpenAI API, and the 'One Qwen-VL pass' note drops the
'vLLM/transformers fallback' wording.
Tests are unaffected (they construct StubVlmClient directly; nothing
referenced the removed backends).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The annotation tests had never actually run in CI (collection failed on
the missing 'datasets' extra); now that they do, three stale assertions
surfaced against the evolved pipeline:
* test_module1_plan_memory_subtask_smoke: the memory canned-responder
marker 'Update the memory' no longer appears in module_1_memory.txt
(now 'compressed semantic memory'), so the stub returned no memory
row and the {subtask,plan,memory} subset check failed. Marker
updated to match the current prompt.
* test_module2_mid_episode_emits_paired_interjection_and_speech: the
interjection marker 'Write ONE interjection' is now 'Write ONE
compact interjection' in module_2_interjection.txt, so 0 interjections
were emitted. Marker updated.
* tests/datasets/test_language.py::test_style_registry_routes_columns:
PERSISTENT_STYLES gained 'action_record' in this PR; add it to the
expected set.
These are test/prompt-marker syncs — no production behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fast Pytest 'dataset' tier failed collecting tests/datasets/test_video_
decoder_cache.py with 'Could not load libtorchcodec ... undefined symbol:
torch_dtype_float4_e2m1fn_x2' — a torch/torchcodec ABI mismatch.
Root cause: the annotations extra's vllm hard-pins an older torch
(via xformers/xgrammar -> torch 2.8). uv resolves a SINGLE unified lock
across all extras, so vllm capped torch to 2.8 for every tier —
including dataset, whose torchcodec 0.11.1 needs torch 2.11. The
result was torch 2.8 + torchcodec 0.11.1 installed together -> ABI break.
(main has no vllm, so it resolves torch 2.11 + torchcodec 0.11.1 cleanly.)
Fix: remove vllm from the annotations extra. It is not needed by
the shipped workflow — examples/annotations/run_hf_job.py gets vllm from
the vllm/vllm-openai image and talks to it over the OpenAI-compatible
API (--vlm.backend=openai), and vlm_client._make_vllm_client imports vllm
lazily. For the in-process --vlm.backend=vllm path, install vllm
separately (the ImportError now says so).
After the fix uv resolves torch 2.11.0 + torchcodec 0.11.1 (matching
main); uv lock --check is clean. The annotations extra still provides
datasets / transformers / openai.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fast Pytest Tests failed at COLLECTION in the base '--extra test' tier
with 'ModuleNotFoundError: No module named datasets': tests/annotations/
conftest.py imported the fixture dataset builder (-> lerobot.datasets ->
the HF 'datasets' lib + pandas/pyarrow), which only ship under the
'dataset' extra, so the whole annotations package crashed.
Fix uses the repo's proven module-level guard pattern (see
tests/datasets/test_language.py), NOT a conftest-level importorskip —
verified empirically that pytest.importorskip raised during conftest
*import* is treated as a collection ERROR (exit 1), while module-level
importorskip is a clean SKIP.
* conftest.py: import build_annotation_dataset LAZILY inside the
fixtures so the conftest itself imports cleanly in every tier.
* test_modules / test_validator / test_writer / test_pipeline_recipe_
render: add module-level pytest.importorskip('datasets') +
('pandas') before the pyarrow / lerobot.* imports (# noqa: E402 to
match the existing convention). pyarrow-importing modules place the
guard before the pyarrow import.
* tests/scripts/test_lerobot_annotate.py: same guard (its _push_to_hub
path imports lerobot.datasets).
Result:
- base / hardware / viz tiers (no dataset extra): annotation tests
skip cleanly; the rest of the suite runs -> exit 0.
- dataset tier: datasets present -> guards pass through -> annotation
tests run with the stub VLM. The pipeline modules import only
stdlib + relative + lerobot.datasets (no module-level datatrove /
vllm / openai), so they import fine there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``fit_fast_tokenizer`` previously called ``LeRobotDataset(repo_id,
episodes=[N])`` per sampled episode, which on v3-format datasets
routes through HF datasets' split lookup and raises ``ValueError:
Instruction "train" corresponds to no data!`` on every episode. On
``pepijn223/robocasa_pretrain_human300_v4`` (32 k episodes) this looped
through 13,293 skipped episodes for ~2.5 h before the NCCL watchdog
killed the run via the 2 h ALLREDUCE timeout (job 22182985).
Switch to reading the ``action`` column directly from the dataset's
``data/chunk-*/file-*.parquet`` shards (same pattern as the audit
scripts). Verified end-to-end on the 32 k-episode dataset: 1000 chunks
collected from 1000 episodes in 70.7 s.
Co-authored-by: Cursor <cursoragent@cursor.com>
Quality-gate fix: ruff-format/markdown prettier hook reflow of the
annotation pipeline doc. No content change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quality-gate fixes after the main merge:
* UP037: drop redundant quotes from PlanConfig forward-ref annotations
(action_records / task_aug_axes) — safe under 'from __future__ import
annotations'.
* ruff format applied to config.py, executor.py, general_vqa.py,
plan_subtasks_memory.py, validator.py, lerobot_annotate.py.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long episodes no longer get sparse subtasks. Previously a long episode
was subsampled to max_video_frames=32 across its whole duration (~1
frame/4s for a 2-min clip). New opt-in windowing keeps a CONSTANT
frames_per_second density by splitting the episode into fixed-length
windows and running the subtask chain per window.
New PlanConfig.subtask_window_seconds (default 0.0 = off). When > 0 and
the episode is longer than one window:
* episode is split into consecutive [w0, w1] windows of this length
* each window's frames are sampled at frames_per_second (so a 32s
window at 1 fps = 32 frames, filling but not exceeding the per-call
context budget)
* the full describe -> segment -> verify chain runs PER window, in
window-relative time [0, L]; spans are offset back to absolute
* all windows' spans are merged, frame-snap-deduped, and stitched into
one contiguous whole-episode cover
Implementation:
* _episode_video_block / _video_message / _describe_episode /
_verify_subtasks gain an optional window=(w0,w1); when set they
embed frames sampled in that absolute range at frames_per_second
(video_url path skipped — it's whole-episode).
* _clean_spans gains bounds= (override clamp range, for window-relative
spans) and dedupe= (skip frame-snap until the merged absolute set).
* new _generate_subtasks_windowed + _subtasks_for_window orchestrate
the loop; _generate_subtasks branches to them when window_s > 0.
run_hf_job.py: --plan.subtask_window_seconds=32 (32s windows at 1 fps).
Cost scales with episode length (chain calls × ceil(duration/window)).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swap the annotation VLM from Qwen3.6-35B-A3B (sparse MoE, ~3B active)
to Qwen3.6-27B (dense, 27B all-active). Per Scale's dense-captioning
study, model capacity is the #1 lever and the dominant failure is
visual grounding — both helped by ~9x more active params. Qwen3.6-27B
is a vision-language model (vision encoder, image + video), same family
so the chat template / video handling / enable_thinking=false flag are
unchanged, and at 27B dense it still fits one H200 per server, so the
two-parallel-server layout (TP=1, one per GPU) is preserved — no
throughput-layout change, just a much stronger model.
Kept: parallel_servers=2, num_gpus=2, max-model-len 32768 (the 32-frame
embedded budget is ~10k tokens, well under), gpu-mem 0.8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adopt the one prompt technique Scale's dense-captioning study found
reliably positive: targeted, verb-scoped, visually-grounded
disambiguation rules. Their lesson was that such a rule must fire ONLY
on the spatial situation it names (their narrow 'Stack vs Put' rule
helped; an over-broad directional 'Scoop' rule bled into other verbs
and hurt), so each rule here is phrased visually and scoped to one
confusable pair:
* stack-vs-put (on top of an object vs on a surface)
* insert-vs-put (fitted slot vs surface)
* pick-up/retrieve-vs-put (decide by which way the OBJECT moves:
gripper closes + object moves with hand = pick up; gripper opens +
object stays = put — directly targets Scale's dominant
direction-flip failure)
* pour-vs-put (tilt + flow vs untilted move)
This is the highest-confidence, lowest-risk change from the Scale
findings; our pipeline already aligns with their 'avoid' list (no
temporal tokens, no overlays, no fancy sampling, no sequential context
injection, uniform sampling, describe-don't-predict framing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switching the plan module to embedded frames (use_video_url=false)
exposed a context overflow: at frames_per_second=2.0 with the old
max_video_frames=128 default, a 480x640 episode embeds ~128 frames ≈
33-39k vision tokens, over the model's 32768 context — every plan call
died with 'Input length exceeds maximum context length' (HTTP 400),
crashing the whole annotation job.
The video_url path never hit this because the server downsampled; the
embedded path sends every sampled frame, so the frame count is a hard
token budget.
Fix:
* config default max_video_frames 128 -> 32 (~8-10k vision tokens,
comfortable headroom for the prompt + describe/verify passes).
Frames are still sampled UNIFORMLY across the whole episode, so
longer episodes are subsampled, not truncated — full temporal
coverage preserved, just coarser density.
* run_hf_job.py: frames_per_second 2.0 -> 1.0, explicit
--plan.max_video_frames=32, with a comment explaining the token
budget and the 'do not raise toward 128 with embedded frames' rule.
Only the plan module embeds the full episode; VQA (1 frame/tick) and
interjections (4-frame window) were never at risk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the subtask_full_coverage config flag. Stitching subtask spans
into a contiguous full-episode cover is now always applied in
_generate_subtasks — a sparse / gap-ridden subtask timeline is never
desirable for conditioning, so there's no reason to make it optional.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The verify pass prunes subtasks, which could leave the first subtask
starting after t0 or leave gaps between spans — so the subtask timeline
no longer tiled the episode and frames fell through with no active
subtask label.
New deterministic post-step (no VLM call), default on via
PlanConfig.subtask_full_coverage:
* first subtask start pulled back to the episode's first frame t0
(idle / approach before the first labelled action folds into it)
* each subtask end snapped to the next subtask start (gaps closed)
* last subtask end extended to the last frame t_last
Runs after segment + verify in _generate_subtasks. Starts other than
the first are left as the VLM/verify produced them (already frame-
snapped + distinct), so the cover is contiguous and non-overlapping.
Disable with --plan.subtask_full_coverage=false if a consumer wants
sparse subtasks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip PlanConfig.subtask_describe_first and subtask_verify defaults
False -> True. Every subtask annotation now runs the 3-call grounding
+ pruning chain by default, since the single-call path reliably
hallucinates steps from the task text. Costs 2 extra VLM calls/episode;
disable with --plan.subtask_describe_first=false / --plan.subtask_
verify=false on easy datasets where fewer calls matter more than
label fidelity.
run_hf_job.py: drop the now-redundant explicit flags, leave a note that
the chain is default-on and how to opt out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single-call 'watch video -> emit subtask JSON' pattern makes the
VLM commit to structured output before reasoning about what it saw, so
it pattern-matches the task text and hallucinates steps. Split it into
an opt-in multi-call chain that grounds first and prunes last.
New PlanConfig flags (both default False -> single-call unchanged):
* subtask_describe_first: a grounding pass narrates ONLY what is
visible in the video (no subtask JSON yet). That description is
injected into the segmentation prompt via a new {observation_block}
placeholder, so the model segments its own grounded observations
instead of the instruction text. +1 VLM call/episode.
* subtask_verify: after segmentation, an adversarial pass re-watches
the video and drops any candidate subtask it cannot see. Can only
PRUNE (never add/rewrite/move) and fails open (keeps un-verified
spans if the call returns nothing). +1 VLM call/episode.
Implementation:
* _generate_subtasks now orchestrates describe -> segment -> verify.
* Factored span cleaning into _clean_spans (shared by segment + verify
outputs); added _describe_episode and _verify_subtasks helpers.
* New prompts module_1_subtask_describe.txt (returns {description})
and module_1_subtask_verify.txt (returns pruned {subtasks}).
* module_1_subtasks.txt gains a {observation_block} slot at the top.
run_hf_job.py enables both for the RoboCasa run (3 VLM calls/episode
for subtasks). Combined with single-camera grounding + the embedded-
frame path, this is the high-quality configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for 'subtasks describe actions not in the video' plus a way
to focus the whole pipeline on one camera.
ANTI-HALLUCINATION
1. _episode_video_block: when use_video_url is set but clip extraction
fails, FALL BACK to embedded frames instead of returning an empty
block. An empty block left the VLM with zero visual grounding, so
it invented subtasks from the task text alone — the likely root
cause of hallucinated steps. Now logs a warning and embeds frames.
2. module_1_subtasks.txt gains a GROUNDING preamble (overrides all
other rules): label only motion visible in specific frames; never
invent/anticipate/pad; max_steps is a CEILING not a target; atomic
demos may be exactly ONE subtask; the VIDEO is ground truth, not
the instruction text.
SINGLE-CAMERA GROUNDING
* New VqaConfig.restrict_to_default_camera (default False). When True,
the VQA module grounds on only the --vlm.camera_key stream instead
of iterating every camera — matching the plan / interjection
modules, which already use that single camera. Now the whole
pipeline can focus on one view (e.g. observation.images.base).
run_hf_job.py updated:
* use_video_url=false + frames_per_second=2.0 — embed frames directly
(most reliable; no silent text-only failure mode) with dense
grounding.
* vqa.restrict_to_default_camera=true — VQA on the single camera too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the replace_subtask_text option and the
_render_action_record_to_subtask_text renderer. Action records are now
strictly additive: when action_records.enabled=True the module emits
style='action_record' rows (the typed {verb,object,arm,grasp,dest,
mistake} schema) and NEVER rewrites the subtask text the policy
conditions on.
The render-back-to-text path was the source of corrupted subtasks
(navigation tasks produced 'move stove to stove', manipulation tasks
got spurious 'with left arm using pinch grip' suffixes). Reconstructing
natural-language subtasks from hallucinated structured fields is
inherently fragile, so the capability is removed rather than guarded.
Removed:
* ActionRecordsConfig.replace_subtask_text field
* PlanSubtasksMemoryModule._render_action_record_to_subtask_text
* the span['text'] = canonical_text overwrite in run_episode
Updated docstrings + run_hf_job.py comment accordingly. emit_record_row
(default True) is now the feature's only output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three compounding bugs made RoboCasa annotation produce off-task
subtasks ('move stove to stove with left arm') and drifting
augmentations ('wander around the kitchen' for 'Navigate to the stove').
1. action_records.replace_subtask_text now defaults False.
Overwriting the VLM's subtask text with a reconstruction of
hallucinated {verb,object,arm,grasp,dest} fields is high-risk:
navigation / non-manipulation tasks don't fit the schema and render
to nonsense. Records are now additive by default (emit_record_row),
never silently replacing subtask text. Flip replace_subtask_text on
only for manipulation datasets verified to render cleanly.
2. _render_action_record_to_subtask_text drops a degenerate
destination that just echoes the object (verb=move object=stove
destination=stove -> 'move stove' instead of 'move stove to stove').
Also routes 'navigate' through the 'to <dest>' preposition family.
3. module_1_task_aug_axes.txt hardened: variants MUST preserve the
goal/destination. Explicitly forbids 'Navigate to the stove' ->
'wander around the kitchen'. Only wording / arm / orientation /
grasp may vary; verb meaning, object, and destination are fixed.
examples/annotations/run_hf_job.py — corrected for RoboCasa:
* derive_task_from_video=off (was =always). The dataset task string
is authoritative and is what eval conditions on; =always threw it
away, re-derived a hallucinated task from the video, and poisoned
every downstream subtask/plan row. THIS was the dominant cause.
* n_task_rephrasings=0 + task_aug_axes left off — RoboCasa eval uses
exact task strings, so augmentation is unused/harmful.
* action_records left off — manipulation schema doesn't fit atomic /
navigation tasks.
* plan_max_steps=6 to keep atomic-task decomposition tight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The FAST action tokenizer maps action codes to the top of the PaliGemma
vocab (id = vocab_size-1-fast_skip_tokens-t). The lower part of that band
sits just below the reserved <loc> block, so it escaped the existing
suppress_loc_tokens mask and leaked into generated subtask/VQA/memory text
as high-codepoint gibberish. Mask the FAST band on every select_message
call so the high-level head emits clean language.
Co-authored-by: Cursor <cursoragent@cursor.com>
VideoFrameProvider derived its default camera and camera list from
meta.camera_keys, which mixes image- and video-stored cameras. The
clip/decode paths read videos/<key>/from_timestamp, which only exists
for video keys, so an image-stored camera sorted first (e.g.
observation.images.wrist) crashed the plan phase with a KeyError.
Restrict the list and default to meta.video_keys. Add a regression test
and point the example job at the dataset's actual video camera. Skip
bandit B607 (ffmpeg/git are intentionally resolved via PATH).
Co-authored-by: Cursor <cursoragent@cursor.com>
EgoMimic-inspired additions to the plan module, both opt-in for back-compat.
1. PHASE 1a + 1b: per-subtask structured action records
* cfg.action_records.enabled=True triggers, after Phase 1 subtask-span
generation, one extra VLM call per subtask to extract a typed record:
{verb, object, arm, grasp_type, destination, mistake}
* A deterministic Python template (_render_action_record_to_subtask_text)
renders the record back to canonical subtask text. When replace_subtask_
text=True (default), this REPLACES the VLM's free-form text — eliminates
cross-episode phrasing drift.
* When emit_record_row=True (default), the structured record is also
emitted as a row with style='action_record' (added to PERSISTENT_STYLES)
so downstream training can consume the typed schema directly.
* Verb + grasp vocabularies are configurable. Out-of-vocab values are
rejected at extraction time.
2. STRUCTURED 5-AXIS TASK AUGMENTATION
* cfg.task_aug_axes.enabled=True replaces the free-form n_task_rephrasings
path with a structured prompt producing variants along 5 named axes:
synonym_paraphrase (3)
omit_arm (3)
omit_orientation (2)
omit_grasp_method (2)
combined_omissions (2)
Total ~12 variants. Axes with nothing to omit emit fewer entries.
* Each variant is emitted as a task_aug row at t=0 (existing style).
Inspired by https://github.com/GaTech-RL2/EgoVerse/tree/main/egomimic/scripts/language_process
— they pay Scale AI annotators to fill a structured form and then generate
language via a deterministic prompt. We get the same hallucination-reducing
structure via one extra VLM call per subtask.
Files:
src/lerobot/datasets/language.py
src/lerobot/annotations/steerable_pipeline/config.py
src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
src/lerobot/annotations/steerable_pipeline/prompts/module_1_action_record.txt
src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_aug_axes.txt
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EgoMimic-inspired additions to the plan module, both opt-in for back-compat.
1. PHASE 1a + 1b: per-subtask structured action records
* cfg.action_records.enabled=True triggers, after Phase 1 subtask-span
generation, one extra VLM call per subtask to extract a typed record:
{verb, object, arm, grasp_type, destination, mistake}
* A deterministic Python template (_render_action_record_to_subtask_text)
renders the record back to canonical subtask text. When replace_subtask_
text=True (default), this REPLACES the VLM's free-form text — eliminates
cross-episode phrasing drift.
* When emit_record_row=True (default), the structured record is also
emitted as a row with style='action_record' (added to PERSISTENT_STYLES)
so downstream training can consume the typed schema directly.
* Verb + grasp vocabularies are configurable. Out-of-vocab values are
rejected at extraction time.
2. STRUCTURED 5-AXIS TASK AUGMENTATION
* cfg.task_aug_axes.enabled=True replaces the free-form n_task_rephrasings
path with a structured prompt producing variants along 5 named axes:
synonym_paraphrase (3)
omit_arm (3)
omit_orientation (2)
omit_grasp_method (2)
combined_omissions (2)
Total ~12 variants. Axes with nothing to omit emit fewer entries.
* Each variant is emitted as a task_aug row at t=0 (existing style).
Inspired by https://github.com/GaTech-RL2/EgoVerse/tree/main/egomimic/scripts/language_process
— they pay Scale AI annotators to fill a structured form and then generate
language via a deterministic prompt. We get the same hallucination-reducing
structure via one extra VLM call per subtask.
Files:
src/lerobot/datasets/language.py
src/lerobot/annotations/steerable_pipeline/config.py
src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
src/lerobot/annotations/steerable_pipeline/prompts/module_1_action_record.txt
src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_aug_axes.txt
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- modeling_pi052: per-env low-level subtask generation in select_action so
hierarchical inference is correct for eval.batch_size > 1
- render_messages_processor: always emit a fallback low-level prompt so
observation.language.tokens are produced when recipe annotations are absent
- lerobot_eval: overlay high-level task + predicted subtask onto recorded
rollout videos (render path only; does not affect policy observations)
Co-authored-by: Cursor <cursoragent@cursor.com>
pi052's preprocessor pipelines don't roundtrip through the saved
``policy_preprocessor.json``: ``RenderMessagesStep`` holds a
``TrainingRecipe`` Python object (not JSON-serializable, saved as
``{}``) and ``ActionTokenizerProcessorStep`` saves the fitted FAST
tokenizer's host-only ``~/.cache/lerobot/fast_tokenizers/...`` path.
``PolicyProcessorPipeline.from_pretrained`` then dies with
``RenderMessagesStep.__init__() missing 1 required positional
argument: 'recipe'`` (job 22164494).
The pi052 training path was workable because the recipe-aware steps
were built directly; the runtime path
(``lerobot.scripts.lerobot_pi052_runtime``) sidesteps the loader by
passing ``pretrained_path=None`` to ``make_pre_post_processors`` and
building fresh from ``config.recipe_path``. The standard
``lerobot-eval`` entry point had no such escape hatch.
Two surgical fixes:
* ``factory.make_pre_post_processors``: when ``policy_cfg.type ==
"pi052"`` AND ``pretrained_path`` is set, bypass the generic
``PolicyProcessorPipeline.from_pretrained`` call. Build the
pipelines fresh via ``make_pi052_pre_post_processors`` (same
bootstrap the runtime uses) and transplant the saved stateful
blobs from each step's ``state_file`` reference in the saved JSON
(today: NormalizerProcessorStep + UnnormalizerProcessorStep
quantile stats). Pairing is by ``registry_name`` AND position so
a benign reorder logs a warning instead of silently mis-loading.
* ``PI052Config.use_hf_kernels``: re-add as a deprecated no-op
field. The flag was removed in d70c8104 (Liger kernels became
unconditional), but checkpoints saved before that commit
serialize ``use_hf_kernels: true`` into ``config.json``. Without
this field draccus rejects the load with ``DecodingError: The
fields use_hf_kernels are not valid for PI052Config`` (job
22164492). Mark for removal in a future major bump.
Together these let an external ``lerobot-eval --policy.path=<pi052
checkpoint>`` invocation evaluate the model against any env.
Co-authored-by: Cursor <cursoragent@cursor.com>
The flag gated a process-global, idempotent Liger patch that swaps
in fused Triton rope / geglu / layer_norm kernels (~4.5 % step time
on H100, bench job 22161421). Since liger-kernel is now a hard
dependency of the loss path (``_shifted_lin_ce`` / ``_fast_lin_ce``
in ``modeling_pi052``), gating the same dep behind an opt-in flag
was redundant — every pi052 run pulls the wheel in either way.
* ``PI052Policy.__init__`` calls ``_enable_hf_kernels()``
unconditionally; the function still degrades gracefully if the
wheel happens to be missing (logs a warning, returns).
* Drop ``PI052Config.use_hf_kernels``; the bench numbers and the
``fused_linear_cross_entropy`` pointer to ``_shifted_lin_ce`` /
``_fast_lin_ce`` are kept as comments next to the docstring.
* Update the warning + ``_shifted_lin_ce`` lazy-import comment to
drop stale ``use_hf_kernels`` / ``reduce-overhead`` references.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Replace ``_shifted_ce`` / ``_fast_ce`` with Liger's
``fused_linear_cross_entropy``: the ``(B, T, 257k)`` logits tensor
is no longer materialised — the kernel chunks over the ``(B*T)``
axis and computes matmul + softmax + CE in fused Triton blocks.
~30 % step speedup and ~12 GB of activation memory freed on the
dual-CE pi052 recipe. All four call sites in
``_compute_all_losses_fused`` and ``_compute_text_and_fast_loss``
updated; the ``.any().item()`` CPU sync is dropped so the loss
path stays CUDA-graph-capturable.
* DDP-safe FAST tokenizer fit. The cache-hit sentinel previously
looked for ``preprocessor_config.json`` but
``ProcessorMixin.save_pretrained`` writes ``processor_config.json``
— every rank always cache-missed and re-fit, racing on writes and
occasionally producing a stale ``.pyc`` that crashed
``AutoProcessor.from_pretrained`` with ``AttributeError:
UniversalActionProcessor``. Fix the sentinel; gate the fit on the
(local) main process; non-leader ranks poll the cache until the
leader is done. Caught by job 22162549.
* New recipe ``subtask_mem_vqa_robocasa.yaml`` — subtask + memory +
per-camera VQA over the three robocasa camera keys produced by the
port pipeline (``robot0_agentview_left/right``, ``robot0_eye_in_hand``).
The previously-shipped ``subtask_mem_vqa_speech.yaml`` references
``observation.images.front`` / ``wrist`` which don't exist in
robocasa, so VQA never rendered.
Co-authored-by: Cursor <cursoragent@cursor.com>
Low-confidence VLM detections were producing many overlapping, loose
boxes per frame (oven + toaster oven + counter + drawer + ...) and
coarse keypoints, hurting downstream policy grounding. Two surgical
fixes:
- module_3_vqa prompt: cap bbox at most 3 high-confidence detections
(prefer 1 tight box), require specific labels and ≤10% padding,
allow empty detections list when nothing meets the bar; keypoint
must be a single pixel-precise feature (handle / button / gripper
tip) rather than a coarse "somewhere on object" point.
- run_hf_job: lower vlm.temperature 0.7 → 0.2. Bbox + keypoint are
coordinate-regression tasks where sampling noise directly degrades
localization; question phrasing still varies enough at 0.2.
No new config knobs — the count cap lives in the prompt since "top-N
by confidence" is best picked by the VLM itself. Validator already
accepts empty detections.
Co-authored-by: Cursor <cursoragent@cursor.com>
Tighten ``module_1_subtasks.txt`` so the VLM emits one composite
atomic action per subtask instead of decomposing every pick into
``move to X`` / ``grasp X`` / ``lift X``:
- Lock the verb vocabulary to the composite set the low-level
policy actually learns end-to-end: ``pick up`` (approach + grasp +
lift), ``put``/``place`` (transport + release), ``push``, ``pull``,
``turn``, ``press``, ``open``, ``close``, ``pour``, ``insert``.
``go to`` is allowed only as a pure relocation between phases.
- Add an explicit ``Forbidden ultra-fine splits`` block enumerating
the patterns the VLM was tempted to emit (``move to X``,
``reach for X``, ``grasp X``, ``lift X``, ``release X``) and
instructing it to fold each into its parent composite.
- Rewrite the Good/Bad examples to match the composite contract;
the previous ``"move to blue cube" / "grasp blue cube" / "lift
blue cube"`` Good list was actively encouraging the over-
segmentation pattern this prompt is supposed to prevent.
- Tighten the duration rule: candidates shorter than
``min_subtask_seconds`` must be merged into a neighbour rather
than emitted. Pairs with bumping the runtime floor to 3 s so
composites have room to land.
Pure prompt change — no code or schema change. Existing canonical-
vocabulary retry path is unaffected (the new verb whitelist lives
in prose, not in the validator).
Co-authored-by: Cursor <cursoragent@cursor.com>
Subtask prompt (``module_1_subtasks.txt``):
- Lock the verb vocabulary to composite atomic actions (``pick up``,
``put``/``place``, ``push``/``pull``, ``turn``, ``press``, ``open``/
``close``, ``pour``, ``insert``, ``go to``).
- Add an explicit ``Forbidden ultra-fine splits`` block instructing
the VLM to fold ``move to X`` / ``reach for X`` / ``grasp X`` /
``lift X`` / ``release X`` into the parent composite. Previous
examples actively encouraged the over-segmentation pattern.
- Rewrite the Good/Bad examples around the composite contract.
Job config (``examples/annotations/run_hf_job.py``):
- Point at ``pepijn223/robocasa_smoke_2atomic_v3`` on ``h200x4``.
- ``--vlm.camera_key=robot0_agentview_left`` (real key for the
dataset; the prior ``observation.images.wrist`` did not exist
and would have silenced the VQA module).
- ``--vlm.serve_command`` ``--max-model-len 131072`` (4x): keeps
90 s @ 1 Hz episode video blocks under context even at full
Qwen vision resolution. On 1x H200 (144 GB) the 35B-FP8 model
has plenty of room for the bigger KV cache.
- ``--vocabulary.enabled=false`` — heterogeneous dataset, no
benefit from a single canonical vocabulary.
- ``--plan.derive_task_from_video=off``, ``--plan.n_task_rephrasings=0``
— reuse the dataset's own ``episode_task`` strings as-is.
- ``--plan.min_subtask_seconds=3.0``, ``--plan.plan_max_steps=6`` —
give the new composite-action rules room to land (1.5 s floor
was too small to host a full grasp-or-place composite).
- ``--vqa.vqa_emission_hz=3.0`` — denser VQA grounding.
- Timeout 24h, episode_parallelism=64, client_concurrency=256 to
scale to the 25k-trajectory regime when the same recipe is
pointed at a larger dataset.
Co-authored-by: Cursor <cursoragent@cursor.com>
Heterogeneous datasets (different tasks/scenes across episodes) don't
share a single small subtask + memory vocabulary, so the canonical
vocabulary phase narrowed every episode to the wrong target distribution.
Flip the example to free-form generation by default and document the
``--vocabulary.enabled=true`` switch for homogeneous datasets where the
canonical vocabulary still helps the downstream policy.
No pipeline-code changes: ``VocabularyConfig.enabled`` already gates
phase 0 (see ``executor.py:_run_vocabulary_phase`` and
``VocabularyConfig`` docstring) and falls back to free-form generation.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces the per-layer ``modeling_gemma.eager_attention_forward`` call
with ``torch.nn.functional.scaled_dot_product_attention`` in
``compute_layer_complete`` (pi05) and ``_compute_layer_ki`` (pi052).
PyTorch SDPA picks the memory-efficient kernel for the
block-bidirectional 4D additive mask the dual-expert model uses (FA2 /
FA3 reject it because they only accept causal / sliding-window / varlen
patterns). The shared ``sdpa_attention_forward`` helper mirrors the
eager signature so the call sites are unchanged.
Selective AC: removes the redundant outer ``_apply_checkpoint(forward_func, ...)``
wrap in ``PI05Pytorch.forward``. Per-layer checkpointing inside
``PaliGemmaWithExpertModel.forward`` already handles activation
recompute; the outer wrap was double-recomputing the whole backbone.
+14% steps/sec on its own (job 22161405 vs 22161398, 1xH100).
groot: drop ``@strict`` on ``GR00TN15Config`` — newer ``huggingface_hub``
rejects ``@strict`` on non-dataclass ``PretrainedConfig`` subclasses,
which was blocking imports of any sibling policy through
``lerobot.policies.factory``.
New ``examples/benchmark/bench_pi052_step.py`` (+ slurm sweeps v1..v8)
times PI052Policy.forward+backward (optionally with AdamW) on
synthetic inputs. Headline numbers on 1xH100 with KI=True, GC=True,
L=512, 4.14 B trainable params, AdamW state in bf16:
pre-SDPA eager BS=8 610ms 19.5 GiB -> 13.1 samples/s
sdpa BS=8 + compile=default 413ms 19.5 GiB -> 19.3 samples/s
sdpa BS=16 + compile=default 715ms 37.3 GiB -> 22.4 samples/s
sdpa BS=32 + compile=default 1325ms 44.8 GiB -> 24.2 samples/s
sdpa BS=40 + compile=default 1665ms 48.6 GiB -> 24.0 samples/s
Parity tests in ``tests/policies/pi052/test_pi052_sdpa_attention.py``
cover fp32 / bf16 / GQA / MHA forward + backward — output and grads
match the eager path within bf16 tolerance.
Also ships ``examples/benchmark/fsdp_pi052.yaml`` (FSDP2 accelerate
config wrapping GemmaDecoderLayer + SiglipEncoderLayer) for the
follow-up multi-GPU memory sharding work.
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds ``PI052Config.use_hf_kernels`` (default off). When enabled,
``PI052Policy.__init__`` calls ``apply_liger_kernel_to_paligemma``
before the backbone is built so PaliGemma / Gemma / Siglip layers
pick up Liger's fused Triton forwards.
Measured at BS=16 / L=512 / H100 80GB with KI+GC on (bench job
22161421, see ``examples/benchmark/bench_pi052_kernels.slurm``):
rope only → -2.5% step time
geglu only → -2.2% step time
layer_norm only → -1.1% step time
all three → -4.5% step time, peak_mem unchanged
``cross_entropy`` / ``fused_linear_cross_entropy`` are deliberately
skipped — pi052 calls ``F.cross_entropy`` directly and bypasses
``PaliGemmaForConditionalGeneration.forward``, so neither patch
fires without invasive model-code changes (left for a follow-up).
``rms_norm`` measured as noise on this workload (GC dominates),
so it stays off to keep the patch surface minimal.
Requires ``pip install liger-kernel``; falls back to a warning if
missing so the default path is unaffected.
Co-authored-by: Cursor <cursoragent@cursor.com>
Flip EMAConfig.enable default from False -> True. Every training run
now maintains an EMA shadow of the policy and uses it for eval + W&B
example dumps. Disable per-run with --ema.enable=false for short or
memory-constrained training.
Rationale:
* openpi (JAX, official) ships EMA on for every shipped config,
decay=0.99 by default and 0.999 for pi05_libero. The openpi
PyTorch port explicitly lists EMA as unsupported, a gap LeRobot
main inherited. Flipping the default closes that gap for every
LeRobot policy that ships through lerobot-train.
* EMA is established best practice for diffusion / flow-matching
policies (Diffusion Policy §V.D; standard in DDPM/EDM/Stable
Diffusion training recipes). For autoregressive policies the
extra cost is real but the safety net (smoother eval, better
final checkpoint) doesn't hurt.
Trade-offs to be aware of:
* Memory: 1x model params in fp32 shadow (~13 GB for pi052's
3.3B params; <500 MB for ACT/Diffusion-Policy class). Memory-
constrained users on consumer GPUs may need --ema.enable=false.
* Checkpoint disk: extra .pt file in training_state/, size ~=
pretrained_model/model.safetensors. Over a 100k-step run with
save_freq=20000 that's 5x the model size in extra disk.
* Eval scores will now reflect EMA model instead of live model -
expected to be 1-3% higher on closed-loop tasks per the
diffusion-policy literature; might surprise users who memorize
their last run's numbers.
Opt out:
--ema.enable=false # disable entirely
--ema.use_for_eval=false # keep EMA but eval reflects live
--ema.use_for_wandb_examples=false # keep EMA but W&B reflects live
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the 250-line src/lerobot/utils/ema.py with a direct dependency
on ema-pytorch (lucidrains' canonical PyTorch EMA library). Same
semantics, decay=0.999 default unchanged, but offloads the maintenance
burden to a maintained library used by every diffusion repo.
Why ema-pytorch:
* Standard PyTorch EMA library; battle-tested across diffusion +
speech + image-gen codebases.
* Tiny pure-python dep (no compiled code).
* Cleaner consumer-side API: ema.ema_model is a full nn.Module
clone of the policy, so eval / wandb just pass it through instead
of context-managed swap/restore on the live model.
What changed mechanically:
* pyproject.toml: add 'ema-pytorch>=0.7.7,<1.0.0' to core deps.
* deleted src/lerobot/utils/ema.py (the custom ModelEMA).
* scripts/lerobot_train.py:
- import EMA from ema_pytorch
- instantiate with beta=cfg.ema.decay,
update_after_step=cfg.ema.warmup_steps, update_every=1,
include_online_model=False (accelerator owns live model
lifecycle; double-registration would double-count params).
- ema.update() (no args) — library tracks the online model
internally.
- Eval block: pass eval_target_policy = ema.ema_model (when
cfg.ema.use_for_eval) instead of swap context manager.
- W&B examples: same pattern.
- Save: torch.save(ema.state_dict(), .../ema_state.pt) instead
of custom safetensors writer. .pt format is consistent with
the rest of training_state which already mixes safetensors +
json + (now) pt.
- Resume: ema.load_state_dict(torch.load(.../ema_state.pt)).
- WandB observability: ema/step (count of ema.update calls),
ema/initted (bool from library), ema/beta (constant from
cfg).
* configs/default.py: EMAConfig.decay stays 0.999 (matches
openpi's pi05_libero); docstring updated to reflect ema-pytrch
semantics for warmup_steps (now maps to update_after_step — a hard
skip, not a smooth decay ramp).
Behavior preserved:
* Defaults: enable=False, decay=0.999, warmup_steps=0,
use_for_eval=True, use_for_wandb_examples=True.
* Same CLI: --ema.enable=true, --ema.decay=X, etc.
* Same checkpoint layout (training_state/ema_state.pt next to
optimizer_state.safetensors etc.); resumes silently if present.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Exponential Moving Average of trainable policy parameters with
warmup, eval-time swap, checkpoint save/resume, and wandb observability.
For diffusion / flow-matching policies (pi052's flow expert exactly
qualifies), averaging late-training parameter oscillations yields a
smoother model that generalises substantially better at inference —
~1–3% absolute success-rate improvement on closed-loop tasks per the
diffusion-policy lit (Chi et al. 2023 §V.D; standard in DDPM/EDM).
New module: src/lerobot/utils/ema.py
ModelEMA class with:
* fp32 shadow of every requires_grad parameter
* decay warmup: min(decay, (1+n)/(10+n)) for first warmup_steps updates
* update(model) -> effective_decay (for logging)
* apply_to(model) context manager: temp-swap weights, restore on exit
* copy_to(model): permanent overwrite
* save() / load_from_file(): safetensors + JSON sidecar for metadata
* state_dict() / load_state_dict() for in-process round-tripping
New config: src/lerobot/configs/default.py EMAConfig + wired into
TrainPipelineConfig as 'ema: EMAConfig'.
Fields:
enable: bool = False (off by default, back-compat)
decay: float = 0.999 (standard; 0.75 for fast Diffusion-Policy)
warmup_steps: int = 0 (no warmup by default)
use_for_eval: bool = True (eval swaps in EMA weights)
use_for_wandb_examples: bool = True
(W&B training-examples table uses EMA
for predicted-action columns -> matches
what eval / deployment would see)
Training loop integration (src/lerobot/scripts/lerobot_train.py):
1. After accelerator.prepare + policy.train(), instantiate ModelEMA
on the main process if cfg.ema.enable. Resume from
checkpoint_path/training_state/ema_state.safetensors if present.
2. After each update_policy() call, ema.update(unwrap_model(policy))
returns the effective decay (logged to wandb during warmup).
3. The save_checkpoint() block also ema.save(...) the shadow next to
the existing optimizer/scheduler/rng training state. Resume picks
it up automatically in (1).
4. The eval block (cfg.env && is_eval_step) wraps eval_policy_all in
ema.apply_to() when use_for_eval=True. Live weights restored
byte-for-byte on context exit.
5. The W&B training-example dump wraps log_training_examples in
ema.apply_to() when use_for_wandb_examples=True so the predicted-
action columns match the eval/deployment behavior.
6. Two new wandb scalars: ema/effective_decay, ema/num_updates.
Cost:
Memory: 1x model params in fp32 (~13 GB for pi052's 3.3B params).
Lives only on main-process GPU. CPU offload available via
ModelEMA(device='cpu') if needed.
Compute: one elementwise update per step (~1% of step time).
Eval: 2x checkpoint files in training_state/ (live optimizer state
+ ema shadow). Negligible relative to model.safetensors.
Usage:
lerobot-train ... --ema.enable=true
lerobot-train ... --ema.enable=true --ema.decay=0.9999 # very slow EMA
lerobot-train ... --ema.enable=true --ema.warmup_steps=1000
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Penalise the log-partition function z = log Σ exp(logits) drifting away
from zero on text-CE supervised positions. Without it, large-vocab
models (PaliGemma's 257k vocab) can let logsumexp grow unboundedly
while CE stays low — a uniform additive logit bias cancels in softmax
but pushes the partition function out of bounds, causing numerical
instability and generation drift.
PaLM appendix B / Chinchilla report z-loss is essential for stable
large-vocab CE. It is especially valuable for pi052 because the recent
default lm_head_lr_scale=5.0 amplifies head-drift risk: the 5x boost
keeps the head pinned to fine-tuning targets, and z-loss caps the
partition function so the head can't just bias all logits high uniformly.
Implementation:
* _shifted_ce(logits, labels, z_loss_weight=0.0) gains the new arg
with default 0.0 (back-compat for any other caller).
* Both call sites in PI052Policy.forward read self.config.text_ce_
z_loss_weight and pass it through.
* PI052Config.text_ce_z_loss_weight defaults to 1e-4 (commonly cited
PaLM value); set to 0 to disable.
Cheap to compute: one extra logsumexp shares the softmax kernel that
F.cross_entropy already runs. No memory overhead beyond a (B*T,) tensor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The base optimizer LR (2.5e-5, cosine to 2.5e-6, 1k warmup, AdamW
(0.9, 0.95), wd 0.01, grad_clip 1.0) is the openpi/π0.5 setting used
for the RoboCasa leaderboard baselines and is well-validated for 3B-
class VLAs with a paligemma backbone. Leave it alone.
The one place pi052 needs to diverge from pi05 is the LM-head LR
multiplier:
* pi05 has no text supervision -> head doesn't get gradients ->
lm_head_lr_scale is moot, stays at 1.0.
* pi052 always has text supervision via the recipe (subtask /
memory / VQA). Under KI, the LM head only sees gradients on
~30-45% of the batch (the text-CE mask share). Under aggressive
cosine decay the head drifts back toward PaliGemma's pretrained
<loc> first-token bias, despite teacher-forced CE staying near 0.
5x is the documented fix (see PI05Config.lm_head_lr_scale docstring
and PI05Policy.get_optim_params, which is already wired to split the
LM head + tied embed_tokens into their own param group while sharing
the same cosine lambda). Flipping the default here lifts the fix from
opt-in to on-by-default for every pi052 run, with zero downside on
text-free recipes (head still gets no gradients to scale).
Other LR knobs reviewed and intentionally NOT changed:
- optimizer_lr=2.5e-5: openpi-validated, matches leaderboard.
- scheduler_warmup_steps=1000: standard for VLA finetuning.
- scheduler_decay_steps=30000: auto-scales for short runs.
- optimizer_betas=(0.9, 0.95): GPT/LLM convention, works for
flow-matching + LM-CE.
- optimizer_weight_decay=0.01, grad_clip=1.0: standard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The training-example wandb.Table dump (camera images + text fields +
GT/predicted action chunk endpoints) was opt-in. Flip defaults so any
run with --wandb.enable=true gets visual training observability for free.
log_examples_freq: 0 -> 5000 (push table every 5k steps)
log_examples_n: 4 -> 4 (unchanged)
log_examples_predict_actions: False -> True (extra forward in eval mode)
Runs without --wandb.enable=true are unaffected (the training loop gate
checks wandb_logger is not None first). Set log_examples_freq=0 to opt
out of the dump even with wandb enabled; set log_examples_predict_actions
=false to skip the extra inference forward pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in cadence for pushing rich training examples to W&B,
independent of the scalar log_freq. Off by default; turn on with
--wandb.log_examples_freq=5000 (one wandb.Table dump every 5k steps).
WandBConfig (configs/default.py):
+ log_examples_freq: int = 0 # 0 disables
+ log_examples_n: int = 4 # batch elements per dump
+ log_examples_predict_actions: bool = False
# opt-in extra forward pass to
# show predicted vs GT action chunk
WandBLogger.log_training_examples (common/wandb_utils.py):
Builds one wandb.Table row per sampled batch element with:
* one wandb.Image column per camera (auto handles CHW/HWC,
uint8/float32 [0,1])
* any text fields present in the batch (task / subtask /
memory / instruction)
* gt_action_first / gt_action_last (chunk endpoints)
* pred_action_first / pred_action_last when --wandb.log_examples_
predict_actions=true (policy.eval() + no_grad; restores train
mode after)
Defensive: per-camera failures don't poison the row; predict_action_
chunk exceptions are logged and the predicted columns are dropped.
Training loop (scripts/lerobot_train.py):
One new gated block right after the existing scalar log_step clause.
Reads batch + dataset.meta.camera_keys, hands them to
log_training_examples. Wrapped in try/except so a bad sample never
kills the run.
Usage:
lerobot-train ... \
--wandb.enable=true --wandb.project=robocasa_composite_seen \
--wandb.log_examples_freq=5000 \
--wandb.log_examples_n=4 \
--wandb.log_examples_predict_actions=true
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add ATOMIC_TASKS, COMPOSITE_UNSEEN_TASKS and four new --task-set keys
(atomic, composite_unseen, composite_all, composite_atomic) so the same
builder produces the 50-task target benchmark or the 300-task Human300
pretraining slice (via --split=pretrain --task-set=all) without
duplicating logic.
- Stop hardcoding the composite_seen tag on the HF push; tags are now
derived from --split / --source / --task-set so atomic, composite_all,
and pretrain runs land with accurate metadata.
- Refresh module docstring to match the broader scope.
- Add scripts/build_robocasa_smoke.sh: 2-atomic-task smoke dataset
(~1k episodes, ~131k frames) for fast end-to-end training validation
before kicking off Human300-scale runs.
Resolves conflicts from 32 commits on main:
* docs/source/_toctree.yml — keep both new toc entries
(annotation_pipeline + video_encoding_parameters).
* docs/source/language_and_recipes.mdx — adopt main's section
ordering (Layer 2 before "Temporal semantics") and float32
timestamp dtype to match the codebase.
* src/lerobot/configs/__init__.py — keep both export sets
(recipe + video encoder).
* src/lerobot/datasets/dataset_metadata.py — drop redundant lazy
imports (top-level imports cover both LANGUAGE_COLUMNS and
DEFAULT_TOOLS); adopt main's @tools.setter for info.json
write-back.
* src/lerobot/datasets/feature_utils.py — call the real
validate_feature_language() instead of returning "".
* src/lerobot/datasets/language.py — float32 timestamps to match
pa.float32() used in video_utils.py and the rest of the codebase.
* src/lerobot/datasets/language_render.py — adopt main's
unwrap_scalar() helper (drops two hand-rolled .item()/list
unwrappers); float32 in docstring.
* src/lerobot/processor/render_messages_processor.py — drop
PR-local _scalar() helper, use shared unwrap_scalar().
* tests/datasets/test_language.py — adopt main's new float32 dtype
+ validate_feature_language warning tests.
* tests/datasets/test_dataset_metadata.py — adopt main's new
tools.setter persist/clear tests.
* uv.lock — regenerated cleanly from main's resolver.
90 of 92 touched tests pass. Two pre-existing test failures
(test_module1_plan_memory_subtask_smoke,
test_module2_mid_episode_emits_paired_interjection_and_speech in
tests/annotations/test_modules.py) are unrelated to this merge —
that test file doesn't exist on main, so the failures originate on
the branch and are addressed by the 8 newer fix(annotate) commits
already on origin that will land in a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three diagnostic surfaces shipped in PR3 that don't belong in a clean
release:
* ``LEROBOT_DUMP_RECIPE_SAMPLES`` env-var dump (~70 LOC in
text_processor_pi052.py): pretty-prints the next N rendered samples
with ``[TGT]...[/TGT]`` markers over supervised spans. One-off
training-inspection tool — no production user, never wired into a
CLI flag, only useful while iterating on the recipe. Drop the module
constants, the ``_is_dump_rank`` / ``_dump_recipe_sample`` helpers,
the call site, and the now-unused ``import os``.
* ``_log_obs_tensors_once()`` in lerobot_pi052_runtime.py: the
docstring literally says "Used to bisect train/inference mismatches"
— a debugging artifact from when the LM head was collapsing on the
live robot. Logged unconditionally at WARNING level from both the
dataset-driven and robot-driven providers, with no ``--verbose``
gate. Drop the function, both call sites, and the ``_logged`` /
``_obs_logged`` flag dicts that fed them. (``_resize_logged`` is
kept — it gates the operationally useful camera-size sanity log.)
* Defensive ``unsqueeze(0)`` block in the dataset observation
provider: papered over an upstream bug where some preprocessor step
could produce an unbatched tensor. ``AddBatchDimensionProcessorStep``
is reliable in the current pipeline — pi052 tests still pass with
the block removed. If the bug ever resurfaces it should be fixed
at the source, not silently re-batched here.
Net: -169 LOC. All 30 ``tests/policies/pi052/`` tests pass.
The ``<loc>`` token plumbing (``register_paligemma_loc_tokens``,
``_loc_token``, ``suppress_loc_tokens`` runtime gate) is left as-is —
it's the actual mechanism for VQA spatial answers, not scaffolding,
and the ``suppress_loc_tokens=True`` callers on subtask/memory/
interjection paths and ``=False`` on the VQA path are intentional
asymmetric behaviour, not a bug-routing knob.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the earlier wrapper (which depended on robocasa.scripts.download
+ dataset_registry) with a self-contained pipeline that:
* downloads each task tarball directly from Box via box_links_ds.json
* converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30
* standardizes camera keys under observation.images.robot0_* and
flattens observation.state by concatenating base/EE/gripper subkeys
when the source dataset stores them separately
* builds per-rank unified shards then aggregates into one dataset
Filter: composite_seen task-set restricts discovery to the 16 multi-step
target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use
--task-set=all to keep every discovered task in the split/source slice;
--tasks=... overrides for arbitrary subsets.
Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task.
Adapted from a battle-tested port_robocasa.py reference shared by the
user; the only semantic addition is the task-set filter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both observation providers in lerobot_pi052_runtime.py ended a sample
dict the same way — strip the runtime-owned language columns and hand
the policy a device-resident ``observation.*``-only subset. Extract
two tiny helpers (``_strip_runtime_owned_language_cols`` and
``_select_observation_to_device``) so the dataset and robot paths
read as a clear linear pipeline. Path-specific concerns (defensive
unsqueeze on the dataset path; camera resize + state-vector sanity
logging on the robot path) stay inline at the call sites.
Behaviour unchanged; all 30 ``tests/policies/pi052/`` tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parallel variant of build_robocasa_composite_seen.py modeled after the
existing slurm_port_shards.py / slurm_aggregate_shards.py pattern.
Two-phase datatrove pipeline:
* Phase 1 DOWNLOAD: tasks=16 (one per RoboCasa composite_seen task),
each worker downloads its assigned tar via RoboCasa's own
download_datasets helper. Network-bound, idempotent.
* Phase 2 AGGREGATE: tasks=1, single worker calls aggregate_datasets
over the 16 extracted directories. Submitted with depends=phase1 so
SLURM only releases it once all 16 downloads succeed.
Reuses the COMPOSITE_SEEN_TASKS list and per-task download/resolve
helpers from the single-machine script via aliased imports — single
source of truth for 'what does it mean to download a composite_seen
task'.
Local (--slurm 0) mode runs the two phases sequentially in-process for
debugging on a workstation.
Usage on SLURM:
uv run python examples/port_datasets/slurm_build_robocasa_composite_seen.py \
--output-dir=/scratch/${USER}/robocasa_composite_seen \
--hub-repo-id=${HF_USER}/robocasa_composite_seen \
--logs-dir=/scratch/${USER}/logs/robocasa \
--partition=cpu --push-to-hub
Prereq: uv sync --extra annotations (pulls datatrove)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RoboCasa 1.0 ships its target/human demos in LeRobot format (parquet +
mp4) as lerobot.tar archives distributed via Box. This script wraps
RoboCasa's own download_datasets helper to pull each of the 16
composite_seen tasks, opens each extracted directory as a
LeRobotDataset, and merges them into a single combined dataset via
merge_datasets (a thin wrapper over aggregate_datasets that revalidates
fps/robot_type/features, unifies task indices, concatenates videos and
parquet, and recomputes stats).
The 16-task slice corresponds exactly to the 'Composite-Seen' column of
the published RoboCasa365 leaderboard, so the resulting dataset is the
right substrate for an apples-to-apples pi05 vs pi052 comparison on
multi-step kitchen manipulation.
Usage:
uv run python -m lerobot.scripts.build_robocasa_composite_seen \
--output-dir=/data/lerobot/robocasa_composite_seen \
--hub-repo-id=${HF_USER}/robocasa_composite_seen \
--push-to-hub
Idempotent: re-running skips already-downloaded tasks. Defensive
fallbacks handle RoboCasa API drift in get_ds_path / download_datasets.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task picker:
The dataset bootstrap used to silently overwrite args.task with the
canonical training task. Replace that with an interactive picker
(_select_task_interactively) that shows every unique task in
ds_meta.tasks as a numbered menu (canonical task first as default) plus
a 'type a custom task' option. --task on the CLI still skips the
picker, and non-TTY runs fall back to the bootstrap task so scripted
invocations are unchanged.
Action diagnostic removal:
Drop the [act] log block in LowLevelForward.run (|a|_mean / spread /
normalized + unnormalized first/last + state) that was added while
debugging the 'barely moving' issue. Robot motion is now healthy, the
output is noise in steady-state, and it depended on stashing the
postprocessor on runtime.state — also removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MemoryUpdateFwd was importable but never installed, so subtask_change
events fired by HighLevelSubtaskFwd had no listener and current_memory
stayed at its initial None value — the runtime panel always showed
'memory (not set)' even when the policy was trained with the
memory_update recipe (e.g. subtask_mem_vqa_speech.yaml, weight 0.15).
Insert MemoryUpdateFwd between HighLevelSubtaskFwd and AskVQAFwd so
the event is visible the same tick it is emitted, and refresh the
stale comment that claimed memory was not in scope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-chunk-boundary counter to HighLevelSubtaskFwd: subtask gen
fires only once every N chunk boundaries (default 1 = current
behavior). Lets the operator run e.g. 5 flow-matching action chunks
per LM-head subtask gen so the subtask doesn't churn every 1.7s while
the previous one is still being executed — saves compute and avoids
re-planning the action trajectory mid-grasp.
--subtask_chunks_per_gen=5 # 5 chunks per subtask refresh
The counter starts at 0 so the very first chunk boundary fires
immediately (no startup delay). Trigger is rearmed when skipping so
a low high_level_hz doesn't lose slots.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-chunk log line in LowLevelForward that surfaces what the
action expert actually emits and what the robot receives after the
postprocessor unnormalizes it, so "barely moving" can be diagnosed
at a glance:
[act] T=50 |a|_mean=0.234 spread=0.512
[act] norm first=[0.12, -0.31, ...] last=[0.45, -0.22, ...]
[act] joint first=[3.2, -47.8, ...] last=[12.4, -41.0, ...] state=[0.5, -55.3, ...]
|a|_mean ~ 0.3–0.6 with spread ~ 0.3+ and visible delta from first to
last → healthy trajectory. |a|_mean near 0 across the chunk → model
defaulting to median pose. joint values that don't differ much from
state → safety cap or model output near current state.
Postprocessor is stashed on runtime.state["_postprocessor"] at startup
so the diagnostic can replay the same unnormalize the dispatcher uses.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If two consecutive VLM-emitted subtask spans have ``start`` timestamps
that round to the same source frame after ``snap_to_frame`` (e.g. on
short episodes the VLM sometimes nominates two ~adjacent action
boundaries within one 30 Hz step), the writer emits two
``style=subtask`` rows at the identical persistent timestamp. The
training-time renderer's default binding
``subtask: active_at(t, style=subtask)`` then raises:
ValueError: Ambiguous resolver for style='subtask';
add role=..., tool_name=..., or camera=... to disambiguate.
… and the whole training run dies on the first batch.
Observed concretely on ``pepijn223/super_poulain_vocab2`` (job
22159979): episodes 3 and 30 each had two subtask rows at the same
timestamp (``release yellow cube`` + ``retract arm`` snapping to the
same frame).
Add ``_dedupe_starts_to_distinct_frames`` to walk the cleaned span list
and, whenever a snapped start collides with one already used, push the
later span onto the next free frame timestamp. Both subtasks survive
on distinct timestamps; the renderer can now disambiguate. If the
episode genuinely has no later free frame (extremely unlikely — would
require a same-timestamp collision on the very last frame of the
episode), the later span is dropped with a warning rather than left
to poison the render.
New test ``test_plan_module_bumps_collocated_subtasks_to_distinct_frames``
locks in the contract; full vocabulary suite is 14/14 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The Jaccard-overlap snap was warping VLM output into wrong canonical
labels — e.g. an off-vocab "consult the wizard" span would silently
become "grasp blue cube" if that scored highest. Even with a higher
floor the operator can't tell which subtasks were paraphrases vs
genuine mislabels in the resulting dataset.
Replace with strict exact-match validation + a single targeted retry:
1. Generate subtasks as before.
2. If any returned subtask's normalised form (lowercased, articles
stripped, whitespace collapsed) isn't in the canonical vocab,
fire one retry call naming the offending strings and re-sending
the full canonical list. The retry prompt requires byte-identical
output from the vocab.
3. After the retry, validate again. Spans still off-vocab are
dropped — no fuzzy snapping ever produces a different canonical
label than the VLM actually emitted.
4. If every span ends up off-vocab even after the retry, warn loudly
so the operator extends ``meta/canonical_vocabulary.json`` to
cover the missing phase. The episode is left with empty subtasks
rather than silently fabricated ones — visibility > sweep-under-
the-rug.
Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the
normalisation helper out so the retry-validation and the final
canonicalisation share one source of truth.
Tests:
- test_plan_module_accepts_article_only_difference: "grasp the blue
cube" still maps to canonical "grasp blue cube" (article-tolerant).
- test_plan_module_retries_when_subtask_off_vocab: paraphrase
triggers the retry which the VLM corrects in pass 2.
- test_plan_module_drops_off_vocab_subtask_after_retry: VLM that
refuses to correct → bad span dropped, in-vocab span kept.
- test_plan_module_empty_when_all_off_vocab_after_retry: every
span off-vocab → episode left empty (no warping).
All 13 vocabulary tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
When the canonical vocabulary is enabled and the VLM produces spans
that don't overlap any canonical label, the previous Jaccard-floor
(0.5) dropped them and the episode came out with no subtasks at all
— invisible to the downstream policy. Observed on
``pepijn223/super_poulain_vocab``: some episodes had empty subtask
columns because every VLM-emitted phrase scored below 0.5 against
the discovered vocabulary.
Two-pass canonicalisation:
- First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to
let mild paraphrases through) and drops everything below.
- If that first pass leaves the episode with **zero** subtasks,
fall back to a second pass that always snaps each VLM span to
its nearest canonical label by Jaccard (no floor). The episode
ends up with subtasks even when the vocabulary missed a phase
— a slightly-wrong canonical label is still closer to the right
motion than nothing at all.
- Log loudly when the fallback fires so the operator can spot
coverage gaps in ``meta/canonical_vocabulary.json``.
- Log a per-episode count at INFO when some (but not all) spans
were dropped so it's visible without spamming the run output.
Promote the Jaccard floor + ignore-tokens to class constants so
they're a single edit point. Add ``force=True`` parameter to
``_canonicalize_subtask`` for the no-floor fallback path.
New test ``test_plan_module_snaps_when_all_off_vocab`` covers the
fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is
adjusted to keep at least one in-vocab span so the floor path can
still fire and is exercised. All 12 vocabulary tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Hardcoding ``n_subtask_target=10`` and ``n_memory_target=6`` baked task
complexity into the config — a simple pick-and-place needs ~6, a
multi-step recipe needs ~20. The VLM already sees the clips, so let it
pick the count itself from what's recurring across episodes.
Drop both knobs from ``VocabularyConfig`` and the ``module_0_vocabulary``
prompt template. The prompt now says "decide the count yourself based
on what you see — the smallest set that still covers every recurring
phase" and adds an "each label must recur across the demos" rule so
the VLM filters out one-off motions.
Update the launcher script + docs to remove the old knobs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Three stale things in the launcher script:
- ``--module_1/2/3.*`` no longer exist; review commit fd18beb renamed
the CLI namespaces to ``--plan/interjections/vqa``. Forwarded all
eight existing args to their new names.
- ``--push_to_hub`` is now a bool; the destination repo lives at
``--dest_repo_id``. Split the single positional into both args.
- ``openai`` was missing from the pip install list, which the prior
review review (claude bot, 2026-05-08) flagged — the default vlm
backend is ``openai`` so the job would have ImportError'd. Added.
Also expose the new phase 0 (canonical vocabulary discovery) knobs
explicitly: ``--vocabulary.sample_episodes``, ``--n_subtask_target``,
``--n_memory_target``. Defaults are sane (3 / 10 / 6) but worth
flagging in the example so the operator knows what they're running.
Update the docstring + section comments to match the current phase
layout (vocabulary → plan → interjections → vqa → writer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The pipeline previously emitted near-unique subtask + memory phrasings
per episode (free-form LLM rephrasing). On the downstream low-level
policy that collapses the action expert's conditioning to noise: every
episode pairs a different paraphrase with similar motions, so the
expert learns a flat scene-prior that ignores the subtask string —
then at inference the high-level head invents *yet another* paraphrase
and the expert produces tiny "uncertain hover" chunks.
Add a vocabulary-discovery phase (phase 0) that runs once per dataset:
- watches the first ``vocabulary.sample_episodes`` (default 3)
episode videos as one Qwen-VL prompt,
- asks the VLM to derive ~``n_subtask_target`` canonical imperative
subtask labels and ~``n_memory_target`` first-person past-tense
memory milestones that recur across the demos,
- persists them to ``meta/canonical_vocabulary.json`` (human-
inspectable, hand-editable), and
- wires the resulting ``Vocabulary`` into the ``plan`` module so
every per-episode subtask + memory call is constrained to those
exact strings (both as prompt-side instructions *and* post-VLM
validation: paraphrases snap to the closest canonical entry via
token-set overlap; below a 0.5 Jaccard floor the subtask is
dropped rather than warped into something semantically wrong).
Operator workflow:
- first run discovers the vocabulary, writes the JSON, and runs
the ``plan`` module against it,
- subsequent runs reuse the on-disk file (``reuse_existing=True``
default) so hand-edits stick,
- set ``--vocabulary.enabled=False`` to fall back to free-form
generation (the original behaviour).
The discovery prompt forbids gerunds / third-person / adverbs and
caps the lists to the requested counts, matching the Hi-Robot /
π0.6-MEM convention of small per-environment vocabularies. The
``plan`` module's subtask + memory prompts grow a conditional
``{vocabulary_block}`` slot rendered only when a vocabulary is
present; without one the templates collapse to their previous
free-form form.
Tests: 11 new unit tests under tests/annotations/test_vocabulary.py
cover the on-disk round-trip, discovery against the fixture dataset,
``reuse_existing`` short-circuit, paraphrase canonicalisation, off-
vocab subtask dropping, and the no-vocabulary pass-through path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Two runtime fixes that surfaced from on-robot testing.
(1) HighLevelSubtaskFwd was double-gated: HzTrigger fires every period
(e.g. every 5s at --high_level_hz=0.2) AND the step requires the
action queue to be empty. The queue-empty window is brief (~tens of
ms between drain and refill) and almost never coincides with the
low-hz timer, so HL effectively never fired and the subtask shown
in the runtime panel stayed on the dataset's frame-0 annotation.
Add HzTrigger.rearm() and have HighLevelSubtaskFwd call it when
skipping due to queue-non-empty — the trigger stays armed and tries
again on the next tick instead of waiting another full period.
LowLevelForward keeps the original "skip" semantics because chunk_hz
is meant as a true upper bound on chunk-generation rate.
(2) The "robot state at startup" warning in _build_robot_observation_provider
was meant to fire once but wasn't gated by _resize_logged like the
sibling "camera ... live=AxB" warning. Result: it spammed every
observation tick (~1-2s). Gate it on first_call (snapshot of
_resize_logged["done"]) so both logs fire once at session start.
Co-authored-by: Cursor <cursoragent@cursor.com>
With knowledge_insulation=True the LM head only receives gradients on
text-CE samples (e.g. ~45% of the mix for subtask_mem.yaml). Under
aggressive cosine LR decay this is enough for the head's first-token
distribution to drift back toward PaliGemma's pretrained <loc>
detection prior — teacher-forced argmax stays high while autoregressive
generation collapses to <locDDDD> tokens.
Add `lm_head_lr_scale` (default 1.0, no behavior change) on PI05Config.
When != 1.0, PI05Policy.get_optim_params splits the policy into two
param groups: the PaliGemma lm_head projection plus its tied
embed_tokens at lr * lm_head_lr_scale, and the rest at lr. The cosine
scheduler multiplies both groups by the same lambda each step, so the
ratio is preserved across decay.
Recommended starting point for pi052 + subtask_mem.yaml runs: 5.0,
combined with a higher scheduler_decay_lr floor (e.g. 5e-6 instead of
1e-6) so the head doesn't get starved in the second half of training.
Co-authored-by: Cursor <cursoragent@cursor.com>
PaliGemma's pretraining puts heavy first-token mass on its <loc0000>..
<loc1023> ids at any "Assistant:" continuation. Our pi052 fine-tunes
with knowledge_insulation=True and a small text-CE budget (~45% of
samples) drift back toward that prior on long runs at low LR — teacher-
forced argmax stays at 100% (CE only measures next-token given correct
prefix) while autoregressive first-token selection collapses onto <loc>.
On the running poulain11 checkpoint at step 8000 this manifests as a
stream of <locDDDD> tokens for every subtask call — confirmed locally
against the saved checkpoint on a dataset frame.
Add a `suppress_loc_tokens` knob to `PI052Policy.select_message` that
masks ids [256000, 257024) to -inf before sampling, and pass it from
the three text-only inference steps (HighLevelSubtaskFwd,
MemoryUpdateFwd, UserInterjectionFwd). VQA steps keep the default
False so spatial answers can still emit locs. Verified end-to-end:
suppressed → "the robot arm moves the blue block to the green basket".
Also fix `_msgs_for_memory`: it was emitting the older
`User: ${task}\nPlan:..\nMemory:..` / `Assistant: ${subtask}` template,
which no longer matches the `memory_update` recipe layout
(`User: ${task}` / `Assistant: Previous memory: ..` /
`User: Completed subtask: ..`). The new prompt mirrors the training
recipe; `HighLevelSubtaskFwd` stashes the just-completed subtask in
`state['prior_subtask']` so the memory prompt can render
`Completed subtask: ..` for `MemoryUpdateFwd`.
Co-authored-by: Cursor <cursoragent@cursor.com>
memory_update was bound to `emitted_at(t, style=memory)`, which requires
the frame's exact timestamp to match a memory annotation. Memory rows are
placed at subtask-boundary timestamps and at 30 fps that's ~1% of frames,
so 99% of memory_update draws couldn't render and silently fell through
to _fallback_low_level_render — injecting task-conditioned low-level
training on ~30% of samples (subtask_mem.yaml).
Switch to `active_at`. At inference `MemoryUpdateFwd` is triggered on
`subtask_change` events, but the model only needs to learn the stateless
mapping (prior_memory, completed_subtask) -> current_memory. active_at
supervises this mapping on every frame inside a subtask interval, against
varied observations; the trigger lives outside the model. Net effect:
memory_update renders on ~87% of frames, the fallback leak drops from
~30% to ~4%, and memory CE gets a meaningful (not 0.3%) training share.
subtask_mem.yaml: rebalance to 0.30 / 0.55 / 0.15 so memory CE is
~13% effective and the freed weight goes to low_level_execution.
subtask_mem_vqa_speech.yaml: keep weights (memory_update=0.10 was
already balanced against the other text-CE branches).
Co-authored-by: Cursor <cursoragent@cursor.com>
When fine-tuning from pi05_base, reuse only the pretrained weights so pi052 still generates recipe text labels and FAST action labels.
Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the PaliGemma LM head in float32 and initialize it from pretrained weights or token embeddings when loading pi05 checkpoints.
Co-authored-by: Cursor <cursoragent@cursor.com>
The parity check in debug_text_predictions was producing false ✗
DIVERGED reports. Root cause: I built the "inference" batch by
zero-masking the attention past the supervised span, but kept the
full 512-token padded sequence. select_message reads the prompt-end
hidden state via ``vlm_out[:, -1:]`` — the LAST position of the
prefix — which in a padded batch is a padding-token hidden state,
not the last prompt token. PaliGemma's prior on those padded
positions reliably argmaxes to <loc0879>, falsely flagging a
training/inference mismatch.
Fix: truncate both tokens AND mask to length == first_sup before
calling select_message, mirroring what the real runtime does
(``tokenizer(prompt)`` returns un-padded ids). Now the parity check
compares like-with-like.
The actual training argmax in the dump was sensible English
("' move the blue cube into the green bin'" at acc=6/9) — the head
is learning correctly. The "<loc>" salad was purely the harness
reading from the wrong position.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At training time the policy is wrapped by Accelerator/DDP into a
.module attribute and custom methods are NOT proxied through the
wrapper, so ``hasattr(policy, "debug_text_predictions")`` was False
and the periodic dump was silently no-op'ing. Walk through .module
indirection to reach the raw PI052Policy that defines the method.
Also surface why the dump didn't fire (no method / empty supervised
positions / generation error) so users can see what's blocking it
instead of staring at silence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the periodic LM-head dump (LEROBOT_DEBUG_PREDS_EVERY) to ALSO
run select_message autoregressively on the same prompt prefix and show:
prompt : '<bos>User: ... Assistant: '
target (ground truth) : ' close the gripper ...'
training argmax (teacher-fed) : ' close the gri lift ...' acc=12/15=80%
inference (autoregressive) : ' close the gripper around ...'
first-token parity : train=3387 (' close') vs infer=3387 (' close') ✓ MATCH
The first-token parity check is decisive: training-side argmax at the
prompt-end position and inference's first generated token both compute
``argmax(lm_head(h_last_prompt))`` on identical context, so they MUST
match. Any divergence signals a training↔inference bug (mask, dtype,
KI routing, embedding scale, etc.). Subsequent tokens can diverge
because training uses teacher forcing while inference free-runs.
debug_text_predictions now also returns an ``inference`` list keyed
by sample, each entry carrying ``first_sup_pos`` and ``decoded``.
Limited to 24 new tokens per sample to keep the dump fast.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in diagnostic that, every N training steps, dumps 5 batch
samples plus the LM head's argmax prediction at every supervised
position alongside the label and a ✓/✗ marker — the cheapest signal
for "is text training actually learning what we expect, or collapsing
to a fixed token". Refills the recipe-sample dump budget on the same
cadence so the raw input shapes are also re-dumped.
Opt in via env var:
LEROBOT_DEBUG_PREDS_EVERY=1000 lerobot-train ...
PI052 implements ``debug_text_predictions`` (mirrors the text-loss
forward but returns argmax instead of CE); other policies are silently
skipped. The dump runs in eval() mode under no_grad, slicing the
current batch to N samples — no extra data fetch, no train-state
mutation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The trained model collapsed to spewing 40+ <loc> tokens for *every*
prompt — subtask, memory, anything — because VQA targets were supervised
to *start* with <loc>. With ~25% of all text samples beginning with a
<loc> token, the LM head learned "Assistant: → <loc>" as a strong
attractor; once one loc is emitted, autoregression chains the rest.
Flip the format so every text target — subtask, memory, speech, AND VQA
— starts with a regular word. The model still learns the <loc>
vocabulary for the spatial portion of the answer, but loc can no
longer be the first generation step out of a clean prompt.
Examples:
point : "green box <loc0162><loc0759>"
bbox : "cube <loc0082>…<loc0409>"
multi : "blue <locs> ; yellow <locs>"
The runtime parser (parse_loc_answer) strips loc tokens and uses the
remainder as label, so it's order-tolerant and works under either
format. Old loc-first checkpoints still parse cleanly at inference;
new training will use label-first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
select_message's bf16 cast used next(paligemma.parameters()).dtype,
which lands on a fp32-kept param (norm / embedding) under
to_bfloat16_for_selected_params. Mask stayed fp32 while q/k/v were
bf16 → SDPA still raised "invalid dtype for bias". Read the dtype
from layers[0].self_attn.q_proj.weight instead — q_proj is always
cast with the rest, so its dtype matches what SDPA sees.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_prepare_attention_masks_4d` always returns fp32 (the 0.0 / -inf
literals); with bf16 weights, HF PaliGemma's SDPA path raises
"invalid dtype for bias - should match query's dtype" and
select_message returns empty every step. Cast in both attention
sites: `_compute_layer_ki` (training, when both experts run) and
`select_message` (inference, VLM-only branch). Bf16 training +
bf16 inference now run end to end with no dtype mismatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the production target used in examples/annotations/run_hf_job.py.
Per Scale Labs' dense-captioning ablations, model capacity dominates
prompt-engineering gains; defaulting to the larger model avoids
shipping a worst-tier configuration out of the box.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
THE bug behind the <loc>-salad. PaliGemma's vocab reserves ids
[256000, 257023] for <locDDDD> detection / pointing tokens, but the
stock AutoTokenizer does NOT match them on raw text — it BPE-splits
<loc0162> into SEVEN pieces (<, loc, 0, 1, 6, 2, >). So a VQA target
like "<loc0162><loc0759> green box<eos>" tokenized to 16 pieces, not
5, and training the LM head supervised those generic BPE pieces
instead of one detection-vocab id. The piece logits got pumped up
across ~25% of supervised positions; at inference they dominated
every turn — even subtask prompts produced <loc>-salad followed by
the actual answer.
Register the 1024 <locDDDD> tokens via tokenizer.add_tokens once on
load, in every path the policy uses: PI052TextTokenizerStep (training
encode), _build_text_batch_pi052 (runtime encode), and
select_message's default tokenizer (runtime decode). Verified
empirically with the real PaliGemma tokenizer: VQA target now
tokenizes to 5 ids matching the loc-vocab range (256162, 256759, ...)
with correct offset_mapping.
This unlocks PaliGemma's actual detection prior; <loc>-salad cannot
recur because each <locDDDD> is a single class on the LM head, not a
character sequence the head accidentally learns to extend.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Confirmed empirically on the published dataset: VQA bbox/keypoint
coordinates are Qwen2.5-VL's 0–1000 normalized grounding output, NOT
pixels. Scanning 8207 samples showed x and y both spanning 0..1000
with ~30% of values exceeding the camera's pixel dimensions (which is
impossible if they were pixels).
_vqa_answer_to_loc was dividing by the observation image's H/W, so
e.g. point [742, 158] on a 640x480 wrist cam clamped x to <loc1023>
(the far-right edge) instead of mapping to <loc0760> (~74% across).
Fix: divide by 1000 — the actual Qwen scale. The conversion is now
camera-resolution-independent, so _camera_image_shapes and the
image_shapes plumbing through __call__ / _encode_messages /
_messages_vqa_to_loc are dropped. Tests updated to the new signature
and the 0–1000 round-trip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
base.fit() rejected the data with "Vocab size 1024 is too small for
the range of tokens 9339": the FAST tokenizer was fit on raw
motor-unit actions, whose DCT-token range vastly exceeds the 1024
codebook.
Two problems, one fix. (1) Raw actions blow up the token range. (2) At
training time ActionTokenizerProcessorStep runs after the QUANTILES
NormalizerProcessorStep, so it encodes normalized actions — fitting on
raw actions mismatches that space. Replicate QUANTILES normalization
(per-dim [q01,q99] -> [-1,1], clipped) before base.fit() so the fit and
the training-time encode see the same distribution and the token range
fits the codebook.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fit_fast_tokenizer collected action chunks via ds[i]["action"], which
builds a full training item — delta-timestamp expansion, video decode,
image transforms. A single video-decode failure threw, was swallowed
at debug level, and silently starved the fit of every chunk → "FAST
fit collected zero action chunks", falling back to the universal
tokenizer.
Read the ``action`` column straight from the HF dataset instead: it
carries no video, so it is immune to decode errors and far faster.
Also fail fast with a clear message when the dataset has no ``action``
feature or all episodes are shorter than chunk_size.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_compute_layer_ki called modeling_gemma._gated_residual, but that
adaRMSNorm gated-residual helper is a lerobot helper in pi_gemma, not
part of HF transformers — so enabling knowledge_insulation crashed with
AttributeError on the first training step. Import _gated_residual from
pi_gemma, matching pi05's own layer code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spatial VQA answers (bbox / keypoint) were trained as pixel-coordinate
JSON, which fights PaliGemma's detection prior and leaks <loc>-token
salad at inference. Convert them to PaliGemma's native <locNNNN>
vocabulary instead so the LM head reuses that prior.
Training side (text_processor_pi052.py): a target turn whose content
parses as a bbox/keypoint answer is rewritten to <loc> text, using the
camera frame's native (H, W) from the observation and the preceding
image block. Non-spatial answers, subtask/memory targets and SmolVLA2
keep their JSON form — the dataset stays backbone-agnostic.
Runtime side (smolvla2/inference/vqa.py): parse_vqa_answer detects
<loc> answers (2 locs -> keypoint, 4 -> bbox), returning normalized
[0,1] coords with a normalized flag; draw_vqa_overlay denormalizes
against the chosen camera frame's pixel size.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep action-only samples trainable by rendering the task as a low-level user message when no recipe branch matches.
Co-authored-by: Cursor <cursoragent@cursor.com>
PI052TextTokenizerStep masked text_labels over the assistant turn's
*content only* — the trailing newline was excluded and no EOS token was
ever a supervised label. So the LM head was never given a stop signal:
at inference select_message decoded to max_new_tokens, producing the
runaway subtask paragraphs and the "}"}"}-style VQA tails.
_format_messages now appends the tokenizer's EOS to each supervised
target turn and extends that turn's span to cover it, so the EOS lands
in text_labels. _shifted_ce then trains "<last content token> -> EOS"
and the model learns to terminate; select_message stops on it.
Inference callers (the runtime's _build_text_batch_pi052) pass no
target_indices / eos_token, so no EOS is baked into the prompt — the
model generates it. Verified end-to-end with the PaliGemma tokenizer:
the supervised span is `<content><eos>` and the trailing newline stays
unsupervised.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime's text path was hard-wired to SmolVLA2: _build_text_batch
read policy.config.vlm_model_name (which PI052Config doesn't have) and
built a SmolVLM2 chat-template prompt. PI052/PaliGemma is not
chat-pretrained and trains on a flat `User: ... \nAssistant: ...`
prompt, so the runtime crashed or fed an out-of-distribution prefix.
- _build_text_batch now dispatches on policy.config.type: smolvla2 ->
chat template (renamed _build_text_batch_chat); pi052 -> flat
role-prefixed text via PI052TextTokenizerStep's own _format_messages /
_strip_blocks / _flatten_say_tool_calls, so the inference prefix
matches PI052 training exactly.
- Add a lerobot-pi052-runtime entry point (alias of the same main; the
policy type is read from the checkpoint) so the command name isn't
misleading. argparse prog now defaults to the invoked command name.
PI052's select_message / predict_action_chunk already work with the
runtime; this was the one SmolVLA2-only coupling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first-person memory narrative, task-rephrasing and initial-speech
prompt tweaks belong in the annotation pipeline itself. Applied to
feat/language-annotation-pipeline (#3471); reverting them here to the
merge-base so they drop out of this PR's diff. general_vqa.py keeps its
docstring fix since it references a recipe this PR introduces.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- module_1_memory: rewrite as an explicit first-person, past-tense
narrative ("I picked up...", "I opened...") matching the MEM
(Torne 2026) running-memory style, instead of "one or two short
sentences" with no person/tense guidance.
- module_1_task_rephrasings: bias rephrasings toward short imperative.
- module_2_initial_speech: prefer very short robot acknowledgements.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deterministic-plan rewrite, single-frame VQA (K 3->1), dataset
version tagging, telegraphic-subtask prompt and shorter interjection
prompt belong in the annotation pipeline itself, not in the SmolVLA
training PR. They have been applied to feat/language-annotation-
pipeline (#3471). Reverting these six files here to the merge-base so
they drop out of this PR's diff; #3491 will inherit the canonical
versions when it next rebases on its base.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port the steerable-pipeline refinements developed on feat/smolvla-on-
steerable back into the annotation pipeline itself:
- module_1_subtasks: imperative verb-first telegraphic labels with a
consistent-object-noun rule and good/bad examples (no hard word cap).
- _generate_plan: drop the VLM round-trip; the plan is now a
deterministic numbered list of still-todo subtasks, re-emitted at
every subtask boundary so it shrinks as work progresses. Removes
module_1_plan.txt.
- VqaConfig.K 3 -> 1: a VQA pair anchors exactly its emission frame, no
stale-label temporal smear.
- lerobot-annotate: tag the pushed dataset with its codebase_version so
LeRobotDataset can resolve a revision and load it.
- module_2_interjection: shorter, more natural mid-task cues.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite module_1_subtasks prompt to produce short imperative commands
("pick up the orange") instead of third-person narration ("the robot
arm moves to the orange"). Drops the verbose "how, not what" rule and
adds a good/bad few-shot table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convert PI052 prefix-only attention masks before calling PaliGemma so text-only batches and generation use the same mask shape as fused training.
Co-authored-by: Cursor <cursoragent@cursor.com>
Route full PI05/PI052 fine-tuning through PyTorch's fused AdamW path to avoid the single-tensor Adam denominator allocation near GPU memory limits.
Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid the multi-tensor AdamW temporary that can OOM full PI05/PI052 fine-tuning near GPU memory limits.
Co-authored-by: Cursor <cursoragent@cursor.com>
Select only supervised text and FAST action-code positions before cross-entropy to avoid full-vocabulary loss tensors over padded sequences.
Co-authored-by: Cursor <cursoragent@cursor.com>
Use PI05Policy helpers for action padding and image preprocessing in PI052 fused losses instead of looking them up on the inner PI05Pytorch module.
Co-authored-by: Cursor <cursoragent@cursor.com>
Tokenize batched recipe outputs in PI052 so training batches with nested message lists do not crash before model forward.
Co-authored-by: Cursor <cursoragent@cursor.com>
Mask the FAST auxiliary loss to discrete action-code tokens so wrapper formatting tokens do not affect action co-training.
Co-authored-by: Cursor <cursoragent@cursor.com>
Module 3 anchored each VQA emission tick to K=3 consecutive frames
(~0.1s at 30fps). The VLM grounds the answer — bbox/keypoint
coordinates especially — against the first frame's image, so copying it
onto frames 2-3 smears a stale label over a moving scene.
Default K=1: a VQA pair lands on exactly its emission frame, no
temporal smear. VQA frames get sparser; the WeightedEpisodeAwareSampler
(vqa_target_fraction) is the knob to compensate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The decoder chain tried torchcodec first, then ffmpeg. torchcodec is
not thread-safe: under the executor's 16-wide concurrent decode in the
interjections phase it SIGSEGVs (exit 139) before the ffmpeg fallback
is ever reached — uncatchable, so it kills the whole job.
Default the auto chain to ffmpeg only. Per-frame ffmpeg decode runs in
an isolated child process: crash-safe and concurrency-safe (the plan
phase already proved 16 parallel ffmpeg subprocesses are fine).
torchcodec / pyav remain available via an explicit video_backend.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PyAV segfaulted (exit 139) decoding the AV1 streams modern LeRobot
datasets use — a SIGSEGV that the per-episode try/except cannot catch,
killing the whole job when the interjections phase started.
Replace the PyAV fallback with _decode_frames_ffmpeg, which shells out
to the ffmpeg CLI: a full ffmpeg build decodes AV1, and a child-process
crash is a catchable non-zero exit rather than a segfault. Decoder chain
is now torchcodec -> ffmpeg. _decode_frames_av stays available behind
video_backend="pyav" for callers that want it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- hirobot.yaml -> subtasks_vqa.yaml
- hirobot_memory.yaml -> subtask_mem_vqa_speech.yaml
- pi05_hirobot.yaml -> deleted (stale: uses plan, top-camera names;
superseded by the two recipes above)
- smolvla2_hirobot.yaml -> deleted (was untracked stale junk)
Updated the smolvla2 / pi052 `recipe_path` config defaults, all
docstring / comment references, the annotation-pipeline + recipe docs,
and the three tests that loaded pi05_hirobot.yaml (repointed to the
renamed recipes; the low-level-branch and pipeline-render assertions
now accept a flow-only `low_level` stream as valid supervision, since
the new recipes' low_level_execution has no text-CE target).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pyav fallback routed through lerobot's decode_video_frames(backend=
"pyav"), which uses torchvision.io.VideoReader — removed in torchvision
0.23+. On modern torch stacks (e.g. vllm-openai with torchvision 0.26)
both torchcodec and that path fail, leaving interjection/vqa prompts
without visual context.
Add _decode_frames_av: a self-contained PyAV decoder that picks the
nearest frame per timestamp. It is the always-available tail of the
decoder chain (torchcodec -> pyav) and the target of --video_backend=pyav.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pi052 had the same text-CE collapse bug smolvla2 had — PaliGemma's
embed_prefix flags the language block att=0, so make_att_2d_masks makes
it fully bidirectional and the text cross-entropy degenerates into a
copy task. Ported the three model-specific fixes:
- _mark_target_span_causal: set att=1 on supervised target language
positions so the text-CE is genuine causal next-token prediction.
Applied in both _compute_all_losses_fused and _compute_text_and_fast_loss.
- flow_loss_weight 10.0 -> 5.0: the paper's a=10 swamps the LM head once
the flow-only low_level recipe fires often (matches SmolVLA2Config).
- _flatten_say_tool_calls in the text tokenizer: serialize `say` tool
calls into a <say>...</say> marker so the spoken reply is tokenized
and supervised (PaliGemma's flat prompt has no structured calls, so
they were dropped entirely).
select_message needed no change: pi052's prefix is [images, language]
with no trailing state token, so it already decodes from the last
language token.
Regression tests mirror the smolvla2 attention-masking + tool-call suite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VQA annotations are sparse, so VQA was badly underrepresented in training:
its effective share was weight x density, and blend draws that picked an
ask_vqa* sub-recipe for a non-VQA frame were wasted entirely.
Two pieces:
1. Recipe-side consumption (language_render.py): render_sample now routes
any frame that carries a VQA annotation to a matching ask_vqa* sub-recipe,
regardless of the weighted blend draw. No VQA annotation is wasted and no
draw lands on a non-renderable VQA recipe — VQA's recipe-side share now
equals the VQA-annotation density.
2. Dataset-side oversampling (WeightedEpisodeAwareSampler + vqa_target_fraction):
a new weighted, episode-aware sampler draws frames with replacement by
per-frame weight. When TrainPipelineConfig.vqa_target_fraction is set, the
train script scans language_events, weights VQA frames so they make up
~that fraction of the training stream, and uses the weighted sampler. This
is what actually lets VQA exceed its natural density. Default None keeps
uniform episode-aware sampling unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VideoFrameProvider decoded keyframes via torchcodec only. Some containers
(e.g. vllm-openai) ship a torchcodec that cannot push packets to the
decoder ("Operation not permitted"), silently degrading interjection/vqa
prompts to no visual context.
_decode now retries with pyav when the default backend raises, and a new
`video_backend` config field lets callers pin the backend explicitly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional `dest_repo_id` to AnnotationPipelineConfig. When set,
`push_to_hub` uploads the annotated dataset there instead of overwriting
the source `repo_id`, restoring separate source/destination repos.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The VLM already sees every camera, so the operator never needs to name
one to ask a question. Move the camera prompt to after generation and
only fire it when the answer actually carries a bounding box / point
(whose pixel coordinates are camera-specific and need a target frame).
Non-spatial answers (count / attribute / spatial / plain text) now skip
the prompt entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handle_vqa_query filtered the observation down to the single chosen
camera before calling the VLM. But training feeds every camera: the
ask_vqa_* recipes' image blocks are stripped before tokenization and
the frames reach the model via OBS_IMAGES_*, where embed_prefix
consumes all config.image_features regardless of the per-camera recipe
tag. Filtering to one camera changed the image-token count in the
prefix (the dropped camera zero-padded with mask=0) — a prefix shape
the model never saw at training.
Now the full observation is passed to select_message; the chosen
camera is used only to pick which frame the bbox/point overlay is
drawn on.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Command arguments never needed quotes (`_strip_quotes` only strips a
matching pair if present) — `/question point to the yellow cube` works.
The hints wrongly implied `""` were required; all hints/help now show
`/action <task>` / `/question <text>`.
Also adds a reference line to the state panel showing the two
overlay-producing VQA prompt shapes:
/question point to the yellow cube -> point overlay
/question detect the blue cube -> bounding-box overlay
plus the same examples in /help.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the startup mode prompt + task picker with a single
command-driven prompt. The runtime now comes up immediately at the
command line in `paused` mode (robot idle) and the operator drives it:
/action "task" run the robot on a task (bare = resume, number = timed burst)
/pause stop the action loop — robot holds position
/question "..." pause and answer one VQA question (camera prompt + overlay)
/help / stop
- Removed _select_mode_interactively / _select_task_interactively /
_dataset_task_strings (the interactive pickers).
- mode value renamed "question" -> "paused"; --mode choices are now
action|paused (default paused).
- /question takes the question inline and runs it via _handle_slash_command
(pauses first, so the policy isn't used concurrently).
- The ENTER-to-start gate only fires when starting in action mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets the operator skip the interactive startup entirely and go straight
to the command line:
- New --mode {action,question} arg; when given, the startup mode prompt
is skipped.
- When --task is passed explicitly on the CLI, the startup task picker
is skipped (the dataset-bootstrap task still shows the picker so you
can override it).
Also adds a timed action burst: /action <seconds> runs the robot for N
seconds, then the autonomous loop auto-reverts to question mode and
clears the action queue. Plain /action stays unlimited.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New recipe alongside hirobot.yaml (kept as the lean baseline). Superset
that adds two text-supervised sub-recipes:
- memory_update: compress progress into a memory note.
- user_interjection_response: reply to a user interjection with a `say`
tool call only (no plan/subtask text). The SmolVLA2 chat tokenizer
flattens the call to a `<say>...</say>` marker the runtime parses back.
Plan is intentionally omitted; memory is the only persistent high-level
state. Weights: low_level 0.40, subtask 0.25, memory 0.10, interjection
0.10, vqa 0.075 x2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a mode prompt at startup, shown before the task picker, so the
operator chooses action (run the robot) vs question (VQA only) up front
instead of having to discover /vlm mid-run.
Also rename the VQA mode from "vlm" to the clearer "question":
- state["mode"] value is now "action" | "question"
- the command is /question (/vlm and /vqa kept as aliases)
- panels, hints and help text updated to match
handle_vqa_query now reports via both push_log and direct stdout, so
VQA answers / overlay paths are visible in autonomous question mode
where the panel redraw is suspended.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The autonomous panel redraw cleared the screen every 0.5s, so the "> "
prompt and the one-shot command hint vanished — the operator could not
see what to type or what they were typing, making /vlm unreachable.
- Suspend the timer redraw entirely while in /vlm mode (the action loop
is paused, nothing changes in the background) so the VQA question and
camera prompt stay on a stable screen.
- Re-print the "> " prompt after each redraw so it is always visible.
- Show an always-on command hint in the panel (/vlm, /help, /action)
instead of relying on the startup line that scrolls away.
- Redraw immediately after a slash command so the mode flip is visible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The picker was skipped whenever a task was already resolved — which is
always the case with --dataset.repo_id, since the dataset's canonical
task is auto-filled. The operator never got to choose. Now the picker
always runs on an interactive terminal: the resolved task is shown as
"(current)" and selected by an empty Enter, so the dataset-canonical
default still works while letting the operator pick another task or
type a custom one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses three of CarolinePascal's frames.py comments (the fourth, the
subprocess re-encode, waits on #3611):
- replace the bespoke _decode_pyav_direct PyAV decoder with
lerobot.datasets.video_utils.decode_video_frames (torchcodec backend,
PyAV fallback) — torchvision's VideoReader removal no longer applies
- frames flow through the provider as torch.Tensor (C, H, W uint8); PIL
is materialised only at the VLM-message boundary in to_image_blocks /
to_video_block, where the chat backends need it
- _decode now returns exactly one frame per timestamp (or [] on failure),
so frames_at pairs them with strict=True
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- name the three modules everywhere (plan / interjections / vqa) instead
of module_1/2/3 — config classes, config fields, executor params,
staging keys and phase names now carry the module name
- rename examples/annotation -> examples/annotations; add the Apache
header to run_hf_job.py
- drop the unused GeneralVqaModule._generate_one
- remove "PR 1" references from comments/docstrings
- frames.py: rely on the always-defined LeRobotDatasetMetadata.camera_keys
- executor.py: read/write meta/info.json via load_info / write_info
- reader.py: load meta/tasks.parquet via io_utils.load_tasks
- make --push_to_hub a bool; push the annotated dataset back to --repo_id
- move the on-disk test dataset builder into tests/fixtures
(build_annotation_dataset); run_e2e_smoke reuses it
- clarify in the docs that the vqa module grounds each pair on a single
frame (K = per-tick anchor count)
- hoist stdlib dynamic imports to module scope
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions to the SmolVLA2 interactive runtime:
1. Startup task picker — when no --task is given, the runtime lists the
dataset's task strings as a numbered menu (plus a custom-task option)
instead of silently waiting for the first stdin line.
2. Mode toggle — /action and /vlm slash commands flip a persistent run
mode. /vlm pauses the whole action loop (HighLevelSubtaskFwd,
LowLevelForward and DispatchAction gate on state["mode"]) and clears
the action queue so the robot holds position; /action resumes it.
The mode is shown in the state panel.
3. Interactive VQA — in /vlm mode a typed line is a VQA question. The
new inference/vqa.py module asks which camera to ground on, runs the
VLM on that single camera, and when the answer is a bbox/keypoint it
draws the overlay, saves a PNG to ./vqa_overlays/ and auto-opens it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chat tokenizer passed assistant `tool_calls` straight to
`apply_chat_template`, which renders them as a structured JSON
`<tool_call>` block — so the LM head was trained to emit JSON. But the
inference parser `_split_plan_and_say` looks for a `<say>...</say>`
marker, which the model never saw in training, so the `say` tool never
fired at inference.
`_flatten_say_tool_calls` is the missing training-time serializer (the
one `_split_plan_and_say`'s docstring already assumed existed): it
rewrites a `say` tool call into a `<say>...</say>` marker inside the
content text before the chat template runs, so the template only
tokenizes plain text and the supervised target span trains the model to
emit exactly the marker the runtime parses back (Pi 0.5-style flat
tool-call serialization).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLA's 1e-4 is safe only because it freezes the language head. SmolVLA2
unfreezes lm_head + the last text layer and fine-tunes the pretrained
SmolVLM2 language weights; 1e-4 is too aggressive there and destabilises
generation into degenerate repetition. Match pi05's 2.5e-5 peak LR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Pi 0.5 α=10 split assumed text is a rare auxiliary task. With the
flow-only `low_level` recipe (~40% of the blend) now rendering, the flow
term fires often and at 10x weight dominates the shared VLM backbone,
starving the text head into degenerate repetition decoding. A 5:1 split
keeps actions primary while leaving the language head enough gradient.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A recipe whose only supervision is the action-expert flow loss (e.g.
`low_level_execution`: `user(${subtask})` with `stream: low_level` and no
`target` turn) was rejected at render time by `_render_message_recipe` and
`_validate_rendered`, both of which required at least one target turn.
The result: every blend draw of the flow-only recipe rendered to `None`,
`predict_actions` was never set, `run_flow` never fired, and the action
expert received no flow loss — leaving it at random init. Both gates now
also accept a `low_level`-stream turn as valid supervision.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the text-CE collapse bug fixed in 3cd348ff.
Pure-function tests over ``_mark_target_span_causal`` /
``_locate_lang_range`` / ``make_att_2d_masks`` — no model load, fast.
Pins:
* the target span flips to att=1, prompt/images stay att=0;
* target tokens attend causally among themselves (no peeking at
future targets) — genuine next-token prediction;
* targets still attend bidirectionally to images + the user prompt;
* the action-expert (state) token still attends to every target;
* a no-target subtask (low_level_execution user turn, labels all
-100) leaves the mask bidirectional;
* an explicit test documenting the bug: the raw embed_prefix mask
lets the first target token see the last — the copy-task collapse.
Skips cleanly when transformers isn't installed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of every collapsed inference run. ``embed_prefix`` flags
all language tokens ``att=0``; ``make_att_2d_masks`` turns that into
a single fully BIDIRECTIONAL block. So during the text-loss forward,
a supervised subtask token's hidden state attends to the very tokens
it is trained to predict. The cross-entropy degenerates into a copy
task — ``text_loss → ~3e-5`` not because the model learned to
predict subtasks but because it can see the answer.
At inference ``select_message`` decodes autoregressively (causally):
each token must be predicted WITHOUT seeing it — a task the model
was never actually trained on. Hence the universal collapse: a
coherent first token or two ("grasp the yellow cube"), then a loop
("cover cover cover", "icatorsicators", "the the the").
Fix: ``_mark_target_span_causal`` sets ``att=1`` on the language
positions that are supervised targets (``text_labels != -100``).
With make_att_2d_masks's cumulative-block rule each target token
then attends to images + the user prompt bidirectionally and to
EARLIER target tokens only — genuine causal next-token prediction,
matching select_message. Applied in both ``_compute_text_loss`` and
``_compute_fused_loss``. Per-sample correct: high_level_subtask
targets become causal; low_level_execution subtasks (a user turn,
labels all -100) stay bidirectional so the action expert reads them
as bidirectional context. The action expert is otherwise unaffected
— the suffix has a strictly higher cumsum and still attends to the
whole prefix.
Requires retraining: this changes the training objective. Existing
checkpoints were all trained on the degenerate copy task and cannot
generate text. Expect ``text_loss`` to settle MUCH higher than 3e-5
after this — that is correct; it is now a real prediction task.
NOTE: pi052's text path (PaliGemma prefix-LM) has the same
bidirectional-block structure and needs the analogous fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``embed_prefix`` lays the prefix out as ``[images, lang, state]`` with
the state token LAST. Training supervises the text head on the
*language* positions (``_compute_text_loss`` / ``_compute_fused_loss``
slice ``prefix_out[lang_start:lang_end]`` and run lm_head there).
But ``select_message`` started AR generation from the full prefix and
read ``prefix_out[:, -1:]`` — the **state token** — to decode the
first subtask token. The state token's hidden state exists for the
action expert to read; the lm_head was never trained to produce
subtask text from it. So inference decoded the high-level head from a
position entirely outside the training distribution: the text head
collapses (``the arm the arm``, ``grasp the surface population``,
``_333 absburg…``) no matter how cleanly ``text_loss`` converged.
Fix: truncate the state token off the prefix before the AR loop, so
``prefix_out[:, -1:]`` is the last language token (right after the
``Assistant:`` generation prompt) — exactly where training supervised.
Inference-only change — no retraining needed; existing checkpoints
benefit immediately. The action path (``predict_action_chunk``) is
untouched: state belongs in the action expert's prefix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLAConfig defaults ``load_vlm_weights=False``. With that and no
``--policy.path``, ``SmolVLMWithExpert.__init__`` builds the VLM via
``SmolVLMForConditionalGeneration(config=...)`` — i.e. a fully
**random-initialised** 500M backbone, including a random ``lm_head``.
For plain SmolVLA that's a deliberate "pre-train the expert" mode.
For SmolVLA2 it's a footgun: the high-level text head *is* the
SmolVLM2 ``lm_head``. Training subtask prediction from a random
language model can only memorise — which is exactly the repetition
collapse seen on the real robot ("the arm the arm the arm …").
SmolVLA2 now defaults ``load_vlm_weights=True`` so every run
fine-tunes the pretrained ``HuggingFaceTB/SmolVLM2-500M-Video-Instruct``
backbone (vision tower + language model + lm_head). The action
expert still trains from scratch on the robot data (standard SmolVLA
fine-tuning); start it from pretrained too by fine-tuning a full
``lerobot/smolvla_base`` checkpoint via ``--policy.path``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tighten the subtask prompt further per real-data feedback. The old
≤5-word cap still produced things like "release the yellow block
into the green bin" (8 words, articles, destination, and "block"
where the task said "cube").
New rules:
* Hard cap ≤ 4 words, ideally 2-3. Form: VERB + (color) + OBJECT.
* No articles, no destinations, no adverbs, no "robot/arm/gripper".
* Must reuse the exact object nouns from the task — no block/cube,
bin/box/container drift across the episode.
* Concrete good/bad examples anchored on the cube task.
Shorter, templated, consistent targets are far more robust for the
autoregressive LM head — fewer tokens to drift on, fewer dominant
n-grams to repetition-collapse into.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``_looks_like_gibberish`` low-unique-token check was gated on
``len(stripped) < 80``, so an LM head that loops an n-gram for the
whole 256-token budget — "the arm the arm … the the the the" —
sailed straight through (``gibberish:0`` in the panel) and the
garbage subtask got accepted and fed to the action expert.
Added a length-independent check: ``>= 8 tokens`` but unique-token
count ``<= max(3, tokens // 10)`` ⇒ repetition collapse. Now the
runtime rejects the looped output and keeps the previous (real)
subtask instead of propagating nonsense.
This is a guard, not a cure — the underlying issue is the LM head
on the current checkpoint being undertrained / collapsed; re-
annotate with the short prompts and train longer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The current recipe trains neither plan nor memory, and no inference
step consumes them — ``_msgs_for_subtask`` renders the bare task and
``LowLevelForward`` conditions on the subtask. Bootstrapping
``current_plan`` / ``current_memory`` from the dataset's
``language_persistent`` annotations therefore only placed a stale,
do-nothing plan in the status panel.
Keep seeding ``current_subtask`` — it's a useful first-frame
fallback for ``LowLevelForward`` before ``HighLevelSubtaskFwd``
produces its first subtask.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the previous "condition actions on the task" shortcut.
The action expert is conditioned on the SUBTASK again:
* ``low_level_execution`` recipe back to ``user(${subtask})``.
* ``LowLevelForward`` conditions on ``current_subtask`` (falls back
to the task only on the first frame, before the high-level loop
has produced a subtask).
* ``HighLevelSubtaskFwd`` re-added to the runtime pipeline so the
subtask is actually generated each high-level tick and written to
``current_subtask`` before ``LowLevelForward`` consumes it.
* ``_msgs_for_subtask`` now renders just ``${task}`` (no
``Plan: ``/``Memory: `` lines) to match the current
``high_level_subtask`` recipe, whose user turn is the bare task.
So the loop is: task → HighLevelSubtaskFwd (LM head) → subtask →
LowLevelForward → action chunk conditioned on that subtask.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-robot runs shook and failed the task despite a low flow loss.
Root cause: train/inference conditioning mismatch — not a flow-loss
bug (``_compute_fused_loss``'s flow path is byte-identical to
``SmolVLAModel.forward``).
At training, ``low_level_execution`` conditioned the action expert
on ``${subtask}``, and every frame's subtask was the correct one
for that frame. At inference the runtime has no high-level subtask
generator (VQA-only pipeline), so ``current_subtask`` was frozen —
the action expert got "move towards the blue cube" for the entire
episode. Once the arm reached the cube, that (image, subtask) pair
never occurred in training → OOD conditioning → incoherent flow
output → shaking.
Fix: ``low_level_execution`` now renders ``user(${task})``. The
task is stable for the whole episode and always available, so the
action expert's conditioning is identical at train and inference
with no high-level loop required. ``LowLevelForward`` updated to
build the same ``[user(task)]`` prompt.
``high_level_subtask`` still trains the text head to predict
subtasks (kept for when a reliable subtask loop is reintroduced) —
it's just no longer on the action expert's critical path.
Requires re-training for the recipe change to take effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scope reduction while the core subtask + action loop is validated:
Recipe (hirobot.yaml)
* Removed ``plan_generation`` sub-recipe entirely.
* Removed the memory tail from ``high_level_subtask`` (the
``new_memory`` binding + the second assistant turn).
* ``high_level_subtask`` user turn is now just ``${task}`` — no
``Plan: …\nMemory: …`` context.
* Weights rebalanced over the four remaining sub-recipes:
high_level_subtask 0.40, low_level_execution 0.40,
ask_vqa_top/wrist 0.10 each.
Runtime (inference/runtime.py)
* Pipeline trimmed to VQA + the action loop:
AskVQAFwd → LowLevelForward → DispatchAction → DispatchToolCalls.
* Dropped HighLevelSubtaskFwd / MemoryUpdateFwd / UserInterjectionFwd
from the default pipeline. They remain importable from
``inference.steps`` for when plan/memory/subtask generation is
brought back. The action expert conditions on the task string
directly via LowLevelForward's ``current_subtask or task``
fallback.
This commit lands on top of a rollback of the previous two commits
(repetition_penalty / no_repeat_ngram_size knobs, and the
deterministic plan-walker) — both were bandaids for the LM-head
repetition collapse that the reduced-scope recipe sidesteps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PI052 and PI0_FAST both load ``physical-intelligence/fast`` as
their action tokenizer. That tokenizer's HF backend requires
``sentencepiece`` to instantiate (or ``tiktoken``); without it
``AutoProcessor.from_pretrained`` raises:
ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a tokenizers library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece or tiktoken installed [...]
It wasn't listed in pyproject so fresh installs missed it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dumper was printing ``stream=None target=None`` for every
message because it read those fields off the message dicts, but
the recipe renderer keeps them in parallel arrays
(``message_streams`` / ``target_message_indices`` in
COMPLEMENTARY_DATA) so the chat template doesn't see unknown
keys. Zip them back into the dump-time dicts so the printed
metadata is accurate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both feed into the high-level prompt and the plan rendering, so
keeping them short directly reduces the rendered ``${task}\nPlan:
…\nMemory: …`` prefix the model has to chew through at inference.
Subtasks
* Hard cap: ≤ 5 words. Verb + object only, drop articles/adverbs.
* Concrete good/bad examples to anchor the VLM.
Memory
* Hard cap: ≤ 10 words. Telegraphic noun→location fragments
("bowl in box, lid open"), no past-tense verbs, drop attributes
that don't matter for downstream subtasks.
* Allow empty string when no material change occurred — keeps the
rendered memory line literally blank instead of forcing a no-op
sentence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously only emitted a plan at t=0 and on interjections, so the
active plan rendered into training carried "done" subtasks until
the next interjection. With the new "plan = remaining subtasks"
summariser this meant the plan was stale between boundaries.
Emit a fresh plan row at every subtask start. ``active_at(t)`` then
returns a plan that contains exactly the subtasks whose start ≥
the current span's start — completed subtasks fall off the plan
the moment the next subtask begins.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The plan was being generated by a separate VLM call (one per
episode + one per interjection refresh) with a prompt that asked
the model to "compress the subtasks into a compact hierarchical
plan". In practice the plans came out longer than necessary and
sometimes drifted from the actual subtask sequence the runtime
would execute.
Replaced ``_generate_plan`` with a deterministic numbered list
of the upcoming subtasks. At a refresh time the list shrinks to
subtasks whose start ≥ refresh_t — the plan describes what's
*left* to do, so it gets shorter as work progresses.
Saves the per-episode + per-interjection VLM round-trip in the
annotation pipeline and keeps train-time plan text bit-aligned
with the subtask annotations the rest of Module 1 emits.
Removed the now-unused ``prompts/module_1_plan.txt``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two regressions surfaced by the first training run:
1. ``--policy.type=pi052`` failed with ``invalid choice``. PI052Config
wasn't imported in ``policies/__init__.py``, so its
``@register_subclass("pi052")`` decorator never ran and draccus
didn't see it as a valid policy type. Mirror PI05Config /
SmolVLA2Config in the top-level imports + __all__.
2. ``low_level_execution`` (user-only ``${subtask}`` recipe used for
π0.5-style flow conditioning) tripped
``ValueError: Message recipes must contain at least one target
turn.`` The validator was too strict — a recipe with only a
``stream: low_level`` turn still drives meaningful supervision
(flow MSE on the action expert via ``predict_actions=True``).
Allow either ``target: true`` OR ``stream: low_level`` to satisfy
the "supervises something" requirement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recipe renders ``"\${task}\nPlan: \${plan}\nMemory: \${memory}"``
unconditionally — when a binding resolves to None,
``language_render._substitute`` substitutes an empty string, so the
training-time user turn always contains the literal ``Plan: `` /
``Memory: `` prefixes even with empty values.
The inference message builders were skipping those lines entirely
when ``state['current_plan']`` / ``state['current_memory']`` was
empty, producing a different prompt shape on early frames (before
the plan-generation step runs) and on datasets without plan/memory
annotations.
Factored a shared ``_hirobot_user_head`` helper used by
``_msgs_for_subtask``, ``_msgs_for_memory``, and the legacy
``_control_context_messages`` so they all match training byte-for-
byte regardless of which bindings are populated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a one-shot debug dumper to both chat processors. When the env
var ``LEROBOT_DUMP_RECIPE_SAMPLES`` is set to a positive integer N,
the next N samples processed (rank-0 only) get pretty-printed:
* the recipe-rendered messages (role / stream / target / content),
* the full tokenized prompt (decoded back),
* inline ``[TGT]...[/TGT]`` markers over the spans the LM head is
supervised on,
* token count + target-token count,
* ``predict_actions`` flag.
Usage:
LEROBOT_DUMP_RECIPE_SAMPLES=5 sbatch train_smolvla2.slurm
After N dumps the helper becomes a no-op; training continues
unaffected. Works for both smolvla2 (chat-template renderer) and
pi052 (plain ``Role: content`` concat renderer); each processor has
its own copy to avoid cross-package imports.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The smolvla2 and pi052 recipe blends had drifted to identical content
twice in a row; collapse them to a single ``recipes/hirobot.yaml``
both policies point at. Each backbone's text tokenizer (chat-template
for SmolVLA2, plain ``Role: content`` for PI052) handles the
rendering differences downstream — the recipe spec is shared.
Audit fixes folded into the same commit:
* **Train/inference prefix mismatch on the action expert**
``_build_text_batch`` always passed ``add_generation_prompt=True``,
appending ``<|im_start|>assistant\\n`` tokens that the action
expert never saw at training (the chat tokenizer renders with
``add_generation_prompt=False``). Parameterized the helper and
pass ``False`` from ``LowLevelForward``; ``select_message`` paths
still default to ``True`` for AR text generation.
* **PI052 fallthrough could silently train flow on text-only frames**
When ``text_loss_weight=0`` AND every sample was high-level
(``predict_actions.any()==False``), the previous heuristic
delegated to ``PI05Policy.forward``, which ignores
``predict_actions`` and runs flow on every sample. Reverted to
delegating only on fully unannotated batches.
* **SmolVLA2 silent zero-loss training**
``forward`` returned ``loss=0`` (no error) when neither flow nor
text path fired. Now raises ``RuntimeError`` with the weights and
routing flags — fails loud like PI052 already does.
* **PI052 dropout-seed key**
Was reading ``complementary["dataset_index"]`` (only set by
``MultiDataset`` and means "which sub-dataset", not row index)
with fallback to ``frame_index`` (never set) — every sample got
seed=0, so per-component dropout was deterministic across the
epoch. Switched to ``complementary["index"]`` to match SmolVLA2
and the canonical ``BatchProcessor`` convention.
* **Dead ``DEFAULT_TOOLS`` import**
Removed from ``chat_processor_smolvla2.py`` — unused since the
default-tools list was switched to ``[]`` in the prior commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRITICAL (smolvla2) — the SmolVLM2 chat template was rendering the
``say`` tool's JSON schema as a system message on every training
sample because ``DEFAULT_TOOLS`` was the default in
``SmolVLA2ChatTokenizerStep``. That schema was only relevant to
the now-removed ``user_interjection_response`` recipe; with it
gone the schema is dead weight that polluted every action-expert
prefix AND created a train/inference mismatch (the inference
``_build_text_batch`` doesn't pass ``tools=``). Default is now
``[]``; callers needing tools can still set them via
``with_tools(meta.tools)``.
LIKELY-BUG — ``low_level_execution`` had ``target: true`` on its
assistant turn, so text-CE trained the LM head to predict the
same subtask string the user just stated (trivial "copy previous
turn" supervision that diluted LM head capacity). Dropped the
assistant turn entirely; ``high_level_subtask`` (w=0.50) already
owns subtask prediction from real context.
The chat-tokenizer's ``predict_actions`` detection used to scan
target streams only. With the new no-target low_level recipe it
would mis-fire as False. Switched both
``chat_processor_smolvla2.py`` and ``text_processor_pi052.py`` to
scan all message streams — any ``stream: low_level`` on the
sample is enough to trigger flow loss.
Inference: the low-level loop sends only ``[user(subtask)]`` now,
matching the new recipe shape.
PI052 — hardened the forward fallthrough so a degenerate batch
where every sample's recipe is text-only AND text supervision is
disabled (text_loss_weight<=0 or text_labels missing) cleanly
delegates to ``PI05Policy.forward`` instead of raising
"nothing to train".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously ``action_execution`` rendered ``task + plan + memory +
subtask`` into one prefix and ran the flow loss on it. That meant
the action expert was conditioned on the full hierarchical context
(closer to π0.7 §V.A), not just the subtask.
The π0.5 paper's hierarchical inference has the action expert see
only the *subtask* (plus images and state). Split the recipe to
match:
high_level_subtask (0.50)
user(task + plan + memory) → assistant(subtask)
[+ assistant(new_memory) at boundary frames]
All ``stream: high_level`` → text-CE only, no flow loss.
low_level_execution (0.30)
user(subtask) → assistant(subtask)
Both ``stream: low_level`` → flow loss fires; text CE on the
subtask is a small redundant extra signal. Prefix the action
expert sees: [images, subtask, state].
plan_generation (0.10) — unchanged.
ask_vqa_{top,wrist} (0.05 each) — unchanged.
Runtime: the low-level loop in ``smolvla2/inference/steps.py``
now sends ``[user(subtask), assistant(subtask)]`` to
``predict_action_chunk`` instead of the full task+plan+memory
context. Falls back to ``state['task']`` when no subtask has been
generated yet so the first frame still has something to condition
on.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRITICAL (smolvla2) — text-CE was applied to the wrong prefix slice.
``num_state`` was being read from ``state.shape[1]`` (the raw
max_state_dim, ~14-32) instead of the *number of state tokens*
(always 1). Compounded by the trailing-padding issue (state is
not at the end of the padded prefix when ``seq_len < prefix_length``),
the lang slice was landing on image / padding hidden states.
New ``_locate_lang_range`` finds the state position via
``att_masks.nonzero()`` (the only ``1`` in the mask), making the
slice robust to both bugs. Used by ``_compute_text_loss`` and
``_compute_fused_loss``.
LIKELY-BUG (smolvla2) — ``_unfreeze_lm_head`` only re-enabled
``lm_head`` and ``text_model.model.norm.weight``. SmolVLA's parent
ALSO freezes the last 1-2 transformer layers, so text-loss
gradients died in a frozen final block. Now mirrors the parent's
freeze targets and unfreezes the matching ``layers.{N-1}`` (and
``N-2`` when num_vlm % num_expert == 0).
CRITICAL (pi052) — flow and FAST CE were not per-sample masked
under per-sample-routing. Text-only recipe samples
(``plan_generation``, ``ask_vqa_*``) contributed to flow/FAST
loss with prompts that deliberately omit the subtask, corrupting
the signal. Threaded ``predict_actions_t`` through both
``_compute_all_losses_fused`` and ``_compute_text_and_fast_loss``;
flow uses ``(per_sample * mask).sum() / mask.sum()``, FAST uses
``shift_valid & sample_mask`` before ``masked_fill(-100)``.
OTHER
* PI052Policy.forward now falls through to PI05Policy.forward on
unannotated batches (no text_labels, no predict_actions, no FAST).
* fit_fast_tokenizer cache key now includes ``chunk_size`` — changing
the chunk size no longer silently loads a wrongly-fit tokenizer.
* Removed dead ``_compute_text_loss`` / ``_compute_fast_action_loss``
in pi052 (superseded by the fused helpers).
* Fixed stale "no-op stub" docstring on ``knowledge_insulation`` —
it's been fully wired since the per-layer KI forward port.
* Stripped unused ``copy`` / ``resize_with_pad`` imports.
* Extracted ``_shifted_ce`` / ``_mask_per_sample`` / ``_fast_ce``
helpers shared between fused and prefix-only paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New text-only sub-recipe at 0.10 weight on both blends:
user : ${task}
assistant : ${current_plan} (high_level target)
Bound to ``active_at(t, style=plan)`` so it supervises the
currently-active plan on every frame, gated by ``if_present`` to
skip frames without a plan annotation.
Weights rebalanced: action_execution 0.85 → 0.75, plan_generation
0.10, VQA top/wrist 0.075 each (sums to 1.0).
Added matching runtime builder ``_msgs_for_plan`` in
``smolvla2/inference/steps.py`` so the high-level loop can call
``select_message`` with the bare-task prompt at episode start /
replanning events.
Closes a gap vs. Pi 0.7 §V — without this recipe the model could
read ``${plan}`` from the prompt but never had to produce one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recipes were over-commented (paper citations, history of removed
sub-recipes, inference-time loop walkthroughs). Stripped down to a
short header + a one-line note on the boundary-frame memory tail.
Also removed the ``_tool3`` diversity-knobs comment block in
``examples/annotation/run_hf_job.py`` — it was a personal note about
a since-merged experiment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recipe changes:
* action_execution now bundles the memory update as a second
assistant target gated on a new ``new_memory`` binding (fires
only at subtask-boundary frames). No "Completed subtask: X"
filler — the model emits the new subtask AND the updated
memory back-to-back in one prefix.
* user_interjection_response sub-recipe removed (current
datasets don't have interjection / say() annotations).
* Standalone memory_update sub-recipe removed (folded above).
* Weights rebalanced: action_execution 0.85, ask_vqa_top/wrist
0.075 each (sums to 1.0).
Runtime ``_msgs_for_memory`` updated to match the new
boundary-frame prompt layout.
Modeling:
* SmolVLA2Policy now fuses the flow + text losses into a SINGLE
backbone forward via ``_compute_fused_loss`` (one
vlm_with_expert pass with [prefix, suffix] embeds, then both
lm_head CE on lang slice + action_out_proj MSE on suffix).
Mirrors pi052's existing ``_compute_all_losses_fused`` —
saves one backbone pass per training step.
Examples:
* Removed the two training SLURM scaffolds; they were
out-of-date with the recipe refactor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a
clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions)
and the hierarchical inference pattern from Pi 0.5 §IV.D.
Flavor 1 — action_execution (60% weight, "main path")
-----------------------------------------------------
One always-on recipe that fuses **all** available context (task,
plan, memory) into a single user prompt and uses the current subtask
as the supervised assistant target. This single recipe supervises
*both* objectives:
* subtask prediction (text CE on the assistant span via lm_head)
* action chunks (flow MSE on the action expert via
stream: low_level, target: true; plus FAST CE on action tokens
when enable_fast_action_loss=True)
At inference, the *same* prompt structure drives both inference
modes:
* select_message(user_prompt_only) → LM head generates the next
subtask. Matches action_execution's training distribution
exactly (prompt is the user turn, target is the subtask).
* predict_action_chunk(user_prompt + assistant_subtask) → action
expert produces the chunk. Matches action_execution's full
prompt+target.
This replaces what used to be a separate high_level_subtask recipe
plus a low_level_execution recipe; both were supervising the same
subtask text, so collapsing them into one is correct and removes
the redundant text-CE gradient.
Flavor 2 — event-driven text-only recipes
-----------------------------------------
Each of these supervises the LM head to predict a specific kind of
text given a specific event-triggered context. ``stream: high_level``
on all targets so they never trigger predict_actions / flow loss.
``if_present`` guards ensure they only fire on frames where the
event annotation is present.
* memory_update (10%) new memory at subtask boundary
* user_interjection_response (15%) new plan + say(...) on input
* ask_vqa_top (7.5%) front-camera VQA
* ask_vqa_wrist (7.5%) wrist-camera VQA
Total weight = 1.0.
Prompt format consistency
-------------------------
User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}``
matches what ``inference/steps.py::_msgs_for_subtask`` and
``_control_context_messages`` already emit at inference time. No
"Task: " prefix — the bare task string is used as the leading
content with literal "Plan: " / "Memory: " labels for the
subsequent components.
What changed structurally
-------------------------
- low_level_execution DROPPED (folded into action_execution)
- high_level_subtask DROPPED (subtask supervision moved into action_execution)
+ action_execution NEW (the fused main recipe)
memory_update kept, prompt cleaned up
user_interjection_response kept, prompt cleaned up
ask_vqa_top / ask_vqa_wrist kept
Runtime compatibility
---------------------
No runtime change needed — ``SmolVLA2Runtime`` and the inference
helpers already build their high-level prompt as just the user turn
(task + plan + memory) and append a ``current_subtask`` assistant
turn for the low-level call. Both match the new ``action_execution``
prompt shape exactly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the forward did 2 backbone passes when all heads were
active: one for flow (via super().forward) and one for the fused
text+FAST helper. This commit reduces it to **one pass** — same
compute as flow-only training.
New ``_compute_all_losses_fused`` builds:
prefix = [images, language, FAST (when provided)]
suffix = [noisy_actions] (action expert via gemma_expert)
and runs a single ``paligemma_with_expert.forward`` with
``inputs_embeds=[prefix_embs, suffix_embs]`` (both experts active
in the same call). Captures *both* prefix_out and suffix_out, slices
each for its respective loss:
flow MSE ← suffix_out (existing action_out_proj + MSE path)
text CE ← prefix_out at language positions (lm_head + CE)
FAST CE ← prefix_out at FAST positions (lm_head + CE)
Critical attention mask override
--------------------------------
``make_att_2d_masks`` produces a cumulative-block attention mask in
which suffix tokens (highest cumsum) attend to *every* lower-cumsum
position by default, including FAST tokens. If we let that stand the
action expert reads the discrete FAST tokens and trivially decodes
them back to the same continuous actions the flow head is supposed
to predict from noise — the entire training signal collapses to a
copy operation.
The fix is a single line right after make_att_2d_masks:
att_2d_masks[:, fast_end:, fast_start:fast_end] = False
Explicitly zeros out *suffix → FAST* attention. Everything else
remains correct under the cumsum semantics:
* prefix images/language stay bidirectional among themselves
* FAST stays causal within itself, attending bidirectionally
to images+language
* FAST cannot see suffix (cumsum < suffix cumsum, default)
* suffix attends bidirectionally among itself, to images+language,
and now NOT to FAST (this override)
Bit-equivalent to the previous separated forward path for text+FAST
losses (the prefix hidden states at language and FAST positions are
unchanged whether suffix is present or not — the prefix doesn't
attend to suffix). For flow loss, suffix→FAST being masked is the
correct behaviour we *want* — if anything the previous separated
path was less correct for production use because the joint
gradient signal through the action expert was missing the prefix
extension.
Forward routing in ``forward()``
--------------------------------
* run_flow=True → _compute_all_losses_fused (one forward, all
three losses)
* run_flow=False, run_text or run_fast → _compute_text_and_fast_loss
(one prefix-only forward, two CE losses, no
suffix → cheaper than fusion)
* neither → RuntimeError (explicit; both losses disabled)
Wall-time per step
------------------
Before this commit: flow + (text+FAST fused) = 2 forwards
After this commit: (flow+text+FAST fused) = 1 forward
Compute parity with flow-only training when all three heads active.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same bug we fixed for high_level_subtask, just on the other
subtask-supervised sub-recipe. ``low_level_execution`` targets
``${subtask}`` (the current active span) but had no
``if_present`` guard. When ``active_at(t, style=subtask)`` returned
None at a frame (gaps in the annotation, or the very first/last
frames of an episode if the annotator's spans don't fully tile),
the assistant message rendered with empty content. The chat
tokenizer still included it in ``target_message_indices`` → text CE
supervised whatever the chat-template's empty assistant turn
decoded to (usually a single ``\n``). That trains the LM head's
prior at the first generation position toward ``\n``, the same
collapse we observed with the original ``${next_subtask}`` target.
Fix: ``if_present: subtask`` on the assistant target in
``low_level_execution`` for both ``smolvla2_hirobot.yaml`` and
``pi052_hirobot.yaml``.
Side effect: frames without an active subtask span no longer
contribute to the flow loss either (the only ``low_level`` target
is skipped, ``predict_actions = bool(targets_by_stream.get("low_level"))``
becomes False). For a well-annotated dataset where subtask spans
tile the whole episode this is a no-op. For datasets with gaps,
those gap frames lose flow supervision — strictly better than the
degenerate text-CE alternative.
Sub-recipe audit summary (no other changes needed):
* memory_update — all if_present guards present, OK
* user_interjection_response — all if_present guards present, OK
* high_level_subtask — fixed earlier, OK
* low_level_execution — fixed by this commit
* ask_vqa_top / ask_vqa_wrist — query+answer both guarded, OK
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the forward did three backbone passes per training step
when all heads were active: one for flow (via super().forward), one
for text CE, and one for FAST CE. That's ~3× the compute of
flow-only training.
The text and FAST losses share their prefix forward exactly — both
are CE on the LM head, evaluated at different slices of the same
hidden states. Adding FAST tokens after language in the prefix is
bit-equivalent for the text loss because the mask_ar convention in
``make_att_2d_masks`` keeps FAST tokens in a strictly-later causal
block: language tokens never see FAST, so their hidden states are
unchanged.
New ``_compute_text_and_fast_loss``:
* embeds [images, language] once
* optionally appends [FAST] (when run_fast is True)
* one backbone forward
* slices ``vlm_out[:, -(fast_len + lang_len):-fast_len]`` for
language hidden states (or ``vlm_out[:, -lang_len:]`` when no
FAST) → text CE
* slices ``vlm_out[:, -fast_len:]`` for FAST hidden states →
FAST CE
* returns both losses, either of which can be None when the
caller doesn't want that head.
forward() now calls this fused helper instead of running the two
separate ``_compute_text_loss`` / ``_compute_fast_action_loss``
methods. Those remain in the file for callers that only want one
head (e.g. ablations).
Why flow isn't fused
--------------------
Flow MSE comes from the action-expert (suffix) hidden states, which
attend to the prefix. If we just concat FAST onto the prefix and let
the action expert attend to it, the expert can trivially decode FAST
back to continuous actions — overfitting via shortcut. Preventing
that requires a custom segment-aware attention mask (action expert
can attend to images+language but NOT to subtask/FAST), which is
what pi05_full does in ``compute_layer_complete_knowledge_insulation``.
That's the full-fusion path; deferred as a follow-up since the
text+FAST fusion already recovers most of the compute.
End-to-end forward pass count
-----------------------------
Before: 1 (flow) + 1 (text) + 1 (FAST) = 3 backbone forwards
After: 1 (flow) + 1 (text+FAST fused) = 2 backbone forwards
~33% wall-time reduction per training step when all three heads
are active.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FAST loss changes
-----------------
1. Gate by ``predict_actions`` (same routing as flow loss). The
ActionTokenizerProcessorStep tokenises actions for *every*
sample regardless of which sub-recipe rendered it; for text-only
recipes (high_level_subtask, memory_update, ...) the action
tokens are still in the batch but mustn't be supervised. Skip
the FAST forward+CE entirely when no sample in the batch has
``predict_actions=True``.
2. Switch from "multiply-by-mask" masking to ``ignore_index=-100``.
The old pattern computed per-token CE for all positions, then
zeroed out invalid ones. Two issues: (a) any out-of-vocab target
id at a padded position would have crashed cross_entropy before
the mask got a chance to zero it out, and (b) the pattern is
needlessly clever. Now ``shift_targets.masked_fill(~mask, -100)``
followed by ``ignore_index=-100`` cleanly drops invalid positions.
Matches the smolvla2 text-loss convention.
3. Clean up unused ``bsize`` variable in _compute_fast_action_loss
and expand the attention-mask docstring with the
``make_att_2d_masks`` mask_ar convention spec (causal vs
bidirectional blocks).
smolvla2 audit (reference review, no code change)
-------------------------------------------------
Compared smolvla2/modeling_smolvla2.py against pi052/modeling_pi052.py
to catch parallel bugs. Findings:
* No ``paligemma.language_model`` vs ``paligemma.model.language_model``
issue — smolvla2 uses SmolVLM (different class, different attribute
layout) so the bug doesn't apply.
* ``fill_kv_cache=True`` is correctly passed to smolvla's
``vlm_with_expert.forward`` — that class *does* accept the kwarg
(unlike pi05's PaliGemmaWithExpertModel.forward, which is why
pi052 must omit it).
* Text-loss alignment is correct: ``_compute_text_loss`` computes
``lang_start`` / ``lang_end`` from the known prefix layout
(``[image_blocks..., lang, state]``) and slices ``prefix_out``
to just the language positions before applying ``lm_head``. The
parallel bug I fixed in pi052 (lm_head over the full prefix,
shape-mismatched against text_labels) was *not* present in
smolvla2.
* Per-sample flow routing via ``predict_actions``: correctly masks
per-sample by calling the parent ``forward(..., reduction='none')``
and applying the predict_actions mask before the mean. pi052 only
has the batch-level any() gate — a parallel improvement for pi052
would require modifying PI05Pytorch.forward to support per-sample
reduction, deferred.
* ``reduction="none"`` returns ``total.expand(bsize)``: identical
scalar-broadcast limitation in both policies. Acknowledged but
low priority (only RA-BC weighting uses the per-sample path and
it's documented as a known approximation in smolvla2).
* Chat tokenizer correctly handles batched/unbatched messages,
pads with -100 for label positions, builds attention masks. No
bugs found.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defaults
--------
* enable_fast_action_loss: False -> True (match paper §III.B-C Eq.1)
* auto_fit_fast_tokenizer: True -> False (opt-in; needs base.fit())
Bug fixes
---------
1. Wrong attribute path on PaliGemma. The KI port copied
pi05_full's ``paligemma.language_model.layers[...]`` literally,
but the production pi05 wrapper exposes the text model at
``paligemma.model.language_model``. With KI enabled, every layer
would have raised AttributeError on first forward. Fixed all
references in _compute_layer_ki + _paligemma_forward_ki.
2. ``fill_kv_cache=True`` passed to PaliGemmaWithExpertModel.forward.
That kwarg is a SmolVLA-only concept; pi05's signature has no
such argument, so every forward call from pi052 (text loss, FAST
loss, select_message) would have crashed with TypeError. Dropped
from all four call sites — pi05's forward already handles the
cache via past_key_values, and re-forwarding the cumulative
sequence each step in select_message is fine for our short
subtask completions.
3. Text-loss shape mismatch. _compute_text_loss applied lm_head to
the *full* vlm_out (image tokens + language tokens), then tried
to cross-entropy that against text_labels which only covers the
language portion — the .view(-1) calls would produce two
tensors of different lengths and CE would fail. Now slices
vlm_out to the last text_labels.shape[1] positions before
running lm_head, matching the [images, language] order
embed_prefix produces.
4. Dead-code conditional in _paligemma_forward_ki's single-expert
fallback. The ``if hasattr(...) else self._pi052_orig_forward``
ternary always took the wrong branch because the attribute is
always set (we save it in PI052Policy.__init__). Simplified to
just call self._pi052_orig_forward directly.
After this commit, pi052 should be runnable end-to-end for the
first time with all three loss heads + KI active. Still worth a
100-step smoke test before kicking off a long run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Pertsch et al. 2025 (FAST paper, [64] in π0.5) and π0.5 §III.C,
the recommended practice is to *fit* the FAST action tokenizer on
the specific dataset's action distribution rather than using the
published universal codebook off the shelf. The universal tokenizer
works on any 6-DoF action sequence but produces suboptimal
compression, which slows CE convergence and wastes vocab capacity.
New utility ``lerobot.policies.pi052.fit_fast_tokenizer``:
* samples N action chunks from the LeRobotDataset (default 1024)
* loads ``physical-intelligence/fast`` as the base
* calls ``.fit(actions)`` (the AutoProcessor API the HF model card
documents) — produces a per-dataset codebook
* saves to ``{cache_dir}/{sha256(dataset, base, n_samples)[:16]}/``
* returns the local path, ready to feed
``ActionTokenizerProcessorStep(action_tokenizer_name=...)``.
Cache is keyed on (dataset, base tokenizer, sample count) so changing
any of them re-runs the fit. Re-running training on the same dataset
re-uses the cache (one fit per dataset per machine).
Auto-fit wiring:
* PI052Config gets ``auto_fit_fast_tokenizer`` (default True),
``fast_tokenizer_cache_dir`` (default ~/.cache/lerobot/...),
``fast_tokenizer_fit_samples`` (default 1024).
* make_pi052_pre_post_processors now takes ``dataset_repo_id``;
when ``enable_fast_action_loss`` and ``auto_fit_fast_tokenizer``
are both True and a repo_id is provided, the factory calls
``fit_fast_tokenizer`` before constructing the processor step
and points it at the fitted path.
* ProcessorConfigKwargs gains ``dataset_repo_id``; the global
factory dispatch threads it through for ``pi052`` policies.
* lerobot_train.py populates ``processor_kwargs['dataset_repo_id']``
from ``--dataset.repo_id`` for pi052 runs.
Failure mode: if ``.fit()`` fails (e.g. older transformers without
the method, or no usable action chunks in the dataset), the factory
logs a warning and falls back to the universal base tokenizer. Train
still works; you just lose the compression improvement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions ported from ``pi05_full`` on branch ``feat/add-pi05``,
giving pi052 full paper-§III.B-C training capabilities alongside the
recipe-driven text supervision it already had:
* **Config flags** in PI052Config:
- ``enable_fast_action_loss`` default False
- ``action_tokenizer_name`` default "physical-intelligence/fast"
- ``max_action_tokens`` default 256
- ``fast_skip_tokens`` default 128
- ``fast_action_loss_weight`` default 1.0
- ``knowledge_insulation`` default False
* **Processor wiring** (processor_pi052.py): when
``enable_fast_action_loss=True``, append an
``ActionTokenizerProcessorStep`` after the text tokenizer. It
tokenises the action tensor with the FAST tokenizer and writes
ACTION_TOKENS / ACTION_TOKEN_MASK into ``COMPLEMENTARY_DATA`` —
the existing batch-collation pipeline forwards them as
``batch['action.tokens']`` / ``batch['action.token_mask']``.
* **FAST CE loss** (modeling_pi052.py::_compute_fast_action_loss):
Re-embeds the prefix [images, language], appends the FAST token
embeddings (using PaliGemma's shared embed_language_tokens),
forwards through the backbone, slices the trailing
``fast_len`` positions, applies the LM head, computes shifted
next-token CE with the action-mask gating the loss. The loss is
summed into ``forward()``'s total with ``fast_action_loss_weight``.
* **Knowledge insulation** (modeling_pi052.py::_compute_layer_ki +
_paligemma_forward_ki): port of pi05_full's per-layer attention
that detaches VLM K/V on the action-query path so action loss
gradients cannot flow back into the VLM's K/V projections. Bound
per-instance via ``types.MethodType`` so it doesn't leak into
stock ``pi05`` policies that share PaliGemmaWithExpertModel.
Activated automatically when ``config.knowledge_insulation=True``.
Combined with the existing recipe-driven text head, pi052 now
supports the full three-loss objective:
L = text_w·H(text) + fast_w·H(FAST actions) + flow_w·MSE(flow)
matching Eq. (1) of arxiv:2504.16054 §IV.D (α=10 by default for the
flow term, 1.0 each for text and FAST CE).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the working SmolVLA2 launch pattern so the two SLURM scripts
are interchangeable:
* literal NUM_PROCESSES / BATCH_SIZE / STEPS (no env-var defaults)
* STEPS=10000 to match the next SmolVLA2 run
* save_freq=$STEPS so only the final checkpoint is saved
* dropouts 0.1/0.1/0.1 (mild — matches the operator's iteration)
* flow_loss_weight / text_loss_weight come from the PI052Config
defaults (10.0 / 1.0 per Pi 0.5 paper §IV.D), no need to pass
them explicitly
Job name and policy_repo_id mirror the SmolVLA2 ``_tool-g2`` naming
so the two runs can be compared side-by-side in WandB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pi 0.5 paper §IV.D Eq. (1) sets the loss balance to α=10 between text
CE and flow MSE: actions are the primary output and the flow head
should dominate the gradient signal. SmolVLA2 was defaulting both
weights to 1.0, which inverts that — text CE (~0.5-2.0 nats) ends up
larger than flow MSE (~0.1-1.0), so the action expert gets less
gradient than the LM head despite being the primary task.
Match the paper's split: text_loss_weight=1.0, flow_loss_weight=10.0.
Same as ``pi052`` (the new full reproduction policy).
Also pin the values explicitly in the SLURM launcher so the choice is
visible and overridable per-run rather than buried in the config
default.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New ``lerobot.policies.pi052`` (parallel to ``smolvla2``) that adds
text-prediction + hierarchical-inference on top of the existing π0.5
implementation. Mirrors the paper's §IV.D dual-head training:
L = H(text) + α * ‖ω - a - f_θ_action(...)‖², α = 10
Components:
* ``configuration_pi052.py`` thin PI05Config subclass; adds
recipe_path, text/flow loss weights
(default α=10 per paper), prompt
dropout knobs, ``unfreeze_lm_head``.
* ``text_processor_pi052.py`` PI052TextTokenizerStep — concatenates
rendered messages as ``Role: ...``
plain text (PaliGemma has no chat
template), tokenises with the
PaliGemma tokenizer, builds a label
mask covering supervised target
spans. Includes Pi 0.7 §V.E
per-component prompt dropout.
* ``processor_pi052.py`` make_pi052_pre_post_processors —
Rename + Batch + Relative +
Normalize + RenderMessagesStep +
PI052TextTokenizerStep + Device.
Falls back to π0.5's plain pipeline
when recipe_path is unset.
* ``modeling_pi052.py`` PI052Policy(PI05Policy) — re-enables
PaliGemma ``lm_head``, computes
text_loss via CE on the supervised
span, sums with flow_loss in
forward(), and adds select_message
for AR text generation at inference
(same surface as
SmolVLA2Policy.select_message so
SmolVLA2Runtime drives it unchanged).
Plus the supporting plumbing:
* recipe ``configs/recipes/pi052_hirobot.yaml`` — same Hi-Robot blend
as smolvla2_hirobot.yaml, with the same ``${subtask}`` /
``if_present`` supervision fix (current span at every frame, not
``${next_subtask}``).
* SLURM ``examples/training/pi052_hirobot.slurm`` — full training
command matching the SmolVLA2 launcher.
* factory registration: ``--policy.type=pi052`` resolves to
PI052Policy with the new processor.
Same multi-rate runtime (``lerobot.policies.smolvla2.inference``)
drives this policy too — both expose ``predict_action_chunk`` for the
action expert and ``select_message`` for the LM head.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After _tool-good (2000 steps, 0.50/0.50/0.20 dropout) the LM head's
distribution at position 0 shifted from EOS to subtask-vocabulary
tokens but emitted bag-of-words ("cube arm and") rather than well-
formed sentences. That's the expected mid-fine-tuning phase: token-
level supervision has landed, sequence-level grammar hasn't.
Two changes for the next retrain:
* STEPS=15000 (from 2000) — chat-pretrained backbones need O(10k+)
steps to walk their pretraining priors down far enough to commit
to the fine-tuned distribution structurally, not just at the
token level. _tool-g2's bag-of-words output proves the model is
on the right path; it just needs more gradient signal.
* plan/memory dropout 0.50 -> 0.30 — 0.50 was probably too
aggressive for a small dataset. Half the training samples had
crucial context missing, which slows down learning the full
conditional structure. 0.30 still regularises against prompt
leakage but lets the model learn proper grammar first; the
higher dropout can be revisited once the head is solid.
Subtask dropout stays at 0.20 since subtask isn't in the high-level
prompt anyway (recipe fix removed the "Current subtask:" message).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recipe fix (target=${subtask} instead of ${next_subtask}) shifted
the LM head's failure mode from "emit newlines" to "emit EOS at
position 0". On the new ``_tool-good`` checkpoint inference produces
exactly one token (``<end_of_utterance>``, id 49279) and decodes to
empty. That's the chat-pretrained backbone's short-turn EOS prior
not yet being overridden by 2000 steps of fine-tuning supervision.
Expose three knobs so the operator can probe whether the head has
real subtask-token probability mass *under* the EOS argmax without
recompiling or retraining:
--text_min_new_tokens=N suppress EOS for the first N tokens
--text_temperature=T sample at temperature T
--text_top_p=P nucleus filtering at top-p
These are explicitly off-policy (training was greedy / no min-tokens),
so they shouldn't ship in production runs — but they let us tell
whether the model has *learned* subtask prediction (just under EOS)
or hasn't yet. If forcing min_new_tokens=3 with temperature=0.5
produces a sensible subtask, the model is fine and just needs more
training steps to walk EOS down. If it produces gibberish, training
hasn't progressed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the recipe fix (target=${subtask} at every frame) the model
can still reach low text_loss by reading the answer off the plan in
the prompt: at training the prompt contains the 6-step plan, and the
current subtask is one of those steps, so the model just learns
"active step N matches subtask N" and never needs to look at the
image. Symptom at inference: subtask string is set but never updates
because the model isn't really conditioning on the visual progress.
Drop plan and memory with p=0.50 each — half of training frames the
prompt is just "${task}" (constant for this dataset) + visual prefix,
which is the only place the answer can come from. Forces the LM head
to actually use vision.
``subtask_dropout`` stays at 0.20 because subtask isn't in the
high-level prompt anymore (recipe fix removed the "Current subtask:
X" message); the knob still affects other sub-recipes that reference
it as context.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Normalize tensor and sequence sample indices before prompt dropout so distributed batched preprocessing does not try to cast full index tensors to scalars.
Co-authored-by: Cursor <cursoragent@cursor.com>
Match the operator's current training command for the _tool6 retrain:
* default DATASET / POLICY_REPO_ID / JOB_NAME point at the tool6
iteration (super_poulain_full_tool3 → smolvla2_hirobot_super_poulain_tool6)
* STEPS default 2000 (short enough to iterate; bump to 10k for full)
* save_freq=$STEPS so the only checkpoint is the final one
* OUTPUT_DIR includes step count so successive runs don't clobber
* Drop the wider augmentation envelope I added earlier — back to
default ColorJitter ranges (brightness ±20% etc) since the
high_level_subtask recipe fix (current-subtask supervision) is
expected to fix the LM-head collapse on its own; the augmentation
is just the standard regulariser, not a load-bearing widener.
* prompt-dropout fractions stay at the original 0.15 / 0.15 / 0.20.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The high_level_subtask recipe targeted ``nth_next(style=subtask, offset=1)``,
which on the last span of any episode resolves to None. The recipe had no
``if_present`` guard on the target, so the renderer emitted an empty
assistant turn and cross-entropy supervised the model on the chat
template's structural newlines (``\n``). Across the dataset this trained
the LM head's argmax at position 0 to collapse to ``\n`` whenever no
transition was imminent (i.e. most frames). Visible failure mode at
inference: the head emits 40+ newlines + ``<end_of_utterance>`` every
chunk boundary while the action expert keeps working — confirmed by
running the dry-run on dataset frame 0 with the dataset's own image
and seeing the same ``\n × 44`` collapse.
Switch to the Pi 0.5 / Pi 0.7 supervision pattern: at every frame, the
assistant target is the *current* active subtask span text (via
``${subtask}`` → ``active_at(t, style=subtask)``). Always non-empty,
always scene-grounded, ``if_present: subtask`` skips frames with no
active span instead of emitting a degenerate empty turn.
Runtime callsite update: ``_msgs_for_subtask`` no longer feeds a
"Current subtask: X" user message into the prompt (that would be
circular — we'd be telling the model the answer). Transition
detection moves into the runtime — when the predicted subtask differs
from ``state['current_subtask']``, the existing ``set_if_changed``
path fires ``subtask_change`` and downstream memory updates. Same
event surface, supervision target is now always meaningful.
Requires re-annotating the dataset and retraining for the fix to land
in the checkpoint, but the recipe + runtime change is what enables it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the dry-run REPL only ticked on user input (empty Enter
just redrew), so the bisection test "does the LM head produce text on
start_frame=0?" required typing something arbitrary to drive a tick.
Just run ``step_once`` at startup — the obs diagnostic *and* the
subtask gen both fire automatically, the diag row populates, and the
operator can read the result before pressing any key.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tensor-level comparison between dry-run (dataset frame) and live-
robot inference proved the runtime is bug-free — same shape, dtype,
device, channel order, batch dim, and normalization on both paths.
The remaining variable: front-camera mean brightness was 0.26 live vs
0.39 on the dataset frame, ~33% darker. Training augmentation only
covered ±20% brightness, so the live scene sits just outside the
supervised envelope and the LM head collapses to its dominant prior.
Widen the augmentation knobs for the next retrain:
* brightness 0.8–1.2 → 0.5–1.6 (covers ~30% darker / 60% lighter)
* contrast 0.8–1.2 → 0.6–1.5
* saturation 0.5–1.5 → 0.3–1.7
* hue ±0.05 → ±0.10
* affine ±5°/±5% → ±15°/±15% (covers cube placement / camera drift)
* max_num_transforms 3 → 4
And bump prompt-component dropout (subtask 0.20 → 0.30) so the LM
can't lean on stale memorised plan/memory at inference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dry-run REPL only fires a tick when the user types, so the
``_log_obs_tensors_once`` diagnostic never reached stdout (the
provider was never called). Probe the provider once at startup —
the result is discarded; we only care about the obs log it triggers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Helper that prints (once per provider lifetime) every
``observation.*`` tensor the policy is about to see, with its shape,
dtype, device, and per-channel min/max/mean/std. Wired into both the
dry-run dataset path and the live-robot path.
Now we can bisect train/inference mismatch *at the tensor level* —
if the same checkpoint produces coherent text on one path's tensors
and ``\n`` on the other's, and the printed tensor stats differ
materially, the bug is in the observation prep, not in the model or
the training distribution.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply the training-time torchvision-v2 ColorJitter / SharpnessJitter /
RandomAffine pipeline to dataset frames in dry-run, so we can isolate
whether the LM head's collapse to '\n' on live frames is:
* pure scene-content OOD (unaugmented dataset frames work, mildly
augmented ones still work — model has learned the augmentation
distribution, only fails when the scene content itself diverges)
* hyper-specific memorisation (dry-run with augmentation also
collapses to '\n' — head is nailed to the exact unperturbed
training samples and only the retrain helps)
Usage:
lerobot-smolvla2-runtime --no_robot --policy.path=... \
--dataset.repo_id=... --dataset.episode=0 \
--dataset.start_frame=1000 \
--dataset.augment_at_inference
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
So the operator can compare live joint values to the dataset's
``observation.state`` mean/std and spot when the robot's home pose is
several σ off the supervised support region. State OOD is the
remaining viable hypothesis for why the live LM head collapses to
``\n`` even though images are pixel-shape-matched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Print one warning the first time the robot observation provider runs
through, showing live camera resolution and the dataset's training
resolution, plus whether we resized. Lets the operator confirm at a
glance that the visual prefix really is being fed at the same shape
the model saw at training — instead of guessing whether the resize
fired silently.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause for the LM head's empty-completion symptom on the live robot
(while the same checkpoint produced sensible subtask/plan/memory in
``--no_robot`` dry-run on dataset frames): the camera observation was
flowing into the model at its native resolution. A Mac/USB webcam
hands us 1280×720 or 1920×1080; the dataset was recorded at the
feature schema's ``observation.images.*['shape']`` resolution
(typically 480×640). SmolVLA's internal ``resize_with_pad(512, 512)``
*does* fit both — but with very different pad geometry, so visual
tokens at each tile carry different content than at training. Action
expert tolerates this; the tightly-supervised LM head goes OOD and
the head's distribution at position 0 collapses to its dominant mode
(``\n`` ×N then ``<end_of_utterance>`` for this checkpoint).
The fix: in ``_build_robot_observation_provider``, pre-compute the
camera-key → (H, W) target from ``ds_features`` and ``cv2.resize``
each live frame to that shape before tensorising. The downstream
``resize_with_pad`` then sees the same input geometry as training and
the LM head returns to producing readable subtask text under plain
greedy decoding — the same as dry-run.
Also drops the inference-time patches (``min_new_tokens``,
``temperature``, ``top_p`` overrides) on the four high-level callers.
They were band-aids around the visual-distribution shift, not a real
LM problem, and they drift inference off the training distribution.
Greedy argmax is what training matched. The ``select_message``
signature still accepts the knobs for callers that want them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous attempt only masked the tokenizer's eos_token_id during the
min_new_tokens prefix. The empty-completion symptom persisted because a
memorised SmolVLM head doesn't just want EOS — its top-1 at position 0
is *some* special token, and when EOS is masked the argmax shifts to a
sibling (``<|im_end|>``, ``<image>``, ``<fake_token_around_image>``,
``<row_X_col_Y>``, …). Those tokens survive generation but then get
stripped by ``decode(skip_special_tokens=True)``, so the runtime still
saw ``last_raw='(empty)'`` every chunk boundary.
Mask the full ``tokenizer.all_special_ids`` set instead. Forces the
head to commit to a normal vocabulary token before it can close or
quietly poison the turn.
Also: when decode returns empty but tokens *were* generated, expose
the raw token ids and the special-tokens-included decoded string via
``policy._last_select_message_debug``. The runtime surfaces this in
the scrollback so the operator can see what the head is actually
emitting — distinguishing "head EOS-ing" from "head emitting image
placeholders" from "head emitting chat-template fragments".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-robot run confirmed the LM head is producing 0 tokens at every
chunk boundary (empty:N counter climbing, no exception in scrollback):
the model EOS-es at decode step 0. That's the memorisation collapse —
training reached text_loss=6e-6 by overfitting one trajectory whose
supervised subtask turn ended in EOS, and at inference the head's
argmax for token 0 is EOS regardless of the actual frame.
Two changes in select_message:
* ``min_new_tokens`` parameter masks the EOS logit to -inf until at
least N real tokens have been decoded. Without this the head's
"EOS first" prior produces an empty completion every single time.
* The runtime callers now pass ``min_new_tokens=5..10`` plus
``temperature=0.4..0.5`` + ``top_p=0.9``. Sampling at moderate
temperature with nucleus filtering also helps break the greedy
argmax collapse — when the model has memorised one continuation,
greedy keeps replaying it; nucleus sampling forces it to commit
to *some* coherent continuation that's well-supported by the
prefix even when greedy's top-1 is degenerate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two improvements for diagnosing why ``last_raw`` stays empty:
1. The autonomous panel-redraw thread calls console.clear() every
0.5 s, wiping any log lines the runtime printed since the last
redraw. So warnings from generation (``[warn] subtask gen failed:
...``, ``[info] subtask gen rejected (gibberish): ...``) flashed
for milliseconds and disappeared, leaving the operator blind.
Capture log_lines from each tick into a bounded scrollback
(last 12 entries) and render them inside the panel itself, below
the diag row. They now stick across redraws until rotated out.
2. ``empty`` counter for subtask gen. Persistent empty completions
are their own failure mode — the LM head EOS-es immediately from
the chat-template generation prompt, distinct from "generated
something but filter rejected it". The diag row now reads:
subtask diag repeat:0 gibberish:0 empty:14 last_raw: '(empty)'
^^^^^^^
plus a periodic log line every 10 empties so the cause is also
surfaced in the scrollback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both HighLevelSubtaskFwd and LowLevelForward are gated on
'action queue is empty'. With LowLevelForward listed first, it refilled
the queue on the empty-queue tick before HighLevelSubtaskFwd got to
check — so the gate I added in the previous commit made the high-level
step a permanent no-op after the initial bootstrap. Visible symptom:
subtask string never advances past whatever bootstrap seeded, no
subtask_change events, memory stays unset, and the new overfit
diagnostics never appear on the panel because last_subtask_raw is
never written.
Move all high-level steps (subtask, memory, interjection, vqa) ahead
of LowLevelForward. On an empty-queue tick the subtask refreshes
first, the new string flows into the next chunk's prompt, then
LowLevelForward generates the chunk, then DispatchAction drains it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The autonomous-mode panel now surfaces what the model is *actually*
producing at every chunk boundary, not just what got accepted:
* last_subtask_raw most recent generation (accepted or not)
* subtask_repeat_count times the same accepted string regenerated
* subtask_gibberish_count rejections by the gibberish filter
* memory_gibberish_count / plan_gibberish_count for the other heads
These let the operator see memorisation collapse without scrolling
back through logs:
subtask diag repeat:8 gibberish:0 last_raw: '<same string>'
^^^^^^^^^^ → model can't move past current phase
subtask diag repeat:0 gibberish:14 last_raw: 'Ass:::'
^^^^^^^^^^^^^^^^^^^^^^ → LM collapsed to template salad
Also silences the per-action ``Relative goal position magnitude had
to be clamped`` warning. The clamp fires every dispatch tick when the
model emits stale joint targets, flooding the panel at ctrl_hz=30.
Replaced the bare ``logging.warning`` call in robots/utils.py with a
module logger so it can be selectively raised to ERROR. Operators
who need the per-tick clamp detail can use ``-v``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third stdin channel alongside 'task:' and bare interjections:
rephrase: <text>
Swaps state['task'] with the new string while preserving plan/memory/
subtask. Lets the operator probe how robust the model is to wording
variations of the same task — the trained augmentation provided
n_task_rephrasings≈30 task wordings per dataset task, and this is the
direct way to exercise that distribution at inference without
generating a fresh plan via user_interjection_response.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both stdin handlers (autonomous mode and rich REPL) gated 'task:' to
'only if no task is set yet' — once the initial task existed, typing
'task: <new task>' silently fell through to the interjection branch.
Make 'task:' always override the active task and clear stale
plan/memory/subtask so the next high-level pass regenerates context
from scratch for the new task.
For rephrasings within the same task, the interjection path
(user_interjection_response recipe) is still the right channel — it
refreshes the plan and emits a paired <say> in one trained call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime is single-threaded. `HighLevelSubtaskFwd` at HzTrigger(1.0)
fires every loop iteration on MPS because each `select_message` call
takes ~2 s, longer than its 1/hz period. The whole tick stretches to
~2.5 s, so `DispatchAction` (HzTrigger 30) only pops a single action per
loop iteration — the queue drains at ~0.4 actions/sec instead of 30 and
the robot barely moves between chunk refreshes.
Two changes, both purely about scheduling — no threading:
* Gate `HighLevelSubtaskFwd` to fire only when the action queue is
empty, matching `LowLevelForward`'s refresh condition. The slow LLM
call now happens during the "think" phase between chunks, not on
every dispatch tick. Restores a clean sense → think → act cycle.
* `DispatchAction` catches up via wall-clock: when the trigger fires
after a stall, pop `round(elapsed * hz)` entries and send only the
most recent. Open-loop chunks are timestamped at ctrl_hz; sending
stale joint targets one-by-one would just lag the robot further
behind. The dynamixel smooths to the latest goal anyway.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous refresh threshold (queue > chunk_size // 2) made each
new chunk *telescope* past the previous one: at queue=25, we kicked
off a new chunk forward from the current observation, but by the
time the new chunk's first action was actually dispatched, the
robot had executed the remaining 25 actions of the previous chunk
— so the new chunk was planned from an observation 25+ steps stale.
Canonical sense → think → act loop: execute the full chunk, then
re-observe and replan. Refresh only when the queue is empty. Every
step of every chunk still gets dispatched to the robot (no
behaviour change there), but each chunk is now planned from an
observation that's at most one chunk's worth of dispatch latency
old, not "previous chunk's worth of stale state on top of that".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary regularisers to attack the
``text_loss=6e-6 = memorised one dataset`` failure mode that's
making the model collapse on real-robot input:
1. **Per-component prompt dropout** (Pi0.7 §V.E / plan's
``feat/pi05-prompt-dropout`` follow-up).
``SmolVLA2ChatTokenizerStep`` gains
``plan_dropout_prob`` / ``memory_dropout_prob`` /
``subtask_dropout_prob`` knobs (default 0.0 — opt-in). At training,
non-target messages whose rendered content starts with
``Plan:`` / ``Memory:`` / ``Current subtask:`` etc. are dropped
with their respective probability before tokenisation, with a
deterministic per-sample RNG keyed off the dataset ``index``.
``target_message_indices`` is re-mapped so the supervision still
lands on the right turn. Forces the model to handle missing
plan/memory/subtask context — directly attacks the real-robot
collapse where a stale or empty plan field puts the prompt OOD.
Surfaced on ``SmolVLA2Config`` as three floats so they're
``--policy.<knob>=<value>``-controllable from the train CLI;
plumbed through ``make_smolvla2_pre_post_processors``.
2. **Image augmentation** is already wired in lerobot via
``--dataset.image_transforms.enable=true`` (torchvision v2
ColorJitter + SharpnessJitter + RandomAffine, default 3 of 6
sampled per frame). No code change needed — just a CLI flag.
``examples/training/smolvla2_hirobot.slurm`` shows the full
training command with both enabled. Drop-in replacement for the
ad-hoc SLURM script Pepijn was using locally; same args, plus the
three dropout probs and the image-transforms flag.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``LowLevelForward`` was calling ``select_action()`` once per
``chunk_hz`` tick. SmolVLA's ``select_action`` is a thin queue-pop:
it returns one action per call and only re-runs the expensive
flow-matching forward when its private internal queue empties.
Result: we got one action back per chunk_hz tick (1Hz default),
``DispatchAction`` at ctrl_hz=30 popped it instantly, then queue
sat empty for ~1s waiting for the next tick. Net throughput was
1 dispatched action/sec instead of the 30 we wanted.
Switch to ``predict_action_chunk`` and enqueue every step of the
returned ``(batch, n_action_steps, action_dim)`` chunk. Refresh
only when the queue is below half a chunk so we don't burn one
flow-matching forward per chunk_hz tick — saves ~5x inference cost
on this hot path. At ctrl_hz=30, chunk_size=50, the queue drains
in ~1.7s before the next refresh, giving smooth dispatch at the
control rate the robot was trained on.
Side effect: ``state['last_chunk_size']`` records how many actions
the most recent chunk produced — useful for the panel later if we
want to surface "chunks generated" alongside "dispatched".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-robot run was unreadable for two reasons:
1. The panel surfaced ``queued actions: 0`` (always zero — dispatch
pops faster than chunk_hz generates) and gave no signal that
actions were actually reaching the robot. The only sign of life
was the safety-clamp warning lines scrolling past.
2. The text head consistently collapses to ``the`` / ``Ass``
fragments on real-camera input (memorisation wall). The old
gibberish filter caught ``":":":"`` JSON salad but let
single-token fragments through, and the ``[info] subtask gen
produced no text this tick`` line flooded the panel every second.
Changes:
* ``DispatchAction`` bumps ``state["actions_dispatched"]`` each
tick; panel renders it next to queue depth. Operator can see
the policy IS issuing actions even when text is broken.
* ``_looks_like_gibberish`` now also rejects:
- too few unique alphabetic tokens (``the``, ``the the``, ...)
- chat-template marker leakage (``Assistant:``, ``Ass\\n::``)
catching the actual failure mode on real-robot frames.
* Gibberish rejections log only the first occurrence + every 30th
after that, with a count, so the panel stays legible.
* Empty completions no longer log at all (was every tick).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dry-run REPL had a clean ANSI-clear-+-rich-panel layout via
``_redraw`` showing task / subtask / plan / memory / queued-actions /
pending-tool-calls; autonomous mode just had bare ``> `` plus log
lines scrolling past the user. Same data, two presentations.
Extract ``_make_state_panel_renderer(runtime, mode_label=...)`` and
use it from both ``_run_repl`` (called per user input) and
``_run_autonomous`` (called both on user input *and* on a 0.5s
background timer so subtask / plan / memory refreshes from the
runtime's own loop become visible without the user typing anything).
Title bar shows ``dry-run`` vs ``autonomous`` so it's obvious which
mode you're in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Training tokenises messages through ``_strip_lerobot_blocks`` (in
``chat_processor_smolvla2.py``), which normalises every variant of
``message['content']`` into the ``[{type:text, text:...}]`` list shape
SmolVLM's chat template expects:
* ``list[block]`` → keep text blocks, drop images
* ``None`` → ``[{type:text, text:""}]``
* ``str`` / other → ``[{type:text, text:str(content)}]``
Inference was doing a partial inline conversion that only handled the
``str`` case — ``None`` and pre-formatted ``list`` content slipped
through unchanged. ``memory_update``'s ``Previous memory: ...``
assistant turn ends up with ``None`` content when there's no prior
memory, which then renders as no-content / role-marker-only and the
model hallucinates ``Assistant:`` fragments. Subtask gen got further
because its prompt always has at least the task string.
Reuse ``_strip_lerobot_blocks`` directly. Now the inference prompt
shape matches the exact tokenisation training did — no more "trained
on shape X, asked to predict shape Y" mismatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLM's chat template (and many other multimodal templates) declares
``message['content']`` as a list of typed blocks and iterates it
expecting dicts with a ``'type'`` field:
{% for line in message['content'] %}
{% if line['type'] == 'text' %}{{ line['text'] }}
{% elif line['type'] == 'image' %}{{ '<image>' }}
{% endif %}
{% endfor %}
When the caller passes ``content`` as a plain ``str`` (which we did
throughout ``_msgs_for_subtask`` / ``_msgs_for_memory`` etc.), Jinja
silently iterates the string character-by-character. ``'P'['type']``
returns nothing; neither branch fires; *no text tokens get emitted*.
The model receives a prompt containing only role markers
(``User:<end_of_utterance>\nAssistant:``) and predictably continues by
emitting ``Assistant:`` fragments — the gibberish ``subtask: Ass\n::``
on the runtime panel.
Before calling ``apply_chat_template``, walk the messages and rewrite
any string ``content`` into ``[{'type': 'text', 'text': content}]``.
The template's text branch then fires correctly and the model sees
the actual user/assistant text, not just structural tokens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``PolicyProcessorPipeline.__call__`` already wraps its input via
``to_transition`` (defaulting to ``batch_to_transition``) before
running the steps, and unwraps via ``to_output`` (defaulting to
``transition_to_batch``) afterwards. The input format is therefore a
*flat batch dict* keyed by ``observation.*`` / ``action`` / etc., not
an ``EnvTransition``.
Previous attempt pre-wrapped the observation into a transition with
``TransitionKey.OBSERVATION`` as the key, then handed *that* to the
pipeline — which fed it to ``batch_to_transition``, which looked for
top-level ``observation.*`` entries, found none (they were nested
inside the enum key), and produced an empty observation. Every step
then bailed with ``ObservationProcessorStep requires an observation
in the transition.``
Pass the flat dict from ``build_inference_frame`` straight to the
preprocessor — it does the wrap/unwrap itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``EnvTransition`` is declared as a ``TypedDict`` keyed by
``TransitionKey.OBSERVATION.value`` (the string ``'observation'``),
but every concrete ``ProcessorStep`` in the pipeline indexes the
transition with the enum *member* (``transition[TransitionKey.
OBSERVATION]`` / ``transition.get(TransitionKey.OBSERVATION)``).
Those are two different keys in a Python dict — string key vs enum
key — so steps couldn't find the observation we'd placed under the
string variant, and bailed every tick with
``ObservationProcessorStep requires an observation in the
transition``.
Build the transition with the enum members directly. Matches how
``BatchProcessor``, ``RelativeActionProcessor``, ``HilProcessor``,
etc. read the dict.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``robot.get_observation()`` on omx_follower (and most lerobot robots)
returns:
* per-joint scalar floats with ``.pos`` suffix
(``shoulder_pan.pos: 0.123``, ``shoulder_lift.pos: 0.456``, ...)
* per-camera ndarrays keyed by the camera config name (``wrist:
ndarray(H,W,3)``)
But the trained policy expects:
* single ``observation.state: tensor[N_joints]`` vector
* image keys prefixed: ``observation.images.<cam_key>:
tensor[1, 3, H, W]``
``prepare_observation_for_inference`` only handles the tensor /
batch-dim / device step — it crashes on scalar floats with
``expected np.ndarray (got float)``. The right helper is
``build_inference_frame`` which uses the dataset's feature schema
(``ds_meta.features``) to:
1. extract the right raw keys per dataset feature,
2. fold ``shoulder_pan.pos`` / ``shoulder_lift.pos`` / ...
into a single ``observation.state`` ndarray,
3. prefix camera keys with ``observation.images.``,
4. delegate to ``prepare_observation_for_inference`` for the
tensor / batch / device step.
Pass ``ds_meta.features`` into the observation provider and switch
to ``build_inference_frame`` when available; fall back to the bare
``prepare_observation_for_inference`` only when no dataset is
provided (rare — autonomous mode already requires it).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The policy preprocessor pipeline is transition-shaped — its steps
read ``TransitionKey.OBSERVATION`` off an ``EnvTransition`` dict, not
a flat ``RobotObservation`` dict. Passing the raw observation through
made every step bail with
``ObservationProcessorStep requires an observation in the transition``,
which the runtime swallowed at warning level. ``select_message`` then
got called with no ``observation.images.*`` features and crashed
with ``All image features are missing from the batch``.
Mirror ``lerobot-record``'s preamble:
1. ``prepare_observation_for_inference`` → numpy → torch, ``CHW``
image layout, ``[0,1]`` scaling, add batch dim, move to device.
2. Wrap into an ``EnvTransition`` (``{TransitionKey.OBSERVATION.value:
...}`` plus ``COMPLEMENTARY_DATA: {}`` and ``None``s for the rest)
so transition-aware steps see the keys they expect.
3. Run preprocessor.
4. Unwrap the transition's ``OBSERVATION`` slot to get the final
flat dict the policy's ``select_action`` / ``select_message``
consume.
Image features now reach the policy; the autonomous loop produces
real actions instead of swallowing warnings every tick.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``--robot.cameras`` parses the JSON into ``dict[str, dict]``, but
``RobotConfig`` expects ``dict[str, CameraConfig]`` — each inner
value must be the actual ``CameraConfig`` subclass instance for the
chosen backend (e.g. ``OpenCVCameraConfig``). Passing raw dicts
blew up in ``RobotConfig.__post_init__`` with
``AttributeError: 'dict' object has no attribute 'width'`` when it
iterated cameras and tried to read attributes.
Look up the right subclass per-camera by its ``"type"`` field via
``CameraConfig.get_choice_class(...)`` (mirroring the lazy-import
dance we already do for ``RobotConfig``: eagerly walk
``lerobot.cameras``'s submodules so the registry is populated
before lookup). Construct an instance with the rest of the dict's
fields. On an unknown camera type, raise a clean ``ValueError``
listing the available choices.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``RobotConfig._choice_registry`` is populated as a side-effect of
each robot's ``@RobotConfig.register_subclass`` decorator running,
and those decorators only fire when the corresponding
``lerobot.robots.<name>`` module is imported. The package's
``__init__.py`` doesn't import them — instead ``make_robot_from_config``
does it lazily in its big if/elif chain.
``_build_robot`` jumped the gun: called ``RobotConfig.get_choice_class
(robot_type)`` before any robot module had been imported, so the
registry was empty and every ``--robot.type=<X>`` produced
``KeyError: 'X'`` (e.g. ``KeyError: 'omx_follower'``).
Walk ``lerobot.robots``'s submodules via ``pkgutil.iter_modules`` and
``importlib.import_module`` each one before the lookup. ~200ms on the
first invocation, negligible for an autonomous run. On a real
``KeyError`` (typo / unsupported robot), raise a clean ``ValueError``
listing the registry's available choices instead of a bare KeyError.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hand-rolled action-norm safety clip duplicated what every
``RobotConfig`` already exposes — ``max_relative_target`` — and at
the wrong layer (after postprocess but before send_action, instead
of inside the robot driver where every other lerobot entry point
puts it). The norm clip also rejected entire actions instead of
clipping per-motor relative motion, so a single rogue joint would
kill the whole tick.
Replace with ``--robot.max_relative_target``: a string parsed as
either a bare float (uniform per-motor cap) or a JSON object
mapping motor name → cap. Passed through to
``RobotConfig(max_relative_target=...)`` at robot construction;
the driver's ``send_action`` clips each commanded joint position
relative to the current measured one before issuing it on the bus —
same behaviour ``lerobot-record`` ships.
Also bump ``--chunk_hz`` default from ``4.0`` to ``1.0``. One new
chunk per second is what the trained checkpoint can comfortably
keep up with on common hardware and gives smoother motion than
sub-second chunk regenerations (no RTC interpolation between
chunks yet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**#1 Plan-update phase reports correct skip count.**
``_run_plan_update_phase`` only ran ``run_plan_updates`` for episodes
with at least one interjection but hardcoded ``episodes_skipped=0``.
The summary undercounted skipped episodes. Now returns
``len(records) - processed`` so processed + skipped == total.
**#2 ``run_hf_job.py`` installs ``openai``.**
The ``CMD`` block does ``pip install --no-deps lerobot[branch]`` then
explicitly lists transitive deps. ``openai`` was missing — and since
``VlmConfig.backend`` defaults to ``"openai"``, the job would have
``ImportError``'d when ``vlm_client._make_openai_client`` ran.
**#3 Dedupe subtask-span reconstruction.**
Module 1's ``_reconstruct_subtasks_from_rows`` (no ``and spans`` guard)
and Module 2's ``_read_subtask_spans`` (with the guard) had near-
identical logic. Promoted to ``reconstruct_subtask_spans`` in
``reader.py`` using the safer guarded form. Both modules now import
the single helper.
**#5 Atomic staging.py JSONL writes.**
Mirroring the parquet-writer fix from an earlier review round:
``EpisodeStaging.write`` now writes to a sibling ``.tmp`` and
``Path.replace`` atomically. A crash mid-write can no longer leave a
half-written JSONL that ``read()`` would then fail to parse.
**#6 Atomic ``info.json`` write.**
Same pattern in ``executor._ensure_annotation_metadata_in_info`` —
``info.json`` is load-bearing for dataset metadata, so partial writes
brick the dataset.
**#7 Writer's role-key guard.**
``_normalize_persistent_row`` and ``_normalize_event_row`` accessed
``row["role"]`` directly while every other field used ``.get()``.
Pre-validate ``"role" in row`` and raise a friendly ``ValueError``
naming the row, so a future module that accidentally drops ``role``
fails with a triagable message instead of a bare KeyError deep in the
writer.
**#8 Last subtask span's ``end`` extends to episode end.**
``reconstruct_subtask_spans`` (the new shared helper) takes an optional
``episode_end_t``. When provided, the final span's ``end`` is closed
to that timestamp instead of equalling its own ``start`` (zero
duration). Both Module 1's plan-update pass and Module 2's interjection
anchoring pass ``record.frame_timestamps[-1]``, so downstream "current
subtask at refresh_t" lookups no longer miss refreshes that land
inside the final span.
Sweep: 66 passed, 0 failed. Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both tests were stale relative to design changes that landed earlier on
this branch. Update the tests to match the current production contract.
**``test_module1_attaches_video_block_to_subtask_prompt``**
The test took ``captured[0]`` and asserted on its content blocks, but
Module 1 issues several sub-prompts and the rephrasings call (which is
text-only, no video block) usually lands first. Two fixes:
* The test's intent is "the subtask prompt carries the video block" —
not "the first prompt carries it". Pick the call by content
(``"atomic subtasks"`` keyword in the text block) so the test is
resilient to future reordering of unrelated sub-prompts.
* Set ``n_task_rephrasings=0`` so the rephrasings call is skipped
entirely — keeps the test focused on ``_generate_subtasks``.
**``test_module2_mid_episode_emits_paired_interjection_and_speech``**
Two issues both rooted in design changes on the branch:
1. ``InterjectionsAndSpeechModule._mid_episode_interjections`` now
anchors interjections on subtask boundaries from Module 1's staging
tree, bailing out with zero rows when no spans exist. The production
executor runs Module 1 first; the test ran Module 2 in isolation.
Reproduce the contract by seeding two ``style=subtask`` rows in the
staging before calling Module 2 — gives it the single ``0 → 1``
boundary it needs.
2. The test's stub responder used the marker ``"ONE realistic
interruption"`` to match the interjection prompt, but that string is
from a previous prompt version. The current
``module_2_interjection.txt`` says ``"Write ONE interjection..."`` —
the old prompt asked for counterfactual interjections (e.g. "skip the
wipe"), the new one anchors on the upcoming subtask. Marker updated
to ``"Write ONE interjection"``; canned response wording aligned to
the new design.
Sweep on the language stack: 66 passed, 0 failed (was 64 passed, 2
failed). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**Critical: video_for_episode was unreachable dead code.**
``video_for_episode`` was indented inside ``_decode_pyav_direct``, after
its ``return`` statement — Python parsed it as a nested function that
never executed. Module 1's ``_episode_video_block`` calls
``self.frame_provider.video_for_episode(record, target_count)`` on the
``use_video_url=False`` path, which would have AttributeError'd on any
real dataset. Tests passed only because they used ``_StubFrameProvider``
/ ``_NullProvider`` which have the method. Moved it to be a proper
method of ``VideoFrameProvider`` (right after ``frames_at``).
**Thread safety on VideoFrameProvider.**
The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so
the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail``
flag were exposed to concurrent reads/writes. Added a ``threading.Lock``
field, wrapped cache reads/writes and the warn-flag check-and-set in
``with self._lock:``. Stub fixtures unaffected.
**episode_clip_path is now a method of VideoFrameProvider.**
Used to be a free function reaching into ``provider._meta.episodes`` and
``provider._meta.get_video_file_path`` from outside the class. As a
method it just uses ``self._meta``. The only caller (Module 1) updated;
no external callers.
**Atomic write in LanguageColumnsWriter.**
``pq.write_table(new_table, path)`` was overwriting the parquet shard
in place — a crash mid-write would corrupt the file. Now writes to a
sibling ``.tmp`` and ``Path.replace`` atomically.
**Smaller items:**
* ``executor.py`` docstring opened with "four phases" but listed six.
Now says "six phases" to match.
* ``[annotations]`` extra in ``pyproject.toml`` now includes
``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``,
so without it ``_make_openai_client`` would ImportError on a fresh
``uv sync --extra annotations``.
* ``_snap_to_frame`` was duplicated identically in
``plan_subtasks_memory.py`` and ``interjections_and_speech.py``.
Promoted to ``snap_to_frame`` in ``reader.py`` (next to
``EpisodeRecord``); both modules now import it. Backwards-compat alias
not needed — no external callers.
* ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every
call. Now memoizes via a private dataclass field so repeat calls from
different modules pay the cost once. Method signature unchanged.
* ``_extract_first_json_object`` had a redundant ``and not escape`` guard
that was dead because the prior block already handled and reset
``escape``. Replaced with a comment explaining the invariant.
**Pre-existing lint cleanups surfaced once these files entered
pre-commit's scope:**
* dead local ``client = clients[0]`` in ``_make_openai_client`` (the
real round-robin uses ``clients[rr_counter[...]]``).
* ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in
``_spawn_parallel_inference_servers``.
* ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary
collapse in ``_decode_pyav_direct``.
* ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in
``_server_is_up`` — the URL is the user-configured local-server endpoint
the CLI itself spawned, not arbitrary user input.
**Test added.**
``tests/annotations/test_frames.py`` pins the regression on
``VideoFrameProvider``: asserts ``video_for_episode`` and
``episode_clip_path`` are callable methods (not nested dead code or
free functions), and that the ``_lock`` field is a real
``threading.Lock``.
Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as
before this commit). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five Module 1 sub-prompts (`_derive_task_from_video`,
`_generate_task_rephrasings`, `_generate_subtasks`, `_generate_plan`,
`_generate_memory`) all repeated the same shape:
result = self.vlm.generate_json([messages])[0]
if isinstance(result, dict) and isinstance(result.get(<field>), <type>):
...
…each spelled with slightly different field names + post-processing.
Three small helpers replace it:
* `_vlm_field(messages, field)` — single VLM call, returns
``result[field]`` or ``None``. Centralizes the
``generate_json([m])[0]`` + ``isinstance(dict)`` dance.
* `_text_message(text)` — wraps a string in the canonical user-message
shape every text-only prompt builds inline.
* `_video_message(record, prompt)` — combines the episode video block
with a prompt; replaces the duplicated video-block construction
inside `_generate_subtasks` (which previously inlined the same
``use_video_url``/``frames_per_second``/``max_video_frames`` branches
that `_episode_video_block` already implements).
Net -35 LOC. Each call site now is 3-5 lines instead of 10-20. The
public method signatures are unchanged so tests don't move.
Drive-by: `_task_seems_bad` collapsed via SIM103 fix; `zip` in
`run_plan_updates` annotated `strict=True` per ruff B905.
Tests: same 2 pre-existing module-impl failures
(`test_module1_attaches_video_block_to_subtask_prompt`,
`test_module2_mid_episode_emits_paired_interjection_and_speech`) —
they were failing on `origin/feat/language-annotation-pipeline` before
this commit and continue to do so for the same reasons. 61/63 in the
language stack pass; pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve conflicts and pull in the latest PR 1 fixes.
Conflicts:
- pyproject.toml: PR 1 added `lerobot-rollout` and PR 2 added
`lerobot-annotate` to the same `[project.scripts]` block. Kept both.
- uv.lock: dropped both sides and regenerated against the merged
`pyproject.toml` (PR 2 dropped the `datatrove` dep when distribution
moved to HF Jobs; PR 1's lock didn't have it).
Test follow-up:
- `tests/annotations/test_pipeline_recipe_render.py` — PR 1 deleted
`src/lerobot/configs/recipes/pi05_hirobot.yaml` (review feedback:
remove the canonical-recipe file; recipes are user-supplied). The
cross-PR contract this test guards is "the recipe DSL renders
non-empty messages from pipeline output", which doesn't depend on
any specific YAML, so the test now builds an inline blend recipe
with the same coverage. Passes.
Sweep: 82 passed, 2 failed (pre-existing module-impl bugs:
`test_module1_attaches_video_block_to_subtask_prompt`,
`test_module2_mid_episode_emits_paired_interjection_and_speech`).
The PR 1 carryover (`test_emitted_at_raises_on_ambiguous_per_camera_vqa`)
is now passing — the merge brought in PR 1's tightened `_select_one`
ambiguity check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The executor previously claimed it would "optionally hand off" to
datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it
already runs phases inline in every code path, and HF Jobs (see
``examples/annotation/run_hf_job.py``) is the actual distribution
strategy. Stop pretending we have an executor selector.
* `executor.py`: drop `select_executor_class`, the "kind" log line, and
the references to LocalPipelineExecutor / SlurmPipelineExecutor.
Module docstring now says distribution is delegated to HF Jobs.
* `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`,
`slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only
`episode_parallelism`. While here, prune the longer "why" docstrings
on every field down to the load-bearing bits — full story moves to
`docs/source/annotation_pipeline.mdx`.
* `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the
`[annotations]` extra; the dep was only there for the (never used)
cluster executors. Comment block notes the new HF-Jobs delegation.
* `reader.py`, `lerobot_annotate.py`: drop their own datatrove /
flavor-namespace mentions.
* `docs/source/annotation_pipeline.mdx`:
- remove the flavor-namespace / sidecar paragraph (out of scope —
"multiple revisions = multiple copies" is dataset-level policy);
- remove the "writer drops the legacy `subtask_index` column" note
(already covered by PR 1's intentional-break call-out);
- remove the chat-template + `apply_chat_template(messages, tools=...)`
line (covered by Tools doc);
- replace the "executor picks Local vs Slurm" paragraph with
`--executor.episode_parallelism` and a pointer to HF Jobs;
- rewrite the style→recipe section to talk about "recipes" generically
instead of pinning a specific YAML;
- add a "Running on Hugging Face Jobs" section pointing at
`examples/annotation/run_hf_job.py`;
- add a "Running locally" example matching the CLI's docstring
(`uv run lerobot-annotate --root=... --vlm.model_id=...`);
- extend the paper-inspirations list with Pi0.7 and Steerable VLA
Policies (Zhao 2025) for Module 3.
Tests: same 3 pre-existing failures as before this commit (2 module
assertions still in flight; 1 carryover from PR 1). 41/44 pass.
Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reword the two callouts in `tools.mdx` to describe the runtime layer
in present tense ("not part of the catalog layer shipped today",
"those modules don't yet exist in the tree") instead of pointing at a
specific follow-up PR. Keeps the doc honest about what works now
without coupling it to a particular release order.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* **Float tolerance in `emitted_at` for persistent styles.** The
``_timestamp(row) == t`` exact-equality check silently missed any
caller that derived ``t`` arithmetically (e.g. ``frame_idx / fps``)
even though the parquet timestamp would only differ by ULPs. Added
``EMITTED_AT_TOLERANCE_S = 0.1`` and check ``abs(...) <= tolerance``
instead, with a docstring explaining why exact equality wasn't
enough and why 0.1 s is safe at typical 30–100 Hz control rates.
Test asserts the new behavior at half-window (matches) and
double-window (no match) using the constant so it stays in sync.
* **`MessageTurn.stream` is required at construction.** It was typed
``MessageStream | None = None`` so YAML could omit ``stream:`` and
pass the dataclass invariant — but ``_validate_rendered`` rejected
``None`` streams later, surfacing the error at the first sample
instead of at recipe load. Now ``__post_init__`` raises
``ValueError`` if ``stream`` is ``None``, with the list of valid
streams in the message. The redundant late-stage check in
``_validate_rendered`` is replaced with a one-line comment that
cites the upstream invariant. Test pins the new construction-time
rejection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* **#2 — dedupe `_PLACEHOLDER_RE`.** The same regex was compiled in
`recipe.py` and `language_render.py`. Promote to module-level
`PLACEHOLDER_RE` in `recipe.py` (its primary owner — declares
template syntax) and import from `language_render.py`.
* **#3 — centralize language column names.** `io_utils.py` had
hardcoded `{"language_persistent", "language_events"}` literals at
two sites. Replace with `LANGUAGE_COLUMNS` import so a future column
rename can't silently desync.
* **#4 — defensive collate preserved-keys.** `lerobot_collate_fn`
silently filtered language fields from samples that didn't have
them, which would hand downstream consumers a preserved list
shorter than the tensor batch. Now: if any sample carries a key,
every sample in the batch must carry it; otherwise raise a
`ValueError` so the upstream rendering bug surfaces at the boundary.
* **#5 — `_scalar` rejects non-singleton lists.** Previously a zero-
or multi-element list fell through and triggered confusing
`float([])` errors downstream. Now raises `ValueError` with the
actual length.
* **#6 — refactor `_extract_complementary_data`.** Replace 11 lines
of `key = {... if ... else {}}` plus an 11-line splat dict with a
single `_COMPLEMENTARY_KEYS` tuple iterated once.
* **#7 — document `EXTENDED_STYLES`.** Was an empty `set()` with no
comment. Add a docstring explaining it's an intentional extension
point: downstream modules append project-local styles before
`column_for_style` is called.
* **#9 — `tools.mdx` notes the runtime layer is future work.** The
page referenced `src/lerobot/tools/`, `registry.py`, and
`get_tools(meta)` — none exist in this PR. Added a callout at the
start of "How to add your own tool" plus a note on the
implementations paragraph.
* **#10 — tests for YAML round-trip, malformed rows, blend
validation.** `test_recipe.py` grew from 1 case to 12 covering:
blend-or-messages exclusivity, target-turn requirement, blend
emptiness, weight presence/positivity, nested-blend rejection,
`from_dict` with nested blends, `from_yaml` / `load_recipe`
agreement, top-level non-mapping rejection. Added a malformed-row
test for `_normalize_rows` that asserts non-dict entries raise
`TypeError`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime CLI was deliberately scoped to dry-run only: it
hard-coded ``robot_executor=None`` and printed a "real-robot
integration is a follow-up" warning even when ``--no_robot`` was
omitted. The runtime *engine* was already structured for real-robot
operation (separate ``LowLevelForward`` chunk-rate generation +
``DispatchAction`` ctrl-rate dispatch with a ``robot_executor``
hook); only the wiring was missing.
Add the wiring:
* ``_load_policy_and_preprocessor`` now also returns the
postprocessor (action denormaliser).
* ``--robot.type`` / ``--robot.port`` / ``--robot.id`` /
``--robot.cameras`` (JSON) build a ``Robot`` via
``make_robot_from_config`` and connect it.
* ``_build_robot_observation_provider`` reads
``robot.get_observation()`` each call, drops the language
columns (runtime drives messages itself), and runs the policy's
preprocessor (rename → batch → device → normalise).
* ``_build_robot_action_executor`` postprocesses the policy's
action tensor (denormalise), converts to the ``{joint: value}``
dict via ``make_robot_action(action, ds_meta.features)``, and
calls ``robot.send_action(...)``. Optional ``--max_action_norm``
safety clip rejects ticks whose action L2 norm exceeds the
threshold (kill-switch when bringing up a new robot).
* ``_run_autonomous`` runs ``runtime.run()`` in a background
thread (the policy must keep generating chunks at chunk_hz and
dispatching at ctrl_hz regardless of stdin) and handles user
interjections / VQA queries from the foreground stdin loop.
Confirmation prompt before start (skip with ``--auto_start``);
Ctrl+C stops the thread and disconnects the robot cleanly.
* Autonomous mode requires ``--dataset.repo_id`` for action stats
/ feature shapes — pass the same dataset the policy was trained
on. The bootstrap path that pulls canonical task / plan / memory
runs in both REPL and autonomous modes so the model's first
prompt matches training distribution.
Dry-run REPL behaviour is unchanged when ``--robot.type`` is not
passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* **`meta.tools` actually reads `info.json["tools"]`.** `DatasetInfo`
had no `tools` field, so `from_dict` silently dropped the key (it
warned about unknown fields then discarded them) and the property
always returned `DEFAULT_TOOLS`. Added `tools: list[dict] | None`
to the dataclass; `to_dict()` drops it when unset so existing
datasets keep a clean `info.json`. Fixed the accessor to read
`self.info.tools` (the previous `.get(...)` would have raised
AttributeError on the dataclass anyway). Added regression tests:
fallback when absent, round-trip from disk, and round-trip
through `DatasetInfo.from_dict` / `to_dict`.
* **`motion` is not view-dependent — fix the docs.** The mdx claimed
rows of style `motion` must carry `camera`, but `VIEW_DEPENDENT_STYLES
= {"vqa", "trace"}` and the validator agrees: motion primitives are
joint/Cartesian-frame, not pixel-space. Updated both call-out
paragraphs in `language_and_recipes.mdx`.
* **Conditional `collate_fn` swap.** Added `meta.has_language_columns`
and gate the `lerobot_collate_fn` swap in `lerobot_train.py` on it,
so non-language datasets keep PyTorch's `default_collate`. Also
added a pass-through test in `test_collate.py` that asserts on a
plain tensor batch the custom collate matches `default_collate`
key-for-key, plus a test for the `None`-sample drop path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`lerobot.processor` re-exported `RenderMessagesStep` at the package
level, so importing anything from `lerobot.processor` pulled in
`lerobot.datasets.language` → `lerobot.datasets/__init__.py` →
`require_package("datasets")`, which fails in the Tier 1 base install
that intentionally omits the `[dataset]` extra. The chain bricked
collection for unrelated suites (`tests/policies/pi0_pi05/...`,
`tests/envs/...`, etc.).
* Stop re-exporting `RenderMessagesStep` from `lerobot.processor`. The
only consumer (the test) already imports from the submodule.
Document the deliberate omission in the module docstring.
* Add `pytest.importorskip("datasets", ...)` (and `pandas` where
needed) at the top of the four PR-added tests that exercise the
language stack:
- tests/datasets/test_language.py
- tests/datasets/test_language_render.py
- tests/processor/test_render_messages_processor.py
- tests/utils/test_collate.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* `ruff format` on CI (newer version) wants the short `camera=None`
ValueError on a single line.
* `uv.lock` was stale relative to `pyproject.toml`'s `datasets>=4.7.0`
pin (and picked up upstream `s390x` marker fixes for cuda packages).
CI runs `uv sync --locked` which rejected the divergence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_select_one` previously skipped its ambiguity check whenever any of
`role`/`tool_name`/`camera` was set, on the assumption that the caller
had already pinned down a unique row. That left a real ambiguity hole
for VQA: with two cameras emitting `(vqa, assistant)` at the same
frame, `emitted_at(..., role="assistant")` silently picked the first
sorted row instead of telling the recipe to add `camera=...`. The
existing `test_emitted_at_raises_on_ambiguous_per_camera_vqa` test
already encoded the desired behavior.
Tighten the check: any time `len(rows) > 1` we now raise with the
selectors echoed back, so users see exactly which fields they passed
and that more is needed to disambiguate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Drop the unused `events` kwarg from `active_at`/`nth_prev`/`nth_next`;
only `emitted_at` actually consults events. The dispatcher in
`_resolve_spec` now passes events conditionally.
* Replace the dual `_persistent_sort_key`/`_event_sort_key` pair with a
single `_row_sort_key` and drop the `sort_key` parameter from
`_select_one`. Event rows lack `timestamp` (it is implicit in the
frame) and now default to `0.0` for sort purposes — the
`(style, role)` tiebreaker is unchanged.
* Inline `_select_latest` into `active_at` (its only caller).
* Collapse `emitted_at`'s dual-branch into one `_select_one` call.
* Tighten `_validate_persistent_resolver` to a single
`column_for_style(style) != LANGUAGE_PERSISTENT` check.
* Parameterize `test_per_camera_blend_renders_both_views` over the two
cameras and factor the sub-recipe builder into `_vqa_subrecipe` so
the test no longer hand-rolls two near-identical recipe blocks.
Net -98 LOC; behavior, public resolver names, and test expectations
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs combining to make the brand-new ``_tool3`` dataset
unloadable:
1. ``lerobot_annotate.py:_push_to_hub`` uploads the annotated
dataset folder but never creates a codebase-version tag, so
``api/datasets/<repo>/refs`` returns ``"tags": []``. Then
``LeRobotDatasetMetadata`` → ``get_safe_version`` →
``get_repo_versions`` returns empty and the loader raises
``RevisionNotFoundError``.
2. ``RevisionNotFoundError`` itself was unconstructible: its
``HfHubHTTPError.__init__`` indexes ``response.headers``
unconditionally on current ``huggingface_hub`` versions, so
constructing it without a real ``Response`` blew up with
``AttributeError: 'NoneType' object has no attribute 'headers'``,
masking the real "no tag" message.
Fix#1: after upload, read ``meta/info.json["codebase_version"]`` and
``HfApi.create_tag(..., tag=<v3.x>, repo_type='dataset',
exist_ok=True)`` so the dataset is loadable straight from the Hub on
the next ``LeRobotDataset(repo_id)`` call. Falls back to the in-tree
``CODEBASE_VERSION`` if info.json is missing/malformed; on tag
creation failure, prints the manual one-liner the user needs.
Fix#2: stop trying to instantiate ``RevisionNotFoundError`` (which
inherits HfHubHTTPError) for what is really a config issue, not an
HTTP failure. Raise plain ``RuntimeError`` with the same message —
the caller actually sees what's wrong instead of an upstream
attribute error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``RevisionNotFoundError`` inherits from
``huggingface_hub.HfHubHTTPError`` which made ``response`` a required
keyword-only argument on recent versions. Constructing it with just a
message string blew up with
``TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only
argument: 'response'`` instead of surfacing the actual problem (the
dataset/checkpoint repo doesn't exist on the Hub yet).
Pass ``response=None`` explicitly. Fall back to the bare-message form
for older ``huggingface_hub`` versions that don't accept the kwarg.
Also clarify the message to call out the most common cause: typing a
hub repo id that hasn't been pushed yet (instead of just "needs a
version tag").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Last bump combined ``module_3.K=3`` with ``vqa_emission_hz=2.0`` and
``executor.episode_parallelism=32``. With 2 cameras per dataset that
produced ~12× the original VQA call volume, all submitted concurrently.
Module 3 latency went from ~30s/phase to ~490s per episode, vLLM's
KV cache pegged at 94% with 800+ in-flight requests, and the
multimodal cache corrupted with ``AssertionError: Expected a cached
item for mm_hash='...'`` (a known vLLM bug under image-heavy
concurrency). Module 1 and 2 ran fine; Module 3 was the bottleneck.
Pull back the multipliers to land in a sustainable spot:
* module_3.K: 3 (kept) — three diverse questions per emission,
where the diversity actually helps the LM head.
* module_3.vqa_emission_hz: 2.0 → 1.0 — back to the original
emission rate. Net VQA volume is now ~3× original (K alone) on
a single camera, ~6× across both cameras — manageable.
* module_2.max_interjections_per_episode: 9 → 6 — still 2× the
default, fewer than the prior 3× to keep total request volume
in check.
* vlm.client_concurrency: 256 → 128 — gives vLLM headroom on the
multimodal request path so the mm_cache doesn't desync.
* executor.episode_parallelism: 32 → 16 — half the episodes
in flight at once, so peak vLLM load is ~half.
n_task_rephrasings stays at 30 (text-only, doesn't load the image
path) and vlm.temperature stays at 0.7. The diversity gains are
preserved; only the throughput knobs come down.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following Pi0.7 §V (prompt expansion / diverse context conditioning),
push more atom variants per episode and higher VLM sampling
temperature so the training distribution has enough wording diversity
that the LM head is forced to use its parameters rather than memorise
specific (prompt, target) pairs.
Changes vs prior annotation pass:
* vlm.temperature: 0.2 (default) → 0.7 — every Module-1/2/3 call
now produces diverse phrasings; same prompt yields different
completions across emissions.
* module_1.n_task_rephrasings: 10 → 30 — three times as many
``task_aug`` rows in language_persistent. ``${task}`` already
rotates through them deterministically per sample_idx (see
``_resolve_task`` in language_render.py).
* module_2.max_interjections_per_episode: 3 (default) → 9 — more
``user_interjection_response`` training samples + more plan
refresh events.
* module_3.K: 1 → 3 — three VQA pairs per emission tick instead of
one. Combined with the hz bump below, ~6× more VQA samples.
* module_3.vqa_emission_hz: 1.0 → 2.0 — double the VQA emission
rate within each subtask span.
Pushes to a new hub repo (``_tool3``) so the working ``_tool2``
dataset stays intact for comparison. ``${task}`` already wired to
rotate through ``task_aug`` rows, so no renderer change needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Memorised models can collapse to dominant-mode outputs (the
JSON-token salad ``":":":":...`` from VQA training) when the prompt
drifts even slightly from training distribution. Without a guard,
that gibberish lands in ``current_subtask`` / ``current_plan`` /
``current_memory``, which feeds the next tick's prompt and cascades
into worse outputs. The user observed exactly this: a clean run
followed by a tick that wrote ``" " "`` into plan and memory, then
slow recovery several ticks later.
Add ``_looks_like_gibberish`` heuristic (alpha density, repeating
chars, JSON-prefix sniff) and apply it before mutating state in
``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd``.
Bad generations are logged inline (``[info] subtask gen rejected
(gibberish): "":":":..."``) so the user can see what was dropped, but
the state stays at its last-known-good value (typically the dataset
bootstrap) instead of being polluted.
VQA path is intentionally exempt — its training targets *are*
JSON-shaped, so the heuristic would false-positive on them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The user-typed task and the dataset's canonical task differ in
wording (capitalisation, ``green box`` vs ``green bin``, etc.). With
``text_loss`` driven down to ~6e-6 across 78 epochs the model is
memorised on the *exact* rendered training prompts: any wording drift
puts the prompt out of distribution and the model collapses to its
dominant training mode (VQA JSON output).
When ``--dataset.repo_id`` is set, automatically:
* read the canonical task string from the chosen episode (and use
it as ``--task`` when the user didn't pass one);
* pull the active ``plan`` / ``memory`` / ``subtask`` rows from the
persistent slice (latest row whose timestamp ≤ start frame's
timestamp — same semantics as the renderer's ``active_at``) and
seed them into the runtime state.
The first prompt the runtime builds at REPL start now mirrors what
the recipe rendered during training (task + active plan + active
memory + optional current subtask). The user can still override any
of these by typing.
Memorisation itself is upstream (training mix collapsed to too few
unique high-level targets); this commit only fixes the inference-side
prompt mismatch that was making the memorisation surface as gibberish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The four high-level steps shared one generic
``_control_context_messages`` that jammed task + plan + memory +
completed_subtask into a single user message. The recipes in
``smolvla2_hirobot.yaml`` each have a *specific* multi-message layout
(``memory_update``: ``user(task) → assistant(prev memory) →
user(completed subtask)``; ``high_level_subtask``: ``user(task+plan+
memory) → user(current subtask)``; ``user_interjection_response``:
``user(task) → assistant(prev plan) → user(interjection)``). After
``apply_chat_template`` those layouts produce different prompts than
the runtime's flattened single-user-turn version, and the model fell
back to its dominant training mode (VQA JSON output) — generating
``":":":":":":...`` repetition.
Add four per-recipe prompt builders (``_msgs_for_subtask``,
``_msgs_for_memory``, ``_msgs_for_interjection``, ``_msgs_for_vqa``),
each mirroring its sub-recipe's exact message structure including
the ``if_present`` skips. Wire each high-level step to its matching
builder. Inference prompts now line up with what the model saw in
training, so generation should produce coherent text instead of
repeated tokens.
Generic ``_control_context_messages`` is kept (still used by tests
and the no-recipe fallback path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous rewrite drove generation through ``vlm.generate()`` (the
standard SmolVLM path), which ignores SmolVLA's custom ``embed_prefix``
that interleaves images + lang + state. Result: the model received a
prompt format it had never been trained on at inference and emitted
JSON-fragment gibberish (``" " " ,",","`` ``cube lift {"...``).
Revert to the cumulative-buffer AR loop driven through
``vlm_with_expert.forward`` — the *same* forward call ``_compute_text_loss``
makes during training (``inputs_embeds=[prefix_embs, None],
use_cache=False, fill_kv_cache=True``). With ``fill_kv_cache=True``,
every layer routes through ``forward_attn_layer``, which gracefully
skips ``None`` expert inputs (``if hidden_states is None or layer is
None: continue``); cross-attention layers — which would otherwise hard-
require a non-None expert input — are bypassed entirely.
Inference now sees the same prefix structure as training: images +
lang + state, with new tokens appended to the lang region. The text
distribution matches what the model was trained to produce.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLA's image preprocessor sizes frames to whatever the action
expert was trained on, but SmolVLM's standard vision tower expects
its own default tile grid (e.g. 384/14 → 27×27 patches). The
mismatch surfaces deep in the post-vision reshape as
``RuntimeError: shape '[2, 34, 34, 768]' is invalid for input of
size 1843200`` — the model has 1200 patches but expects 34×34=1156.
Drop ``pixel_values`` from ``vlm.generate(...)`` so SmolVLM runs as
a text-only LM at REPL time. The high-level branches (subtask /
plan / memory) are dominated by their text context anyway, so this
is acceptable for dry-run inference. VQA loses its image grounding
— that will be marked as expected for the dry-run path until a
follow-up either re-processes images through SmolVLM's own
``ImageProcessor`` to match its tile grid, or gives
``vlm_with_expert`` a real AR text decode mode that handles state
and image embeddings the way training does.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hand-rolled AR loop in ``select_message`` was fighting the
underlying ``vlm_with_expert.forward`` design, which assumes the
"prefix-once + suffix-always-via-expert" pattern that ``denoise_step``
uses for action chunks. Cross-attn layers (every other layer with
``attention_mode='cross_attn'`` + ``self_attn_every_n_layers=2``)
hard-require an expert input on every call: passing
``inputs_embeds=[current_embs, None]`` crashed at
``expert_layer.input_layernorm(None)`` with ``'NoneType' object has
no attribute 'dtype'``. Earlier KV-cache attempts ran into the
matching ``[15, 139] vs [15, 1]`` shape mismatch because the cache
gets *overwritten*, not appended, on each ``fill_kv_cache=True`` call
— there's just no AR-text-decode mode in this forward.
Stop fighting it: drive AR text generation through the underlying
SmolVLM via ``vlm.generate(input_ids=..., attention_mask=...,
pixel_values=...)``. KV caching, sampling/greedy, EOS handling all
come from HF's standard implementation. Trade-off: ``state`` drops
out of the prefix at inference (no slot for it on the standard
SmolVLM path), so high-level generations may drift from training
distribution slightly. That's acceptable for the dry-run REPL — the
high-level branches (subtask / plan / memory / vqa) are mostly
vision+language conditioned anyway, and the action expert (where
state actually matters) goes through the unchanged ``select_action``
path.
Image features the runtime merged in (``observation.images.*``) are
stacked into the ``[B, num_images, C, H, W]`` shape SmolVLM expects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLA's ``vlm_with_expert.forward`` doesn't actually support
incremental KV cache growth — its only ``fill_kv_cache=True`` mode
*overwrites* the cache with the latest call's key/value states, and
its only ``fill_kv_cache=False`` mode concatenates ``cache + new``
into a local ``key_states`` for one matmul without ever updating the
cache itself. The original ``select_message`` decode loop tried to
use ``fill_kv_cache=True`` per step, which clobbered the cache to
1 token after the first decode and threw
``Expected size for first two dimensions of batch2 tensor to be:
[15, 139] but got: [15, 1]`` — the attention mask still expected
139 keys but the cached + new key_states only had 1.
Match the pattern ``denoise_step`` already uses successfully:
maintain a cumulative ``(embs, pad, att)`` buffer that starts as the
prefix and grows by one bool/embedding row per step. Each step
forwards the *full* sequence with ``use_cache=False,
fill_kv_cache=False, past_key_values=None`` so the matmul shapes
always line up. Generated-token rows are tagged ``pad=1, att=1``
which makes them fully causal among themselves while still able to
attend back to the entire prefix (per ``make_att_2d_masks``
semantics: a token can attend to any earlier token whose cumulative
``att`` count is ≤ its own).
Image encoding is still done once via the initial ``embed_prefix``
call — the expensive part doesn't repeat. The remaining cost is
O(n²) text-only transformer forwards, which is fine for the dry-run
REPL's 50–100 token responses.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLA's ``eager_attention_forward`` does
``masked = torch.where(attention_mask[:, None, :, :], ...)``, which
requires a 3D ``[B, query_len, key_len]`` bool tensor so the
broadcast to 4D works. ``select_message``'s prefix forward got this
right (passes ``prefix_2d`` from ``make_att_2d_masks``), but the
KV-cache decoding loop built ``new_attn = torch.ones((bsize,
cur_pos + 1))`` — 2D — and the very first decode step blew up with
``IndexError: too many indices for tensor of dimension 2``.
During KV-cache decoding ``query_len = 1`` and
``key_len = cur_pos + 1`` (prefix + every token already generated),
so the right shape is ``[B, 1, cur_pos + 1]``. Match the layout
SmolVLA's working ``denoise_step`` uses for the equivalent
``prefix_pad_2d_masks`` build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues that combined to make the REPL unusable:
1. ``BatchEncoding.attention_mask`` is a ``Long`` tensor, but SmolVLA's
``eager_attention_forward`` does
``torch.where(attention_mask[..., None, :, :], ...)`` which
requires a *bool* condition. Every forward raised ``where expected
condition to be a boolean tensor, but got a tensor with dtype Long``
and the diagnostic surfaced it cleanly in the REPL — but generation
produced nothing useful. Cast to ``bool`` in ``_build_text_batch``
so the prefix forward goes through.
2. The interactive REPL used ``rich.live.Live`` panels stacked on top
of ``logging.basicConfig(level=DEBUG)`` HTTP request lines from
``httpcore`` / ``httpx`` / ``huggingface_hub``. The two rendering
loops fought each other in the user's terminal and the output was
illegible: hundreds of debug lines interleaved with re-rendered
panels.
Replace ``Live`` with a simple block redraw — clear screen, print
the state block, print any robot log lines, then a single ``> ``
prompt. State changes are visible above the prompt, the way Claude
Code's REPL renders. No flicker, no re-render races.
``_silence_noisy_loggers`` drops the chatty third-party HTTP /
download / model-init loggers to WARNING. ``-v`` still enables
DEBUG on the lerobot loggers; if the user needs the HTTP traces,
they can flip those individually.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``tokenizer.apply_chat_template(..., tokenize=True, return_tensors='pt')``
on newer transformers returns a ``BatchEncoding`` (dict-like) rather
than a raw ``Tensor`` — particularly when the underlying call routes
through a processor. ``_build_text_batch`` only handled the ``Tensor``
and ``list`` shapes, so the encoding object reached SmolVLA's
``embed_language_tokens`` and ``F.embedding`` blew up with
``argument 'indices' must be Tensor, not BatchEncoding`` on every
high-level forward.
Normalise the return:
* ``BatchEncoding`` / ``dict`` → take ``input_ids`` (and the encoder's
``attention_mask`` when present, since ``pad_token_id`` can be
``None`` for SmolVLM and the fall-back ``ids != pad_token_id``
breaks then),
* ``list[int]`` / ``list[list[int]]`` → wrap in a long tensor,
* ``Tensor`` → keep as-is.
After unwrapping, ensure shape ``(1, seq)`` and that ``attention_mask``
is a tensor on the same device as ``ids``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two failure modes were combining to make the runtime "look dead":
1. ``_build_text_batch`` produced lang tokens via
``apply_chat_template(return_tensors='pt')`` on CPU, but the policy
sits on the configured device (mps / cuda). The first prefix-embed
inside ``select_message`` then raised a device-mismatch on every
call. The bare ``except Exception`` in ``_generate_with_policy``
swallowed it at debug level — no logs, no chat output, no visible
sign anything had run.
2. Even when generation succeeded but returned an empty string
(greedy EOS, unhappy chat template, etc.), the high-level steps
silently no-op'd, so users saw nothing.
Move tokens to ``policy.config.device`` in ``_build_text_batch`` so
the prefix forward succeeds in the common case. Bump the swallowing
log level to ``warning`` (with optional traceback under ``-v``), and
when ``state`` is given route the same diagnostic into the REPL log
via ``push_log`` so the user sees ``[warn] subtask gen failed: ...``
inline. Also push an ``[info] ... produced no text this tick`` line
when generation runs but yields nothing, so empty completions are
distinguishable from "step never ran". Apply the same surface to
``LowLevelForward.select_action`` failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LowLevelForward was handing the observation provider's output straight
to ``policy.select_action``, but SmolVLA's ``_get_action_chunk``
indexes ``batch[OBS_LANGUAGE_TOKENS]`` and crashes with ``KeyError:
'observation.language.tokens'`` when the key isn't there. Our provider
deliberately strips the dataset's language columns (the runtime drives
messages itself), so nothing else was producing those tokens — the
chunk path crashed on the very first tick after task was set.
Build a low-level prompt from current runtime state inline (task /
plan / memory as the user turn, current subtask appended as a
continuation assistant turn when known), tokenize it with the same
helper the high-level steps use, and merge ``lang_tokens`` /
``lang_masks`` into the observation before the call. Skip the step
when no task is set yet, and swallow ``select_action`` exceptions at
debug level so a missing observation feature doesn't kill the REPL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``_load_hf_dataset`` was building the strict cast schema only from
``meta/info.json["features"]``. Datasets annotated by
``lerobot-annotate`` but still tagged at the older codebase version
(no ``language_persistent`` / ``language_events`` entry in
``info.json``) carry both columns in the parquet itself but not in the
features dict, so ``Dataset.from_parquet`` blew up with
``CastError: column names don't match`` when trying to project a
9-column parquet onto a 7-column schema.
Probe one parquet shard's actual schema; if either language column is
present in the parquet but missing from ``features``, graft it on
using PR 1's ``language_persistent_column_feature`` /
``language_events_column_feature`` helpers. No-op when neither column
is present (fully backwards-compatible with v3.0 datasets), no-op when
both are already registered (fully forwards-compatible with future
v3.1 ``info.json`` writes).
This unblocks dry-run inference on PR 2-annotated datasets that
weren't re-tagged to v3.1 — including the ones in the field today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``PolicyProcessorPipeline.from_pretrained`` reconstructs each saved
step by passing the persisted JSON config back to ``__init__``, but
``RenderMessagesStep.recipe`` (a ``TrainingRecipe``) doesn't survive
the JSON round-trip — the saved entry is ``{}`` and the reconstructor
crashes with ``missing 1 required argument: 'recipe'``.
Bypass the round-trip in the runtime CLI by passing
``pretrained_path=None`` to ``make_pre_post_processors``. That re-runs
``make_smolvla2_pre_post_processors``, which reloads the recipe YAML
referenced by ``cfg.recipe_path`` and wires it back into the step
correctly. ``NormalizerProcessorStep`` still gets stats from
``ds_meta.stats`` so normalization matches training.
Proper fix is to make ``RenderMessagesStep`` serializable (e.g. by
persisting the recipe path / contents); this commit keeps it scoped to
the runtime path so dry-run testing isn't blocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous bound `>=0.1.0,<1.0.0` matched zero published versions —
pocket-tts went straight to 1.0.0 on PyPI, with 0.x never released.
That made `uv sync --extra tools` (and any sync that pulls the `dev` /
`all` superset) fail with "requirements are unsatisfiable" on every
Python version uv tried, including 3.12.
Bump to `>=1.0.0,<3.0.0` so 1.x and 2.x are reachable. SayTool only
touches `TTSModel.load_model()`, `get_state_for_audio_prompt`,
`generate_audio`, and `sample_rate` — small enough surface that 1.x
and 2.x should both work; tighten if a real API break shows up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime CLI's loader was broken — it imported a `make_policy_from_path`
that doesn't exist in `lerobot.policies.factory` — and the high-level text
steps generated plan / subtask / memory / VQA from a text-only batch with
no images or state, so dry-runs drifted from the training distribution.
Switch to the standard `PreTrainedConfig.from_pretrained` +
`make_policy(cfg, ds_meta=...)` flow so `--policy.path` accepts both local
directories and Hub repo ids, and add a `--dataset.repo_id` path that walks
a chosen episode and feeds preprocessed observations into every forward
pass — including the four high-level steps (`HighLevelSubtaskFwd`,
`MemoryUpdateFwd`, `UserInterjectionFwd`, `AskVQAFwd`). Frames are routed
through the saved preprocessor pipeline with `language_persistent` /
`language_events` stripped so the recipe-render step stays a no-op (the
runtime supplies its own messages from current state).
Also wires the rich-based two-zone REPL layout (`ui.py`) that the script
was already importing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep annotated language columns through collation, render batched recipe samples, and make SmolVLA2 text loss robust enough for distributed training on the steerable dataset.
Co-authored-by: Cursor <cursoragent@cursor.com>
Ensure annotated datasets advertise language columns in meta/info.json so non-streaming dataset loads cast against the rewritten parquet schema.
Co-authored-by: Cursor <cursoragent@cursor.com>
Closes the loop on PR 3: SmolVLA2 can now be queried interactively at
inference, dispatching the same five sub-recipe shapes it was trained
on (action chunks, subtask gen, memory updates, plan/speech on
interjection, VQA on questions).
Modeling fixes + additions
--------------------------
- ``_compute_text_loss``: standard next-token CE shift was missing
(logits at position t were CE'd against the label at t — identity-
mapped, learning nothing). Adds ``logits[:, :-1]`` /
``labels[:, 1:]`` shift to match HuggingFace ``LlamaForCausalLM``.
- New ``select_message`` on ``SmolVLA2Policy``: AR text generation
with KV caching, mirroring SmolVLA's ``select_action`` pattern.
Single prefix forward fills the cache, then per-token forwards
reuse it. Greedy + top-p nucleus sampling. Returns the decoded
string with the prompt stripped.
Runtime package — ``src/lerobot/policies/smolvla2/inference/``
-------------------------------------------------------------
- ``triggers.py`` — ``Trigger`` Protocol + ``HzTrigger`` /
``EventTrigger`` + ``TickClock``. The whole runtime ticks at
``max_rate_hz=50`` and each step gates itself off its own
cadence.
- ``runtime_state.py`` — runtime state dict factory plus tiny
helpers (``take_event``, ``set_if_changed``, ``push_log``).
Stable keys are documented at the top of the module.
- ``steps.py`` — :class:`InferenceStep` base + concrete steps:
``LowLevelForward`` / ``DispatchAction`` (action path),
``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` /
``UserInterjectionFwd`` / ``AskVQAFwd`` (text paths),
``DispatchToolCalls`` (tool registry → ``Tool.call``). Each
text step builds a chat-template prompt from current
``RuntimeState`` (task / plan / memory / subtask) matching
what ``smolvla2_hirobot.yaml`` renders during training.
Includes a tiny ``<say>...</say>`` parser for the
``user_interjection_response`` branch's combined plan + speech
output.
- ``runtime.py`` — :class:`SmolVLA2Runtime` composes the pipeline,
drives ticks via ``TickClock``, polls a user-supplied
``event_collector`` per tick, and prints state-change log lines.
- ``repl.py`` — :class:`StdinReader` non-blocking line reader
with simple intent classification: ``stop`` / ``quit`` /
``exit`` → terminate; ``?`` suffix → ``user_vqa_query`` event;
first line → set task; other lines → ``user_interjection``.
CLI
---
- ``src/lerobot/scripts/lerobot_smolvla2_runtime.py``: console
script ``lerobot-smolvla2-runtime`` that loads a checkpoint,
optionally instantiates ``SayTool`` (pocket-tts), wires up
``SmolVLA2Runtime`` + ``StdinReader``, and runs.
Real-robot wiring (observation_provider / robot_executor) is
intentionally left as a follow-up — v1 is dry-run / language-
only so the REPL works without robot hardware.
Registered in ``pyproject.toml`` ``[project.scripts]``.
Known follow-ups
----------------
- Real-robot integration: today ``LowLevelForward`` only fires when
an observation_provider is wired. The CLI prints a warning if
``--no_robot`` is omitted.
- ``select_message`` runs an extra prefix forward; could share with
the action path's prefix when both are needed in the same tick.
- Tests: no end-to-end runtime test yet (would need a tiny SmolVLM
fixture). The components compile and the public surface is
exercised by the CLI's argument-parsing path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The third and final commit of PR 3's SmolVLA2 work. Wires the actual
training signal through:
* ``predict_actions[i] = True`` → sample i contributes to flow loss
* ``text_labels[i, t] != -100`` → token t of sample i contributes to
LM-head cross-entropy
Both routing knobs come from ``SmolVLA2ChatTokenizerStep`` (previous
commit on this branch), which builds them from the recipe's
``message_streams`` / ``target_message_indices``. The per-sample
``predict_actions`` mask preserves the Pi0.5 convention from the
plan's Section I.7: "True iff any low_level target exists".
Implementation:
- ``forward`` reads ``text_labels`` and ``predict_actions`` from the
batch. When neither is present (vanilla SmolVLA usage with no
recipe), delegates to ``SmolVLAPolicy.forward`` so unannotated
datasets keep training as before — full backward compatibility.
- ``flow_loss``: super().forward(reduction="none") returns the
per-sample (B,) flow loss; we mask non-action samples with the
``predict_actions`` bool and renormalize by the count of action
samples. ``flow_loss_weight = 0`` in the config disables this
branch entirely (text-only training).
- ``text_loss``: a prefix-only forward through the VLM (no action
expert / suffix), slicing the lang-token range out of the
resulting hidden states (``embed_prefix`` orders the prefix as
``[image_blocks..., lang, state]`` so the slice is unambiguous).
Apply ``vlm.lm_head`` to those hidden states, cross-entropy with
``text_labels`` (ignore_index=-100). ``text_loss_weight = 0``
disables this branch (reverts to flow-only behaviour, matching
SmolVLA exactly).
- The two losses are summed with the config-supplied weights.
Mixed-stream samples (one batch containing both action targets and
text-only sub-recipes) are handled correctly: each sample contributes
where its labels are valid and is masked elsewhere.
Limitations / known follow-ups:
- Text loss runs an additional prefix-only forward separate from the
flow path's prefix forward. The forwards could share their prefix
computation; for clarity of this first commit they don't.
Optimization is straightforward when needed.
- Per-sample loss for ``reduction="none"`` is not yet meaningfully
defined for the dual path — we broadcast the scalar to (B,) for
caller compatibility (e.g. RA-BC weighting will need follow-up).
- Inference ``select_action`` is unchanged from SmolVLA today —
it predicts actions only. A separate "generate text"
``select_message`` path is the natural next step for runtime
use of the LM head (memory updates, plan refreshes, VQA answers).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires PR 1's recipe stack into the SmolVLA2 pipeline so multi-target
sub-recipes (memory_update, ask_vqa, user_interjection_response,
high_level_subtask) carry meaningful supervision through to the model.
- New ``chat_processor_smolvla2.py`` with
``SmolVLA2ChatTokenizerStep``: reads ``messages`` /
``message_streams`` / ``target_message_indices`` from the rendered
sample (PR 1 ``RenderMessagesStep``), calls
``apply_chat_template(messages, tools=DEFAULT_TOOLS, ...)`` on the
SmolVLM tokenizer, and writes:
OBS_LANGUAGE_TOKENS / _ATTENTION_MASK ← chat-templated prompt
text_labels ← -100 except target msg tokens
predict_actions ← True iff any low_level target
Builds the label mask robustly by re-rendering the chat through
each target's prefix and reading off the prefix length — same
tokenizer, same tools, so the prefix tokens are guaranteed to be
a prefix of the full sequence. Image/video content blocks
(LeRobot ``feature``-keyed) are stripped before tokenizing; the
actual image tensors flow through SmolVLA's existing
``OBS_IMAGES_*`` channels and ``embed_prefix`` puts them before
the language embeddings, matching the chat-template-stripped
text order.
- ``processor_smolvla2.py``: when ``config.recipe_path`` is set,
build a new pipeline with ``RenderMessagesStep`` +
``SmolVLA2ChatTokenizerStep`` instead of SmolVLA's plain
``TokenizerProcessorStep``. When ``recipe_path`` is ``None``,
fall back to SmolVLA's pipeline so unannotated datasets still
work unchanged. Resolves recipe paths relative to
``src/lerobot/configs/`` so ``recipes/smolvla2_hirobot.yaml``
works directly.
The next commit on this branch picks up ``text_labels`` and
``predict_actions`` from the batch and routes them through the
SmolVLM ``lm_head`` for the actual dual-loss training.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the runtime side of the OpenAI-style function-calling stack
introduced in PR 1 (catalog in ``meta/info.json["tools"]``) and PR 2
(annotation pipeline writes the catalog after a run). One file per
tool — heavy deps stay isolated.
Layout:
- ``base.py`` — :class:`Tool` Protocol: ``name``, ``schema``,
``call(arguments)``. Runtime-checkable so tests can use
``isinstance(...)``.
- ``registry.py`` — :data:`TOOL_REGISTRY` (name → class) plus
``get_tools(meta, **kwargs)`` that instantiates every entry whose
``function.name`` is registered. Tools whose name is unknown are
silently skipped — the schema still rides through the chat
template, the model just can't actually invoke that tool at
inference.
- ``say.py`` — :class:`SayTool` wrapping Kyutai's pocket-tts
(CPU-only, ~100M params, ~6× real-time on a MacBook Air M4).
Lazy model load: pocket-tts is imported and the voice state
computed on first ``call(...)`` (or eagerly via ``preload()``).
Returns the PCM tensor; optionally writes a ``.wav`` to
``output_dir`` for offline inspection.
- ``__init__.py`` — re-exports the public surface.
Optional install:
pip install lerobot[tools]
The ``[tools]`` extra in ``pyproject.toml`` pulls in ``pocket-tts`` +
``scipy`` (for the wav writer). Adding more tools later means a new
file + a registry entry — no new extras unless the tool brings new
deps.
To add your own tool, follow the three-step guide in
``docs/source/tools.mdx`` (PR 1):
1. Drop ``src/lerobot/tools/<my_tool>.py`` with a ``Tool``-conforming
class.
2. Register the class in ``TOOL_REGISTRY`` (this file).
3. Pre-populate ``meta/info.json["tools"]`` with the schema (or let
``lerobot-annotate`` add it on the next run).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 3 of the steerable-annotation plan retargeted from Pi0.5 to SmolVLA
because the recipe stack (PR 1 + PR 2) outputs HF/TRL-compatible chat
which a chat-pretrained backbone consumes natively. SmolVLA strips the
SmolVLM ``lm_head`` though, so it can only do flow-matching action
prediction. SmolVLA2 keeps the LM head so the same model can train on
the full Hi Robot / MEM / ECoT blend defined in the plan:
* action-only sub-recipes (low_level_execution) flow loss
* text-only sub-recipes (memory_update / ask_vqa / CE loss on
user_interjection_response) lm_head
* mixed sub-recipes both summed
This first commit lays down the structural scaffold:
- ``src/lerobot/policies/smolvla2/`` — new package with thin subclasses
of ``SmolVLAConfig`` / ``SmolVLAPolicy`` so we don't fork the 900-line
modeling code. ``SmolVLA2Config`` adds ``recipe_path``,
``apply_chat_template``, ``text_loss_weight``, ``flow_loss_weight``,
and ``unfreeze_lm_head``. ``SmolVLA2Policy`` unfreezes the SmolVLM
``lm_head`` (and the surrounding norm + last text-model layer SmolVLA
freezes) when ``unfreeze_lm_head=True`` and ``text_loss_weight>0``.
- ``factory.py`` registers ``smolvla2`` in ``get_policy_class``,
``make_policy_config``, and the pre/post-processor builder. Important:
the ``smolvla2`` branch lives BEFORE the ``isinstance(config,
SmolVLAConfig)`` check because ``SmolVLA2Config`` subclasses
``SmolVLAConfig`` — without the ordering, SmolVLA2 would silently
pick up SmolVLA's processor.
- ``configs/recipes/smolvla2_hirobot.yaml`` — canonical Hi Robot blend
for SmolVLA2. Same shape as ``pi05_hirobot.yaml`` (PR 1) so the
recipe stack stays uniform across policy backbones.
Behaviour today is identical to SmolVLA: the modeling forward
delegates to ``SmolVLAPolicy.forward`` and the processor delegates to
``make_smolvla_pre_post_processors``. The next commit on this branch
adds the chat-template processor + ``text_labels`` / ``predict_actions``
batch keys; the commit after that wires the actual text-loss path
through ``vlm.lm_head``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After every ``lerobot-annotate`` run, the executor ensures
``meta/info.json["tools"]`` contains at minimum the canonical ``say``
schema, while preserving any tools the user pre-declared on the
dataset. Chat-template consumers (PR 3 SmolVLA2 / Pi0.5 / dataset
visualizer) read the catalog through
``LeRobotDatasetMetadata.tools`` and pass it to
``apply_chat_template(messages, tools=meta.tools, ...)``.
- ``executor.py``: new ``_ensure_tools_in_info`` helper called
after the parquet rewrite. Idempotent and additive — merges by
``function.name``, only writes back if the list changed.
- ``writer.py``: drops the duplicated ``SAY_TOOL_SCHEMA`` /
``DEFAULT_TOOLS`` constants in favour of importing from
``lerobot.datasets.language`` (PR 1's single source of truth).
Re-exported so existing imports keep working.
- ``annotation_pipeline.mdx``: replace the "code constant only" note
with a pointer to the new Tools doc and a description of the
meta/info.json behaviour, including how to pre-declare custom
tools before annotation runs.
This is the storage half of the tools work; PR 3 ships the runnable
implementations under ``src/lerobot/tools/`` (one file per tool,
first up: ``say.py`` wired to Kyutai's pocket-tts).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 1 now produces ``task_aug`` rows (registered in PR 1) so the
PR-1 ``${task}`` resolver can rotate phrasings deterministically per
``sample_idx``. Plus an opt-in video-derived task that bypasses the
canonical ``meta/tasks.parquet`` task when it's empty, low-quality, or
explicitly disabled — every downstream Module-1 prompt then uses the
derived task as its grounding.
- ``Module1Config``: adds ``n_task_rephrasings`` (default 10) and
``derive_task_from_video`` ∈ ``{off, if_short, always}`` (default
``if_short``: triggers when canonical is empty, < 3 words, or matches
a placeholder string like ``debug`` / ``unnamed`` / ``tbd``).
- ``plan_subtasks_memory.py``: ``run_episode`` now resolves an
``effective_task`` (canonical OR video-derived) and threads it
through ``_generate_subtasks`` / ``_generate_plan`` /
``_generate_memory`` so subtasks, plans, and memory are all grounded
in the same task string. Then generates ``n`` rephrasings of the
effective task and writes them as ``task_aug`` rows at ``t=0`` with
``role=user``. The effective task itself is included as the first
variant so the rotation is guaranteed to cover the source-of-truth
phrasing.
- New prompts: ``module_1_video_task.txt`` (one-shot video → task),
``module_1_task_rephrasings.txt`` (text-only paraphraser, ``n`` per
call).
- ``meta/tasks.parquet`` is NOT modified — derived tasks live only in
``language_persistent``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
qwen36moe-11 surfaced a deeper semantic problem with mid-episode
interjections: they were generated as *counterfactual* user requests
("actually skip the wipe", "use the blue one instead") but teleop data
is frozen — the robot in the video already executed everything,
including the steps the user "asked to skip". The training signal was
therefore self-contradictory: interjection text said one thing, the
robot's subsequent action stream did the opposite.
Flip the framing. Anchor every interjection at a subtask boundary and
write it as a natural user request for the *upcoming* subtask. The
robot's visible next behavior IS the interjection's effect, so:
interjection text → plan refresh → action stream
are all consistent with the same observed video.
Concretely:
- ``interjections_and_speech.py``: instead of sampling random
timestamps from ``frame_timestamps``, walk Module 1's subtask spans
and sample from the (subtask N → subtask N+1) transitions. Pass both
the just-finished and the upcoming subtask texts into the prompt.
- ``_window_timestamps``: re-center the multi-frame video window on
the boundary itself (half the frames cover the end of the previous
subtask, half cover the start of the next one) so the VLM has the
same visual conditioning the policy will see at training time.
- ``module_2_interjection.txt``: rewritten. The prompt now states
explicitly that this is offline data, the robot already committed to
the next subtask, and the interjection must be a natural request
that aligns with — not contradicts — the next subtask. Removes the
"negative task / situated correction" Hi Robot framing because those
scenarios require online execution to be coherent.
Plan-refresh logic from the previous commit (forwarding interjection
text into the refresh prompt) is unchanged and now reinforces the same
direction: the refreshed plan emphasizes the upcoming subtask the
interjection just asked for.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
qwen36moe-10 showed three Module-2 / plan-refresh quality issues that
are not architecture problems — they're prompt-grounding bugs:
1. Interjection prompt passed ``current_subtask = record.episode_task``
(the WHOLE-episode task), not the actual subtask in force at the
chosen timestamp. The VLM had no signal about what was visible at
that moment, so its interjections were generic ("actually skip X"
where X had nothing to do with the visible activity).
2. Interjection prompt only attached a single frame
(``frames_at(record, [t_snap])``). With one frozen image the VLM
couldn't read the ongoing motion. Module 1 already gets the whole
episode video for subtask decomposition, which is why subtasks are
well-grounded; Module 2 was the outlier.
3. The plan-refresh prompt told the model "a plan refresh after a user
interjection at t=X.YZs" but never showed it the interjection
*text*. So the refreshed plan couldn't actually reflect the user's
correction — at best it recombined the same step list.
Fix:
- ``interjections_and_speech.py``: Module 2 reads Module 1's subtask
rows from the same staging tree (executor orders module_1 → module_2
so they're already there) and resolves the actual ``current_subtask``
at each chosen timestamp. Pulls a small clip
(``interjection_window_seconds`` × ``interjection_window_frames``,
defaulting to 4 frames over the leading 2 s) instead of one frame.
Drops the silently-zeroing ``len(candidate_ts) // 4`` cap on the
interjection count.
- ``module_2_interjection.txt``: prompt is rewritten to reference the
multi-frame visual context and require the interjection to mention
something visible OR named in the current subtask, not invented.
- ``plan_subtasks_memory.py``: ``run_plan_updates`` now accepts and
threads through interjection texts. ``_generate_plan(refresh_t,
interjection)`` injects both the current subtask AND the interjection
text into the prompt so the refreshed plan can drop / reorder /
constrain steps to match the user's correction. (Plan still refreshes
ONLY at user interjections — subtask generation runs ~1 Hz at
inference, plan re-emission is event-driven.)
- ``executor.py``: forwards ``interjection_texts`` alongside
``interjection_times`` to ``run_plan_updates``.
- ``Module2Config``: bumps ``max_interjections_per_episode`` default
from 1 to 3 and exposes ``interjection_window_seconds`` /
``interjection_window_frames``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 2 used to write a top-level ``tools`` column on every parquet shard
holding the JSON schema for the ``say`` tool, broadcast identically
across every row. That extends PR 1's schema for no real information
gain — the schema is a fixed code constant, parquet's RLE/dict encoding
collapses it on disk anyway, and HF/TRL chat-template consumers can
just import the constant directly.
PR 2 should fill in PR 1's existing schema, not add to it. So:
- ``writer.py``: stop emitting the ``tools`` column. Strip any legacy
``tools`` column from older shards on rerun so the schema converges to
v3.1. ``SAY_TOOL_SCHEMA`` stays as a public constant (now joined by
``DEFAULT_TOOLS = [SAY_TOOL_SCHEMA]``); chat-template policies and the
visualizer import them directly.
- ``test_writer.py``: replace the "tools column present" assertion with
one that explicitly checks the column is absent, plus a new test
asserting the constant's shape.
- ``test_pipeline_recipe_render.py``: drop the tools-column read; assert
it's not present in the rewritten parquet.
- ``annotation_pipeline.mdx``: update the writer description to note the
parquet stays small and the schema lives as a code constant.
If multi-tool-set support ever becomes real (datasets with different
tool inventories), the right home is ``meta/info.json["tools"]`` —
adding it later is non-breaking; ripping out a parquet column already
shipped is not.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``lerobot.datasets.video_utils.decode_video_frames`` routes
``backend="pyav"`` through ``decode_video_frames_torchvision`` →
``torchvision.io.VideoReader``, but ``VideoReader`` was removed in
torchvision >= 0.22 (the vllm/vllm-openai:latest container ships with
torchvision 0.25). That made every Module 3 frame decode raise
``AttributeError: module 'torchvision.io' has no attribute 'VideoReader'``,
which the previous catch-all silently turned into an empty image list,
which then made every Module 3 prompt skip via the
``not _has_image_block(messages)`` branch and produce zero VQA rows.
Bypass ``video_utils`` entirely. The annotation pipeline only needs
a handful of PIL frames per (episode, ts), so a direct PyAV decode is
both simpler and insulated from torchvision API churn. ``av`` is already
in the install set, no new dependency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VideoFrameProvider._decode used to swallow every exception silently and
return []. That made Module 3 (VQA) produce zero rows whenever local
video decoding broke (codec, backend, missing file, ...) because every
prompt got skipped via the ``not _has_image_block(messages)`` branch in
general_vqa.py — without any signal in the job log.
Log the first failure with full exception info (subsequent failures
stay quiet to avoid log spam) so this fast-path is debuggable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Print the default and full camera list once at the top of every run so a
silent Module-3-no-op (cam_keys=[]) is visible in the job log instead of
only being discoverable by counting parquet rows after upload.
Also warn loudly when Module 3 is enabled but no cameras resolved, with
a hint about the --vlm.camera_key fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 3 fast-pathed out (50 episodes in 0.6s) when
``frame_provider.camera_keys`` came back empty even though Module 1/2
worked, because they use ``frame_provider.camera_key`` (singular) and
were happy with the explicit ``--vlm.camera_key=...`` override.
Two fixes:
- ``frames.py``: read ``meta.camera_keys`` (covers both video- and
image-stored cameras) instead of ``meta.video_keys`` (video-only),
matching :class:`LeRobotDatasetMetadata`'s canonical accessor. If
metadata still surfaces nothing but the caller explicitly passed
``--vlm.camera_key=<key>``, fall back to ``[<key>]`` — the key is by
definition known to exist on the dataset.
- ``general_vqa.py``: emit a one-time WARNING log when Module 3 sees
zero cameras so this never silently produces zero VQA again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A ready-to-run example of launching the annotation pipeline on a
Hugging Face job (h200x2) with two vllm replicas serving
Qwen3.6-35B-A3B-FP8. Lives next to other end-to-end recipes under
examples/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 3 now produces one (vqa, user) + (vqa, assistant) pair per
emission tick *per camera* rather than only against the dataset's first
camera. Each emitted row carries the `camera` field added in PR 1
(language-columns), so the resolver can disambiguate per-camera VQA via
`emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity.
- `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property
and a `camera_key=` argument on `frames_at` / `video_for_episode`.
`VideoFrameProvider` exposes every `observation.images.*` key the
dataset declares (not just the first) and keys its decode cache on
`(episode, camera, timestamp)` so per-camera reads don't collide.
Module 1 / 2 keep their old single-camera behaviour by leaving
`camera_key=None` (falls back to the default camera).
- `modules/general_vqa.py`: `run_episode` iterates `frame_provider
.camera_keys` for each emission tick, builds one prompt per camera,
batches all of them through the VLM, and stamps the resulting rows
with `camera=<that key>`. Empty `camera_keys` (null provider) makes
the module a no-op rather than silently emitting untagged rows.
- `writer.py`: `_normalize_persistent_row` / `_normalize_event_row`
carry `camera` through and call `validate_camera_field` so the
invariant is enforced at the writer boundary. Event sort key now
includes `camera` for deterministic ordering when several cameras
share `(timestamp, style, role)`. `speech_atom` sets `camera=None`.
- `validator.py`: `StagingValidator` gains a `dataset_camera_keys`
field; `_check_camera_field` enforces the invariant and cross-checks
every view-dependent row's `camera` against the dataset's known video
keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate
`(vqa, role)` pairs at the same `(t, camera)`.
- `lerobot_annotate.py`: passes the live frame provider's
`camera_keys` into the validator so the cross-check uses the actual
dataset camera set.
- Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new
`camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera`
configures two cameras and asserts both are represented, that every
emitted row has a `camera` tag, and that uniqueness holds per
`(timestamp, camera, role)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Modern LeRobot datasets store videos in AV1, which vllm's libav build
cannot decode (the video processor returns 0 frames and downstream
chokes with ZeroDivisionError). Re-encode each per-episode subclip
with libx264 (preset ultrafast, crf 23) so the resulting mp4 is
universally decodable. Strip audio with -an for a smaller payload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds VlmConfig.num_gpus so parallel_servers can exceed the physical
GPU count. Replicas are round-robin-assigned to GPUs (e.g.
parallel_servers=4 + num_gpus=2 → replicas pinned to GPUs 0,1,0,1).
Backward-compatible: num_gpus=0 keeps the existing 1-replica-per-GPU
behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets callers pass per-request template flags such as
{"enable_thinking": false} for Qwen3.5/Qwen3.6 models, where the
default thinking preamble otherwise consumes the entire max_new_tokens
budget before any JSON is emitted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The setuptools package-data declaration only listed envs/*.json, so
pip-installed wheels (including HF Jobs runs) were missing the
module_1_subtasks/plan/memory and module_2/3 prompt templates,
causing FileNotFoundError at runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default backend is now a local OpenAI-compatible server (vllm /
transformers) which auto_serve spawns. Removes the
use_hf_inference_providers config flag and the router.huggingface.co
routing branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the pipeline completes, optionally create/locate a dataset repo
and upload the dataset root (excluding .annotate_staging/). Add
push_private and push_commit_message knobs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Saturates parallel_servers + client_concurrency. Previously the
executor processed one episode at a time, so each Module 1 episode's
3-5 dependent VLM calls hit a single server with the others idle. Now
defaults to 16 episodes in flight; configurable via
ExecutorConfig.episode_parallelism.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vllm with --uvicorn-log-level warning suppresses the "Uvicorn running"
banner that the readiness watcher waited for, so the spawn helper hung
forever even after the API was live. Add an HTTP probe in parallel with
the log watcher and broaden the log markers to include vllm's own
"Starting vLLM API server" / "Available routes are" lines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 server-streaming threads writing chars unsynchronized cause UTF-8
sequences from different servers to interleave mid-byte, garbling the
terminal output. Switch to line-buffered reads with a single shared
print lock — output stays readable, ready-marker detection still works
on the line containing 'Uvicorn running' / 'Application startup
complete'.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds vlm.client_concurrency (default 16) which uses a ThreadPoolExecutor
to fan out batched chat.completions calls. vllm batches them internally
on the server side, giving big throughput wins on a single TP=1 server
without needing DP/TP and the NCCL setup it requires.
Module 3 now batches all per-episode VQA calls into a single
generate_json invocation so they fire in parallel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds --vlm.parallel_servers=N. Spawns N independent vllm processes
(each pinned to GPU i via CUDA_VISIBLE_DEVICES, listening on
serve_port+i) and round-robins requests across them. Sidesteps DP/TP
NCCL setup failures on nodes with restricted P2P/SHM.
Default serve_command for parallel mode: vllm serve <model_id>
--tensor-parallel-size 1 --max-model-len 32768 --uvicorn-log-level
warning. Override via --vlm.serve_command (use {port} placeholder).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some prompts/models occasionally return pure prose with no JSON object
even on retry. Returning None (and logging a preview) lets the pipeline
skip that one VLM call cleanly instead of aborting the whole episode.
The modules already check for None / non-dict results and degrade
gracefully (no row emitted from that call).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Models often wrap JSON in prose or <think>...</think> blocks. Strip the
think tags first, then try direct json.loads, then fall back to scanning
for the first balanced {...} substring (ignoring braces inside strings).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the fixed max_video_frames count with a rate (default 1 fps).
A 30 s episode now sends 30 frames; a 5 s episode sends 5; capped at
max_video_frames (default 128) to avoid blowing up the payload on long
episodes.
Override with --module_1.frames_per_second=2.0 for denser sampling, or
--module_1.frames_per_second=0.5 for sparser.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fall back to huggingface_hub.get_token() when HF_TOKEN/HUGGINGFACE_API_KEY
env vars aren't set. That picks up the token cached by
'huggingface-cli login' so users don't need to export it on every shell.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip the default backend to 'openai' with use_hf_inference_providers=True
and a Qwen3-VL-30B-A3B-Instruct:novita default model_id. The CLI now
runs end-to-end without a local model load — annotations are produced
by sending video_url + prompt to https://router.huggingface.co/v1.
Switch back to local inference with --vlm.backend=vllm or
--vlm.use_hf_inference_providers=false.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting --vlm.use_hf_inference_providers=true routes requests through
https://router.huggingface.co/v1 using HF_TOKEN as the API key, and
disables auto_serve so no local server is spawned. Combine with a
provider-pinned model id like 'Qwen/Qwen3-VL-30B-A3B-Instruct:novita'
or any plain model id to let HF route.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transformers serve returns HTTP 422 'Unexpected fields' when
mm_processor_kwargs is in extra_body — that field is vllm-specific.
Drop it by default; opt in via LEROBOT_OPENAI_SEND_MM_KWARGS=1 when
talking to vllm serve.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for video_url with transformers serve:
- fps must be in extra_body.mm_processor_kwargs, not in the content
block; otherwise the server discards it as unknown kwargs.
- file:// URLs aren't fetched by transformers serve. Read the local mp4
and inline it as a base64 data:video/mp4 URL so the server sees the
bytes directly.
Both surface as std::bad_alloc on the server side when wrong, which is
unhelpful but explains what we hit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transformers serve rescans the HF cache on every /v1/models request
which exceeds the 2s urllib timeout, leaving the probe loop spinning
even after Uvicorn is fully up. Watch the streamed server output for
'Uvicorn running' / 'Application startup complete' instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous logger-based output never appeared, leaving users in the
dark when auto_serve silently no-op'd. Switch to print(flush=True) so
the spawn decision is unmistakable, and stream the server's stdout to
the parent terminal in real-time on a background thread so model-load
progress and errors surface immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default auto_serve to True so lerobot-annotate can drive the entire
flow with one command. Probe api_base/models first — if a server is
already reachable (user started one manually, or it's a remote
endpoint), skip the spawn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting --vlm.auto_serve=true with --vlm.backend=openai makes the CLI
launch 'transformers serve <model_id> --port <serve_port>
--continuous-batching' as a child process, poll /v1/models until ready
(up to serve_ready_timeout_s), run the pipeline, then SIGINT the
server on process exit.
Override the spawn command with --vlm.serve_command='vllm serve ...'
or any OpenAI-compatible launcher.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 1 can now send the episode's actual mp4 file as a video_url
content block instead of pre-decoded frames. The server (transformers
serve / vllm serve / ktransformers serve) handles frame sampling at
the configured fps. Default fps=1 (one frame per second is enough for
subtask-boundary detection on manipulation episodes).
A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/
via ffmpeg stream-copy (no re-encode) so the model sees only this
episode's frames, not the whole shard.
Enable with --module_1.use_video_url=true (and --vlm.backend=openai).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third backend that talks to any OpenAI-compatible server. This
unblocks Qwen3.6 (and other models) that work in transformers serve /
ktransformers but not in vllm 0.10.2's fallback path:
- launch the server out-of-process (transformers serve, vllm serve,
ktransformers serve)
- point lerobot-annotate at it via --vlm.backend=openai
--vlm.api_base=http://localhost:8000/v1 --vlm.model_id=...
Image and video blocks are converted to OpenAI image_url/video_url
data URLs automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vllm.generate() expects a string/TextPrompt; passing message dicts
fails. vllm.chat() applies the chat template and extracts image/video
blocks automatically, which is what we need for VL models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vllm 0.10.2 expects guided_decoding to be a GuidedDecodingParams object,
not a dict. Different vllm versions differ here. The parser already has
a one-retry JSON-recovery path, so drop guided decoding entirely for
portability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pyav (and sometimes torchcodec) decode can return fewer frames than
requested timestamps when some timestamps fall outside the video file's
content range. Drop the strict=True on the zip and rely on the
None-filter to discard missing frames.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
torchcodec's __init__ bad-allocs on the cu128/torch-2.8 stack in some
environments (Lustre/conda combos). The annotation pipeline calls
decode_video_frames many times per episode, so this is a hard blocker.
Default to pyav (always available via the av package) and let users
opt back into torchcodec via LEROBOT_VIDEO_BACKEND=torchcodec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting trust_remote_code=True unconditionally pulled custom loader
code that triggers std::bad_alloc post-load on Qwen3-VL — the official
transformers class is sufficient. Flip the default to False; keep the
config field so users can opt in for models that actually need it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loading Qwen3-VL via transformers + accelerate's device_map='auto'
fails with std::bad_alloc on hosts with abundant RAM. The bug is in
accelerate's post-load dispatch path. Bypassing accelerate by loading
to CPU first and then calling .to('cuda') manually avoids that path.
LEROBOT_TRANSFORMERS_DEVICE_MAP=auto switches back to the old behavior
for cases where it works.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuDNN 9.x + torch 2.8 has a regression where the conv3d kernel used in
Qwen-VL vision tower patch embedders fails with
CUDNN_STATUS_NOT_INITIALIZED. The crash is independent of model size
and reproduces on both Qwen2.5-VL and Qwen3-VL because both use 3D conv
for video patch embedding.
Setting LEROBOT_DISABLE_CUDNN=1 falls back to native PyTorch conv3d
kernels (slower but functional) so the pipeline can run while the
torch/cuDNN stack is still on the broken combo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Large VL models (Qwen3-VL-30B-A3B BF16) take ~58 GB of an 80 GB H100,
leaving only ~22 GB for KV cache + cuDNN workspace. The vision tower's
3D conv then fails with CUDNN_STATUS_NOT_INITIALIZED because cuDNN
can't grab a workspace large enough.
- vlm.gpu_memory_utilization (default 0.9) — drop to 0.7 when the vision
encoder needs more cuDNN workspace.
- vlm.max_model_len — cap context to free KV cache memory; the 262k
default for Qwen3 is wildly more than annotation prompts need.
- vlm.trust_remote_code — already plumbed; now also passed to LLM().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required for many newer VL checkpoints (Qwen3.x FP8 in particular) that
ship custom loader code in their repo. Without it, the FP8
weight_scale_inv parameters never bind to FP8Linear modules and the
post-load dispatch path bad-allocs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The std::bad_alloc we hit on Qwen3-line VL models is not a real OOM —
it triggers in the post-load tensor-placement path even on hosts with
2 TB RAM. low_cpu_mem_usage=True bypasses the offending intermediate
staging buffer and is the standard accelerate workaround.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without device_map, transformers stages the full FP8 checkpoint in CPU
RAM before any GPU placement, OOMing the host on 27B+ models even when
the GPU has enough VRAM. device_map='auto' streams shards directly to
GPU memory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Newer transformers versions renamed/removed AutoModelForVision2Seq in
favour of AutoModelForImageTextToText for VL models. Try the new name
first and fall back gracefully so the transformers backend works on
both transformers 4.45-4.5x and 5.x.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Older draccus versions (e.g. 0.10.x bundled in some envs) lack a decoder
for typing.Literal and raise:
No decoding function for type typing.Literal['vllm', 'transformers', 'stub']
Switching VlmConfig.backend from Literal to str works under every
draccus version. The runtime branch in vlm_client.make_vlm_client
already validates the value and raises ValueError on unknown backends,
so the constraint stays enforced.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces keyframe sampling with a single Qwen-VL video block covering
the whole demonstration. The model pools temporally itself and chooses
where to cut subtasks — no stride, no count, no keyframe count knob to
tune.
- frames.py: ``FrameProvider`` gains ``video_for_episode(record,
max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames``
uniformly across the episode duration; ``_NullProvider`` returns []
for the no-video fallback. New ``to_video_block`` helper.
- Module 1: drops keyframe sampling. The subtask prompt now goes out as
``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and
the prompt template asks the model to "watch the whole clip, then
segment it" with cut points decided from gripper/contact/regrasp
events the model sees.
- Module1Config: ``keyframes_per_episode`` removed; replaced with
``max_video_frames: int = 32`` (model-capacity bound, not annotation
logic).
- Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in
the single-video-block invariant.
- Stub-VLM markers updated: tests now key on "atomic subtasks" instead
of the old "Decompose the demonstration" phrase that no longer
appears in the prompt.
- Docs: updated to describe the whole-episode video-block behavior and
the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.
- New `frames.py`:
- `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
dataset's first `observation.images.*` stream via
`LeRobotDatasetMetadata.get_video_file_path` and
`decode_video_frames`, with the same `from_timestamp` shift the main
dataset uses.
- Per-process LRU cache so co-timestamped Module 1 plan-update + Module
2 calls share decode work.
- `make_frame_provider` falls back to a null provider when the dataset
has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
prepend image blocks before the text block.
- Module 1 attaches `keyframes_per_episode` keyframes to the subtask
decomposition prompt.
- Module 2 attaches the frame at the interjection timestamp.
- Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
`Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
`--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
attachment behaviour and the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stores OpenAI-style function schemas at ``meta/info.json["tools"]`` so
datasets can declare which tools are available (today: just ``say``;
tomorrow: per-dataset extensions). The ``DEFAULT_TOOLS`` constant
fills in for unannotated datasets so chat-template consumers don't
have to special-case anything.
Three pieces:
- ``language.py``: ``SAY_TOOL_SCHEMA`` and ``DEFAULT_TOOLS``
constants. Single source of truth — PR 2's writer and PR 3's
runtime tool registry will both import from here instead of
duplicating the dict.
- ``dataset_metadata.py``: ``LeRobotDatasetMetadata.tools`` property
reads ``info.json["tools"]`` and falls back to ``DEFAULT_TOOLS``.
Returns deep-copied dicts so callers can mutate the result safely.
- ``docs/source/tools.mdx``: spec page covering the catalog, per-row
invocations, and the three-step "how to add a new tool" workflow
(declare schema, implement, register). Linked from the docs
toctree under the Datasets section.
This lays the groundwork for PR 2's pipeline writing the catalog out
during annotation, and PR 3's ``src/lerobot/tools/`` package shipping
runnable implementations (one file per tool — first up:
``say.py`` wrapping Kyutai's pocket-tts).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds task-prompt diversity (Xiao 2022 / CAST) without touching
``meta/tasks.parquet`` or forcing recipes to opt in. The plan reserved
``task_aug`` as a future style; this lands it now.
- ``language.py``: add ``task_aug`` to ``CORE_STYLES`` and
``PERSISTENT_STYLES``. ``column_for_style("task_aug")`` returns
``language_persistent`` so PR 2 writers route it correctly.
- ``language_render.py``: ``_resolve_task`` now consults the persistent
slice for rows of ``style="task_aug", role="user"``. When any exist
it picks one deterministically by ``sample_idx`` (blake2b-keyed, not
Python's randomized hash) so an epoch sees every rephrasing of every
episode while the same sample still resolves identically across
reruns. Falls back to the canonical ``meta/tasks.parquet`` task when
no rephrasings are present, so existing datasets and unannotated runs
keep their behaviour. Explicit ``task=`` overrides still win.
- Tests: rephrasing coverage across samples, determinism on repeat
``sample_idx``, fallback when persistent has no ``task_aug`` rows,
and explicit override priority.
Recipes get this for free: any ``${task}`` placeholder rotates through
the available rephrasings. Recipes that want the literal canonical task
can override the binding.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Motion primitives are described in robot-frame (joint / Cartesian) terms,
not pixel space, so they are camera-agnostic. Only `vqa` (event) and
`trace` (event, pixel-trajectory) are view-dependent.
The `camera` field stays on PERSISTENT_ROW_FIELDS for schema symmetry —
the validator, resolver, and HF feature mapping behave identically across
the two columns regardless of which styles populate `camera` today —
but persistent rows now always have `camera=None` in practice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a nullable `camera` field to the language row struct (both persistent
and event variants) so view-dependent styles like `vqa` can carry which
`observation.images.*` view they were grounded against. Without this,
multi-camera datasets ended up with multiple `(vqa, role)` rows at the
same timestamp that the resolver could not disambiguate.
- `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS,
to both Arrow struct types and the HF datasets feature mappings;
introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus
`is_view_dependent_style` and `validate_camera_field` helpers (camera
required iff style is view-dependent).
- `language_render.py`: thread an optional `camera=` kwarg through every
resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through
`_matching_rows` / `_select_*`, so recipes can disambiguate per-camera
VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`.
Without a `camera` filter, multi-row matches keep raising the existing
ambiguity error — which is the desired behaviour on multi-camera data.
- `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with
`ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying
the matching image block), keeping the original 0.20 budget and
documenting the customization point for datasets with different cameras.
- Tests: schema test asserts the new field order; new tests cover
`is_view_dependent_style`, `validate_camera_field` (both required and
forbidden directions), per-camera `emitted_at` filtering, and the
ambiguity error when two cameras emit `(vqa, assistant)` at the same
timestamp without a `camera=` filter. RenderMessagesStep + dataset
passthrough fixtures updated to include the new field.
- `docs/source/language_and_recipes.mdx`: document the `camera` field,
the per-camera resolver pattern, and the canonical recipe convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promote the previously-reserved motion/trace styles to first-class core
styles. motion routes to language_persistent (it tracks robot state over
time); trace routes to language_events (single-moment annotations).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers private helpers in recipe.py, language.py, language_render.py,
and render_messages_processor.py. Also reverts uv.lock to main (it was
re-generated by `uv run` during local checks).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While these devices are natively integrated into the LeRobot codebase, the library is designed to be extensible. You can easily implement the Robot interface to utilize LeRobot's data collection, training, and visualization tools for your own custom robot.
Similarly to the hardware, you can easily implement your own policy & leverage LeRobot's data collection, training, and visualization tools, and share your model to the HF Hub
@@ -135,7 +133,6 @@ Learn how to implement your own simulation environment or benchmark and distribu
- **[Discord](https://discord.gg/q8Dzzpym3f):** Join the `LeRobot` server to discuss with the community.
- **[X](https://x.com/LeRobotHF):** Follow us on X to stay up-to-date with the latest developments.
- **[Robot Learning Tutorial](https://huggingface.co/spaces/lerobot/robot-learning-tutorial):** A free, hands-on course to learn robot learning using LeRobot.
- **[T-Shirt Folding Experiment](https://huggingface.co/spaces/lerobot/robot-folding):** An end-to-end demonstration of folding t-shirts with LeRobot.
## Citation
@@ -143,7 +140,7 @@ If you use LeRobot in your project, please cite the GitHub repository to acknowl
```bibtex
@misc{cadene2024lerobot,
author={Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Meftah, Khalil and Ellerbach, Maxime and Moss, Jess and Wolf, Thomas},
author={Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Moss, Jess and Wolf, Thomas},
title={LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch},
| `--plan.frames_per_second` | `2.0` | Frame sampling rate for the contact sheets (`2.0` = one frame every 0.5s). |
| `--plan.max_frames_per_prompt` | `60` | Frame budget per VLM call. Episodes whose sampling exceeds this are auto-windowed at the same density, then stitched. |
| `--plan.contact_sheet_columns` | `5` | Columns per contact-sheet grid (`contact_sheet_frames_per_sheet` tiles, time row-major). |
| `--plan.frames_per_second` | `1.0` | How densely the episode video is sampled. |
| `--plan.max_video_frames` | `32` | Hard cap on frames per call (context-budget guard — don't exceed ~32 for a 32k context). |
| `--plan.subtask_window_seconds` | `0` | Split long episodes into fixed windows for constant frame density (`0` = whole episode). |
| `--plan.plan_max_steps` | `8` | Upper bound on subtasks per episode. |
| `--plan.subtask_describe_first` | `true` | Run the describe→segment grounding pass (best subtask quality; +1 call/episode). |
| `--plan.emit_memory` | `true` | Emit the `memory` rows (`false` = subtasks + plan only); symmetric to `emit_plan`. |
| `--plan.n_task_rephrasings` | `10` | How many `task_aug` rephrasings to emit (`0` disables). |
| `--plan.derive_task_from_video` | `if_short` | Use the dataset task as-is (`off`), only when it's missing/short (`if_short`), or always re-derive from video (`always`). |
| `--plan.use_video_url` | `false` | Send a server-side video clip instead of embedded frames. |
The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone, which keeps the same dataset usable across SmolVLA, Pi0.5, and any future VLM that expects OpenAI-style chat messages.
## Blends
Blend recipes select one weighted sub-recipe deterministically from the sample index.
`recipes/subtasks_vqa.yaml` trains the core blend — high-level subtask prediction, low-level execution, and VQA. `recipes/subtask_mem_vqa_speech.yaml` is the fuller variant that also adds memory updates and spoken interjection responses.
## Graceful absence
If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
- Use the one-hot task conditioning for multi-task training (MT10/MT50 conventions) so policies have explicit task context.
- Inspect the dataset task descriptions and the `info["is_success"]` keys when writing post-processing or logging so your success metrics line up with the benchmark.
- Adjust `batch_size`, `steps`, and `env_eval_freq` to match your compute budget.
- Adjust `batch_size`, `steps`, and `eval_freq` to match your compute budget.
# reachy2-sdk caps grpcio<=1.73.1 and protobuf<=6.32.0; quarantined here so downstream users aren't held back. reachy2-sdk is unlikely to release new versions.
reachy2=[
"reachy2_sdk>=1.0.15,<1.1.0",
"grpcio<=1.73.1",
"protobuf<=6.32.0",
]
reachy2=["reachy2_sdk>=1.0.15,<1.1.0"]
# Seeed Studio reBot B601-DM follower (motorbridge / CAN) + StarArm102 / reBot Arm 102
[SmolVLA](https://huggingface.co/papers/2506.01844) is a compact, efficient vision-language-action model that achieves competitive performance at reduced computational costs and can be deployed on consumer-grade hardware.
{% elif model_name == "act" %}
[Action Chunking with Transformers (ACT)](https://huggingface.co/papers/2304.13705) is an imitation-learning method that predicts short action chunks instead of single steps. It learns from teleoperated data and often achieves high success rates.
{% elif model_name == "tdmpc" %}
[TD-MPC](https://huggingface.co/papers/2203.04955) combines model-free and model-based approaches to improve sample efficiency and performance in continuous control tasks by using a learned latent dynamics model and terminal value function.
{% elif model_name == "diffusion" %}
[Diffusion Policy](https://huggingface.co/papers/2303.04137) treats visuomotor control as a generative diffusion process, producing smooth, multi-step action trajectories that excel at contact-rich manipulation.
{% elif model_name == "vqbet" %}
[VQ-BET](https://huggingface.co/papers/2403.03181) combines vector-quantised action tokens with Behaviour Transformers to discretise control and achieve data-efficient imitation across diverse skills.
{% elif model_name == "pi0" %}
[π₀ (Pi0)](https://www.physicalintelligence.company/blog/pi0) is a general-purpose robot foundation model from Physical Intelligence: a generalist Vision-Language-Action policy that understands visual inputs, interprets natural language instructions, and controls a variety of different robots across diverse tasks. The LeRobot implementation is adapted from their open-source OpenPI repository.
**π₀ (Pi0)**
π₀ is a Vision-Language-Action model for general robot control, from Physical Intelligence. The LeRobot implementation is adapted from their open source OpenPI repository.
**Model Overview**
π₀ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by Physical Intelligence. Unlike traditional robots that are narrow specialists programmed for repetitive motions, π₀ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks.
For more details, see the [Physical Intelligence π₀ blog post](https://www.physicalintelligence.company/blog/pi0).
{% elif model_name == "pi05" %}
[π₀.₅ (Pi05)](https://www.physicalintelligence.company/blog/pi05) is a Vision-Language-Action model from Physical Intelligence designed for open-world generalization: it evolves π₀ to generalize to entirely new environments and situations that were never seen during training. The LeRobot implementation is adapted from their open-source OpenPI repository.
{% elif model_name == "molmoact2" %}
[MolmoAct2](https://allenai.org/blog/molmoact2) is an open robotics foundation model from the Allen Institute for AI (Ai2) that maps camera images and language instructions to robot action chunks. The LeRobot implementation supports training and evaluation of the regular MolmoAct2 model.
{% elif model_name == "vla_jepa" %}
[VLA-JEPA](https://arxiv.org/abs/2602.10098) is a Vision-Language-Action model that combines a Qwen3-VL language backbone with a self-supervised video world model (V-JEPA2) and a flow-matching DiT action head.
**π₀.₅ (Pi05) Policy**
π₀.₅ is a Vision-Language-Action model with open-world generalization, from Physical Intelligence. The LeRobot implementation is adapted from their open source OpenPI repository.
**Model Overview**
π₀.₅ represents a significant evolution from π₀, developed by Physical Intelligence to address a big challenge in robotics: open-world generalization. While robots can perform impressive tasks in controlled environments, π₀.₅ is designed to generalize to entirely new environments and situations that were never seen during training.
For more details, see the [Physical Intelligence π₀.₅ blog post](https://www.physicalintelligence.company/blog/pi05).
{% elif model_name == "gaussian_actor" %}
This is a Gaussian Actor policy (Gaussian policy with a tanh squash) — the policy-side component used by [Soft Actor-Critic (SAC)](https://huggingface.co/papers/1801.01290) and related maximum-entropy continuous-control algorithms.
{% elif model_name == "pi0_fast" %}
[π₀-FAST (Pi0-FAST)](https://www.physicalintelligence.company/research/fast) is a Vision-Language-Action model for general robot control, from Physical Intelligence. It models continuous robot actions with autoregressive next-token prediction using FAST (Frequency-space Action Sequence Tokenization), training up to 5x faster than diffusion-based π₀.
{% elif model_name == "eo1" %}
[EO-1](https://huggingface.co/papers/2508.21112) is a Vision-Language-Action model for general robot control. It pairs a Qwen2.5-VL backbone for vision-language understanding with a continuous flow-matching action head that denoises action chunks.
{% elif model_name == "groot" %}
[GR00T N1.5](https://github.com/NVIDIA/Isaac-GR00T) is an open, cross-embodiment foundation model from NVIDIA for generalized humanoid robot reasoning and skills. It takes language and images as input and uses a flow-matching action transformer to predict actions conditioned on vision, language, and proprioception.
{% elif model_name == "multi_task_dit" %}
[Multi-Task Diffusion Transformer (DiT)](https://huggingface.co/papers/2507.05331) extends Diffusion Policy with a large Diffusion Transformer and text + vision conditioning for multi-task robot learning. It supports both diffusion and flow-matching objectives and reaches high dexterity with only ~450M parameters.
{% elif model_name == "wall_x" %}
[WALL-OSS](https://huggingface.co/papers/2509.11766) is an open-source foundation model for embodied intelligence from XSquare Robot. Built on Qwen2.5-VL, it uses a tightly-coupled multimodal architecture with flow matching to unify semantic reasoning and high-frequency action generation for cross-embodiment control.
{% elif model_name == "xvla" %}
[X-VLA](https://huggingface.co/papers/2510.10274) is a soft-prompted, flow-matching Vision-Language-Action framework that treats each robot or hardware setup as a "task" encoded with a small set of learnable Soft Prompt embeddings, letting a single model reconcile diverse robot morphologies, sensors, and action spaces.
{% else %}
This is a **{{ model_name }}** policy trained with [LeRobot](https://github.com/huggingface/lerobot).
_Model type not recognized — please update this template._
This policy has been trained and pushed to the Hub using [LeRobot](https://github.com/huggingface/lerobot).
{% set policy_docs = {
"act": "act",
"smolvla": "smolvla",
"pi0": "pi0",
"pi0_fast": "pi0fast",
"pi05": "pi05",
"molmoact2": "molmoact2",
"vla_jepa": "vla_jepa",
"eo1": "eo1",
"groot": "groot",
"xvla": "xvla",
"multi_task_dit": "multi_task_dit",
"wall_x": "walloss"
} %}
{% if policy_docs.get(model_name) %}Learn how to train and run it in the [LeRobot {{ model_name }} guide](https://huggingface.co/docs/lerobot/main/en/{{ policy_docs[model_name] }}), or browse the [full documentation](https://huggingface.co/docs/lerobot/index).
{% else %}See the [full LeRobot documentation](https://huggingface.co/docs/lerobot/index).
{% endif %}
See the full documentation at [LeRobot Docs](https://huggingface.co/docs/lerobot/index).
---
## How to Get Started with the Model
For a complete walkthrough, see the [training guide](https://huggingface.co/docs/lerobot/il_robots#train-a-policy).
Below is the short version on how to train and run inference/eval:
--task="{% if dataset and dataset.tasks %}{{ dataset.tasks[0] }}{% else %}<your_task_description>{% endif %}" \
--duration=60
```
Replace the remaining `<...>` placeholders with your own values: `--robot.port` and the camera names/indices are specific to your machine, and the camera names must match the observation keys this policy was trained on.
When `--strategy.type=base` is used the script doesn't record the episodes. Skipping duration will make the policy run indefinitely. For more information look at [rollout documentation](https://huggingface.co/docs/lerobot/main/en/inference).
{% if base_model %}### Train your own policy
This policy type is usually fine-tuned from the pretrained base model [{{ base_model }}](https://huggingface.co/{{ base_model }}):
```bash
lerobot-train \
--dataset.repo_id=${HF_USER}/<dataset> \
--policy.path={{ base_model }} \
--output_dir=outputs/train/<policy_repo_id> \
--job_name=lerobot_training \
--policy.device=cuda \
--policy.repo_id=${HF_USER}/<policy_repo_id> \
--wandb.enable=true
```
{% else %}### Train your own policy
```bash
lerobot-train \
--dataset.repo_id=${HF_USER}/<dataset> \
--policy.type={{ model_name }} \
--output_dir=outputs/train/<policy_repo_id> \
--job_name=lerobot_training \
--policy.device=cuda \
--policy.repo_id=${HF_USER}/<policy_repo_id> \
--wandb.enable=true
```
{% endif %}
_Writes checkpoints to `outputs/train/<policy_repo_id>/checkpoints/`._
---
## Evaluation
<!-- Report real-robot results here: run the policy several times per task and count the
successes. Delete the "No evaluation results" line and fill in this table instead:
| Task | Trials | Successes | Success rate |
| ---- | ------ | --------- | ------------ |
| pick the lego brick | 10 | 8 | 80% |
Also worth noting: anything that affects difficulty (new object positions, lighting,
distractors, a different robot of the same type, ...).
-->
_No evaluation results have been provided for this policy yet._
---
## Citation
If you use this policy, please cite the method linked in the description above, along with LeRobot:
```bibtex
@misc{cadene2024lerobot,
author = {Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Moss, Jess and Wolf, Thomas},
title = {LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch},
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.