lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-11 20:11:48 +00:00

Author	SHA1	Message	Date
Pepijn	83d0c390da	pi052: drop debug scaffolding left over from training/inference bug hunts Three diagnostic surfaces shipped in PR3 that don't belong in a clean release: * ``LEROBOT_DUMP_RECIPE_SAMPLES`` env-var dump (~70 LOC in text_processor_pi052.py): pretty-prints the next N rendered samples with ``[TGT]...[/TGT]`` markers over supervised spans. One-off training-inspection tool — no production user, never wired into a CLI flag, only useful while iterating on the recipe. Drop the module constants, the ``_is_dump_rank`` / ``_dump_recipe_sample`` helpers, the call site, and the now-unused ``import os``. * ``_log_obs_tensors_once()`` in lerobot_pi052_runtime.py: the docstring literally says "Used to bisect train/inference mismatches" — a debugging artifact from when the LM head was collapsing on the live robot. Logged unconditionally at WARNING level from both the dataset-driven and robot-driven providers, with no ``--verbose`` gate. Drop the function, both call sites, and the ``_logged`` / ``_obs_logged`` flag dicts that fed them. (``_resize_logged`` is kept — it gates the operationally useful camera-size sanity log.) * Defensive ``unsqueeze(0)`` block in the dataset observation provider: papered over an upstream bug where some preprocessor step could produce an unbatched tensor. ``AddBatchDimensionProcessorStep`` is reliable in the current pipeline — pi052 tests still pass with the block removed. If the bug ever resurfaces it should be fixed at the source, not silently re-batched here. Net: -169 LOC. All 30 ``tests/policies/pi052/`` tests pass. The ``<loc>`` token plumbing (``register_paligemma_loc_tokens``, ``_loc_token``, ``suppress_loc_tokens`` runtime gate) is left as-is — it's the actual mechanism for VQA spatial answers, not scaffolding, and the ``suppress_loc_tokens=True`` callers on subtask/memory/ interjection paths and ``=False`` on the VQA path are intentional asymmetric behaviour, not a bug-routing knob. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:07:43 +02:00
Pepijn	1ff10b935c	Merge branch 'feat/language-annotation-pipeline' into feat/smolvla-on-steerable Resolves conflicts from 66 commits on the base branch: * pyproject.toml — keep base's transformers>=5.4.0,<5.6.0; add the sentencepiece-dep entry pi052 (FAST action tokenizer) needs. * policies/__init__.py — keep pi052 export; drop the RewardClassifierConfig export that base removed. * policies/factory.py — docstring list resolution (keep pi052; drop reward_classifier, removed by base). * annotations/steerable_pipeline/executor.py — adopt base's renamed _ensure_annotation_metadata_in_info (it already advertises the say tool); drop pi052's older _ensure_tools_in_info call. * configs/train.py — keep pi052's vqa_target_fraction; adopt base's SampleWeightingConfig (legacy RA-BC inline params already covered by the migration shim base added). * scripts/lerobot_train.py — merge pi052's per-policy processor rebuild + dataset_repo_id pass-through with base's active_cfg / is_reward_model_training tightening, and re-route vqa-weighted sampler to active_cfg.drop_n_last_frames. * datasets/language_render.py — adopt base's _select_one + timestamp tolerance (drops pi052's stale _select_latest / per-style sort_key). * tests — adopt base's parametrized per-camera blend + tolerance test; drop pi052 tests that overlap with base's tighter rewrites; keep pi052's flow-only / VQA-blend coverage; add a test_canonical_recipe_loads check on subtask_mem_vqa_speech.yaml. * policies/pi052/processor_pi052.py — import RenderMessagesStep directly from render_messages_processor (base intentionally dropped it from lerobot.processor's re-exports). * uv.lock — regenerated cleanly from base + pi052's pocket-tts / beartype. All 67 touched tests pass (30 pi052 + 37 recipe / language-render / pipeline / render-messages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:47:09 +02:00
Pepijn	67bdf4690e	examples(port_datasets): rewrite RoboCasa composite_seen builder Replace the earlier wrapper (which depended on robocasa.scripts.download + dataset_registry) with a self-contained pipeline that: * downloads each task tarball directly from Box via box_links_ds.json * converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30 * standardizes camera keys under observation.images.robot0_* and flattens observation.state by concatenating base/EE/gripper subkeys when the source dataset stores them separately * builds per-rank unified shards then aggregates into one dataset Filter: composite_seen task-set restricts discovery to the 16 multi-step target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use --task-set=all to keep every discovered task in the split/source slice; --tasks=... overrides for arbitrary subsets. Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task. Adapted from a battle-tested port_robocasa.py reference shared by the user; the only semantic addition is the task-set filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:27:42 +02:00
Pepijn	8085feab6e	pi052(runtime): factor out shared observation-prep boilerplate Both observation providers in lerobot_pi052_runtime.py ended a sample dict the same way — strip the runtime-owned language columns and hand the policy a device-resident ``observation.*``-only subset. Extract two tiny helpers (``_strip_runtime_owned_language_cols`` and ``_select_observation_to_device``) so the dataset and robot paths read as a clear linear pipeline. Path-specific concerns (defensive unsqueeze on the dataset path; camera resize + state-vector sanity logging on the robot path) stay inline at the call sites. Behaviour unchanged; all 30 ``tests/policies/pi052/`` tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:25:08 +02:00
Pepijn	a088c10c80	examples(port_datasets): SLURM+datatrove RoboCasa composite_seen build Parallel variant of build_robocasa_composite_seen.py modeled after the existing slurm_port_shards.py / slurm_aggregate_shards.py pattern. Two-phase datatrove pipeline: * Phase 1 DOWNLOAD: tasks=16 (one per RoboCasa composite_seen task), each worker downloads its assigned tar via RoboCasa's own download_datasets helper. Network-bound, idempotent. * Phase 2 AGGREGATE: tasks=1, single worker calls aggregate_datasets over the 16 extracted directories. Submitted with depends=phase1 so SLURM only releases it once all 16 downloads succeed. Reuses the COMPOSITE_SEEN_TASKS list and per-task download/resolve helpers from the single-machine script via aliased imports — single source of truth for 'what does it mean to download a composite_seen task'. Local (--slurm 0) mode runs the two phases sequentially in-process for debugging on a workstation. Usage on SLURM: uv run python examples/port_datasets/slurm_build_robocasa_composite_seen.py \ --output-dir=/scratch/${USER}/robocasa_composite_seen \ --hub-repo-id=${HF_USER}/robocasa_composite_seen \ --logs-dir=/scratch/${USER}/logs/robocasa \ --partition=cpu --push-to-hub Prereq: uv sync --extra annotations (pulls datatrove) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:10:05 +02:00
Pepijn	9c3d5ab7ce	scripts: build_robocasa_composite_seen — aggregate 16 target tasks RoboCasa 1.0 ships its target/human demos in LeRobot format (parquet + mp4) as lerobot.tar archives distributed via Box. This script wraps RoboCasa's own download_datasets helper to pull each of the 16 composite_seen tasks, opens each extracted directory as a LeRobotDataset, and merges them into a single combined dataset via merge_datasets (a thin wrapper over aggregate_datasets that revalidates fps/robot_type/features, unifies task indices, concatenates videos and parquet, and recomputes stats). The 16-task slice corresponds exactly to the 'Composite-Seen' column of the published RoboCasa365 leaderboard, so the resulting dataset is the right substrate for an apples-to-apples pi05 vs pi052 comparison on multi-step kitchen manipulation. Usage: uv run python -m lerobot.scripts.build_robocasa_composite_seen \ --output-dir=/data/lerobot/robocasa_composite_seen \ --hub-repo-id=${HF_USER}/robocasa_composite_seen \ --push-to-hub Idempotent: re-running skips already-downloaded tasks. Defensive fallbacks handle RoboCasa API drift in get_ds_path / download_datasets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:01:28 +02:00
Pepijn	e84f97a8c1	smolvla2(runtime): interactive task picker + drop action diagnostic Task picker: The dataset bootstrap used to silently overwrite args.task with the canonical training task. Replace that with an interactive picker (_select_task_interactively) that shows every unique task in ds_meta.tasks as a numbered menu (canonical task first as default) plus a 'type a custom task' option. --task on the CLI still skips the picker, and non-TTY runs fall back to the bootstrap task so scripted invocations are unchanged. Action diagnostic removal: Drop the [act] log block in LowLevelForward.run (\|a\|_mean / spread / normalized + unnormalized first/last + state) that was added while debugging the 'barely moving' issue. Robot motion is now healthy, the output is noise in steady-state, and it depended on stashing the postprocessor on runtime.state — also removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:59:08 +02:00
Pepijn	6d2b8c80ab	smolvla2(runtime): wire MemoryUpdateFwd into the inference pipeline MemoryUpdateFwd was importable but never installed, so subtask_change events fired by HighLevelSubtaskFwd had no listener and current_memory stayed at its initial None value — the runtime panel always showed 'memory (not set)' even when the policy was trained with the memory_update recipe (e.g. subtask_mem_vqa_speech.yaml, weight 0.15). Insert MemoryUpdateFwd between HighLevelSubtaskFwd and AskVQAFwd so the event is visible the same tick it is emitted, and refresh the stale comment that claimed memory was not in scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:52:44 +02:00
Pepijn	793c7c4ddd	feat(runtime): --subtask_chunks_per_gen throttles HL gen vs action chunks Adds a per-chunk-boundary counter to HighLevelSubtaskFwd: subtask gen fires only once every N chunk boundaries (default 1 = current behavior). Lets the operator run e.g. 5 flow-matching action chunks per LM-head subtask gen so the subtask doesn't churn every 1.7s while the previous one is still being executed — saves compute and avoids re-planning the action trajectory mid-grasp. --subtask_chunks_per_gen=5 # 5 chunks per subtask refresh The counter starts at 0 so the very first chunk boundary fires immediately (no startup delay). Trigger is rearmed when skipping so a low high_level_hz doesn't lose slots. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:34:59 +02:00
Pepijn	db927ab40b	feat(runtime): action chunk diagnostic — log normalized + unnormalized values Adds a per-chunk log line in LowLevelForward that surfaces what the action expert actually emits and what the robot receives after the postprocessor unnormalizes it, so "barely moving" can be diagnosed at a glance: [act] T=50 \|a\|_mean=0.234 spread=0.512 [act] norm first=[0.12, -0.31, ...] last=[0.45, -0.22, ...] [act] joint first=[3.2, -47.8, ...] last=[12.4, -41.0, ...] state=[0.5, -55.3, ...] \|a\|_mean ~ 0.3–0.6 with spread ~ 0.3+ and visible delta from first to last → healthy trajectory. \|a\|_mean near 0 across the chunk → model defaulting to median pose. joint values that don't differ much from state → safety cap or model output near current state. Postprocessor is stashed on runtime.state["_postprocessor"] at startup so the diagnostic can replay the same unnormalize the dispatcher uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:10:52 +02:00
pepijn	471b2b1b1d	fix(annotate): bump same-frame subtasks onto distinct frames If two consecutive VLM-emitted subtask spans have ``start`` timestamps that round to the same source frame after ``snap_to_frame`` (e.g. on short episodes the VLM sometimes nominates two ~adjacent action boundaries within one 30 Hz step), the writer emits two ``style=subtask`` rows at the identical persistent timestamp. The training-time renderer's default binding ``subtask: active_at(t, style=subtask)`` then raises: ValueError: Ambiguous resolver for style='subtask'; add role=..., tool_name=..., or camera=... to disambiguate. … and the whole training run dies on the first batch. Observed concretely on ``pepijn223/super_poulain_vocab2`` (job 22159979): episodes 3 and 30 each had two subtask rows at the same timestamp (``release yellow cube`` + ``retract arm`` snapping to the same frame). Add ``_dedupe_starts_to_distinct_frames`` to walk the cleaned span list and, whenever a snapped start collides with one already used, push the later span onto the next free frame timestamp. Both subtasks survive on distinct timestamps; the renderer can now disambiguate. If the episode genuinely has no later free frame (extremely unlikely — would require a same-timestamp collision on the very last frame of the episode), the later span is dropped with a warning rather than left to poison the render. New test ``test_plan_module_bumps_collocated_subtasks_to_distinct_frames`` locks in the contract; full vocabulary suite is 14/14 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-23 19:31:44 +00:00
pepijn	a15e16c072	fix(annotate): replace fuzzy subtask snapping with strict match + one-shot retry The Jaccard-overlap snap was warping VLM output into wrong canonical labels — e.g. an off-vocab "consult the wizard" span would silently become "grasp blue cube" if that scored highest. Even with a higher floor the operator can't tell which subtasks were paraphrases vs genuine mislabels in the resulting dataset. Replace with strict exact-match validation + a single targeted retry: 1. Generate subtasks as before. 2. If any returned subtask's normalised form (lowercased, articles stripped, whitespace collapsed) isn't in the canonical vocab, fire one retry call naming the offending strings and re-sending the full canonical list. The retry prompt requires byte-identical output from the vocab. 3. After the retry, validate again. Spans still off-vocab are dropped — no fuzzy snapping ever produces a different canonical label than the VLM actually emitted. 4. If every span ends up off-vocab even after the retry, warn loudly so the operator extends ``meta/canonical_vocabulary.json`` to cover the missing phase. The episode is left with empty subtasks rather than silently fabricated ones — visibility > sweep-under- the-rug. Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the normalisation helper out so the retry-validation and the final canonicalisation share one source of truth. Tests: - test_plan_module_accepts_article_only_difference: "grasp the blue cube" still maps to canonical "grasp blue cube" (article-tolerant). - test_plan_module_retries_when_subtask_off_vocab: paraphrase triggers the retry which the VLM corrects in pass 2. - test_plan_module_drops_off_vocab_subtask_after_retry: VLM that refuses to correct → bad span dropped, in-vocab span kept. - test_plan_module_empty_when_all_off_vocab_after_retry: every span off-vocab → episode left empty (no warping). All 13 vocabulary tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-23 09:57:27 +00:00
pepijn	336af85c09	fix(annotate): never leave an episode with zero canonical subtasks When the canonical vocabulary is enabled and the VLM produces spans that don't overlap any canonical label, the previous Jaccard-floor (0.5) dropped them and the episode came out with no subtasks at all — invisible to the downstream policy. Observed on ``pepijn223/super_poulain_vocab``: some episodes had empty subtask columns because every VLM-emitted phrase scored below 0.5 against the discovered vocabulary. Two-pass canonicalisation: - First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to let mild paraphrases through) and drops everything below. - If that first pass leaves the episode with zero subtasks, fall back to a second pass that always snaps each VLM span to its nearest canonical label by Jaccard (no floor). The episode ends up with subtasks even when the vocabulary missed a phase — a slightly-wrong canonical label is still closer to the right motion than nothing at all. - Log loudly when the fallback fires so the operator can spot coverage gaps in ``meta/canonical_vocabulary.json``. - Log a per-episode count at INFO when some (but not all) spans were dropped so it's visible without spamming the run output. Promote the Jaccard floor + ignore-tokens to class constants so they're a single edit point. Add ``force=True`` parameter to ``_canonicalize_subtask`` for the no-floor fallback path. New test ``test_plan_module_snaps_when_all_off_vocab`` covers the fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is adjusted to keep at least one in-vocab span so the floor path can still fire and is exercised. All 12 vocabulary tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 12:44:03 +00:00
pepijn	54221ceea2	feat(annotate): let the VLM decide vocabulary size Hardcoding ``n_subtask_target=10`` and ``n_memory_target=6`` baked task complexity into the config — a simple pick-and-place needs ~6, a multi-step recipe needs ~20. The VLM already sees the clips, so let it pick the count itself from what's recurring across episodes. Drop both knobs from ``VocabularyConfig`` and the ``module_0_vocabulary`` prompt template. The prompt now says "decide the count yourself based on what you see — the smallest set that still covers every recurring phase" and adds an "each label must recur across the demos" rule so the VLM filters out one-off motions. Update the launcher script + docs to remove the old knobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 11:46:31 +00:00
pepijn	369ab17110	fix(annotate): update run_hf_job CLI args for renamed namespaces + phase 0 Three stale things in the launcher script: - ``--module_1/2/3.*`` no longer exist; review commit `fd18beb` renamed the CLI namespaces to ``--plan/interjections/vqa``. Forwarded all eight existing args to their new names. - ``--push_to_hub`` is now a bool; the destination repo lives at ``--dest_repo_id``. Split the single positional into both args. - ``openai`` was missing from the pip install list, which the prior review review (claude bot, 2026-05-08) flagged — the default vlm backend is ``openai`` so the job would have ImportError'd. Added. Also expose the new phase 0 (canonical vocabulary discovery) knobs explicitly: ``--vocabulary.sample_episodes``, ``--n_subtask_target``, ``--n_memory_target``. Defaults are sane (3 / 10 / 6) but worth flagging in the example so the operator knows what they're running. Update the docstring + section comments to match the current phase layout (vocabulary → plan → interjections → vqa → writer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 11:43:06 +00:00
pepijn	86a7edc590	feat(annotate): phase 0 — derive canonical vocabulary from sample episodes The pipeline previously emitted near-unique subtask + memory phrasings per episode (free-form LLM rephrasing). On the downstream low-level policy that collapses the action expert's conditioning to noise: every episode pairs a different paraphrase with similar motions, so the expert learns a flat scene-prior that ignores the subtask string — then at inference the high-level head invents yet another paraphrase and the expert produces tiny "uncertain hover" chunks. Add a vocabulary-discovery phase (phase 0) that runs once per dataset: - watches the first ``vocabulary.sample_episodes`` (default 3) episode videos as one Qwen-VL prompt, - asks the VLM to derive ~``n_subtask_target`` canonical imperative subtask labels and ~``n_memory_target`` first-person past-tense memory milestones that recur across the demos, - persists them to ``meta/canonical_vocabulary.json`` (human- inspectable, hand-editable), and - wires the resulting ``Vocabulary`` into the ``plan`` module so every per-episode subtask + memory call is constrained to those exact strings (both as prompt-side instructions and post-VLM validation: paraphrases snap to the closest canonical entry via token-set overlap; below a 0.5 Jaccard floor the subtask is dropped rather than warped into something semantically wrong). Operator workflow: - first run discovers the vocabulary, writes the JSON, and runs the ``plan`` module against it, - subsequent runs reuse the on-disk file (``reuse_existing=True`` default) so hand-edits stick, - set ``--vocabulary.enabled=False`` to fall back to free-form generation (the original behaviour). The discovery prompt forbids gerunds / third-person / adverbs and caps the lists to the requested counts, matching the Hi-Robot / π0.6-MEM convention of small per-environment vocabularies. The ``plan`` module's subtask + memory prompts grow a conditional ``{vocabulary_block}`` slot rendered only when a vocabulary is present; without one the templates collapse to their previous free-form form. Tests: 11 new unit tests under tests/annotations/test_vocabulary.py cover the on-disk round-trip, discovery against the fixture dataset, ``reuse_existing`` short-circuit, paraphrase canonicalisation, off- vocab subtask dropping, and the no-vocabulary pass-through path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 11:40:05 +00:00
pepijn	77a16db529	fix(smolvla2): make HighLevelSubtaskFwd actually fire at low hz + quiet startup log Two runtime fixes that surfaced from on-robot testing. (1) HighLevelSubtaskFwd was double-gated: HzTrigger fires every period (e.g. every 5s at --high_level_hz=0.2) AND the step requires the action queue to be empty. The queue-empty window is brief (~tens of ms between drain and refill) and almost never coincides with the low-hz timer, so HL effectively never fired and the subtask shown in the runtime panel stayed on the dataset's frame-0 annotation. Add HzTrigger.rearm() and have HighLevelSubtaskFwd call it when skipping due to queue-non-empty — the trigger stays armed and tries again on the next tick instead of waiting another full period. LowLevelForward keeps the original "skip" semantics because chunk_hz is meant as a true upper bound on chunk-generation rate. (2) The "robot state at startup" warning in _build_robot_observation_provider was meant to fire once but wasn't gated by _resize_logged like the sibling "camera ... live=AxB" warning. Result: it spammed every observation tick (~1-2s). Gate it on first_call (snapshot of _resize_logged["done"]) so both logs fire once at session start. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 11:04:12 +00:00
pepijn	ca1b951e7b	feat(pi05): expose lm_head_lr_scale for stronger text-CE gradient With knowledge_insulation=True the LM head only receives gradients on text-CE samples (e.g. ~45% of the mix for subtask_mem.yaml). Under aggressive cosine LR decay this is enough for the head's first-token distribution to drift back toward PaliGemma's pretrained <loc> detection prior — teacher-forced argmax stays high while autoregressive generation collapses to <locDDDD> tokens. Add `lm_head_lr_scale` (default 1.0, no behavior change) on PI05Config. When != 1.0, PI05Policy.get_optim_params splits the policy into two param groups: the PaliGemma lm_head projection plus its tied embed_tokens at lr * lm_head_lr_scale, and the rest at lr. The cosine scheduler multiplies both groups by the same lambda each step, so the ratio is preserved across decay. Recommended starting point for pi052 + subtask_mem.yaml runs: 5.0, combined with a higher scheduler_decay_lr floor (e.g. 5e-6 instead of 1e-6) so the head doesn't get starved in the second half of training. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 09:56:46 +00:00
pepijn	9d30d91021	fix(pi052,smolvla2): unblock text generation when LM head drifted to <loc> PaliGemma's pretraining puts heavy first-token mass on its <loc0000>.. <loc1023> ids at any "Assistant:" continuation. Our pi052 fine-tunes with knowledge_insulation=True and a small text-CE budget (~45% of samples) drift back toward that prior on long runs at low LR — teacher- forced argmax stays at 100% (CE only measures next-token given correct prefix) while autoregressive first-token selection collapses onto <loc>. On the running poulain11 checkpoint at step 8000 this manifests as a stream of <locDDDD> tokens for every subtask call — confirmed locally against the saved checkpoint on a dataset frame. Add a `suppress_loc_tokens` knob to `PI052Policy.select_message` that masks ids [256000, 257024) to -inf before sampling, and pass it from the three text-only inference steps (HighLevelSubtaskFwd, MemoryUpdateFwd, UserInterjectionFwd). VQA steps keep the default False so spatial answers can still emit locs. Verified end-to-end: suppressed → "the robot arm moves the blue block to the green basket". Also fix `_msgs_for_memory`: it was emitting the older `User: ${task}\nPlan:..\nMemory:..` / `Assistant: ${subtask}` template, which no longer matches the `memory_update` recipe layout (`User: ${task}` / `Assistant: Previous memory: ..` / `User: Completed subtask: ..`). The new prompt mirrors the training recipe; `HighLevelSubtaskFwd` stashes the just-completed subtask in `state['prior_subtask']` so the memory prompt can render `Completed subtask: ..` for `MemoryUpdateFwd`. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 09:50:14 +00:00
pepijn	e050d0fe0a	fix(recipes): use active_at for memory_update, rebalance subtask_mem memory_update was bound to `emitted_at(t, style=memory)`, which requires the frame's exact timestamp to match a memory annotation. Memory rows are placed at subtask-boundary timestamps and at 30 fps that's ~1% of frames, so 99% of memory_update draws couldn't render and silently fell through to _fallback_low_level_render — injecting task-conditioned low-level training on ~30% of samples (subtask_mem.yaml). Switch to `active_at`. At inference `MemoryUpdateFwd` is triggered on `subtask_change` events, but the model only needs to learn the stateless mapping (prior_memory, completed_subtask) -> current_memory. active_at supervises this mapping on every frame inside a subtask interval, against varied observations; the trigger lives outside the model. Net effect: memory_update renders on ~87% of frames, the fallback leak drops from ~30% to ~4%, and memory CE gets a meaningful (not 0.3%) training share. subtask_mem.yaml: rebalance to 0.30 / 0.55 / 0.15 so memory CE is ~13% effective and the freed weight goes to low_level_execution. subtask_mem_vqa_speech.yaml: keep weights (memory_update=0.10 was already balanced against the other text-CE branches). Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 14:53:13 +00:00
pepijn	2ca030fa28	fix(pi052): build processors from current config When fine-tuning from pi05_base, reuse only the pretrained weights so pi052 still generates recipe text labels and FAST action labels. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 13:54:29 +00:00
pepijn	36f828221c	fix(pi05): preserve pretrained paligemma lm head Keep the PaliGemma LM head in float32 and initialize it from pretrained weights or token embeddings when loading pi05 checkpoints. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 13:25:24 +00:00
Pepijn	d41d874581	fix(pi052): debug parity harness truncates prompt instead of masking The parity check in debug_text_predictions was producing false ✗ DIVERGED reports. Root cause: I built the "inference" batch by zero-masking the attention past the supervised span, but kept the full 512-token padded sequence. select_message reads the prompt-end hidden state via ``vlm_out[:, -1:]`` — the LAST position of the prefix — which in a padded batch is a padding-token hidden state, not the last prompt token. PaliGemma's prior on those padded positions reliably argmaxes to <loc0879>, falsely flagging a training/inference mismatch. Fix: truncate both tokens AND mask to length == first_sup before calling select_message, mirroring what the real runtime does (``tokenizer(prompt)`` returns un-padded ids). Now the parity check compares like-with-like. The actual training argmax in the dump was sensible English ("' move the blue cube into the green bin'" at acc=6/9) — the head is learning correctly. The "<loc>" salad was purely the harness reading from the wrong position. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 15:09:36 +02:00
Pepijn	efa05f0ada	fix(train): unwrap DDP policy in debug_text_predictions hook At training time the policy is wrapped by Accelerator/DDP into a .module attribute and custom methods are NOT proxied through the wrapper, so ``hasattr(policy, "debug_text_predictions")`` was False and the periodic dump was silently no-op'ing. Walk through .module indirection to reach the raw PI052Policy that defines the method. Also surface why the dump didn't fire (no method / empty supervised positions / generation error) so users can see what's blocking it instead of staring at silence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 13:41:20 +02:00
Pepijn	e98b6f726b	feat(train): debug dump runs inference too, with parity check Extends the periodic LM-head dump (LEROBOT_DEBUG_PREDS_EVERY) to ALSO run select_message autoregressively on the same prompt prefix and show: prompt : '<bos>User: ... Assistant: ' target (ground truth) : ' close the gripper ...' training argmax (teacher-fed) : ' close the gri lift ...' acc=12/15=80% inference (autoregressive) : ' close the gripper around ...' first-token parity : train=3387 (' close') vs infer=3387 (' close') ✓ MATCH The first-token parity check is decisive: training-side argmax at the prompt-end position and inference's first generated token both compute ``argmax(lm_head(h_last_prompt))`` on identical context, so they MUST match. Any divergence signals a training↔inference bug (mask, dtype, KI routing, embedding scale, etc.). Subsequent tokens can diverge because training uses teacher forcing while inference free-runs. debug_text_predictions now also returns an ``inference`` list keyed by sample, each entry carrying ``first_sup_pos`` and ``decoded``. Limited to 24 new tokens per sample to keep the dump fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 12:27:32 +02:00
Pepijn	f7747d02a9	feat(train): periodic LM-head prediction dump for live debugging Adds an opt-in diagnostic that, every N training steps, dumps 5 batch samples plus the LM head's argmax prediction at every supervised position alongside the label and a ✓/✗ marker — the cheapest signal for "is text training actually learning what we expect, or collapsing to a fixed token". Refills the recipe-sample dump budget on the same cadence so the raw input shapes are also re-dumped. Opt in via env var: LEROBOT_DEBUG_PREDS_EVERY=1000 lerobot-train ... PI052 implements ``debug_text_predictions`` (mirrors the text-loss forward but returns argmax instead of CE); other policies are silently skipped. The dump runs in eval() mode under no_grad, slicing the current batch to N samples — no extra data fetch, no train-state mutation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 12:23:05 +02:00
pepijn	86ecd4bc2e	add subtask memory training recipe Add a recipe that blends subtask prediction, low-level execution, and memory update supervision. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 09:56:10 +00:00
pepijn	28b86449a2	fix(pi05): cast attention masks to model dtype Ensure attention masks follow the backbone dtype during bf16 inference to avoid mixed dtype failures. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 09:52:46 +00:00
Pepijn	5bb2da4da6	fix(pi052): VQA target format = "label <loc><loc>" not "<loc><loc> label" The trained model collapsed to spewing 40+ <loc> tokens for every prompt — subtask, memory, anything — because VQA targets were supervised to start with <loc>. With ~25% of all text samples beginning with a <loc> token, the LM head learned "Assistant: → <loc>" as a strong attractor; once one loc is emitted, autoregression chains the rest. Flip the format so every text target — subtask, memory, speech, AND VQA — starts with a regular word. The model still learns the <loc> vocabulary for the spatial portion of the answer, but loc can no longer be the first generation step out of a clean prompt. Examples: point : "green box <loc0162><loc0759>" bbox : "cube <loc0082>…<loc0409>" multi : "blue <locs> ; yellow <locs>" The runtime parser (parse_loc_answer) strips loc tokens and uses the remainder as label, so it's order-tolerant and works under either format. Old loc-first checkpoints still parse cleanly at inference; new training will use label-first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 18:56:48 +02:00
Pepijn	f7b989ad97	fix(pi052): read backbone dtype from q_proj, not first parameter select_message's bf16 cast used next(paligemma.parameters()).dtype, which lands on a fp32-kept param (norm / embedding) under to_bfloat16_for_selected_params. Mask stayed fp32 while q/k/v were bf16 → SDPA still raised "invalid dtype for bias". Read the dtype from layers[0].self_attn.q_proj.weight instead — q_proj is always cast with the rest, so its dtype matches what SDPA sees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 18:46:08 +02:00
Pepijn	3b4376aa33	fix(pi052): cast attention bias to model dtype for bf16 inference `_prepare_attention_masks_4d` always returns fp32 (the 0.0 / -inf literals); with bf16 weights, HF PaliGemma's SDPA path raises "invalid dtype for bias - should match query's dtype" and select_message returns empty every step. Cast in both attention sites: `_compute_layer_ki` (training, when both experts run) and `select_message` (inference, VLM-only branch). Bf16 training + bf16 inference now run end to end with no dtype mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 18:42:26 +02:00
Pepijn	a0233f53f4	feat(annotate): default VLM to Qwen3.6-35B-A3B-FP8 Match the production target used in examples/annotations/run_hf_job.py. Per Scale Labs' dense-captioning ablations, model capacity dominates prompt-engineering gains; defaulting to the larger model avoids shipping a worst-tier configuration out of the box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 11:46:59 +02:00
Pepijn	34269a5d78	fix(pi052): register PaliGemma <loc> tokens so they tokenize as single ids THE bug behind the <loc>-salad. PaliGemma's vocab reserves ids [256000, 257023] for <locDDDD> detection / pointing tokens, but the stock AutoTokenizer does NOT match them on raw text — it BPE-splits <loc0162> into SEVEN pieces (<, loc, 0, 1, 6, 2, >). So a VQA target like "<loc0162><loc0759> green box<eos>" tokenized to 16 pieces, not 5, and training the LM head supervised those generic BPE pieces instead of one detection-vocab id. The piece logits got pumped up across ~25% of supervised positions; at inference they dominated every turn — even subtask prompts produced <loc>-salad followed by the actual answer. Register the 1024 <locDDDD> tokens via tokenizer.add_tokens once on load, in every path the policy uses: PI052TextTokenizerStep (training encode), _build_text_batch_pi052 (runtime encode), and select_message's default tokenizer (runtime decode). Verified empirically with the real PaliGemma tokenizer: VQA target now tokenizes to 5 ids matching the loc-vocab range (256162, 256759, ...) with correct offset_mapping. This unlocks PaliGemma's actual detection prior; <loc>-salad cannot recur because each <locDDDD> is a single class on the LM head, not a character sequence the head accidentally learns to extend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 11:41:41 +02:00
Pepijn	75507491bf	fix(pi052): VQA <loc> conversion treats coords as 0-1000 normalized Confirmed empirically on the published dataset: VQA bbox/keypoint coordinates are Qwen2.5-VL's 0–1000 normalized grounding output, NOT pixels. Scanning 8207 samples showed x and y both spanning 0..1000 with ~30% of values exceeding the camera's pixel dimensions (which is impossible if they were pixels). _vqa_answer_to_loc was dividing by the observation image's H/W, so e.g. point [742, 158] on a 640x480 wrist cam clamped x to <loc1023> (the far-right edge) instead of mapping to <loc0760> (~74% across). Fix: divide by 1000 — the actual Qwen scale. The conversion is now camera-resolution-independent, so _camera_image_shapes and the image_shapes plumbing through __call__ / _encode_messages / _messages_vqa_to_loc are dropped. Tests updated to the new signature and the 0–1000 round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:21:28 +02:00
Pepijn	88519cb14c	fix(pi052): quantile-normalize actions before FAST tokenizer fit base.fit() rejected the data with "Vocab size 1024 is too small for the range of tokens 9339": the FAST tokenizer was fit on raw motor-unit actions, whose DCT-token range vastly exceeds the 1024 codebook. Two problems, one fix. (1) Raw actions blow up the token range. (2) At training time ActionTokenizerProcessorStep runs after the QUANTILES NormalizerProcessorStep, so it encodes normalized actions — fitting on raw actions mismatches that space. Replicate QUANTILES normalization (per-dim [q01,q99] -> [-1,1], clipped) before base.fit() so the fit and the training-time encode see the same distribution and the token range fits the codebook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:02:20 +02:00
Pepijn	bc0c993b25	fix(pi052): FAST tokenizer fit read actions from column, not ds[i] fit_fast_tokenizer collected action chunks via ds[i]["action"], which builds a full training item — delta-timestamp expansion, video decode, image transforms. A single video-decode failure threw, was swallowed at debug level, and silently starved the fit of every chunk → "FAST fit collected zero action chunks", falling back to the universal tokenizer. Read the ``action`` column straight from the HF dataset instead: it carries no video, so it is immune to decode errors and far faster. Also fail fast with a clear message when the dataset has no ``action`` feature or all episodes are shorter than chunk_size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:51:53 +02:00
Pepijn	ddf4bc2063	fix(pi052): knowledge insulation crashed on wrong _gated_residual import _compute_layer_ki called modeling_gemma._gated_residual, but that adaRMSNorm gated-residual helper is a lerobot helper in pi_gemma, not part of HF transformers — so enabling knowledge_insulation crashed with AttributeError on the first training step. Import _gated_residual from pi_gemma, matching pi05's own layer code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:48:02 +02:00
Pepijn	b7317b6c29	test(pi052): round-trip coverage for VQA <loc> conversion Pins JSON pixel coords -> PaliGemma <loc> -> runtime parse back: the conversion preserves coordinate order (JSON x-first, <loc> y-first) and per-axis normalization, losing only <loc>-grid quantization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:24:24 +02:00
Pepijn	c026aed8f8	feat(pi052): train VQA spatial answers in PaliGemma <loc> format Spatial VQA answers (bbox / keypoint) were trained as pixel-coordinate JSON, which fights PaliGemma's detection prior and leaks <loc>-token salad at inference. Convert them to PaliGemma's native <locNNNN> vocabulary instead so the LM head reuses that prior. Training side (text_processor_pi052.py): a target turn whose content parses as a bbox/keypoint answer is rewritten to <loc> text, using the camera frame's native (H, W) from the observation and the preceding image block. Non-spatial answers, subtask/memory targets and SmolVLA2 keep their JSON form — the dataset stays backbone-agnostic. Runtime side (smolvla2/inference/vqa.py): parse_vqa_answer detects <loc> answers (2 locs -> keypoint, 4 -> bbox), returning normalized [0,1] coords with a normalized flag; draw_vqa_overlay denormalizes against the chosen camera frame's pixel size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 20:23:46 +02:00
pepijn	e425dfd624	fix(processor): fallback to task message when recipe misses Keep action-only samples trainable by rendering the task as a low-level user message when no recipe branch matches. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 15:32:09 +00:00
Pepijn	15f79b5e5e	fix(pi052): supervise an EOS token at the end of each text target PI052TextTokenizerStep masked text_labels over the assistant turn's content only — the trailing newline was excluded and no EOS token was ever a supervised label. So the LM head was never given a stop signal: at inference select_message decoded to max_new_tokens, producing the runaway subtask paragraphs and the "}"}"}-style VQA tails. _format_messages now appends the tokenizer's EOS to each supervised target turn and extends that turn's span to cover it, so the EOS lands in text_labels. _shifted_ce then trains "<last content token> -> EOS" and the model learns to terminate; select_message stops on it. Inference callers (the runtime's _build_text_batch_pi052) pass no target_indices / eos_token, so no EOS is baked into the prompt — the model generates it. Verified end-to-end with the PaliGemma tokenizer: the supervised span is `<content><eos>` and the trailing newline stays unsupervised. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:22:22 +02:00
pepijn	2ea0da2d9f	fix(annotate): tag uploaded dataset revision Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 12:44:35 +00:00
Pepijn	725ac95b0d	feat(runtime): make the interactive runtime drive PI052 too The runtime's text path was hard-wired to SmolVLA2: _build_text_batch read policy.config.vlm_model_name (which PI052Config doesn't have) and built a SmolVLM2 chat-template prompt. PI052/PaliGemma is not chat-pretrained and trains on a flat `User: ... \nAssistant: ...` prompt, so the runtime crashed or fed an out-of-distribution prefix. - _build_text_batch now dispatches on policy.config.type: smolvla2 -> chat template (renamed _build_text_batch_chat); pi052 -> flat role-prefixed text via PI052TextTokenizerStep's own _format_messages / _strip_blocks / _flatten_say_tool_calls, so the inference prefix matches PI052 training exactly. - Add a lerobot-pi052-runtime entry point (alias of the same main; the policy type is read from the checkpoint) so the command name isn't misleading. argparse prog now defaults to the invoked command name. PI052's select_message / predict_action_chunk already work with the runtime; this was the one SmolVLA2-only coupling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:28:55 +02:00
Pepijn	7b64e5498d	revert(annotate): move memory + speech prompts to base PR (#3471 ) The first-person memory narrative, task-rephrasing and initial-speech prompt tweaks belong in the annotation pipeline itself. Applied to feat/language-annotation-pipeline (#3471); reverting them here to the merge-base so they drop out of this PR's diff. general_vqa.py keeps its docstring fix since it references a recipe this PR introduces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:17:52 +02:00
Pepijn	134a707c7a	feat(annotate): first-person memory narrative + shorter speech prompts - module_1_memory: rewrite as an explicit first-person, past-tense narrative ("I picked up...", "I opened...") matching the MEM (Torne 2026) running-memory style, instead of "one or two short sentences" with no person/tense guidance. - module_1_task_rephrasings: bias rephrasings toward short imperative. - module_2_initial_speech: prefer very short robot acknowledgements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:17:30 +02:00
Pepijn	182f10184f	revert(annotate): move pipeline changes to base PR (#3471 ) The deterministic-plan rewrite, single-frame VQA (K 3->1), dataset version tagging, telegraphic-subtask prompt and shorter interjection prompt belong in the annotation pipeline itself, not in the SmolVLA training PR. They have been applied to feat/language-annotation- pipeline (#3471). Reverting these six files here to the merge-base so they drop out of this PR's diff; #3491 will inherit the canonical versions when it next rebases on its base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:07:23 +02:00
Pepijn	ce47075d6b	feat(annotate): deterministic plan, single-frame VQA, dataset tagging Port the steerable-pipeline refinements developed on feat/smolvla-on- steerable back into the annotation pipeline itself: - module_1_subtasks: imperative verb-first telegraphic labels with a consistent-object-noun rule and good/bad examples (no hard word cap). - _generate_plan: drop the VLM round-trip; the plan is now a deterministic numbered list of still-todo subtasks, re-emitted at every subtask boundary so it shrinks as work progresses. Removes module_1_plan.txt. - VqaConfig.K 3 -> 1: a VQA pair anchors exactly its emission frame, no stale-label temporal smear. - lerobot-annotate: tag the pushed dataset with its codebase_version so LeRobotDataset can resolve a revision and load it. - module_2_interjection: shorter, more natural mid-task cues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:06:15 +02:00
Pepijn	26013da699	feat(annotations): enforce imperative verb-first subtask phrasing Rewrite module_1_subtasks prompt to produce short imperative commands ("pick up the orange") instead of third-person narration ("the robot arm moves to the orange"). Drops the verbose "how, not what" rule and adds a good/bad few-shot table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:53:20 +02:00
pepijn	bb31988915	fix(pi052): pass 4d masks to prefix-only forwards Convert PI052 prefix-only attention masks before calling PaliGemma so text-only batches and generation use the same mask shape as fused training. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 21:07:13 +00:00
pepijn	2629175d2d	fix(pi05): use fused AdamW by default Route full PI05/PI052 fine-tuning through PyTorch's fused AdamW path to avoid the single-tensor Adam denominator allocation near GPU memory limits. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 19:23:17 +00:00

1 2 3 4 5 ...

1699 Commits