lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-13 04:51:59 +00:00

Author	SHA1	Message	Date
Pepijn	0f5f0e4091	refactor(recipes): rename recipes, drop pi05_hirobot - hirobot.yaml -> subtasks_vqa.yaml - hirobot_memory.yaml -> subtask_mem_vqa_speech.yaml - pi05_hirobot.yaml -> deleted (stale: uses plan, top-camera names; superseded by the two recipes above) - smolvla2_hirobot.yaml -> deleted (was untracked stale junk) Updated the smolvla2 / pi052 `recipe_path` config defaults, all docstring / comment references, the annotation-pipeline + recipe docs, and the three tests that loaded pi05_hirobot.yaml (repointed to the renamed recipes; the low-level-branch and pipeline-render assertions now accept a flow-only `low_level` stream as valid supervision, since the new recipes' low_level_execution has no text-CE target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:02:15 +02:00
Pepijn	426d48dbbf	fix(pi052): port the smolvla2 text-head fixes to pi052 pi052 had the same text-CE collapse bug smolvla2 had — PaliGemma's embed_prefix flags the language block att=0, so make_att_2d_masks makes it fully bidirectional and the text cross-entropy degenerates into a copy task. Ported the three model-specific fixes: - _mark_target_span_causal: set att=1 on supervised target language positions so the text-CE is genuine causal next-token prediction. Applied in both _compute_all_losses_fused and _compute_text_and_fast_loss. - flow_loss_weight 10.0 -> 5.0: the paper's a=10 swamps the LM head once the flow-only low_level recipe fires often (matches SmolVLA2Config). - _flatten_say_tool_calls in the text tokenizer: serialize `say` tool calls into a <say>...</say> marker so the spoken reply is tokenized and supervised (PaliGemma's flat prompt has no structured calls, so they were dropped entirely). select_message needed no change: pi052's prefix is [images, language] with no trailing state token, so it already decodes from the last language token. Regression tests mirror the smolvla2 attention-masking + tool-call suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:42:19 +02:00
Pepijn	fbcb9225f5	feat: oversample sparse VQA annotations (recipe consumption + weighted sampler) VQA annotations are sparse, so VQA was badly underrepresented in training: its effective share was weight x density, and blend draws that picked an ask_vqa* sub-recipe for a non-VQA frame were wasted entirely. Two pieces: 1. Recipe-side consumption (language_render.py): render_sample now routes any frame that carries a VQA annotation to a matching ask_vqa* sub-recipe, regardless of the weighted blend draw. No VQA annotation is wasted and no draw lands on a non-renderable VQA recipe — VQA's recipe-side share now equals the VQA-annotation density. 2. Dataset-side oversampling (WeightedEpisodeAwareSampler + vqa_target_fraction): a new weighted, episode-aware sampler draws frames with replacement by per-frame weight. When TrainPipelineConfig.vqa_target_fraction is set, the train script scans language_events, weights VQA frames so they make up ~that fraction of the training stream, and uses the weighted sampler. This is what actually lets VQA exceed its natural density. Default None keeps uniform episode-aware sampling unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:30:00 +02:00
Pepijn	b319ccf688	fix(smolvla2): only prompt for a camera when a VQA overlay is drawn The VLM already sees every camera, so the operator never needs to name one to ask a question. Move the camera prompt to after generation and only fire it when the answer actually carries a bounding box / point (whose pixel coordinates are camera-specific and need a target frame). Non-spatial answers (count / attribute / spatial / plain text) now skip the prompt entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:50:19 +02:00
Pepijn	3174e14bc0	fix(smolvla2): feed all cameras to VQA generation, not just the chosen one handle_vqa_query filtered the observation down to the single chosen camera before calling the VLM. But training feeds every camera: the ask_vqa_* recipes' image blocks are stripped before tokenization and the frames reach the model via OBS_IMAGES_*, where embed_prefix consumes all config.image_features regardless of the per-camera recipe tag. Filtering to one camera changed the image-token count in the prefix (the dropped camera zero-padded with mask=0) — a prefix shape the model never saw at training. Now the full observation is passed to select_message; the chosen camera is used only to pick which frame the bbox/point overlay is drawn on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:46:38 +02:00
Pepijn	dc530e10fe	feat(smolvla2): VQA example prompts in the panel; drop quotes from hints Command arguments never needed quotes (`_strip_quotes` only strips a matching pair if present) — `/question point to the yellow cube` works. The hints wrongly implied `""` were required; all hints/help now show `/action <task>` / `/question <text>`. Also adds a reference line to the state panel showing the two overlay-producing VQA prompt shapes: /question point to the yellow cube -> point overlay /question detect the blue cube -> bounding-box overlay plus the same examples in /help. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:42:32 +02:00
Pepijn	e7c5613a39	refactor(smolvla2): command-driven runtime — no startup prompts Replace the startup mode prompt + task picker with a single command-driven prompt. The runtime now comes up immediately at the command line in `paused` mode (robot idle) and the operator drives it: /action "task" run the robot on a task (bare = resume, number = timed burst) /pause stop the action loop — robot holds position /question "..." pause and answer one VQA question (camera prompt + overlay) /help / stop - Removed _select_mode_interactively / _select_task_interactively / _dataset_task_strings (the interactive pickers). - mode value renamed "question" -> "paused"; --mode choices are now action\|paused (default paused). - /question takes the question inline and runs it via _handle_slash_command (pauses first, so the policy isn't used concurrently). - The ENTER-to-start gate only fires when starting in action mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:37:51 +02:00
Pepijn	516ffc7687	feat(smolvla2): --mode flag, skip task picker with --task, timed /action Lets the operator skip the interactive startup entirely and go straight to the command line: - New --mode {action,question} arg; when given, the startup mode prompt is skipped. - When --task is passed explicitly on the CLI, the startup task picker is skipped (the dataset-bootstrap task still shows the picker so you can override it). Also adds a timed action burst: /action <seconds> runs the robot for N seconds, then the autonomous loop auto-reverts to question mode and clears the action queue. Plain /action stays unlimited. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:26:12 +02:00
Pepijn	7a68bf13d9	feat(recipes): add hirobot_memory — hirobot + memory + spoken tool-call replies New recipe alongside hirobot.yaml (kept as the lean baseline). Superset that adds two text-supervised sub-recipes: - memory_update: compress progress into a memory note. - user_interjection_response: reply to a user interjection with a `say` tool call only (no plan/subtask text). The SmolVLA2 chat tokenizer flattens the call to a `<say>...</say>` marker the runtime parses back. Plan is intentionally omitted; memory is the only persistent high-level state. Weights: low_level 0.40, subtask 0.25, memory 0.10, interjection 0.10, vqa 0.075 x2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:21:41 +02:00
Pepijn	15229468d0	feat(smolvla2): startup mode prompt; rename /vlm mode to /question Add a mode prompt at startup, shown before the task picker, so the operator chooses action (run the robot) vs question (VQA only) up front instead of having to discover /vlm mid-run. Also rename the VQA mode from "vlm" to the clearer "question": - state["mode"] value is now "action" \| "question" - the command is /question (/vlm and /vqa kept as aliases) - panels, hints and help text updated to match handle_vqa_query now reports via both push_log and direct stdout, so VQA answers / overlay paths are visible in autonomous question mode where the panel redraw is suspended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:17:03 +02:00
Pepijn	a9cea3e8dd	fix(smolvla2): make the autonomous REPL usable for slash commands / VQA The autonomous panel redraw cleared the screen every 0.5s, so the "> " prompt and the one-shot command hint vanished — the operator could not see what to type or what they were typing, making /vlm unreachable. - Suspend the timer redraw entirely while in /vlm mode (the action loop is paused, nothing changes in the background) so the VQA question and camera prompt stay on a stable screen. - Re-print the "> " prompt after each redraw so it is always visible. - Show an always-on command hint in the panel (/vlm, /help, /action) instead of relying on the startup line that scrolls away. - Redraw immediately after a slash command so the mode flip is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:10:13 +02:00
Pepijn	89d4846590	fix(smolvla2): always show the startup task picker on a TTY The picker was skipped whenever a task was already resolved — which is always the case with --dataset.repo_id, since the dataset's canonical task is auto-filled. The operator never got to choose. Now the picker always runs on an interactive terminal: the resolved task is shown as "(current)" and selected by an empty Enter, so the dataset-canonical default still works while letting the operator pick another task or type a custom one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:04:53 +02:00
Pepijn	26cb38a7d0	feat(smolvla2): startup task picker, /vlm mode toggle, interactive VQA overlay Three additions to the SmolVLA2 interactive runtime: 1. Startup task picker — when no --task is given, the runtime lists the dataset's task strings as a numbered menu (plus a custom-task option) instead of silently waiting for the first stdin line. 2. Mode toggle — /action and /vlm slash commands flip a persistent run mode. /vlm pauses the whole action loop (HighLevelSubtaskFwd, LowLevelForward and DispatchAction gate on state["mode"]) and clears the action queue so the robot holds position; /action resumes it. The mode is shown in the state panel. 3. Interactive VQA — in /vlm mode a typed line is a VQA question. The new inference/vqa.py module asks which camera to ground on, runs the VLM on that single camera, and when the answer is a bbox/keypoint it draws the overlay, saves a PNG to ./vqa_overlays/ and auto-opens it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:20:57 +02:00
Pepijn	bfb8cfb432	fix(smolvla2): flatten say tool_calls into <say> marker before tokenizing The chat tokenizer passed assistant `tool_calls` straight to `apply_chat_template`, which renders them as a structured JSON `<tool_call>` block — so the LM head was trained to emit JSON. But the inference parser `_split_plan_and_say` looks for a `<say>...</say>` marker, which the model never saw in training, so the `say` tool never fired at inference. `_flatten_say_tool_calls` is the missing training-time serializer (the one `_split_plan_and_say`'s docstring already assumed existed): it rewrites a `say` tool call into a `<say>...</say>` marker inside the content text before the chat template runs, so the template only tokenizes plain text and the supervised target span trains the model to emit exactly the marker the runtime parses back (Pi 0.5-style flat tool-call serialization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:47:31 +02:00
Pepijn	5e3b9ba82c	tune(smolvla2): override optimizer_lr to 2.5e-5 for pretrained-LM fine-tuning SmolVLA's 1e-4 is safe only because it freezes the language head. SmolVLA2 unfreezes lm_head + the last text layer and fine-tunes the pretrained SmolVLM2 language weights; 1e-4 is too aggressive there and destabilises generation into degenerate repetition. Match pi05's 2.5e-5 peak LR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:41:13 +02:00
Pepijn	083d3cd419	tune(smolvla2): soften flow:text loss split from 10:1 to 5:1 The Pi 0.5 α=10 split assumed text is a rare auxiliary task. With the flow-only `low_level` recipe (~40% of the blend) now rendering, the flow term fires often and at 10x weight dominates the shared VLM backbone, starving the text head into degenerate repetition decoding. A 5:1 split keeps actions primary while leaving the language head enough gradient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 16:00:08 +02:00
Pepijn	bf996c7938	fix(datasets): render flow-only low_level recipes instead of dropping them A recipe whose only supervision is the action-expert flow loss (e.g. `low_level_execution`: `user(${subtask})` with `stream: low_level` and no `target` turn) was rejected at render time by `_render_message_recipe` and `_validate_rendered`, both of which required at least one target turn. The result: every blend draw of the flow-only recipe rendered to `None`, `predict_actions` was never set, `run_flow` never fired, and the action expert received no flow loss — leaving it at random init. Both gates now also accept a `low_level`-stream turn as valid supervision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 13:20:39 +02:00
Pepijn	0d88eaf8eb	test(smolvla2): attention masking of the language target span Regression coverage for the text-CE collapse bug fixed in `3cd348ff`. Pure-function tests over ``_mark_target_span_causal`` / ``_locate_lang_range`` / ``make_att_2d_masks`` — no model load, fast. Pins: * the target span flips to att=1, prompt/images stay att=0; * target tokens attend causally among themselves (no peeking at future targets) — genuine next-token prediction; * targets still attend bidirectionally to images + the user prompt; * the action-expert (state) token still attends to every target; * a no-target subtask (low_level_execution user turn, labels all -100) leaves the mask bidirectional; * an explicit test documenting the bug: the raw embed_prefix mask lets the first target token see the last — the copy-task collapse. Skips cleanly when transformers isn't installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 18:28:44 +02:00
Pepijn	3cd348ffe2	fix(smolvla2): causal mask on the text-CE target span (THE collapse bug) Root cause of every collapsed inference run. ``embed_prefix`` flags all language tokens ``att=0``; ``make_att_2d_masks`` turns that into a single fully BIDIRECTIONAL block. So during the text-loss forward, a supervised subtask token's hidden state attends to the very tokens it is trained to predict. The cross-entropy degenerates into a copy task — ``text_loss → ~3e-5`` not because the model learned to predict subtasks but because it can see the answer. At inference ``select_message`` decodes autoregressively (causally): each token must be predicted WITHOUT seeing it — a task the model was never actually trained on. Hence the universal collapse: a coherent first token or two ("grasp the yellow cube"), then a loop ("cover cover cover", "icatorsicators", "the the the"). Fix: ``_mark_target_span_causal`` sets ``att=1`` on the language positions that are supervised targets (``text_labels != -100``). With make_att_2d_masks's cumulative-block rule each target token then attends to images + the user prompt bidirectionally and to EARLIER target tokens only — genuine causal next-token prediction, matching select_message. Applied in both ``_compute_text_loss`` and ``_compute_fused_loss``. Per-sample correct: high_level_subtask targets become causal; low_level_execution subtasks (a user turn, labels all -100) stay bidirectional so the action expert reads them as bidirectional context. The action expert is otherwise unaffected — the suffix has a strictly higher cumsum and still attends to the whole prefix. Requires retraining: this changes the training objective. Existing checkpoints were all trained on the degenerate copy task and cannot generate text. Expect ``text_loss`` to settle MUCH higher than 3e-5 after this — that is correct; it is now a real prediction task. NOTE: pi052's text path (PaliGemma prefix-LM) has the same bidirectional-block structure and needs the analogous fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 18:24:44 +02:00
Pepijn	db03fc6dc4	fix(smolvla2): select_message must decode from the language position ``embed_prefix`` lays the prefix out as ``[images, lang, state]`` with the state token LAST. Training supervises the text head on the language positions (``_compute_text_loss`` / ``_compute_fused_loss`` slice ``prefix_out[lang_start:lang_end]`` and run lm_head there). But ``select_message`` started AR generation from the full prefix and read ``prefix_out[:, -1:]`` — the state token — to decode the first subtask token. The state token's hidden state exists for the action expert to read; the lm_head was never trained to produce subtask text from it. So inference decoded the high-level head from a position entirely outside the training distribution: the text head collapses (``the arm the arm``, ``grasp the surface population``, ``_333 absburg…``) no matter how cleanly ``text_loss`` converged. Fix: truncate the state token off the prefix before the AR loop, so ``prefix_out[:, -1:]`` is the last language token (right after the ``Assistant:`` generation prompt) — exactly where training supervised. Inference-only change — no retraining needed; existing checkpoints benefit immediately. The action path (``predict_action_chunk``) is untouched: state belongs in the action expert's prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 15:05:16 +02:00
Pepijn	56068d37ea	fix(smolvla2): default load_vlm_weights=True — don't train from scratch SmolVLAConfig defaults ``load_vlm_weights=False``. With that and no ``--policy.path``, ``SmolVLMWithExpert.__init__`` builds the VLM via ``SmolVLMForConditionalGeneration(config=...)`` — i.e. a fully random-initialised 500M backbone, including a random ``lm_head``. For plain SmolVLA that's a deliberate "pre-train the expert" mode. For SmolVLA2 it's a footgun: the high-level text head is the SmolVLM2 ``lm_head``. Training subtask prediction from a random language model can only memorise — which is exactly the repetition collapse seen on the real robot ("the arm the arm the arm …"). SmolVLA2 now defaults ``load_vlm_weights=True`` so every run fine-tunes the pretrained ``HuggingFaceTB/SmolVLM2-500M-Video-Instruct`` backbone (vision tower + language model + lm_head). The action expert still trains from scratch on the robot data (standard SmolVLA fine-tuning); start it from pretrained too by fine-tuning a full ``lerobot/smolvla_base`` checkpoint via ``--policy.path``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 16:44:00 +02:00
Pepijn	e727688052	annotate: telegraphic subtasks — ≤4 words, verb+object, consistent nouns Tighten the subtask prompt further per real-data feedback. The old ≤5-word cap still produced things like "release the yellow block into the green bin" (8 words, articles, destination, and "block" where the task said "cube"). New rules: * Hard cap ≤ 4 words, ideally 2-3. Form: VERB + (color) + OBJECT. * No articles, no destinations, no adverbs, no "robot/arm/gripper". * Must reuse the exact object nouns from the task — no block/cube, bin/box/container drift across the episode. * Concrete good/bad examples anchored on the cube task. Shorter, templated, consistent targets are far more robust for the autoregressive LM head — fewer tokens to drift on, fewer dominant n-grams to repetition-collapse into. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 14:14:42 +02:00
Pepijn	f1a0a663cc	fix(inference): gibberish detector catches long repetition collapse The ``_looks_like_gibberish`` low-unique-token check was gated on ``len(stripped) < 80``, so an LM head that loops an n-gram for the whole 256-token budget — "the arm the arm … the the the the" — sailed straight through (``gibberish:0`` in the panel) and the garbage subtask got accepted and fed to the action expert. Added a length-independent check: ``>= 8 tokens`` but unique-token count ``<= max(3, tokens // 10)`` ⇒ repetition collapse. Now the runtime rejects the looped output and keeps the previous (real) subtask instead of propagating nonsense. This is a guard, not a cure — the underlying issue is the LM head on the current checkpoint being undertrained / collapsed; re- annotate with the short prompts and train longer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 13:52:26 +02:00
Pepijn	6e64c20cf1	runtime: stop seeding plan/memory from the dataset (unused) The current recipe trains neither plan nor memory, and no inference step consumes them — ``_msgs_for_subtask`` renders the bare task and ``LowLevelForward`` conditions on the subtask. Bootstrapping ``current_plan`` / ``current_memory`` from the dataset's ``language_persistent`` annotations therefore only placed a stale, do-nothing plan in the status panel. Keep seeding ``current_subtask`` — it's a useful first-frame fallback for ``LowLevelForward`` before ``HighLevelSubtaskFwd`` produces its first subtask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 13:47:33 +02:00
Pepijn	b29cccb37e	runtime: restore the subtask hierarchy — generated subtask drives actions Reverts the previous "condition actions on the task" shortcut. The action expert is conditioned on the SUBTASK again: * ``low_level_execution`` recipe back to ``user(${subtask})``. * ``LowLevelForward`` conditions on ``current_subtask`` (falls back to the task only on the first frame, before the high-level loop has produced a subtask). * ``HighLevelSubtaskFwd`` re-added to the runtime pipeline so the subtask is actually generated each high-level tick and written to ``current_subtask`` before ``LowLevelForward`` consumes it. * ``_msgs_for_subtask`` now renders just ``${task}`` (no ``Plan: ``/``Memory: `` lines) to match the current ``high_level_subtask`` recipe, whose user turn is the bare task. So the loop is: task → HighLevelSubtaskFwd (LM head) → subtask → LowLevelForward → action chunk conditioned on that subtask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 13:43:04 +02:00
Pepijn	f161e27e96	recipe+runtime: condition the action expert on the task, not the subtask Real-robot runs shook and failed the task despite a low flow loss. Root cause: train/inference conditioning mismatch — not a flow-loss bug (``_compute_fused_loss``'s flow path is byte-identical to ``SmolVLAModel.forward``). At training, ``low_level_execution`` conditioned the action expert on ``${subtask}``, and every frame's subtask was the correct one for that frame. At inference the runtime has no high-level subtask generator (VQA-only pipeline), so ``current_subtask`` was frozen — the action expert got "move towards the blue cube" for the entire episode. Once the arm reached the cube, that (image, subtask) pair never occurred in training → OOD conditioning → incoherent flow output → shaking. Fix: ``low_level_execution`` now renders ``user(${task})``. The task is stable for the whole episode and always available, so the action expert's conditioning is identical at train and inference with no high-level loop required. ``LowLevelForward`` updated to build the same ``[user(task)]`` prompt. ``high_level_subtask`` still trains the text head to predict subtasks (kept for when a reliable subtask loop is reintroduced) — it's just no longer on the action expert's critical path. Requires re-training for the recipe change to take effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 13:40:15 +02:00
Pepijn	d5f293a1c9	recipe+runtime: VQA + subtask only — drop plan & memory Scope reduction while the core subtask + action loop is validated: Recipe (hirobot.yaml) * Removed ``plan_generation`` sub-recipe entirely. * Removed the memory tail from ``high_level_subtask`` (the ``new_memory`` binding + the second assistant turn). * ``high_level_subtask`` user turn is now just ``${task}`` — no ``Plan: …\nMemory: …`` context. * Weights rebalanced over the four remaining sub-recipes: high_level_subtask 0.40, low_level_execution 0.40, ask_vqa_top/wrist 0.10 each. Runtime (inference/runtime.py) * Pipeline trimmed to VQA + the action loop: AskVQAFwd → LowLevelForward → DispatchAction → DispatchToolCalls. * Dropped HighLevelSubtaskFwd / MemoryUpdateFwd / UserInterjectionFwd from the default pipeline. They remain importable from ``inference.steps`` for when plan/memory/subtask generation is brought back. The action expert conditions on the task string directly via LowLevelForward's ``current_subtask or task`` fallback. This commit lands on top of a rollback of the previous two commits (repetition_penalty / no_repeat_ngram_size knobs, and the deterministic plan-walker) — both were bandaids for the LM-head repetition collapse that the reduced-scope recipe sidesteps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 08:02:06 +02:00
Pepijn	95033733fc	deps: add sentencepiece to the pi extra (FAST action tokenizer) PI052 and PI0_FAST both load ``physical-intelligence/fast`` as their action tokenizer. That tokenizer's HF backend requires ``sentencepiece`` to instantiate (or ``tiktoken``); without it ``AutoProcessor.from_pretrained`` raises: ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece or tiktoken installed [...] It wasn't listed in pyproject so fresh installs missed it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:52:55 +02:00
Pepijn	c3503b774f	fix(debug): dumper now shows real stream + target flags The dumper was printing ``stream=None target=None`` for every message because it read those fields off the message dicts, but the recipe renderer keeps them in parallel arrays (``message_streams`` / ``target_message_indices`` in COMPLEMENTARY_DATA) so the chat template doesn't see unknown keys. Zip them back into the dump-time dicts so the printed metadata is accurate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:43:51 +02:00
Pepijn	99ebee4d16	annotate: tighter subtask + memory prompts (≤5 / ≤10 words) Both feed into the high-level prompt and the plan rendering, so keeping them short directly reduces the rendered ``${task}\nPlan: …\nMemory: …`` prefix the model has to chew through at inference. Subtasks * Hard cap: ≤ 5 words. Verb + object only, drop articles/adverbs. * Concrete good/bad examples to anchor the VLM. Memory * Hard cap: ≤ 10 words. Telegraphic noun→location fragments ("bowl in box, lid open"), no past-tense verbs, drop attributes that don't matter for downstream subtasks. * Allow empty string when no material change occurred — keeps the rendered memory line literally blank instead of forcing a no-op sentence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:28:09 +02:00
Pepijn	a8ca5128b8	fix(annotate): re-emit plan at every subtask boundary Previously only emitted a plan at t=0 and on interjections, so the active plan rendered into training carried "done" subtasks until the next interjection. With the new "plan = remaining subtasks" summariser this meant the plan was stale between boundaries. Emit a fresh plan row at every subtask start. ``active_at(t)`` then returns a plan that contains exactly the subtasks whose start ≥ the current span's start — completed subtasks fall off the plan the moment the next subtask begins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:26:49 +02:00
Pepijn	dd97c33814	refactor(annotate): plan = summary of still-todo subtasks, drop VLM call The plan was being generated by a separate VLM call (one per episode + one per interjection refresh) with a prompt that asked the model to "compress the subtasks into a compact hierarchical plan". In practice the plans came out longer than necessary and sometimes drifted from the actual subtask sequence the runtime would execute. Replaced ``_generate_plan`` with a deterministic numbered list of the upcoming subtasks. At a refresh time the list shrinks to subtasks whose start ≥ refresh_t — the plan describes what's left to do, so it gets shorter as work progresses. Saves the per-episode + per-interjection VLM round-trip in the annotation pipeline and keeps train-time plan text bit-aligned with the subtask annotations the rest of Module 1 emits. Removed the now-unused ``prompts/module_1_plan.txt``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:55:02 +02:00
Pepijn	fa45ba631b	fix(policies,recipe): register PI052Config + allow flow-only sub-recipes Two regressions surfaced by the first training run: 1. ``--policy.type=pi052`` failed with ``invalid choice``. PI052Config wasn't imported in ``policies/__init__.py``, so its ``@register_subclass("pi052")`` decorator never ran and draccus didn't see it as a valid policy type. Mirror PI05Config / SmolVLA2Config in the top-level imports + __all__. 2. ``low_level_execution`` (user-only ``${subtask}`` recipe used for π0.5-style flow conditioning) tripped ``ValueError: Message recipes must contain at least one target turn.`` The validator was too strict — a recipe with only a ``stream: low_level`` turn still drives meaningful supervision (flow MSE on the action expert via ``predict_actions=True``). Allow either ``target: true`` OR ``stream: low_level`` to satisfy the "supervises something" requirement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:51:47 +02:00
Pepijn	ffd8c92ce5	fix(inference): always emit Plan:/Memory: labels in the high-level prompt The recipe renders ``"\${task}\nPlan: \${plan}\nMemory: \${memory}"`` unconditionally — when a binding resolves to None, ``language_render._substitute`` substitutes an empty string, so the training-time user turn always contains the literal ``Plan: `` / ``Memory: `` prefixes even with empty values. The inference message builders were skipping those lines entirely when ``state['current_plan']`` / ``state['current_memory']`` was empty, producing a different prompt shape on early frames (before the plan-generation step runs) and on datasets without plan/memory annotations. Factored a shared ``_hirobot_user_head`` helper used by ``_msgs_for_subtask``, ``_msgs_for_memory``, and the legacy ``_control_context_messages`` so they all match training byte-for- byte regardless of which bindings are populated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:42:29 +02:00
Pepijn	841d3c47e1	feat(debug): LEROBOT_DUMP_RECIPE_SAMPLES=N dumps the first N rendered samples Adds a one-shot debug dumper to both chat processors. When the env var ``LEROBOT_DUMP_RECIPE_SAMPLES`` is set to a positive integer N, the next N samples processed (rank-0 only) get pretty-printed: * the recipe-rendered messages (role / stream / target / content), * the full tokenized prompt (decoded back), * inline ``[TGT]...[/TGT]`` markers over the spans the LM head is supervised on, * token count + target-token count, * ``predict_actions`` flag. Usage: LEROBOT_DUMP_RECIPE_SAMPLES=5 sbatch train_smolvla2.slurm After N dumps the helper becomes a no-op; training continues unaffected. Works for both smolvla2 (chat-template renderer) and pi052 (plain ``Role: content`` concat renderer); each processor has its own copy to avoid cross-package imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:21:46 +02:00
Pepijn	2c920ab178	refactor(recipes): consolidate to shared hirobot.yaml + audit fixes The smolvla2 and pi052 recipe blends had drifted to identical content twice in a row; collapse them to a single ``recipes/hirobot.yaml`` both policies point at. Each backbone's text tokenizer (chat-template for SmolVLA2, plain ``Role: content`` for PI052) handles the rendering differences downstream — the recipe spec is shared. Audit fixes folded into the same commit: * Train/inference prefix mismatch on the action expert ``_build_text_batch`` always passed ``add_generation_prompt=True``, appending ``<\|im_start\|>assistant\\n`` tokens that the action expert never saw at training (the chat tokenizer renders with ``add_generation_prompt=False``). Parameterized the helper and pass ``False`` from ``LowLevelForward``; ``select_message`` paths still default to ``True`` for AR text generation. * PI052 fallthrough could silently train flow on text-only frames When ``text_loss_weight=0`` AND every sample was high-level (``predict_actions.any()==False``), the previous heuristic delegated to ``PI05Policy.forward``, which ignores ``predict_actions`` and runs flow on every sample. Reverted to delegating only on fully unannotated batches. * SmolVLA2 silent zero-loss training ``forward`` returned ``loss=0`` (no error) when neither flow nor text path fired. Now raises ``RuntimeError`` with the weights and routing flags — fails loud like PI052 already does. * PI052 dropout-seed key Was reading ``complementary["dataset_index"]`` (only set by ``MultiDataset`` and means "which sub-dataset", not row index) with fallback to ``frame_index`` (never set) — every sample got seed=0, so per-component dropout was deterministic across the epoch. Switched to ``complementary["index"]`` to match SmolVLA2 and the canonical ``BatchProcessor`` convention. * Dead ``DEFAULT_TOOLS`` import Removed from ``chat_processor_smolvla2.py`` — unused since the default-tools list was switched to ``[]`` in the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:16:28 +02:00
Pepijn	9f630e2a41	fix(recipes,training): stop tool prompt leak + drop subtask copy-supervision CRITICAL (smolvla2) — the SmolVLM2 chat template was rendering the ``say`` tool's JSON schema as a system message on every training sample because ``DEFAULT_TOOLS`` was the default in ``SmolVLA2ChatTokenizerStep``. That schema was only relevant to the now-removed ``user_interjection_response`` recipe; with it gone the schema is dead weight that polluted every action-expert prefix AND created a train/inference mismatch (the inference ``_build_text_batch`` doesn't pass ``tools=``). Default is now ``[]``; callers needing tools can still set them via ``with_tools(meta.tools)``. LIKELY-BUG — ``low_level_execution`` had ``target: true`` on its assistant turn, so text-CE trained the LM head to predict the same subtask string the user just stated (trivial "copy previous turn" supervision that diluted LM head capacity). Dropped the assistant turn entirely; ``high_level_subtask`` (w=0.50) already owns subtask prediction from real context. The chat-tokenizer's ``predict_actions`` detection used to scan target streams only. With the new no-target low_level recipe it would mis-fire as False. Switched both ``chat_processor_smolvla2.py`` and ``text_processor_pi052.py`` to scan all message streams — any ``stream: low_level`` on the sample is enough to trigger flow loss. Inference: the low-level loop sends only ``[user(subtask)]`` now, matching the new recipe shape. PI052 — hardened the forward fallthrough so a degenerate batch where every sample's recipe is text-only AND text supervision is disabled (text_loss_weight<=0 or text_labels missing) cleanly delegates to ``PI05Policy.forward`` instead of raising "nothing to train". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 14:59:01 +02:00
Pepijn	7a32f8a72a	refactor(recipes): π0.5-style split — action expert conditions on subtask only Previously ``action_execution`` rendered ``task + plan + memory + subtask`` into one prefix and ran the flow loss on it. That meant the action expert was conditioned on the full hierarchical context (closer to π0.7 §V.A), not just the subtask. The π0.5 paper's hierarchical inference has the action expert see only the subtask (plus images and state). Split the recipe to match: high_level_subtask (0.50) user(task + plan + memory) → assistant(subtask) [+ assistant(new_memory) at boundary frames] All ``stream: high_level`` → text-CE only, no flow loss. low_level_execution (0.30) user(subtask) → assistant(subtask) Both ``stream: low_level`` → flow loss fires; text CE on the subtask is a small redundant extra signal. Prefix the action expert sees: [images, subtask, state]. plan_generation (0.10) — unchanged. ask_vqa_{top,wrist} (0.05 each) — unchanged. Runtime: the low-level loop in ``smolvla2/inference/steps.py`` now sends ``[user(subtask), assistant(subtask)]`` to ``predict_action_chunk`` instead of the full task+plan+memory context. Falls back to ``state['task']`` when no subtask has been generated yet so the first frame still has something to condition on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 14:13:07 +02:00
Pepijn	129aa207e3	fix(smolvla2,pi052): training-correctness audit fixes CRITICAL (smolvla2) — text-CE was applied to the wrong prefix slice. ``num_state`` was being read from ``state.shape[1]`` (the raw max_state_dim, ~14-32) instead of the number of state tokens (always 1). Compounded by the trailing-padding issue (state is not at the end of the padded prefix when ``seq_len < prefix_length``), the lang slice was landing on image / padding hidden states. New ``_locate_lang_range`` finds the state position via ``att_masks.nonzero()`` (the only ``1`` in the mask), making the slice robust to both bugs. Used by ``_compute_text_loss`` and ``_compute_fused_loss``. LIKELY-BUG (smolvla2) — ``_unfreeze_lm_head`` only re-enabled ``lm_head`` and ``text_model.model.norm.weight``. SmolVLA's parent ALSO freezes the last 1-2 transformer layers, so text-loss gradients died in a frozen final block. Now mirrors the parent's freeze targets and unfreezes the matching ``layers.{N-1}`` (and ``N-2`` when num_vlm % num_expert == 0). CRITICAL (pi052) — flow and FAST CE were not per-sample masked under per-sample-routing. Text-only recipe samples (``plan_generation``, ``ask_vqa_``) contributed to flow/FAST loss with prompts that deliberately omit the subtask, corrupting the signal. Threaded ``predict_actions_t`` through both ``_compute_all_losses_fused`` and ``_compute_text_and_fast_loss``; flow uses ``(per_sample mask).sum() / mask.sum()``, FAST uses ``shift_valid & sample_mask`` before ``masked_fill(-100)``. OTHER * PI052Policy.forward now falls through to PI05Policy.forward on unannotated batches (no text_labels, no predict_actions, no FAST). * fit_fast_tokenizer cache key now includes ``chunk_size`` — changing the chunk size no longer silently loads a wrongly-fit tokenizer. * Removed dead ``_compute_text_loss`` / ``_compute_fast_action_loss`` in pi052 (superseded by the fused helpers). * Fixed stale "no-op stub" docstring on ``knowledge_insulation`` — it's been fully wired since the per-layer KI forward port. * Stripped unused ``copy`` / ``resize_with_pad`` imports. * Extracted ``_shifted_ce`` / ``_mask_per_sample`` / ``_fast_ce`` helpers shared between fused and prefix-only paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 14:08:06 +02:00
Pepijn	e3ad1c59fc	feat(recipes): add plan_generation sub-recipe to smolvla2 + pi052 blends New text-only sub-recipe at 0.10 weight on both blends: user : ${task} assistant : ${current_plan} (high_level target) Bound to ``active_at(t, style=plan)`` so it supervises the currently-active plan on every frame, gated by ``if_present`` to skip frames without a plan annotation. Weights rebalanced: action_execution 0.85 → 0.75, plan_generation 0.10, VQA top/wrist 0.075 each (sums to 1.0). Added matching runtime builder ``_msgs_for_plan`` in ``smolvla2/inference/steps.py`` so the high-level loop can call ``select_message`` with the bare-task prompt at episode start / replanning events. Closes a gap vs. Pi 0.7 §V — without this recipe the model could read ``${plan}`` from the prompt but never had to produce one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:51:37 +02:00
Pepijn	9ff62cb08c	docs(recipes): trim header comments, drop diversity-knobs note in run_hf_job Recipes were over-commented (paper citations, history of removed sub-recipes, inference-time loop walkthroughs). Stripped down to a short header + a one-line note on the boundary-frame memory tail. Also removed the ``_tool3`` diversity-knobs comment block in ``examples/annotation/run_hf_job.py`` — it was a personal note about a since-merged experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:55:03 +02:00
Pepijn	b2aa372fcf	refactor(recipes): fold memory into action_execution, drop interjection, fuse smolvla2 forward Recipe changes: * action_execution now bundles the memory update as a second assistant target gated on a new ``new_memory`` binding (fires only at subtask-boundary frames). No "Completed subtask: X" filler — the model emits the new subtask AND the updated memory back-to-back in one prefix. * user_interjection_response sub-recipe removed (current datasets don't have interjection / say() annotations). * Standalone memory_update sub-recipe removed (folded above). * Weights rebalanced: action_execution 0.85, ask_vqa_top/wrist 0.075 each (sums to 1.0). Runtime ``_msgs_for_memory`` updated to match the new boundary-frame prompt layout. Modeling: * SmolVLA2Policy now fuses the flow + text losses into a SINGLE backbone forward via ``_compute_fused_loss`` (one vlm_with_expert pass with [prefix, suffix] embeds, then both lm_head CE on lang slice + action_out_proj MSE on suffix). Mirrors pi052's existing ``_compute_all_losses_fused`` — saves one backbone pass per training step. Examples: * Removed the two training SLURM scaffolds; they were out-of-date with the recipe refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:51:09 +02:00
Pepijn	058b8f3958	refactor(recipes): two-flavor design — one fused action_execution + text-only events Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions) and the hierarchical inference pattern from Pi 0.5 §IV.D. Flavor 1 — action_execution (60% weight, "main path") ----------------------------------------------------- One always-on recipe that fuses all available context (task, plan, memory) into a single user prompt and uses the current subtask as the supervised assistant target. This single recipe supervises both objectives: * subtask prediction (text CE on the assistant span via lm_head) * action chunks (flow MSE on the action expert via stream: low_level, target: true; plus FAST CE on action tokens when enable_fast_action_loss=True) At inference, the same prompt structure drives both inference modes: * select_message(user_prompt_only) → LM head generates the next subtask. Matches action_execution's training distribution exactly (prompt is the user turn, target is the subtask). * predict_action_chunk(user_prompt + assistant_subtask) → action expert produces the chunk. Matches action_execution's full prompt+target. This replaces what used to be a separate high_level_subtask recipe plus a low_level_execution recipe; both were supervising the same subtask text, so collapsing them into one is correct and removes the redundant text-CE gradient. Flavor 2 — event-driven text-only recipes ----------------------------------------- Each of these supervises the LM head to predict a specific kind of text given a specific event-triggered context. ``stream: high_level`` on all targets so they never trigger predict_actions / flow loss. ``if_present`` guards ensure they only fire on frames where the event annotation is present. * memory_update (10%) new memory at subtask boundary * user_interjection_response (15%) new plan + say(...) on input * ask_vqa_top (7.5%) front-camera VQA * ask_vqa_wrist (7.5%) wrist-camera VQA Total weight = 1.0. Prompt format consistency ------------------------- User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}`` matches what ``inference/steps.py::_msgs_for_subtask`` and ``_control_context_messages`` already emit at inference time. No "Task: " prefix — the bare task string is used as the leading content with literal "Plan: " / "Memory: " labels for the subsequent components. What changed structurally ------------------------- - low_level_execution DROPPED (folded into action_execution) - high_level_subtask DROPPED (subtask supervision moved into action_execution) + action_execution NEW (the fused main recipe) memory_update kept, prompt cleaned up user_interjection_response kept, prompt cleaned up ask_vqa_top / ask_vqa_wrist kept Runtime compatibility --------------------- No runtime change needed — ``SmolVLA2Runtime`` and the inference helpers already build their high-level prompt as just the user turn (task + plan + memory) and append a ``current_subtask`` assistant turn for the low-level call. Both match the new ``action_execution`` prompt shape exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:35:51 +02:00
Pepijn	b873fe454c	perf(pi052): full fusion — text + FAST + flow in ONE backbone forward Previously the forward did 2 backbone passes when all heads were active: one for flow (via super().forward) and one for the fused text+FAST helper. This commit reduces it to one pass — same compute as flow-only training. New ``_compute_all_losses_fused`` builds: prefix = [images, language, FAST (when provided)] suffix = [noisy_actions] (action expert via gemma_expert) and runs a single ``paligemma_with_expert.forward`` with ``inputs_embeds=[prefix_embs, suffix_embs]`` (both experts active in the same call). Captures both prefix_out and suffix_out, slices each for its respective loss: flow MSE ← suffix_out (existing action_out_proj + MSE path) text CE ← prefix_out at language positions (lm_head + CE) FAST CE ← prefix_out at FAST positions (lm_head + CE) Critical attention mask override -------------------------------- ``make_att_2d_masks`` produces a cumulative-block attention mask in which suffix tokens (highest cumsum) attend to every lower-cumsum position by default, including FAST tokens. If we let that stand the action expert reads the discrete FAST tokens and trivially decodes them back to the same continuous actions the flow head is supposed to predict from noise — the entire training signal collapses to a copy operation. The fix is a single line right after make_att_2d_masks: att_2d_masks[:, fast_end:, fast_start:fast_end] = False Explicitly zeros out suffix → FAST attention. Everything else remains correct under the cumsum semantics: * prefix images/language stay bidirectional among themselves * FAST stays causal within itself, attending bidirectionally to images+language * FAST cannot see suffix (cumsum < suffix cumsum, default) * suffix attends bidirectionally among itself, to images+language, and now NOT to FAST (this override) Bit-equivalent to the previous separated forward path for text+FAST losses (the prefix hidden states at language and FAST positions are unchanged whether suffix is present or not — the prefix doesn't attend to suffix). For flow loss, suffix→FAST being masked is the correct behaviour we want — if anything the previous separated path was less correct for production use because the joint gradient signal through the action expert was missing the prefix extension. Forward routing in ``forward()`` -------------------------------- * run_flow=True → _compute_all_losses_fused (one forward, all three losses) * run_flow=False, run_text or run_fast → _compute_text_and_fast_loss (one prefix-only forward, two CE losses, no suffix → cheaper than fusion) * neither → RuntimeError (explicit; both losses disabled) Wall-time per step ------------------ Before this commit: flow + (text+FAST fused) = 2 forwards After this commit: (flow+text+FAST fused) = 1 forward Compute parity with flow-only training when all three heads active. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:28:38 +02:00
Pepijn	83d7250a22	fix(recipes): low_level_execution needs if_present:subtask guard too Same bug we fixed for high_level_subtask, just on the other subtask-supervised sub-recipe. ``low_level_execution`` targets ``${subtask}`` (the current active span) but had no ``if_present`` guard. When ``active_at(t, style=subtask)`` returned None at a frame (gaps in the annotation, or the very first/last frames of an episode if the annotator's spans don't fully tile), the assistant message rendered with empty content. The chat tokenizer still included it in ``target_message_indices`` → text CE supervised whatever the chat-template's empty assistant turn decoded to (usually a single ``\n``). That trains the LM head's prior at the first generation position toward ``\n``, the same collapse we observed with the original ``${next_subtask}`` target. Fix: ``if_present: subtask`` on the assistant target in ``low_level_execution`` for both ``smolvla2_hirobot.yaml`` and ``pi052_hirobot.yaml``. Side effect: frames without an active subtask span no longer contribute to the flow loss either (the only ``low_level`` target is skipped, ``predict_actions = bool(targets_by_stream.get("low_level"))`` becomes False). For a well-annotated dataset where subtask spans tile the whole episode this is a no-op. For datasets with gaps, those gap frames lose flow supervision — strictly better than the degenerate text-CE alternative. Sub-recipe audit summary (no other changes needed): * memory_update — all if_present guards present, OK * user_interjection_response — all if_present guards present, OK * high_level_subtask — fixed earlier, OK * low_level_execution — fixed by this commit * ask_vqa_top / ask_vqa_wrist — query+answer both guarded, OK Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:22:45 +02:00
Pepijn	35f9063a6c	perf(pi052): fuse text + FAST loss into a single prefix forward Previously the forward did three backbone passes per training step when all heads were active: one for flow (via super().forward), one for text CE, and one for FAST CE. That's ~3× the compute of flow-only training. The text and FAST losses share their prefix forward exactly — both are CE on the LM head, evaluated at different slices of the same hidden states. Adding FAST tokens after language in the prefix is bit-equivalent for the text loss because the mask_ar convention in ``make_att_2d_masks`` keeps FAST tokens in a strictly-later causal block: language tokens never see FAST, so their hidden states are unchanged. New ``_compute_text_and_fast_loss``: * embeds [images, language] once * optionally appends [FAST] (when run_fast is True) * one backbone forward * slices ``vlm_out[:, -(fast_len + lang_len):-fast_len]`` for language hidden states (or ``vlm_out[:, -lang_len:]`` when no FAST) → text CE * slices ``vlm_out[:, -fast_len:]`` for FAST hidden states → FAST CE * returns both losses, either of which can be None when the caller doesn't want that head. forward() now calls this fused helper instead of running the two separate ``_compute_text_loss`` / ``_compute_fast_action_loss`` methods. Those remain in the file for callers that only want one head (e.g. ablations). Why flow isn't fused -------------------- Flow MSE comes from the action-expert (suffix) hidden states, which attend to the prefix. If we just concat FAST onto the prefix and let the action expert attend to it, the expert can trivially decode FAST back to continuous actions — overfitting via shortcut. Preventing that requires a custom segment-aware attention mask (action expert can attend to images+language but NOT to subtask/FAST), which is what pi05_full does in ``compute_layer_complete_knowledge_insulation``. That's the full-fusion path; deferred as a follow-up since the text+FAST fusion already recovers most of the compute. End-to-end forward pass count ----------------------------- Before: 1 (flow) + 1 (text) + 1 (FAST) = 3 backbone forwards After: 1 (flow) + 1 (text+FAST fused) = 2 backbone forwards ~33% wall-time reduction per training step when all three heads are active. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:08:34 +02:00
Pepijn	17c0800461	fix(pi052): FAST loss masking + predict_actions gating + smolvla2 review FAST loss changes ----------------- 1. Gate by ``predict_actions`` (same routing as flow loss). The ActionTokenizerProcessorStep tokenises actions for every sample regardless of which sub-recipe rendered it; for text-only recipes (high_level_subtask, memory_update, ...) the action tokens are still in the batch but mustn't be supervised. Skip the FAST forward+CE entirely when no sample in the batch has ``predict_actions=True``. 2. Switch from "multiply-by-mask" masking to ``ignore_index=-100``. The old pattern computed per-token CE for all positions, then zeroed out invalid ones. Two issues: (a) any out-of-vocab target id at a padded position would have crashed cross_entropy before the mask got a chance to zero it out, and (b) the pattern is needlessly clever. Now ``shift_targets.masked_fill(~mask, -100)`` followed by ``ignore_index=-100`` cleanly drops invalid positions. Matches the smolvla2 text-loss convention. 3. Clean up unused ``bsize`` variable in _compute_fast_action_loss and expand the attention-mask docstring with the ``make_att_2d_masks`` mask_ar convention spec (causal vs bidirectional blocks). smolvla2 audit (reference review, no code change) ------------------------------------------------- Compared smolvla2/modeling_smolvla2.py against pi052/modeling_pi052.py to catch parallel bugs. Findings: * No ``paligemma.language_model`` vs ``paligemma.model.language_model`` issue — smolvla2 uses SmolVLM (different class, different attribute layout) so the bug doesn't apply. * ``fill_kv_cache=True`` is correctly passed to smolvla's ``vlm_with_expert.forward`` — that class does accept the kwarg (unlike pi05's PaliGemmaWithExpertModel.forward, which is why pi052 must omit it). * Text-loss alignment is correct: ``_compute_text_loss`` computes ``lang_start`` / ``lang_end`` from the known prefix layout (``[image_blocks..., lang, state]``) and slices ``prefix_out`` to just the language positions before applying ``lm_head``. The parallel bug I fixed in pi052 (lm_head over the full prefix, shape-mismatched against text_labels) was not present in smolvla2. * Per-sample flow routing via ``predict_actions``: correctly masks per-sample by calling the parent ``forward(..., reduction='none')`` and applying the predict_actions mask before the mean. pi052 only has the batch-level any() gate — a parallel improvement for pi052 would require modifying PI05Pytorch.forward to support per-sample reduction, deferred. * ``reduction="none"`` returns ``total.expand(bsize)``: identical scalar-broadcast limitation in both policies. Acknowledged but low priority (only RA-BC weighting uses the per-sample path and it's documented as a known approximation in smolvla2). * Chat tokenizer correctly handles batched/unbatched messages, pads with -100 for label positions, builds attention masks. No bugs found. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:05:37 +02:00
Pepijn	c8763e0ad5	fix(pi052): four real bugs in the modeling code + flip defaults Defaults -------- * enable_fast_action_loss: False -> True (match paper §III.B-C Eq.1) * auto_fit_fast_tokenizer: True -> False (opt-in; needs base.fit()) Bug fixes --------- 1. Wrong attribute path on PaliGemma. The KI port copied pi05_full's ``paligemma.language_model.layers[...]`` literally, but the production pi05 wrapper exposes the text model at ``paligemma.model.language_model``. With KI enabled, every layer would have raised AttributeError on first forward. Fixed all references in _compute_layer_ki + _paligemma_forward_ki. 2. ``fill_kv_cache=True`` passed to PaliGemmaWithExpertModel.forward. That kwarg is a SmolVLA-only concept; pi05's signature has no such argument, so every forward call from pi052 (text loss, FAST loss, select_message) would have crashed with TypeError. Dropped from all four call sites — pi05's forward already handles the cache via past_key_values, and re-forwarding the cumulative sequence each step in select_message is fine for our short subtask completions. 3. Text-loss shape mismatch. _compute_text_loss applied lm_head to the full vlm_out (image tokens + language tokens), then tried to cross-entropy that against text_labels which only covers the language portion — the .view(-1) calls would produce two tensors of different lengths and CE would fail. Now slices vlm_out to the last text_labels.shape[1] positions before running lm_head, matching the [images, language] order embed_prefix produces. 4. Dead-code conditional in _paligemma_forward_ki's single-expert fallback. The ``if hasattr(...) else self._pi052_orig_forward`` ternary always took the wrong branch because the attribute is always set (we save it in PI052Policy.__init__). Simplified to just call self._pi052_orig_forward directly. After this commit, pi052 should be runnable end-to-end for the first time with all three loss heads + KI active. Still worth a 100-step smoke test before kicking off a long run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:58:40 +02:00
Pepijn	0f4faddc01	feat(pi052): auto-fit FAST tokenizer per-dataset before training Per Pertsch et al. 2025 (FAST paper, [64] in π0.5) and π0.5 §III.C, the recommended practice is to fit the FAST action tokenizer on the specific dataset's action distribution rather than using the published universal codebook off the shelf. The universal tokenizer works on any 6-DoF action sequence but produces suboptimal compression, which slows CE convergence and wastes vocab capacity. New utility ``lerobot.policies.pi052.fit_fast_tokenizer``: * samples N action chunks from the LeRobotDataset (default 1024) * loads ``physical-intelligence/fast`` as the base * calls ``.fit(actions)`` (the AutoProcessor API the HF model card documents) — produces a per-dataset codebook * saves to ``{cache_dir}/{sha256(dataset, base, n_samples)[:16]}/`` * returns the local path, ready to feed ``ActionTokenizerProcessorStep(action_tokenizer_name=...)``. Cache is keyed on (dataset, base tokenizer, sample count) so changing any of them re-runs the fit. Re-running training on the same dataset re-uses the cache (one fit per dataset per machine). Auto-fit wiring: * PI052Config gets ``auto_fit_fast_tokenizer`` (default True), ``fast_tokenizer_cache_dir`` (default ~/.cache/lerobot/...), ``fast_tokenizer_fit_samples`` (default 1024). * make_pi052_pre_post_processors now takes ``dataset_repo_id``; when ``enable_fast_action_loss`` and ``auto_fit_fast_tokenizer`` are both True and a repo_id is provided, the factory calls ``fit_fast_tokenizer`` before constructing the processor step and points it at the fitted path. * ProcessorConfigKwargs gains ``dataset_repo_id``; the global factory dispatch threads it through for ``pi052`` policies. * lerobot_train.py populates ``processor_kwargs['dataset_repo_id']`` from ``--dataset.repo_id`` for pi052 runs. Failure mode: if ``.fit()`` fails (e.g. older transformers without the method, or no usable action chunks in the dataset), the factory logs a warning and falls back to the universal base tokenizer. Train still works; you just lose the compression improvement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:52:31 +02:00
Pepijn	8dc0af3c28	feat(pi052): FAST action CE loss + knowledge insulation + processor wiring Three additions ported from ``pi05_full`` on branch ``feat/add-pi05``, giving pi052 full paper-§III.B-C training capabilities alongside the recipe-driven text supervision it already had: * Config flags in PI052Config: - ``enable_fast_action_loss`` default False - ``action_tokenizer_name`` default "physical-intelligence/fast" - ``max_action_tokens`` default 256 - ``fast_skip_tokens`` default 128 - ``fast_action_loss_weight`` default 1.0 - ``knowledge_insulation`` default False * Processor wiring (processor_pi052.py): when ``enable_fast_action_loss=True``, append an ``ActionTokenizerProcessorStep`` after the text tokenizer. It tokenises the action tensor with the FAST tokenizer and writes ACTION_TOKENS / ACTION_TOKEN_MASK into ``COMPLEMENTARY_DATA`` — the existing batch-collation pipeline forwards them as ``batch['action.tokens']`` / ``batch['action.token_mask']``. * FAST CE loss (modeling_pi052.py::_compute_fast_action_loss): Re-embeds the prefix [images, language], appends the FAST token embeddings (using PaliGemma's shared embed_language_tokens), forwards through the backbone, slices the trailing ``fast_len`` positions, applies the LM head, computes shifted next-token CE with the action-mask gating the loss. The loss is summed into ``forward()``'s total with ``fast_action_loss_weight``. * Knowledge insulation (modeling_pi052.py::_compute_layer_ki + _paligemma_forward_ki): port of pi05_full's per-layer attention that detaches VLM K/V on the action-query path so action loss gradients cannot flow back into the VLM's K/V projections. Bound per-instance via ``types.MethodType`` so it doesn't leak into stock ``pi05`` policies that share PaliGemmaWithExpertModel. Activated automatically when ``config.knowledge_insulation=True``. Combined with the existing recipe-driven text head, pi052 now supports the full three-loss objective: L = text_w·H(text) + fast_w·H(FAST actions) + flow_w·MSE(flow) matching Eq. (1) of arxiv:2504.16054 §IV.D (α=10 by default for the flow term, 1.0 each for text and FAST CE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:46:21 +02:00

1 2 3 4 5 ...

1588 Commits