lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-15 08:39:49 +00:00

Author	SHA1	Message	Date
Pepijn	db2972fb6c	feat(inference): repetition_penalty + no_repeat_ngram_size for select_message Under-trained LM heads (small dataset + a few thousand steps on a chat-pretrained backbone) collapse into n-gram loops under greedy decoding — observed in real-robot run as "the robot arm extends and retracts and retracts from the beige surface and retracts from the surface" repeating the same trigram across the whole 256-token budget. Added the two standard HF generation knobs to ``SmolVLA2Policy.select_message``: * ``repetition_penalty`` (1.0 = off) — divides positive logits / multiplies negative logits for already-emitted token ids. * ``no_repeat_ngram_size`` (0 = off) — hard-bans any token that would complete an n-gram already present in the generated suffix. Implemented via a small ``_ngram_banned_ids`` helper that mirrors HF's ``_get_ngrams`` semantics. Wired through ``_generate_with_policy`` to all four call sites (subtask, memory, plan/say, vqa) and exposed as ``--text_repetition_penalty=1.2-1.5`` and ``--text_no_repeat_ngram_size=3`` on the runtime CLI. Empirically ``--text_no_repeat_ngram_size=3`` alone usually breaks the trigram-loop failure mode without distorting the next-token distribution; combine with ``--text_repetition_penalty=1.2`` for heavier collapses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 13:35:46 +02:00
Pepijn	95033733fc	deps: add sentencepiece to the pi extra (FAST action tokenizer) PI052 and PI0_FAST both load ``physical-intelligence/fast`` as their action tokenizer. That tokenizer's HF backend requires ``sentencepiece`` to instantiate (or ``tiktoken``); without it ``AutoProcessor.from_pretrained`` raises: ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece or tiktoken installed [...] It wasn't listed in pyproject so fresh installs missed it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:52:55 +02:00
Pepijn	c3503b774f	fix(debug): dumper now shows real stream + target flags The dumper was printing ``stream=None target=None`` for every message because it read those fields off the message dicts, but the recipe renderer keeps them in parallel arrays (``message_streams`` / ``target_message_indices`` in COMPLEMENTARY_DATA) so the chat template doesn't see unknown keys. Zip them back into the dump-time dicts so the printed metadata is accurate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:43:51 +02:00
Pepijn	99ebee4d16	annotate: tighter subtask + memory prompts (≤5 / ≤10 words) Both feed into the high-level prompt and the plan rendering, so keeping them short directly reduces the rendered ``${task}\nPlan: …\nMemory: …`` prefix the model has to chew through at inference. Subtasks * Hard cap: ≤ 5 words. Verb + object only, drop articles/adverbs. * Concrete good/bad examples to anchor the VLM. Memory * Hard cap: ≤ 10 words. Telegraphic noun→location fragments ("bowl in box, lid open"), no past-tense verbs, drop attributes that don't matter for downstream subtasks. * Allow empty string when no material change occurred — keeps the rendered memory line literally blank instead of forcing a no-op sentence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:28:09 +02:00
Pepijn	a8ca5128b8	fix(annotate): re-emit plan at every subtask boundary Previously only emitted a plan at t=0 and on interjections, so the active plan rendered into training carried "done" subtasks until the next interjection. With the new "plan = remaining subtasks" summariser this meant the plan was stale between boundaries. Emit a fresh plan row at every subtask start. ``active_at(t)`` then returns a plan that contains exactly the subtasks whose start ≥ the current span's start — completed subtasks fall off the plan the moment the next subtask begins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:26:49 +02:00
Pepijn	dd97c33814	refactor(annotate): plan = summary of still-todo subtasks, drop VLM call The plan was being generated by a separate VLM call (one per episode + one per interjection refresh) with a prompt that asked the model to "compress the subtasks into a compact hierarchical plan". In practice the plans came out longer than necessary and sometimes drifted from the actual subtask sequence the runtime would execute. Replaced ``_generate_plan`` with a deterministic numbered list of the upcoming subtasks. At a refresh time the list shrinks to subtasks whose start ≥ refresh_t — the plan describes what's left to do, so it gets shorter as work progresses. Saves the per-episode + per-interjection VLM round-trip in the annotation pipeline and keeps train-time plan text bit-aligned with the subtask annotations the rest of Module 1 emits. Removed the now-unused ``prompts/module_1_plan.txt``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:55:02 +02:00
Pepijn	fa45ba631b	fix(policies,recipe): register PI052Config + allow flow-only sub-recipes Two regressions surfaced by the first training run: 1. ``--policy.type=pi052`` failed with ``invalid choice``. PI052Config wasn't imported in ``policies/__init__.py``, so its ``@register_subclass("pi052")`` decorator never ran and draccus didn't see it as a valid policy type. Mirror PI05Config / SmolVLA2Config in the top-level imports + __all__. 2. ``low_level_execution`` (user-only ``${subtask}`` recipe used for π0.5-style flow conditioning) tripped ``ValueError: Message recipes must contain at least one target turn.`` The validator was too strict — a recipe with only a ``stream: low_level`` turn still drives meaningful supervision (flow MSE on the action expert via ``predict_actions=True``). Allow either ``target: true`` OR ``stream: low_level`` to satisfy the "supervises something" requirement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:51:47 +02:00
Pepijn	ffd8c92ce5	fix(inference): always emit Plan:/Memory: labels in the high-level prompt The recipe renders ``"\${task}\nPlan: \${plan}\nMemory: \${memory}"`` unconditionally — when a binding resolves to None, ``language_render._substitute`` substitutes an empty string, so the training-time user turn always contains the literal ``Plan: `` / ``Memory: `` prefixes even with empty values. The inference message builders were skipping those lines entirely when ``state['current_plan']`` / ``state['current_memory']`` was empty, producing a different prompt shape on early frames (before the plan-generation step runs) and on datasets without plan/memory annotations. Factored a shared ``_hirobot_user_head`` helper used by ``_msgs_for_subtask``, ``_msgs_for_memory``, and the legacy ``_control_context_messages`` so they all match training byte-for- byte regardless of which bindings are populated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:42:29 +02:00
Pepijn	841d3c47e1	feat(debug): LEROBOT_DUMP_RECIPE_SAMPLES=N dumps the first N rendered samples Adds a one-shot debug dumper to both chat processors. When the env var ``LEROBOT_DUMP_RECIPE_SAMPLES`` is set to a positive integer N, the next N samples processed (rank-0 only) get pretty-printed: * the recipe-rendered messages (role / stream / target / content), * the full tokenized prompt (decoded back), * inline ``[TGT]...[/TGT]`` markers over the spans the LM head is supervised on, * token count + target-token count, * ``predict_actions`` flag. Usage: LEROBOT_DUMP_RECIPE_SAMPLES=5 sbatch train_smolvla2.slurm After N dumps the helper becomes a no-op; training continues unaffected. Works for both smolvla2 (chat-template renderer) and pi052 (plain ``Role: content`` concat renderer); each processor has its own copy to avoid cross-package imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:21:46 +02:00
Pepijn	2c920ab178	refactor(recipes): consolidate to shared hirobot.yaml + audit fixes The smolvla2 and pi052 recipe blends had drifted to identical content twice in a row; collapse them to a single ``recipes/hirobot.yaml`` both policies point at. Each backbone's text tokenizer (chat-template for SmolVLA2, plain ``Role: content`` for PI052) handles the rendering differences downstream — the recipe spec is shared. Audit fixes folded into the same commit: * Train/inference prefix mismatch on the action expert ``_build_text_batch`` always passed ``add_generation_prompt=True``, appending ``<\|im_start\|>assistant\\n`` tokens that the action expert never saw at training (the chat tokenizer renders with ``add_generation_prompt=False``). Parameterized the helper and pass ``False`` from ``LowLevelForward``; ``select_message`` paths still default to ``True`` for AR text generation. * PI052 fallthrough could silently train flow on text-only frames When ``text_loss_weight=0`` AND every sample was high-level (``predict_actions.any()==False``), the previous heuristic delegated to ``PI05Policy.forward``, which ignores ``predict_actions`` and runs flow on every sample. Reverted to delegating only on fully unannotated batches. * SmolVLA2 silent zero-loss training ``forward`` returned ``loss=0`` (no error) when neither flow nor text path fired. Now raises ``RuntimeError`` with the weights and routing flags — fails loud like PI052 already does. * PI052 dropout-seed key Was reading ``complementary["dataset_index"]`` (only set by ``MultiDataset`` and means "which sub-dataset", not row index) with fallback to ``frame_index`` (never set) — every sample got seed=0, so per-component dropout was deterministic across the epoch. Switched to ``complementary["index"]`` to match SmolVLA2 and the canonical ``BatchProcessor`` convention. * Dead ``DEFAULT_TOOLS`` import Removed from ``chat_processor_smolvla2.py`` — unused since the default-tools list was switched to ``[]`` in the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:16:28 +02:00
Pepijn	9f630e2a41	fix(recipes,training): stop tool prompt leak + drop subtask copy-supervision CRITICAL (smolvla2) — the SmolVLM2 chat template was rendering the ``say`` tool's JSON schema as a system message on every training sample because ``DEFAULT_TOOLS`` was the default in ``SmolVLA2ChatTokenizerStep``. That schema was only relevant to the now-removed ``user_interjection_response`` recipe; with it gone the schema is dead weight that polluted every action-expert prefix AND created a train/inference mismatch (the inference ``_build_text_batch`` doesn't pass ``tools=``). Default is now ``[]``; callers needing tools can still set them via ``with_tools(meta.tools)``. LIKELY-BUG — ``low_level_execution`` had ``target: true`` on its assistant turn, so text-CE trained the LM head to predict the same subtask string the user just stated (trivial "copy previous turn" supervision that diluted LM head capacity). Dropped the assistant turn entirely; ``high_level_subtask`` (w=0.50) already owns subtask prediction from real context. The chat-tokenizer's ``predict_actions`` detection used to scan target streams only. With the new no-target low_level recipe it would mis-fire as False. Switched both ``chat_processor_smolvla2.py`` and ``text_processor_pi052.py`` to scan all message streams — any ``stream: low_level`` on the sample is enough to trigger flow loss. Inference: the low-level loop sends only ``[user(subtask)]`` now, matching the new recipe shape. PI052 — hardened the forward fallthrough so a degenerate batch where every sample's recipe is text-only AND text supervision is disabled (text_loss_weight<=0 or text_labels missing) cleanly delegates to ``PI05Policy.forward`` instead of raising "nothing to train". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 14:59:01 +02:00
Pepijn	7a32f8a72a	refactor(recipes): π0.5-style split — action expert conditions on subtask only Previously ``action_execution`` rendered ``task + plan + memory + subtask`` into one prefix and ran the flow loss on it. That meant the action expert was conditioned on the full hierarchical context (closer to π0.7 §V.A), not just the subtask. The π0.5 paper's hierarchical inference has the action expert see only the subtask (plus images and state). Split the recipe to match: high_level_subtask (0.50) user(task + plan + memory) → assistant(subtask) [+ assistant(new_memory) at boundary frames] All ``stream: high_level`` → text-CE only, no flow loss. low_level_execution (0.30) user(subtask) → assistant(subtask) Both ``stream: low_level`` → flow loss fires; text CE on the subtask is a small redundant extra signal. Prefix the action expert sees: [images, subtask, state]. plan_generation (0.10) — unchanged. ask_vqa_{top,wrist} (0.05 each) — unchanged. Runtime: the low-level loop in ``smolvla2/inference/steps.py`` now sends ``[user(subtask), assistant(subtask)]`` to ``predict_action_chunk`` instead of the full task+plan+memory context. Falls back to ``state['task']`` when no subtask has been generated yet so the first frame still has something to condition on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 14:13:07 +02:00
Pepijn	129aa207e3	fix(smolvla2,pi052): training-correctness audit fixes CRITICAL (smolvla2) — text-CE was applied to the wrong prefix slice. ``num_state`` was being read from ``state.shape[1]`` (the raw max_state_dim, ~14-32) instead of the number of state tokens (always 1). Compounded by the trailing-padding issue (state is not at the end of the padded prefix when ``seq_len < prefix_length``), the lang slice was landing on image / padding hidden states. New ``_locate_lang_range`` finds the state position via ``att_masks.nonzero()`` (the only ``1`` in the mask), making the slice robust to both bugs. Used by ``_compute_text_loss`` and ``_compute_fused_loss``. LIKELY-BUG (smolvla2) — ``_unfreeze_lm_head`` only re-enabled ``lm_head`` and ``text_model.model.norm.weight``. SmolVLA's parent ALSO freezes the last 1-2 transformer layers, so text-loss gradients died in a frozen final block. Now mirrors the parent's freeze targets and unfreezes the matching ``layers.{N-1}`` (and ``N-2`` when num_vlm % num_expert == 0). CRITICAL (pi052) — flow and FAST CE were not per-sample masked under per-sample-routing. Text-only recipe samples (``plan_generation``, ``ask_vqa_``) contributed to flow/FAST loss with prompts that deliberately omit the subtask, corrupting the signal. Threaded ``predict_actions_t`` through both ``_compute_all_losses_fused`` and ``_compute_text_and_fast_loss``; flow uses ``(per_sample mask).sum() / mask.sum()``, FAST uses ``shift_valid & sample_mask`` before ``masked_fill(-100)``. OTHER * PI052Policy.forward now falls through to PI05Policy.forward on unannotated batches (no text_labels, no predict_actions, no FAST). * fit_fast_tokenizer cache key now includes ``chunk_size`` — changing the chunk size no longer silently loads a wrongly-fit tokenizer. * Removed dead ``_compute_text_loss`` / ``_compute_fast_action_loss`` in pi052 (superseded by the fused helpers). * Fixed stale "no-op stub" docstring on ``knowledge_insulation`` — it's been fully wired since the per-layer KI forward port. * Stripped unused ``copy`` / ``resize_with_pad`` imports. * Extracted ``_shifted_ce`` / ``_mask_per_sample`` / ``_fast_ce`` helpers shared between fused and prefix-only paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 14:08:06 +02:00
Pepijn	e3ad1c59fc	feat(recipes): add plan_generation sub-recipe to smolvla2 + pi052 blends New text-only sub-recipe at 0.10 weight on both blends: user : ${task} assistant : ${current_plan} (high_level target) Bound to ``active_at(t, style=plan)`` so it supervises the currently-active plan on every frame, gated by ``if_present`` to skip frames without a plan annotation. Weights rebalanced: action_execution 0.85 → 0.75, plan_generation 0.10, VQA top/wrist 0.075 each (sums to 1.0). Added matching runtime builder ``_msgs_for_plan`` in ``smolvla2/inference/steps.py`` so the high-level loop can call ``select_message`` with the bare-task prompt at episode start / replanning events. Closes a gap vs. Pi 0.7 §V — without this recipe the model could read ``${plan}`` from the prompt but never had to produce one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:51:37 +02:00
Pepijn	9ff62cb08c	docs(recipes): trim header comments, drop diversity-knobs note in run_hf_job Recipes were over-commented (paper citations, history of removed sub-recipes, inference-time loop walkthroughs). Stripped down to a short header + a one-line note on the boundary-frame memory tail. Also removed the ``_tool3`` diversity-knobs comment block in ``examples/annotation/run_hf_job.py`` — it was a personal note about a since-merged experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:55:03 +02:00
Pepijn	b2aa372fcf	refactor(recipes): fold memory into action_execution, drop interjection, fuse smolvla2 forward Recipe changes: * action_execution now bundles the memory update as a second assistant target gated on a new ``new_memory`` binding (fires only at subtask-boundary frames). No "Completed subtask: X" filler — the model emits the new subtask AND the updated memory back-to-back in one prefix. * user_interjection_response sub-recipe removed (current datasets don't have interjection / say() annotations). * Standalone memory_update sub-recipe removed (folded above). * Weights rebalanced: action_execution 0.85, ask_vqa_top/wrist 0.075 each (sums to 1.0). Runtime ``_msgs_for_memory`` updated to match the new boundary-frame prompt layout. Modeling: * SmolVLA2Policy now fuses the flow + text losses into a SINGLE backbone forward via ``_compute_fused_loss`` (one vlm_with_expert pass with [prefix, suffix] embeds, then both lm_head CE on lang slice + action_out_proj MSE on suffix). Mirrors pi052's existing ``_compute_all_losses_fused`` — saves one backbone pass per training step. Examples: * Removed the two training SLURM scaffolds; they were out-of-date with the recipe refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:51:09 +02:00
Pepijn	058b8f3958	refactor(recipes): two-flavor design — one fused action_execution + text-only events Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions) and the hierarchical inference pattern from Pi 0.5 §IV.D. Flavor 1 — action_execution (60% weight, "main path") ----------------------------------------------------- One always-on recipe that fuses all available context (task, plan, memory) into a single user prompt and uses the current subtask as the supervised assistant target. This single recipe supervises both objectives: * subtask prediction (text CE on the assistant span via lm_head) * action chunks (flow MSE on the action expert via stream: low_level, target: true; plus FAST CE on action tokens when enable_fast_action_loss=True) At inference, the same prompt structure drives both inference modes: * select_message(user_prompt_only) → LM head generates the next subtask. Matches action_execution's training distribution exactly (prompt is the user turn, target is the subtask). * predict_action_chunk(user_prompt + assistant_subtask) → action expert produces the chunk. Matches action_execution's full prompt+target. This replaces what used to be a separate high_level_subtask recipe plus a low_level_execution recipe; both were supervising the same subtask text, so collapsing them into one is correct and removes the redundant text-CE gradient. Flavor 2 — event-driven text-only recipes ----------------------------------------- Each of these supervises the LM head to predict a specific kind of text given a specific event-triggered context. ``stream: high_level`` on all targets so they never trigger predict_actions / flow loss. ``if_present`` guards ensure they only fire on frames where the event annotation is present. * memory_update (10%) new memory at subtask boundary * user_interjection_response (15%) new plan + say(...) on input * ask_vqa_top (7.5%) front-camera VQA * ask_vqa_wrist (7.5%) wrist-camera VQA Total weight = 1.0. Prompt format consistency ------------------------- User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}`` matches what ``inference/steps.py::_msgs_for_subtask`` and ``_control_context_messages`` already emit at inference time. No "Task: " prefix — the bare task string is used as the leading content with literal "Plan: " / "Memory: " labels for the subsequent components. What changed structurally ------------------------- - low_level_execution DROPPED (folded into action_execution) - high_level_subtask DROPPED (subtask supervision moved into action_execution) + action_execution NEW (the fused main recipe) memory_update kept, prompt cleaned up user_interjection_response kept, prompt cleaned up ask_vqa_top / ask_vqa_wrist kept Runtime compatibility --------------------- No runtime change needed — ``SmolVLA2Runtime`` and the inference helpers already build their high-level prompt as just the user turn (task + plan + memory) and append a ``current_subtask`` assistant turn for the low-level call. Both match the new ``action_execution`` prompt shape exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:35:51 +02:00
Pepijn	b873fe454c	perf(pi052): full fusion — text + FAST + flow in ONE backbone forward Previously the forward did 2 backbone passes when all heads were active: one for flow (via super().forward) and one for the fused text+FAST helper. This commit reduces it to one pass — same compute as flow-only training. New ``_compute_all_losses_fused`` builds: prefix = [images, language, FAST (when provided)] suffix = [noisy_actions] (action expert via gemma_expert) and runs a single ``paligemma_with_expert.forward`` with ``inputs_embeds=[prefix_embs, suffix_embs]`` (both experts active in the same call). Captures both prefix_out and suffix_out, slices each for its respective loss: flow MSE ← suffix_out (existing action_out_proj + MSE path) text CE ← prefix_out at language positions (lm_head + CE) FAST CE ← prefix_out at FAST positions (lm_head + CE) Critical attention mask override -------------------------------- ``make_att_2d_masks`` produces a cumulative-block attention mask in which suffix tokens (highest cumsum) attend to every lower-cumsum position by default, including FAST tokens. If we let that stand the action expert reads the discrete FAST tokens and trivially decodes them back to the same continuous actions the flow head is supposed to predict from noise — the entire training signal collapses to a copy operation. The fix is a single line right after make_att_2d_masks: att_2d_masks[:, fast_end:, fast_start:fast_end] = False Explicitly zeros out suffix → FAST attention. Everything else remains correct under the cumsum semantics: * prefix images/language stay bidirectional among themselves * FAST stays causal within itself, attending bidirectionally to images+language * FAST cannot see suffix (cumsum < suffix cumsum, default) * suffix attends bidirectionally among itself, to images+language, and now NOT to FAST (this override) Bit-equivalent to the previous separated forward path for text+FAST losses (the prefix hidden states at language and FAST positions are unchanged whether suffix is present or not — the prefix doesn't attend to suffix). For flow loss, suffix→FAST being masked is the correct behaviour we want — if anything the previous separated path was less correct for production use because the joint gradient signal through the action expert was missing the prefix extension. Forward routing in ``forward()`` -------------------------------- * run_flow=True → _compute_all_losses_fused (one forward, all three losses) * run_flow=False, run_text or run_fast → _compute_text_and_fast_loss (one prefix-only forward, two CE losses, no suffix → cheaper than fusion) * neither → RuntimeError (explicit; both losses disabled) Wall-time per step ------------------ Before this commit: flow + (text+FAST fused) = 2 forwards After this commit: (flow+text+FAST fused) = 1 forward Compute parity with flow-only training when all three heads active. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:28:38 +02:00
Pepijn	83d7250a22	fix(recipes): low_level_execution needs if_present:subtask guard too Same bug we fixed for high_level_subtask, just on the other subtask-supervised sub-recipe. ``low_level_execution`` targets ``${subtask}`` (the current active span) but had no ``if_present`` guard. When ``active_at(t, style=subtask)`` returned None at a frame (gaps in the annotation, or the very first/last frames of an episode if the annotator's spans don't fully tile), the assistant message rendered with empty content. The chat tokenizer still included it in ``target_message_indices`` → text CE supervised whatever the chat-template's empty assistant turn decoded to (usually a single ``\n``). That trains the LM head's prior at the first generation position toward ``\n``, the same collapse we observed with the original ``${next_subtask}`` target. Fix: ``if_present: subtask`` on the assistant target in ``low_level_execution`` for both ``smolvla2_hirobot.yaml`` and ``pi052_hirobot.yaml``. Side effect: frames without an active subtask span no longer contribute to the flow loss either (the only ``low_level`` target is skipped, ``predict_actions = bool(targets_by_stream.get("low_level"))`` becomes False). For a well-annotated dataset where subtask spans tile the whole episode this is a no-op. For datasets with gaps, those gap frames lose flow supervision — strictly better than the degenerate text-CE alternative. Sub-recipe audit summary (no other changes needed): * memory_update — all if_present guards present, OK * user_interjection_response — all if_present guards present, OK * high_level_subtask — fixed earlier, OK * low_level_execution — fixed by this commit * ask_vqa_top / ask_vqa_wrist — query+answer both guarded, OK Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:22:45 +02:00
Pepijn	35f9063a6c	perf(pi052): fuse text + FAST loss into a single prefix forward Previously the forward did three backbone passes per training step when all heads were active: one for flow (via super().forward), one for text CE, and one for FAST CE. That's ~3× the compute of flow-only training. The text and FAST losses share their prefix forward exactly — both are CE on the LM head, evaluated at different slices of the same hidden states. Adding FAST tokens after language in the prefix is bit-equivalent for the text loss because the mask_ar convention in ``make_att_2d_masks`` keeps FAST tokens in a strictly-later causal block: language tokens never see FAST, so their hidden states are unchanged. New ``_compute_text_and_fast_loss``: * embeds [images, language] once * optionally appends [FAST] (when run_fast is True) * one backbone forward * slices ``vlm_out[:, -(fast_len + lang_len):-fast_len]`` for language hidden states (or ``vlm_out[:, -lang_len:]`` when no FAST) → text CE * slices ``vlm_out[:, -fast_len:]`` for FAST hidden states → FAST CE * returns both losses, either of which can be None when the caller doesn't want that head. forward() now calls this fused helper instead of running the two separate ``_compute_text_loss`` / ``_compute_fast_action_loss`` methods. Those remain in the file for callers that only want one head (e.g. ablations). Why flow isn't fused -------------------- Flow MSE comes from the action-expert (suffix) hidden states, which attend to the prefix. If we just concat FAST onto the prefix and let the action expert attend to it, the expert can trivially decode FAST back to continuous actions — overfitting via shortcut. Preventing that requires a custom segment-aware attention mask (action expert can attend to images+language but NOT to subtask/FAST), which is what pi05_full does in ``compute_layer_complete_knowledge_insulation``. That's the full-fusion path; deferred as a follow-up since the text+FAST fusion already recovers most of the compute. End-to-end forward pass count ----------------------------- Before: 1 (flow) + 1 (text) + 1 (FAST) = 3 backbone forwards After: 1 (flow) + 1 (text+FAST fused) = 2 backbone forwards ~33% wall-time reduction per training step when all three heads are active. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:08:34 +02:00
Pepijn	17c0800461	fix(pi052): FAST loss masking + predict_actions gating + smolvla2 review FAST loss changes ----------------- 1. Gate by ``predict_actions`` (same routing as flow loss). The ActionTokenizerProcessorStep tokenises actions for every sample regardless of which sub-recipe rendered it; for text-only recipes (high_level_subtask, memory_update, ...) the action tokens are still in the batch but mustn't be supervised. Skip the FAST forward+CE entirely when no sample in the batch has ``predict_actions=True``. 2. Switch from "multiply-by-mask" masking to ``ignore_index=-100``. The old pattern computed per-token CE for all positions, then zeroed out invalid ones. Two issues: (a) any out-of-vocab target id at a padded position would have crashed cross_entropy before the mask got a chance to zero it out, and (b) the pattern is needlessly clever. Now ``shift_targets.masked_fill(~mask, -100)`` followed by ``ignore_index=-100`` cleanly drops invalid positions. Matches the smolvla2 text-loss convention. 3. Clean up unused ``bsize`` variable in _compute_fast_action_loss and expand the attention-mask docstring with the ``make_att_2d_masks`` mask_ar convention spec (causal vs bidirectional blocks). smolvla2 audit (reference review, no code change) ------------------------------------------------- Compared smolvla2/modeling_smolvla2.py against pi052/modeling_pi052.py to catch parallel bugs. Findings: * No ``paligemma.language_model`` vs ``paligemma.model.language_model`` issue — smolvla2 uses SmolVLM (different class, different attribute layout) so the bug doesn't apply. * ``fill_kv_cache=True`` is correctly passed to smolvla's ``vlm_with_expert.forward`` — that class does accept the kwarg (unlike pi05's PaliGemmaWithExpertModel.forward, which is why pi052 must omit it). * Text-loss alignment is correct: ``_compute_text_loss`` computes ``lang_start`` / ``lang_end`` from the known prefix layout (``[image_blocks..., lang, state]``) and slices ``prefix_out`` to just the language positions before applying ``lm_head``. The parallel bug I fixed in pi052 (lm_head over the full prefix, shape-mismatched against text_labels) was not present in smolvla2. * Per-sample flow routing via ``predict_actions``: correctly masks per-sample by calling the parent ``forward(..., reduction='none')`` and applying the predict_actions mask before the mean. pi052 only has the batch-level any() gate — a parallel improvement for pi052 would require modifying PI05Pytorch.forward to support per-sample reduction, deferred. * ``reduction="none"`` returns ``total.expand(bsize)``: identical scalar-broadcast limitation in both policies. Acknowledged but low priority (only RA-BC weighting uses the per-sample path and it's documented as a known approximation in smolvla2). * Chat tokenizer correctly handles batched/unbatched messages, pads with -100 for label positions, builds attention masks. No bugs found. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:05:37 +02:00
Pepijn	c8763e0ad5	fix(pi052): four real bugs in the modeling code + flip defaults Defaults -------- * enable_fast_action_loss: False -> True (match paper §III.B-C Eq.1) * auto_fit_fast_tokenizer: True -> False (opt-in; needs base.fit()) Bug fixes --------- 1. Wrong attribute path on PaliGemma. The KI port copied pi05_full's ``paligemma.language_model.layers[...]`` literally, but the production pi05 wrapper exposes the text model at ``paligemma.model.language_model``. With KI enabled, every layer would have raised AttributeError on first forward. Fixed all references in _compute_layer_ki + _paligemma_forward_ki. 2. ``fill_kv_cache=True`` passed to PaliGemmaWithExpertModel.forward. That kwarg is a SmolVLA-only concept; pi05's signature has no such argument, so every forward call from pi052 (text loss, FAST loss, select_message) would have crashed with TypeError. Dropped from all four call sites — pi05's forward already handles the cache via past_key_values, and re-forwarding the cumulative sequence each step in select_message is fine for our short subtask completions. 3. Text-loss shape mismatch. _compute_text_loss applied lm_head to the full vlm_out (image tokens + language tokens), then tried to cross-entropy that against text_labels which only covers the language portion — the .view(-1) calls would produce two tensors of different lengths and CE would fail. Now slices vlm_out to the last text_labels.shape[1] positions before running lm_head, matching the [images, language] order embed_prefix produces. 4. Dead-code conditional in _paligemma_forward_ki's single-expert fallback. The ``if hasattr(...) else self._pi052_orig_forward`` ternary always took the wrong branch because the attribute is always set (we save it in PI052Policy.__init__). Simplified to just call self._pi052_orig_forward directly. After this commit, pi052 should be runnable end-to-end for the first time with all three loss heads + KI active. Still worth a 100-step smoke test before kicking off a long run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:58:40 +02:00
Pepijn	0f4faddc01	feat(pi052): auto-fit FAST tokenizer per-dataset before training Per Pertsch et al. 2025 (FAST paper, [64] in π0.5) and π0.5 §III.C, the recommended practice is to fit the FAST action tokenizer on the specific dataset's action distribution rather than using the published universal codebook off the shelf. The universal tokenizer works on any 6-DoF action sequence but produces suboptimal compression, which slows CE convergence and wastes vocab capacity. New utility ``lerobot.policies.pi052.fit_fast_tokenizer``: * samples N action chunks from the LeRobotDataset (default 1024) * loads ``physical-intelligence/fast`` as the base * calls ``.fit(actions)`` (the AutoProcessor API the HF model card documents) — produces a per-dataset codebook * saves to ``{cache_dir}/{sha256(dataset, base, n_samples)[:16]}/`` * returns the local path, ready to feed ``ActionTokenizerProcessorStep(action_tokenizer_name=...)``. Cache is keyed on (dataset, base tokenizer, sample count) so changing any of them re-runs the fit. Re-running training on the same dataset re-uses the cache (one fit per dataset per machine). Auto-fit wiring: * PI052Config gets ``auto_fit_fast_tokenizer`` (default True), ``fast_tokenizer_cache_dir`` (default ~/.cache/lerobot/...), ``fast_tokenizer_fit_samples`` (default 1024). * make_pi052_pre_post_processors now takes ``dataset_repo_id``; when ``enable_fast_action_loss`` and ``auto_fit_fast_tokenizer`` are both True and a repo_id is provided, the factory calls ``fit_fast_tokenizer`` before constructing the processor step and points it at the fitted path. * ProcessorConfigKwargs gains ``dataset_repo_id``; the global factory dispatch threads it through for ``pi052`` policies. * lerobot_train.py populates ``processor_kwargs['dataset_repo_id']`` from ``--dataset.repo_id`` for pi052 runs. Failure mode: if ``.fit()`` fails (e.g. older transformers without the method, or no usable action chunks in the dataset), the factory logs a warning and falls back to the universal base tokenizer. Train still works; you just lose the compression improvement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:52:31 +02:00
Pepijn	8dc0af3c28	feat(pi052): FAST action CE loss + knowledge insulation + processor wiring Three additions ported from ``pi05_full`` on branch ``feat/add-pi05``, giving pi052 full paper-§III.B-C training capabilities alongside the recipe-driven text supervision it already had: * Config flags in PI052Config: - ``enable_fast_action_loss`` default False - ``action_tokenizer_name`` default "physical-intelligence/fast" - ``max_action_tokens`` default 256 - ``fast_skip_tokens`` default 128 - ``fast_action_loss_weight`` default 1.0 - ``knowledge_insulation`` default False * Processor wiring (processor_pi052.py): when ``enable_fast_action_loss=True``, append an ``ActionTokenizerProcessorStep`` after the text tokenizer. It tokenises the action tensor with the FAST tokenizer and writes ACTION_TOKENS / ACTION_TOKEN_MASK into ``COMPLEMENTARY_DATA`` — the existing batch-collation pipeline forwards them as ``batch['action.tokens']`` / ``batch['action.token_mask']``. * FAST CE loss (modeling_pi052.py::_compute_fast_action_loss): Re-embeds the prefix [images, language], appends the FAST token embeddings (using PaliGemma's shared embed_language_tokens), forwards through the backbone, slices the trailing ``fast_len`` positions, applies the LM head, computes shifted next-token CE with the action-mask gating the loss. The loss is summed into ``forward()``'s total with ``fast_action_loss_weight``. * Knowledge insulation (modeling_pi052.py::_compute_layer_ki + _paligemma_forward_ki): port of pi05_full's per-layer attention that detaches VLM K/V on the action-query path so action loss gradients cannot flow back into the VLM's K/V projections. Bound per-instance via ``types.MethodType`` so it doesn't leak into stock ``pi05`` policies that share PaliGemmaWithExpertModel. Activated automatically when ``config.knowledge_insulation=True``. Combined with the existing recipe-driven text head, pi052 now supports the full three-loss objective: L = text_w·H(text) + fast_w·H(FAST actions) + flow_w·MSE(flow) matching Eq. (1) of arxiv:2504.16054 §IV.D (α=10 by default for the flow term, 1.0 each for text and FAST CE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:46:21 +02:00
Pepijn	8eba704f15	Revert "chore(training): align pi052_hirobot.slurm with the operator's actual command" This reverts commit `ecbac17196`.	2026-05-13 11:03:58 +02:00
Pepijn	ecbac17196	chore(training): align pi052_hirobot.slurm with the operator's actual command Match the working SmolVLA2 launch pattern so the two SLURM scripts are interchangeable: * literal NUM_PROCESSES / BATCH_SIZE / STEPS (no env-var defaults) * STEPS=10000 to match the next SmolVLA2 run * save_freq=$STEPS so only the final checkpoint is saved * dropouts 0.1/0.1/0.1 (mild — matches the operator's iteration) * flow_loss_weight / text_loss_weight come from the PI052Config defaults (10.0 / 1.0 per Pi 0.5 paper §IV.D), no need to pass them explicitly Job name and policy_repo_id mirror the SmolVLA2 ``_tool-g2`` naming so the two runs can be compared side-by-side in WandB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:03:09 +02:00
Pepijn	12cce8f2cc	fix(smolvla2): align flow_loss_weight default with Pi 0.5 paper's α=10 Pi 0.5 paper §IV.D Eq. (1) sets the loss balance to α=10 between text CE and flow MSE: actions are the primary output and the flow head should dominate the gradient signal. SmolVLA2 was defaulting both weights to 1.0, which inverts that — text CE (~0.5-2.0 nats) ends up larger than flow MSE (~0.1-1.0), so the action expert gets less gradient than the LM head despite being the primary task. Match the paper's split: text_loss_weight=1.0, flow_loss_weight=10.0. Same as ``pi052`` (the new full reproduction policy). Also pin the values explicitly in the SLURM launcher so the choice is visible and overridable per-run rather than buried in the config default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:02:17 +02:00
Pepijn	ef5879a02a	feat(pi052): π0.5 v2 — full reproduction of the π0.5 paper recipe New ``lerobot.policies.pi052`` (parallel to ``smolvla2``) that adds text-prediction + hierarchical-inference on top of the existing π0.5 implementation. Mirrors the paper's §IV.D dual-head training: L = H(text) + α * ‖ω - a - f_θ_action(...)‖², α = 10 Components: * ``configuration_pi052.py`` thin PI05Config subclass; adds recipe_path, text/flow loss weights (default α=10 per paper), prompt dropout knobs, ``unfreeze_lm_head``. * ``text_processor_pi052.py`` PI052TextTokenizerStep — concatenates rendered messages as ``Role: ...`` plain text (PaliGemma has no chat template), tokenises with the PaliGemma tokenizer, builds a label mask covering supervised target spans. Includes Pi 0.7 §V.E per-component prompt dropout. * ``processor_pi052.py`` make_pi052_pre_post_processors — Rename + Batch + Relative + Normalize + RenderMessagesStep + PI052TextTokenizerStep + Device. Falls back to π0.5's plain pipeline when recipe_path is unset. * ``modeling_pi052.py`` PI052Policy(PI05Policy) — re-enables PaliGemma ``lm_head``, computes text_loss via CE on the supervised span, sums with flow_loss in forward(), and adds select_message for AR text generation at inference (same surface as SmolVLA2Policy.select_message so SmolVLA2Runtime drives it unchanged). Plus the supporting plumbing: * recipe ``configs/recipes/pi052_hirobot.yaml`` — same Hi-Robot blend as smolvla2_hirobot.yaml, with the same ``${subtask}`` / ``if_present`` supervision fix (current span at every frame, not ``${next_subtask}``). * SLURM ``examples/training/pi052_hirobot.slurm`` — full training command matching the SmolVLA2 launcher. * factory registration: ``--policy.type=pi052`` resolves to PI052Policy with the new processor. Same multi-rate runtime (``lerobot.policies.smolvla2.inference``) drives this policy too — both expose ``predict_action_chunk`` for the action expert and ``select_message`` for the LM head. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 10:59:26 +02:00
Pepijn	1d24301b67	chore(training): STEPS=15000 default + dropout walked back to 0.30/0.30/0.20 After _tool-good (2000 steps, 0.50/0.50/0.20 dropout) the LM head's distribution at position 0 shifted from EOS to subtask-vocabulary tokens but emitted bag-of-words ("cube arm and") rather than well- formed sentences. That's the expected mid-fine-tuning phase: token- level supervision has landed, sequence-level grammar hasn't. Two changes for the next retrain: * STEPS=15000 (from 2000) — chat-pretrained backbones need O(10k+) steps to walk their pretraining priors down far enough to commit to the fine-tuned distribution structurally, not just at the token level. _tool-g2's bag-of-words output proves the model is on the right path; it just needs more gradient signal. * plan/memory dropout 0.50 -> 0.30 — 0.50 was probably too aggressive for a small dataset. Half the training samples had crucial context missing, which slows down learning the full conditional structure. 0.30 still regularises against prompt leakage but lets the model learn proper grammar first; the higher dropout can be revisited once the head is solid. Subtask dropout stays at 0.20 since subtask isn't in the high-level prompt anyway (recipe fix removed the "Current subtask:" message). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 10:46:19 +02:00
Pepijn	3a20ea337e	feat(smolvla2-runtime): --text_min_new_tokens / --text_temperature CLI debug knobs The recipe fix (target=${subtask} instead of ${next_subtask}) shifted the LM head's failure mode from "emit newlines" to "emit EOS at position 0". On the new ``_tool-good`` checkpoint inference produces exactly one token (``<end_of_utterance>``, id 49279) and decodes to empty. That's the chat-pretrained backbone's short-turn EOS prior not yet being overridden by 2000 steps of fine-tuning supervision. Expose three knobs so the operator can probe whether the head has real subtask-token probability mass under the EOS argmax without recompiling or retraining: --text_min_new_tokens=N suppress EOS for the first N tokens --text_temperature=T sample at temperature T --text_top_p=P nucleus filtering at top-p These are explicitly off-policy (training was greedy / no min-tokens), so they shouldn't ship in production runs — but they let us tell whether the model has learned subtask prediction (just under EOS) or hasn't yet. If forcing min_new_tokens=3 with temperature=0.5 produces a sensible subtask, the model is fine and just needs more training steps to walk EOS down. If it produces gibberish, training hasn't progressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 21:39:33 +02:00
Pepijn	b6fb536460	chore(training): bump plan/memory dropout to 0.50 to force vision-grounding After the recipe fix (target=${subtask} at every frame) the model can still reach low text_loss by reading the answer off the plan in the prompt: at training the prompt contains the 6-step plan, and the current subtask is one of those steps, so the model just learns "active step N matches subtask N" and never needs to look at the image. Symptom at inference: subtask string is set but never updates because the model isn't really conditioning on the visual progress. Drop plan and memory with p=0.50 each — half of training frames the prompt is just "${task}" (constant for this dataset) + visual prefix, which is the only place the answer can come from. Forces the LM head to actually use vision. ``subtask_dropout`` stays at 0.20 because subtask isn't in the high-level prompt anymore (recipe fix removed the "Current subtask: X" message); the knob still affects other sub-recipes that reference it as context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 21:31:00 +02:00
pepijn	bfd3bb1791	fix(smolvla2): handle batched sample indices in chat tokenizer Normalize tensor and sequence sample indices before prompt dropout so distributed batched preprocessing does not try to cast full index tensors to scalars. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 16:56:13 +00:00
Pepijn	4908433f9a	chore(training): align smolvla2_hirobot.slurm with what's actually run Match the operator's current training command for the _tool6 retrain: * default DATASET / POLICY_REPO_ID / JOB_NAME point at the tool6 iteration (super_poulain_full_tool3 → smolvla2_hirobot_super_poulain_tool6) * STEPS default 2000 (short enough to iterate; bump to 10k for full) * save_freq=$STEPS so the only checkpoint is the final one * OUTPUT_DIR includes step count so successive runs don't clobber * Drop the wider augmentation envelope I added earlier — back to default ColorJitter ranges (brightness ±20% etc) since the high_level_subtask recipe fix (current-subtask supervision) is expected to fix the LM-head collapse on its own; the augmentation is just the standard regulariser, not a load-bearing widener. * prompt-dropout fractions stay at the original 0.15 / 0.15 / 0.20. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:45:38 +02:00
Pepijn	6ce1f36002	fix(smolvla2): supervise high-level head with current subtask at every frame The high_level_subtask recipe targeted ``nth_next(style=subtask, offset=1)``, which on the last span of any episode resolves to None. The recipe had no ``if_present`` guard on the target, so the renderer emitted an empty assistant turn and cross-entropy supervised the model on the chat template's structural newlines (``\n``). Across the dataset this trained the LM head's argmax at position 0 to collapse to ``\n`` whenever no transition was imminent (i.e. most frames). Visible failure mode at inference: the head emits 40+ newlines + ``<end_of_utterance>`` every chunk boundary while the action expert keeps working — confirmed by running the dry-run on dataset frame 0 with the dataset's own image and seeing the same ``\n × 44`` collapse. Switch to the Pi 0.5 / Pi 0.7 supervision pattern: at every frame, the assistant target is the current active subtask span text (via ``${subtask}`` → ``active_at(t, style=subtask)``). Always non-empty, always scene-grounded, ``if_present: subtask`` skips frames with no active span instead of emitting a degenerate empty turn. Runtime callsite update: ``_msgs_for_subtask`` no longer feeds a "Current subtask: X" user message into the prompt (that would be circular — we'd be telling the model the answer). Transition detection moves into the runtime — when the predicted subtask differs from ``state['current_subtask']``, the existing ``set_if_changed`` path fires ``subtask_change`` and downstream memory updates. Same event surface, supervision target is now always meaningful. Requires re-annotating the dataset and retraining for the fix to land in the checkpoint, but the recipe + runtime change is what enables it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:42:59 +02:00
Pepijn	731576be80	chore(smolvla2-runtime): auto-fire one tick at dry-run startup Previously the dry-run REPL only ticked on user input (empty Enter just redrew), so the bisection test "does the LM head produce text on start_frame=0?" required typing something arbitrary to drive a tick. Just run ``step_once`` at startup — the obs diagnostic and the subtask gen both fire automatically, the diag row populates, and the operator can read the result before pressing any key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:34:42 +02:00
Pepijn	47fb8318b1	chore(training): widen augmentation envelope after live-robot diagnostic The tensor-level comparison between dry-run (dataset frame) and live- robot inference proved the runtime is bug-free — same shape, dtype, device, channel order, batch dim, and normalization on both paths. The remaining variable: front-camera mean brightness was 0.26 live vs 0.39 on the dataset frame, ~33% darker. Training augmentation only covered ±20% brightness, so the live scene sits just outside the supervised envelope and the LM head collapses to its dominant prior. Widen the augmentation knobs for the next retrain: * brightness 0.8–1.2 → 0.5–1.6 (covers ~30% darker / 60% lighter) * contrast 0.8–1.2 → 0.6–1.5 * saturation 0.5–1.5 → 0.3–1.7 * hue ±0.05 → ±0.10 * affine ±5°/±5% → ±15°/±15% (covers cube placement / camera drift) * max_num_transforms 3 → 4 And bump prompt-component dropout (subtask 0.20 → 0.30) so the LM can't lean on stale memorised plan/memory at inference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:25:41 +02:00
Pepijn	53172873e3	chore(smolvla2-runtime): probe obs once at dry-run startup The dry-run REPL only fires a tick when the user types, so the ``_log_obs_tensors_once`` diagnostic never reached stdout (the provider was never called). Probe the provider once at startup — the result is discarded; we only care about the obs log it triggers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:21:58 +02:00
Pepijn	fcdae0ce8e	chore(smolvla2-runtime): tensor-level obs print for both inference paths Helper that prints (once per provider lifetime) every ``observation.`` tensor the policy is about to see, with its shape, dtype, device, and per-channel min/max/mean/std. Wired into both the dry-run dataset path and the live-robot path. Now we can bisect train/inference mismatch at the tensor level* — if the same checkpoint produces coherent text on one path's tensors and ``\n`` on the other's, and the printed tensor stats differ materially, the bug is in the observation prep, not in the model or the training distribution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:19:18 +02:00
Pepijn	4852b9f952	feat(smolvla2-runtime): --dataset.augment_at_inference for the bisection test Apply the training-time torchvision-v2 ColorJitter / SharpnessJitter / RandomAffine pipeline to dataset frames in dry-run, so we can isolate whether the LM head's collapse to '\n' on live frames is: * pure scene-content OOD (unaugmented dataset frames work, mildly augmented ones still work — model has learned the augmentation distribution, only fails when the scene content itself diverges) * hyper-specific memorisation (dry-run with augmentation also collapses to '\n' — head is nailed to the exact unperturbed training samples and only the retrain helps) Usage: lerobot-smolvla2-runtime --no_robot --policy.path=... \ --dataset.repo_id=... --dataset.episode=0 \ --dataset.start_frame=1000 \ --dataset.augment_at_inference Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:14:57 +02:00
Pepijn	0410705aff	chore(smolvla2-runtime): print live state vector once at startup So the operator can compare live joint values to the dataset's ``observation.state`` mean/std and spot when the robot's home pose is several σ off the supervised support region. State OOD is the remaining viable hypothesis for why the live LM head collapses to ``\n`` even though images are pixel-shape-matched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:12:27 +02:00
Pepijn	398a8cf730	chore(smolvla2-runtime): log first-tick resize so train/inference match is verifiable Print one warning the first time the robot observation provider runs through, showing live camera resolution and the dataset's training resolution, plus whether we resized. Lets the operator confirm at a glance that the visual prefix really is being fed at the same shape the model saw at training — instead of guessing whether the resize fired silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:06:00 +02:00
Pepijn	ab5c1dc392	fix(smolvla2-runtime): match training visual distribution on robot frames Root cause for the LM head's empty-completion symptom on the live robot (while the same checkpoint produced sensible subtask/plan/memory in ``--no_robot`` dry-run on dataset frames): the camera observation was flowing into the model at its native resolution. A Mac/USB webcam hands us 1280×720 or 1920×1080; the dataset was recorded at the feature schema's ``observation.images.['shape']`` resolution (typically 480×640). SmolVLA's internal ``resize_with_pad(512, 512)`` does* fit both — but with very different pad geometry, so visual tokens at each tile carry different content than at training. Action expert tolerates this; the tightly-supervised LM head goes OOD and the head's distribution at position 0 collapses to its dominant mode (``\n`` ×N then ``<end_of_utterance>`` for this checkpoint). The fix: in ``_build_robot_observation_provider``, pre-compute the camera-key → (H, W) target from ``ds_features`` and ``cv2.resize`` each live frame to that shape before tensorising. The downstream ``resize_with_pad`` then sees the same input geometry as training and the LM head returns to producing readable subtask text under plain greedy decoding — the same as dry-run. Also drops the inference-time patches (``min_new_tokens``, ``temperature``, ``top_p`` overrides) on the four high-level callers. They were band-aids around the visual-distribution shift, not a real LM problem, and they drift inference off the training distribution. Greedy argmax is what training matched. The ``select_message`` signature still accepts the knobs for callers that want them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:59:24 +02:00
Pepijn	1292304c42	fix(smolvla2): suppress all special tokens during min_new_tokens window Previous attempt only masked the tokenizer's eos_token_id during the min_new_tokens prefix. The empty-completion symptom persisted because a memorised SmolVLM head doesn't just want EOS — its top-1 at position 0 is some special token, and when EOS is masked the argmax shifts to a sibling (``<\|im_end\|>``, ``<image>``, ``<fake_token_around_image>``, ``<row_X_col_Y>``, …). Those tokens survive generation but then get stripped by ``decode(skip_special_tokens=True)``, so the runtime still saw ``last_raw='(empty)'`` every chunk boundary. Mask the full ``tokenizer.all_special_ids`` set instead. Forces the head to commit to a normal vocabulary token before it can close or quietly poison the turn. Also: when decode returns empty but tokens were generated, expose the raw token ids and the special-tokens-included decoded string via ``policy._last_select_message_debug``. The runtime surfaces this in the scrollback so the operator can see what the head is actually emitting — distinguishing "head EOS-ing" from "head emitting image placeholders" from "head emitting chat-template fragments". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:49:53 +02:00
Pepijn	b95eebff77	fix(smolvla2): force min_new_tokens + sampling so memorised LM emits something Real-robot run confirmed the LM head is producing 0 tokens at every chunk boundary (empty:N counter climbing, no exception in scrollback): the model EOS-es at decode step 0. That's the memorisation collapse — training reached text_loss=6e-6 by overfitting one trajectory whose supervised subtask turn ended in EOS, and at inference the head's argmax for token 0 is EOS regardless of the actual frame. Two changes in select_message: * ``min_new_tokens`` parameter masks the EOS logit to -inf until at least N real tokens have been decoded. Without this the head's "EOS first" prior produces an empty completion every single time. * The runtime callers now pass ``min_new_tokens=5..10`` plus ``temperature=0.4..0.5`` + ``top_p=0.9``. Sampling at moderate temperature with nucleus filtering also helps break the greedy argmax collapse — when the model has memorised one continuation, greedy keeps replaying it; nucleus sampling forces it to commit to some coherent continuation that's well-supported by the prefix even when greedy's top-1 is degenerate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:48:08 +02:00
Pepijn	fbcac95662	feat(smolvla2-runtime): scrollback in autonomous panel + empty-gen counter Two improvements for diagnosing why ``last_raw`` stays empty: 1. The autonomous panel-redraw thread calls console.clear() every 0.5 s, wiping any log lines the runtime printed since the last redraw. So warnings from generation (``[warn] subtask gen failed: ...``, ``[info] subtask gen rejected (gibberish): ...``) flashed for milliseconds and disappeared, leaving the operator blind. Capture log_lines from each tick into a bounded scrollback (last 12 entries) and render them inside the panel itself, below the diag row. They now stick across redraws until rotated out. 2. ``empty`` counter for subtask gen. Persistent empty completions are their own failure mode — the LM head EOS-es immediately from the chat-template generation prompt, distinct from "generated something but filter rejected it". The diag row now reads: subtask diag repeat:0 gibberish:0 empty:14 last_raw: '(empty)' ^^^^^^^ plus a periodic log line every 10 empties so the cause is also surfaced in the scrollback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:42:13 +02:00
Pepijn	b9db4d21a2	fix(smolvla2): high-level steps must run before LowLevelForward refills Both HighLevelSubtaskFwd and LowLevelForward are gated on 'action queue is empty'. With LowLevelForward listed first, it refilled the queue on the empty-queue tick before HighLevelSubtaskFwd got to check — so the gate I added in the previous commit made the high-level step a permanent no-op after the initial bootstrap. Visible symptom: subtask string never advances past whatever bootstrap seeded, no subtask_change events, memory stays unset, and the new overfit diagnostics never appear on the panel because last_subtask_raw is never written. Move all high-level steps (subtask, memory, interjection, vqa) ahead of LowLevelForward. On an empty-queue tick the subtask refreshes first, the new string flows into the next chunk's prompt, then LowLevelForward generates the chunk, then DispatchAction drains it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:38:06 +02:00
Pepijn	aecb80a9d2	feat(smolvla2-runtime): overfit/memorisation diagnostics on the panel The autonomous-mode panel now surfaces what the model is actually producing at every chunk boundary, not just what got accepted: * last_subtask_raw most recent generation (accepted or not) * subtask_repeat_count times the same accepted string regenerated * subtask_gibberish_count rejections by the gibberish filter * memory_gibberish_count / plan_gibberish_count for the other heads These let the operator see memorisation collapse without scrolling back through logs: subtask diag repeat:8 gibberish:0 last_raw: '<same string>' ^^^^^^^^^^ → model can't move past current phase subtask diag repeat:0 gibberish:14 last_raw: 'Ass:::' ^^^^^^^^^^^^^^^^^^^^^^ → LM collapsed to template salad Also silences the per-action ``Relative goal position magnitude had to be clamped`` warning. The clamp fires every dispatch tick when the model emits stale joint targets, flooding the panel at ctrl_hz=30. Replaced the bare ``logging.warning`` call in robots/utils.py with a module logger so it can be selectively raised to ERROR. Operators who need the per-tick clamp detail can use ``-v``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:31:04 +02:00
Pepijn	c98c695127	feat(smolvla2-runtime): 'rephrase:' prefix to swap task string in place Adds a third stdin channel alongside 'task:' and bare interjections: rephrase: <text> Swaps state['task'] with the new string while preserving plan/memory/ subtask. Lets the operator probe how robust the model is to wording variations of the same task — the trained augmentation provided n_task_rephrasings≈30 task wordings per dataset task, and this is the direct way to exercise that distribution at inference without generating a fresh plan via user_interjection_response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:26:59 +02:00
Pepijn	d528078aca	fix(smolvla2-runtime): allow task switching mid-run via 'task:' prefix Both stdin handlers (autonomous mode and rich REPL) gated 'task:' to 'only if no task is set yet' — once the initial task existed, typing 'task: <new task>' silently fell through to the interjection branch. Make 'task:' always override the active task and clear stale plan/memory/subtask so the next high-level pass regenerates context from scratch for the new task. For rephrasings within the same task, the interjection path (user_interjection_response recipe) is still the right channel — it refreshes the plan and emits a paired <say> in one trained call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:24:16 +02:00
Pepijn	a648da0455	fix(smolvla2): unblock action dispatch when high-level LLM stalls loop The runtime is single-threaded. `HighLevelSubtaskFwd` at HzTrigger(1.0) fires every loop iteration on MPS because each `select_message` call takes ~2 s, longer than its 1/hz period. The whole tick stretches to ~2.5 s, so `DispatchAction` (HzTrigger 30) only pops a single action per loop iteration — the queue drains at ~0.4 actions/sec instead of 30 and the robot barely moves between chunk refreshes. Two changes, both purely about scheduling — no threading: * Gate `HighLevelSubtaskFwd` to fire only when the action queue is empty, matching `LowLevelForward`'s refresh condition. The slow LLM call now happens during the "think" phase between chunks, not on every dispatch tick. Restores a clean sense → think → act cycle. * `DispatchAction` catches up via wall-clock: when the trigger fires after a stall, pop `round(elapsed * hz)` entries and send only the most recent. Open-loop chunks are timestamped at ctrl_hz; sending stale joint targets one-by-one would just lag the robot further behind. The dynamixel smooths to the latest goal anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:23:09 +02:00

1 2 3 4 5 ...

1562 Commits