lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-10 11:31:57 +00:00

Author	SHA1	Message	Date
Pepijn	b2aa372fcf	refactor(recipes): fold memory into action_execution, drop interjection, fuse smolvla2 forward Recipe changes: * action_execution now bundles the memory update as a second assistant target gated on a new ``new_memory`` binding (fires only at subtask-boundary frames). No "Completed subtask: X" filler — the model emits the new subtask AND the updated memory back-to-back in one prefix. * user_interjection_response sub-recipe removed (current datasets don't have interjection / say() annotations). * Standalone memory_update sub-recipe removed (folded above). * Weights rebalanced: action_execution 0.85, ask_vqa_top/wrist 0.075 each (sums to 1.0). Runtime ``_msgs_for_memory`` updated to match the new boundary-frame prompt layout. Modeling: * SmolVLA2Policy now fuses the flow + text losses into a SINGLE backbone forward via ``_compute_fused_loss`` (one vlm_with_expert pass with [prefix, suffix] embeds, then both lm_head CE on lang slice + action_out_proj MSE on suffix). Mirrors pi052's existing ``_compute_all_losses_fused`` — saves one backbone pass per training step. Examples: * Removed the two training SLURM scaffolds; they were out-of-date with the recipe refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:51:09 +02:00
Pepijn	058b8f3958	refactor(recipes): two-flavor design — one fused action_execution + text-only events Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions) and the hierarchical inference pattern from Pi 0.5 §IV.D. Flavor 1 — action_execution (60% weight, "main path") ----------------------------------------------------- One always-on recipe that fuses all available context (task, plan, memory) into a single user prompt and uses the current subtask as the supervised assistant target. This single recipe supervises both objectives: * subtask prediction (text CE on the assistant span via lm_head) * action chunks (flow MSE on the action expert via stream: low_level, target: true; plus FAST CE on action tokens when enable_fast_action_loss=True) At inference, the same prompt structure drives both inference modes: * select_message(user_prompt_only) → LM head generates the next subtask. Matches action_execution's training distribution exactly (prompt is the user turn, target is the subtask). * predict_action_chunk(user_prompt + assistant_subtask) → action expert produces the chunk. Matches action_execution's full prompt+target. This replaces what used to be a separate high_level_subtask recipe plus a low_level_execution recipe; both were supervising the same subtask text, so collapsing them into one is correct and removes the redundant text-CE gradient. Flavor 2 — event-driven text-only recipes ----------------------------------------- Each of these supervises the LM head to predict a specific kind of text given a specific event-triggered context. ``stream: high_level`` on all targets so they never trigger predict_actions / flow loss. ``if_present`` guards ensure they only fire on frames where the event annotation is present. * memory_update (10%) new memory at subtask boundary * user_interjection_response (15%) new plan + say(...) on input * ask_vqa_top (7.5%) front-camera VQA * ask_vqa_wrist (7.5%) wrist-camera VQA Total weight = 1.0. Prompt format consistency ------------------------- User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}`` matches what ``inference/steps.py::_msgs_for_subtask`` and ``_control_context_messages`` already emit at inference time. No "Task: " prefix — the bare task string is used as the leading content with literal "Plan: " / "Memory: " labels for the subsequent components. What changed structurally ------------------------- - low_level_execution DROPPED (folded into action_execution) - high_level_subtask DROPPED (subtask supervision moved into action_execution) + action_execution NEW (the fused main recipe) memory_update kept, prompt cleaned up user_interjection_response kept, prompt cleaned up ask_vqa_top / ask_vqa_wrist kept Runtime compatibility --------------------- No runtime change needed — ``SmolVLA2Runtime`` and the inference helpers already build their high-level prompt as just the user turn (task + plan + memory) and append a ``current_subtask`` assistant turn for the low-level call. Both match the new ``action_execution`` prompt shape exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:35:51 +02:00
Pepijn	b873fe454c	perf(pi052): full fusion — text + FAST + flow in ONE backbone forward Previously the forward did 2 backbone passes when all heads were active: one for flow (via super().forward) and one for the fused text+FAST helper. This commit reduces it to one pass — same compute as flow-only training. New ``_compute_all_losses_fused`` builds: prefix = [images, language, FAST (when provided)] suffix = [noisy_actions] (action expert via gemma_expert) and runs a single ``paligemma_with_expert.forward`` with ``inputs_embeds=[prefix_embs, suffix_embs]`` (both experts active in the same call). Captures both prefix_out and suffix_out, slices each for its respective loss: flow MSE ← suffix_out (existing action_out_proj + MSE path) text CE ← prefix_out at language positions (lm_head + CE) FAST CE ← prefix_out at FAST positions (lm_head + CE) Critical attention mask override -------------------------------- ``make_att_2d_masks`` produces a cumulative-block attention mask in which suffix tokens (highest cumsum) attend to every lower-cumsum position by default, including FAST tokens. If we let that stand the action expert reads the discrete FAST tokens and trivially decodes them back to the same continuous actions the flow head is supposed to predict from noise — the entire training signal collapses to a copy operation. The fix is a single line right after make_att_2d_masks: att_2d_masks[:, fast_end:, fast_start:fast_end] = False Explicitly zeros out suffix → FAST attention. Everything else remains correct under the cumsum semantics: * prefix images/language stay bidirectional among themselves * FAST stays causal within itself, attending bidirectionally to images+language * FAST cannot see suffix (cumsum < suffix cumsum, default) * suffix attends bidirectionally among itself, to images+language, and now NOT to FAST (this override) Bit-equivalent to the previous separated forward path for text+FAST losses (the prefix hidden states at language and FAST positions are unchanged whether suffix is present or not — the prefix doesn't attend to suffix). For flow loss, suffix→FAST being masked is the correct behaviour we want — if anything the previous separated path was less correct for production use because the joint gradient signal through the action expert was missing the prefix extension. Forward routing in ``forward()`` -------------------------------- * run_flow=True → _compute_all_losses_fused (one forward, all three losses) * run_flow=False, run_text or run_fast → _compute_text_and_fast_loss (one prefix-only forward, two CE losses, no suffix → cheaper than fusion) * neither → RuntimeError (explicit; both losses disabled) Wall-time per step ------------------ Before this commit: flow + (text+FAST fused) = 2 forwards After this commit: (flow+text+FAST fused) = 1 forward Compute parity with flow-only training when all three heads active. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:28:38 +02:00
Pepijn	83d7250a22	fix(recipes): low_level_execution needs if_present:subtask guard too Same bug we fixed for high_level_subtask, just on the other subtask-supervised sub-recipe. ``low_level_execution`` targets ``${subtask}`` (the current active span) but had no ``if_present`` guard. When ``active_at(t, style=subtask)`` returned None at a frame (gaps in the annotation, or the very first/last frames of an episode if the annotator's spans don't fully tile), the assistant message rendered with empty content. The chat tokenizer still included it in ``target_message_indices`` → text CE supervised whatever the chat-template's empty assistant turn decoded to (usually a single ``\n``). That trains the LM head's prior at the first generation position toward ``\n``, the same collapse we observed with the original ``${next_subtask}`` target. Fix: ``if_present: subtask`` on the assistant target in ``low_level_execution`` for both ``smolvla2_hirobot.yaml`` and ``pi052_hirobot.yaml``. Side effect: frames without an active subtask span no longer contribute to the flow loss either (the only ``low_level`` target is skipped, ``predict_actions = bool(targets_by_stream.get("low_level"))`` becomes False). For a well-annotated dataset where subtask spans tile the whole episode this is a no-op. For datasets with gaps, those gap frames lose flow supervision — strictly better than the degenerate text-CE alternative. Sub-recipe audit summary (no other changes needed): * memory_update — all if_present guards present, OK * user_interjection_response — all if_present guards present, OK * high_level_subtask — fixed earlier, OK * low_level_execution — fixed by this commit * ask_vqa_top / ask_vqa_wrist — query+answer both guarded, OK Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:22:45 +02:00
Pepijn	35f9063a6c	perf(pi052): fuse text + FAST loss into a single prefix forward Previously the forward did three backbone passes per training step when all heads were active: one for flow (via super().forward), one for text CE, and one for FAST CE. That's ~3× the compute of flow-only training. The text and FAST losses share their prefix forward exactly — both are CE on the LM head, evaluated at different slices of the same hidden states. Adding FAST tokens after language in the prefix is bit-equivalent for the text loss because the mask_ar convention in ``make_att_2d_masks`` keeps FAST tokens in a strictly-later causal block: language tokens never see FAST, so their hidden states are unchanged. New ``_compute_text_and_fast_loss``: * embeds [images, language] once * optionally appends [FAST] (when run_fast is True) * one backbone forward * slices ``vlm_out[:, -(fast_len + lang_len):-fast_len]`` for language hidden states (or ``vlm_out[:, -lang_len:]`` when no FAST) → text CE * slices ``vlm_out[:, -fast_len:]`` for FAST hidden states → FAST CE * returns both losses, either of which can be None when the caller doesn't want that head. forward() now calls this fused helper instead of running the two separate ``_compute_text_loss`` / ``_compute_fast_action_loss`` methods. Those remain in the file for callers that only want one head (e.g. ablations). Why flow isn't fused -------------------- Flow MSE comes from the action-expert (suffix) hidden states, which attend to the prefix. If we just concat FAST onto the prefix and let the action expert attend to it, the expert can trivially decode FAST back to continuous actions — overfitting via shortcut. Preventing that requires a custom segment-aware attention mask (action expert can attend to images+language but NOT to subtask/FAST), which is what pi05_full does in ``compute_layer_complete_knowledge_insulation``. That's the full-fusion path; deferred as a follow-up since the text+FAST fusion already recovers most of the compute. End-to-end forward pass count ----------------------------- Before: 1 (flow) + 1 (text) + 1 (FAST) = 3 backbone forwards After: 1 (flow) + 1 (text+FAST fused) = 2 backbone forwards ~33% wall-time reduction per training step when all three heads are active. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:08:34 +02:00
Pepijn	17c0800461	fix(pi052): FAST loss masking + predict_actions gating + smolvla2 review FAST loss changes ----------------- 1. Gate by ``predict_actions`` (same routing as flow loss). The ActionTokenizerProcessorStep tokenises actions for every sample regardless of which sub-recipe rendered it; for text-only recipes (high_level_subtask, memory_update, ...) the action tokens are still in the batch but mustn't be supervised. Skip the FAST forward+CE entirely when no sample in the batch has ``predict_actions=True``. 2. Switch from "multiply-by-mask" masking to ``ignore_index=-100``. The old pattern computed per-token CE for all positions, then zeroed out invalid ones. Two issues: (a) any out-of-vocab target id at a padded position would have crashed cross_entropy before the mask got a chance to zero it out, and (b) the pattern is needlessly clever. Now ``shift_targets.masked_fill(~mask, -100)`` followed by ``ignore_index=-100`` cleanly drops invalid positions. Matches the smolvla2 text-loss convention. 3. Clean up unused ``bsize`` variable in _compute_fast_action_loss and expand the attention-mask docstring with the ``make_att_2d_masks`` mask_ar convention spec (causal vs bidirectional blocks). smolvla2 audit (reference review, no code change) ------------------------------------------------- Compared smolvla2/modeling_smolvla2.py against pi052/modeling_pi052.py to catch parallel bugs. Findings: * No ``paligemma.language_model`` vs ``paligemma.model.language_model`` issue — smolvla2 uses SmolVLM (different class, different attribute layout) so the bug doesn't apply. * ``fill_kv_cache=True`` is correctly passed to smolvla's ``vlm_with_expert.forward`` — that class does accept the kwarg (unlike pi05's PaliGemmaWithExpertModel.forward, which is why pi052 must omit it). * Text-loss alignment is correct: ``_compute_text_loss`` computes ``lang_start`` / ``lang_end`` from the known prefix layout (``[image_blocks..., lang, state]``) and slices ``prefix_out`` to just the language positions before applying ``lm_head``. The parallel bug I fixed in pi052 (lm_head over the full prefix, shape-mismatched against text_labels) was not present in smolvla2. * Per-sample flow routing via ``predict_actions``: correctly masks per-sample by calling the parent ``forward(..., reduction='none')`` and applying the predict_actions mask before the mean. pi052 only has the batch-level any() gate — a parallel improvement for pi052 would require modifying PI05Pytorch.forward to support per-sample reduction, deferred. * ``reduction="none"`` returns ``total.expand(bsize)``: identical scalar-broadcast limitation in both policies. Acknowledged but low priority (only RA-BC weighting uses the per-sample path and it's documented as a known approximation in smolvla2). * Chat tokenizer correctly handles batched/unbatched messages, pads with -100 for label positions, builds attention masks. No bugs found. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:05:37 +02:00
Pepijn	c8763e0ad5	fix(pi052): four real bugs in the modeling code + flip defaults Defaults -------- * enable_fast_action_loss: False -> True (match paper §III.B-C Eq.1) * auto_fit_fast_tokenizer: True -> False (opt-in; needs base.fit()) Bug fixes --------- 1. Wrong attribute path on PaliGemma. The KI port copied pi05_full's ``paligemma.language_model.layers[...]`` literally, but the production pi05 wrapper exposes the text model at ``paligemma.model.language_model``. With KI enabled, every layer would have raised AttributeError on first forward. Fixed all references in _compute_layer_ki + _paligemma_forward_ki. 2. ``fill_kv_cache=True`` passed to PaliGemmaWithExpertModel.forward. That kwarg is a SmolVLA-only concept; pi05's signature has no such argument, so every forward call from pi052 (text loss, FAST loss, select_message) would have crashed with TypeError. Dropped from all four call sites — pi05's forward already handles the cache via past_key_values, and re-forwarding the cumulative sequence each step in select_message is fine for our short subtask completions. 3. Text-loss shape mismatch. _compute_text_loss applied lm_head to the full vlm_out (image tokens + language tokens), then tried to cross-entropy that against text_labels which only covers the language portion — the .view(-1) calls would produce two tensors of different lengths and CE would fail. Now slices vlm_out to the last text_labels.shape[1] positions before running lm_head, matching the [images, language] order embed_prefix produces. 4. Dead-code conditional in _paligemma_forward_ki's single-expert fallback. The ``if hasattr(...) else self._pi052_orig_forward`` ternary always took the wrong branch because the attribute is always set (we save it in PI052Policy.__init__). Simplified to just call self._pi052_orig_forward directly. After this commit, pi052 should be runnable end-to-end for the first time with all three loss heads + KI active. Still worth a 100-step smoke test before kicking off a long run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:58:40 +02:00
Pepijn	0f4faddc01	feat(pi052): auto-fit FAST tokenizer per-dataset before training Per Pertsch et al. 2025 (FAST paper, [64] in π0.5) and π0.5 §III.C, the recommended practice is to fit the FAST action tokenizer on the specific dataset's action distribution rather than using the published universal codebook off the shelf. The universal tokenizer works on any 6-DoF action sequence but produces suboptimal compression, which slows CE convergence and wastes vocab capacity. New utility ``lerobot.policies.pi052.fit_fast_tokenizer``: * samples N action chunks from the LeRobotDataset (default 1024) * loads ``physical-intelligence/fast`` as the base * calls ``.fit(actions)`` (the AutoProcessor API the HF model card documents) — produces a per-dataset codebook * saves to ``{cache_dir}/{sha256(dataset, base, n_samples)[:16]}/`` * returns the local path, ready to feed ``ActionTokenizerProcessorStep(action_tokenizer_name=...)``. Cache is keyed on (dataset, base tokenizer, sample count) so changing any of them re-runs the fit. Re-running training on the same dataset re-uses the cache (one fit per dataset per machine). Auto-fit wiring: * PI052Config gets ``auto_fit_fast_tokenizer`` (default True), ``fast_tokenizer_cache_dir`` (default ~/.cache/lerobot/...), ``fast_tokenizer_fit_samples`` (default 1024). * make_pi052_pre_post_processors now takes ``dataset_repo_id``; when ``enable_fast_action_loss`` and ``auto_fit_fast_tokenizer`` are both True and a repo_id is provided, the factory calls ``fit_fast_tokenizer`` before constructing the processor step and points it at the fitted path. * ProcessorConfigKwargs gains ``dataset_repo_id``; the global factory dispatch threads it through for ``pi052`` policies. * lerobot_train.py populates ``processor_kwargs['dataset_repo_id']`` from ``--dataset.repo_id`` for pi052 runs. Failure mode: if ``.fit()`` fails (e.g. older transformers without the method, or no usable action chunks in the dataset), the factory logs a warning and falls back to the universal base tokenizer. Train still works; you just lose the compression improvement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:52:31 +02:00
Pepijn	8dc0af3c28	feat(pi052): FAST action CE loss + knowledge insulation + processor wiring Three additions ported from ``pi05_full`` on branch ``feat/add-pi05``, giving pi052 full paper-§III.B-C training capabilities alongside the recipe-driven text supervision it already had: * Config flags in PI052Config: - ``enable_fast_action_loss`` default False - ``action_tokenizer_name`` default "physical-intelligence/fast" - ``max_action_tokens`` default 256 - ``fast_skip_tokens`` default 128 - ``fast_action_loss_weight`` default 1.0 - ``knowledge_insulation`` default False * Processor wiring (processor_pi052.py): when ``enable_fast_action_loss=True``, append an ``ActionTokenizerProcessorStep`` after the text tokenizer. It tokenises the action tensor with the FAST tokenizer and writes ACTION_TOKENS / ACTION_TOKEN_MASK into ``COMPLEMENTARY_DATA`` — the existing batch-collation pipeline forwards them as ``batch['action.tokens']`` / ``batch['action.token_mask']``. * FAST CE loss (modeling_pi052.py::_compute_fast_action_loss): Re-embeds the prefix [images, language], appends the FAST token embeddings (using PaliGemma's shared embed_language_tokens), forwards through the backbone, slices the trailing ``fast_len`` positions, applies the LM head, computes shifted next-token CE with the action-mask gating the loss. The loss is summed into ``forward()``'s total with ``fast_action_loss_weight``. * Knowledge insulation (modeling_pi052.py::_compute_layer_ki + _paligemma_forward_ki): port of pi05_full's per-layer attention that detaches VLM K/V on the action-query path so action loss gradients cannot flow back into the VLM's K/V projections. Bound per-instance via ``types.MethodType`` so it doesn't leak into stock ``pi05`` policies that share PaliGemmaWithExpertModel. Activated automatically when ``config.knowledge_insulation=True``. Combined with the existing recipe-driven text head, pi052 now supports the full three-loss objective: L = text_w·H(text) + fast_w·H(FAST actions) + flow_w·MSE(flow) matching Eq. (1) of arxiv:2504.16054 §IV.D (α=10 by default for the flow term, 1.0 each for text and FAST CE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:46:21 +02:00
Pepijn	8eba704f15	Revert "chore(training): align pi052_hirobot.slurm with the operator's actual command" This reverts commit `ecbac17196`.	2026-05-13 11:03:58 +02:00
Pepijn	ecbac17196	chore(training): align pi052_hirobot.slurm with the operator's actual command Match the working SmolVLA2 launch pattern so the two SLURM scripts are interchangeable: * literal NUM_PROCESSES / BATCH_SIZE / STEPS (no env-var defaults) * STEPS=10000 to match the next SmolVLA2 run * save_freq=$STEPS so only the final checkpoint is saved * dropouts 0.1/0.1/0.1 (mild — matches the operator's iteration) * flow_loss_weight / text_loss_weight come from the PI052Config defaults (10.0 / 1.0 per Pi 0.5 paper §IV.D), no need to pass them explicitly Job name and policy_repo_id mirror the SmolVLA2 ``_tool-g2`` naming so the two runs can be compared side-by-side in WandB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:03:09 +02:00
Pepijn	12cce8f2cc	fix(smolvla2): align flow_loss_weight default with Pi 0.5 paper's α=10 Pi 0.5 paper §IV.D Eq. (1) sets the loss balance to α=10 between text CE and flow MSE: actions are the primary output and the flow head should dominate the gradient signal. SmolVLA2 was defaulting both weights to 1.0, which inverts that — text CE (~0.5-2.0 nats) ends up larger than flow MSE (~0.1-1.0), so the action expert gets less gradient than the LM head despite being the primary task. Match the paper's split: text_loss_weight=1.0, flow_loss_weight=10.0. Same as ``pi052`` (the new full reproduction policy). Also pin the values explicitly in the SLURM launcher so the choice is visible and overridable per-run rather than buried in the config default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:02:17 +02:00
Pepijn	ef5879a02a	feat(pi052): π0.5 v2 — full reproduction of the π0.5 paper recipe New ``lerobot.policies.pi052`` (parallel to ``smolvla2``) that adds text-prediction + hierarchical-inference on top of the existing π0.5 implementation. Mirrors the paper's §IV.D dual-head training: L = H(text) + α * ‖ω - a - f_θ_action(...)‖², α = 10 Components: * ``configuration_pi052.py`` thin PI05Config subclass; adds recipe_path, text/flow loss weights (default α=10 per paper), prompt dropout knobs, ``unfreeze_lm_head``. * ``text_processor_pi052.py`` PI052TextTokenizerStep — concatenates rendered messages as ``Role: ...`` plain text (PaliGemma has no chat template), tokenises with the PaliGemma tokenizer, builds a label mask covering supervised target spans. Includes Pi 0.7 §V.E per-component prompt dropout. * ``processor_pi052.py`` make_pi052_pre_post_processors — Rename + Batch + Relative + Normalize + RenderMessagesStep + PI052TextTokenizerStep + Device. Falls back to π0.5's plain pipeline when recipe_path is unset. * ``modeling_pi052.py`` PI052Policy(PI05Policy) — re-enables PaliGemma ``lm_head``, computes text_loss via CE on the supervised span, sums with flow_loss in forward(), and adds select_message for AR text generation at inference (same surface as SmolVLA2Policy.select_message so SmolVLA2Runtime drives it unchanged). Plus the supporting plumbing: * recipe ``configs/recipes/pi052_hirobot.yaml`` — same Hi-Robot blend as smolvla2_hirobot.yaml, with the same ``${subtask}`` / ``if_present`` supervision fix (current span at every frame, not ``${next_subtask}``). * SLURM ``examples/training/pi052_hirobot.slurm`` — full training command matching the SmolVLA2 launcher. * factory registration: ``--policy.type=pi052`` resolves to PI052Policy with the new processor. Same multi-rate runtime (``lerobot.policies.smolvla2.inference``) drives this policy too — both expose ``predict_action_chunk`` for the action expert and ``select_message`` for the LM head. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 10:59:26 +02:00
Pepijn	1d24301b67	chore(training): STEPS=15000 default + dropout walked back to 0.30/0.30/0.20 After _tool-good (2000 steps, 0.50/0.50/0.20 dropout) the LM head's distribution at position 0 shifted from EOS to subtask-vocabulary tokens but emitted bag-of-words ("cube arm and") rather than well- formed sentences. That's the expected mid-fine-tuning phase: token- level supervision has landed, sequence-level grammar hasn't. Two changes for the next retrain: * STEPS=15000 (from 2000) — chat-pretrained backbones need O(10k+) steps to walk their pretraining priors down far enough to commit to the fine-tuned distribution structurally, not just at the token level. _tool-g2's bag-of-words output proves the model is on the right path; it just needs more gradient signal. * plan/memory dropout 0.50 -> 0.30 — 0.50 was probably too aggressive for a small dataset. Half the training samples had crucial context missing, which slows down learning the full conditional structure. 0.30 still regularises against prompt leakage but lets the model learn proper grammar first; the higher dropout can be revisited once the head is solid. Subtask dropout stays at 0.20 since subtask isn't in the high-level prompt anyway (recipe fix removed the "Current subtask:" message). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 10:46:19 +02:00
Pepijn	3a20ea337e	feat(smolvla2-runtime): --text_min_new_tokens / --text_temperature CLI debug knobs The recipe fix (target=${subtask} instead of ${next_subtask}) shifted the LM head's failure mode from "emit newlines" to "emit EOS at position 0". On the new ``_tool-good`` checkpoint inference produces exactly one token (``<end_of_utterance>``, id 49279) and decodes to empty. That's the chat-pretrained backbone's short-turn EOS prior not yet being overridden by 2000 steps of fine-tuning supervision. Expose three knobs so the operator can probe whether the head has real subtask-token probability mass under the EOS argmax without recompiling or retraining: --text_min_new_tokens=N suppress EOS for the first N tokens --text_temperature=T sample at temperature T --text_top_p=P nucleus filtering at top-p These are explicitly off-policy (training was greedy / no min-tokens), so they shouldn't ship in production runs — but they let us tell whether the model has learned subtask prediction (just under EOS) or hasn't yet. If forcing min_new_tokens=3 with temperature=0.5 produces a sensible subtask, the model is fine and just needs more training steps to walk EOS down. If it produces gibberish, training hasn't progressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 21:39:33 +02:00
Pepijn	b6fb536460	chore(training): bump plan/memory dropout to 0.50 to force vision-grounding After the recipe fix (target=${subtask} at every frame) the model can still reach low text_loss by reading the answer off the plan in the prompt: at training the prompt contains the 6-step plan, and the current subtask is one of those steps, so the model just learns "active step N matches subtask N" and never needs to look at the image. Symptom at inference: subtask string is set but never updates because the model isn't really conditioning on the visual progress. Drop plan and memory with p=0.50 each — half of training frames the prompt is just "${task}" (constant for this dataset) + visual prefix, which is the only place the answer can come from. Forces the LM head to actually use vision. ``subtask_dropout`` stays at 0.20 because subtask isn't in the high-level prompt anymore (recipe fix removed the "Current subtask: X" message); the knob still affects other sub-recipes that reference it as context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 21:31:00 +02:00
pepijn	bfd3bb1791	fix(smolvla2): handle batched sample indices in chat tokenizer Normalize tensor and sequence sample indices before prompt dropout so distributed batched preprocessing does not try to cast full index tensors to scalars. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 16:56:13 +00:00
Pepijn	4908433f9a	chore(training): align smolvla2_hirobot.slurm with what's actually run Match the operator's current training command for the _tool6 retrain: * default DATASET / POLICY_REPO_ID / JOB_NAME point at the tool6 iteration (super_poulain_full_tool3 → smolvla2_hirobot_super_poulain_tool6) * STEPS default 2000 (short enough to iterate; bump to 10k for full) * save_freq=$STEPS so the only checkpoint is the final one * OUTPUT_DIR includes step count so successive runs don't clobber * Drop the wider augmentation envelope I added earlier — back to default ColorJitter ranges (brightness ±20% etc) since the high_level_subtask recipe fix (current-subtask supervision) is expected to fix the LM-head collapse on its own; the augmentation is just the standard regulariser, not a load-bearing widener. * prompt-dropout fractions stay at the original 0.15 / 0.15 / 0.20. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:45:38 +02:00
Pepijn	6ce1f36002	fix(smolvla2): supervise high-level head with current subtask at every frame The high_level_subtask recipe targeted ``nth_next(style=subtask, offset=1)``, which on the last span of any episode resolves to None. The recipe had no ``if_present`` guard on the target, so the renderer emitted an empty assistant turn and cross-entropy supervised the model on the chat template's structural newlines (``\n``). Across the dataset this trained the LM head's argmax at position 0 to collapse to ``\n`` whenever no transition was imminent (i.e. most frames). Visible failure mode at inference: the head emits 40+ newlines + ``<end_of_utterance>`` every chunk boundary while the action expert keeps working — confirmed by running the dry-run on dataset frame 0 with the dataset's own image and seeing the same ``\n × 44`` collapse. Switch to the Pi 0.5 / Pi 0.7 supervision pattern: at every frame, the assistant target is the current active subtask span text (via ``${subtask}`` → ``active_at(t, style=subtask)``). Always non-empty, always scene-grounded, ``if_present: subtask`` skips frames with no active span instead of emitting a degenerate empty turn. Runtime callsite update: ``_msgs_for_subtask`` no longer feeds a "Current subtask: X" user message into the prompt (that would be circular — we'd be telling the model the answer). Transition detection moves into the runtime — when the predicted subtask differs from ``state['current_subtask']``, the existing ``set_if_changed`` path fires ``subtask_change`` and downstream memory updates. Same event surface, supervision target is now always meaningful. Requires re-annotating the dataset and retraining for the fix to land in the checkpoint, but the recipe + runtime change is what enables it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:42:59 +02:00
Pepijn	731576be80	chore(smolvla2-runtime): auto-fire one tick at dry-run startup Previously the dry-run REPL only ticked on user input (empty Enter just redrew), so the bisection test "does the LM head produce text on start_frame=0?" required typing something arbitrary to drive a tick. Just run ``step_once`` at startup — the obs diagnostic and the subtask gen both fire automatically, the diag row populates, and the operator can read the result before pressing any key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:34:42 +02:00
Pepijn	47fb8318b1	chore(training): widen augmentation envelope after live-robot diagnostic The tensor-level comparison between dry-run (dataset frame) and live- robot inference proved the runtime is bug-free — same shape, dtype, device, channel order, batch dim, and normalization on both paths. The remaining variable: front-camera mean brightness was 0.26 live vs 0.39 on the dataset frame, ~33% darker. Training augmentation only covered ±20% brightness, so the live scene sits just outside the supervised envelope and the LM head collapses to its dominant prior. Widen the augmentation knobs for the next retrain: * brightness 0.8–1.2 → 0.5–1.6 (covers ~30% darker / 60% lighter) * contrast 0.8–1.2 → 0.6–1.5 * saturation 0.5–1.5 → 0.3–1.7 * hue ±0.05 → ±0.10 * affine ±5°/±5% → ±15°/±15% (covers cube placement / camera drift) * max_num_transforms 3 → 4 And bump prompt-component dropout (subtask 0.20 → 0.30) so the LM can't lean on stale memorised plan/memory at inference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:25:41 +02:00
Pepijn	53172873e3	chore(smolvla2-runtime): probe obs once at dry-run startup The dry-run REPL only fires a tick when the user types, so the ``_log_obs_tensors_once`` diagnostic never reached stdout (the provider was never called). Probe the provider once at startup — the result is discarded; we only care about the obs log it triggers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:21:58 +02:00
Pepijn	fcdae0ce8e	chore(smolvla2-runtime): tensor-level obs print for both inference paths Helper that prints (once per provider lifetime) every ``observation.`` tensor the policy is about to see, with its shape, dtype, device, and per-channel min/max/mean/std. Wired into both the dry-run dataset path and the live-robot path. Now we can bisect train/inference mismatch at the tensor level* — if the same checkpoint produces coherent text on one path's tensors and ``\n`` on the other's, and the printed tensor stats differ materially, the bug is in the observation prep, not in the model or the training distribution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:19:18 +02:00
Pepijn	4852b9f952	feat(smolvla2-runtime): --dataset.augment_at_inference for the bisection test Apply the training-time torchvision-v2 ColorJitter / SharpnessJitter / RandomAffine pipeline to dataset frames in dry-run, so we can isolate whether the LM head's collapse to '\n' on live frames is: * pure scene-content OOD (unaugmented dataset frames work, mildly augmented ones still work — model has learned the augmentation distribution, only fails when the scene content itself diverges) * hyper-specific memorisation (dry-run with augmentation also collapses to '\n' — head is nailed to the exact unperturbed training samples and only the retrain helps) Usage: lerobot-smolvla2-runtime --no_robot --policy.path=... \ --dataset.repo_id=... --dataset.episode=0 \ --dataset.start_frame=1000 \ --dataset.augment_at_inference Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:14:57 +02:00
Pepijn	0410705aff	chore(smolvla2-runtime): print live state vector once at startup So the operator can compare live joint values to the dataset's ``observation.state`` mean/std and spot when the robot's home pose is several σ off the supervised support region. State OOD is the remaining viable hypothesis for why the live LM head collapses to ``\n`` even though images are pixel-shape-matched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:12:27 +02:00
Pepijn	398a8cf730	chore(smolvla2-runtime): log first-tick resize so train/inference match is verifiable Print one warning the first time the robot observation provider runs through, showing live camera resolution and the dataset's training resolution, plus whether we resized. Lets the operator confirm at a glance that the visual prefix really is being fed at the same shape the model saw at training — instead of guessing whether the resize fired silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:06:00 +02:00
Pepijn	ab5c1dc392	fix(smolvla2-runtime): match training visual distribution on robot frames Root cause for the LM head's empty-completion symptom on the live robot (while the same checkpoint produced sensible subtask/plan/memory in ``--no_robot`` dry-run on dataset frames): the camera observation was flowing into the model at its native resolution. A Mac/USB webcam hands us 1280×720 or 1920×1080; the dataset was recorded at the feature schema's ``observation.images.['shape']`` resolution (typically 480×640). SmolVLA's internal ``resize_with_pad(512, 512)`` does* fit both — but with very different pad geometry, so visual tokens at each tile carry different content than at training. Action expert tolerates this; the tightly-supervised LM head goes OOD and the head's distribution at position 0 collapses to its dominant mode (``\n`` ×N then ``<end_of_utterance>`` for this checkpoint). The fix: in ``_build_robot_observation_provider``, pre-compute the camera-key → (H, W) target from ``ds_features`` and ``cv2.resize`` each live frame to that shape before tensorising. The downstream ``resize_with_pad`` then sees the same input geometry as training and the LM head returns to producing readable subtask text under plain greedy decoding — the same as dry-run. Also drops the inference-time patches (``min_new_tokens``, ``temperature``, ``top_p`` overrides) on the four high-level callers. They were band-aids around the visual-distribution shift, not a real LM problem, and they drift inference off the training distribution. Greedy argmax is what training matched. The ``select_message`` signature still accepts the knobs for callers that want them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:59:24 +02:00
Pepijn	1292304c42	fix(smolvla2): suppress all special tokens during min_new_tokens window Previous attempt only masked the tokenizer's eos_token_id during the min_new_tokens prefix. The empty-completion symptom persisted because a memorised SmolVLM head doesn't just want EOS — its top-1 at position 0 is some special token, and when EOS is masked the argmax shifts to a sibling (``<\|im_end\|>``, ``<image>``, ``<fake_token_around_image>``, ``<row_X_col_Y>``, …). Those tokens survive generation but then get stripped by ``decode(skip_special_tokens=True)``, so the runtime still saw ``last_raw='(empty)'`` every chunk boundary. Mask the full ``tokenizer.all_special_ids`` set instead. Forces the head to commit to a normal vocabulary token before it can close or quietly poison the turn. Also: when decode returns empty but tokens were generated, expose the raw token ids and the special-tokens-included decoded string via ``policy._last_select_message_debug``. The runtime surfaces this in the scrollback so the operator can see what the head is actually emitting — distinguishing "head EOS-ing" from "head emitting image placeholders" from "head emitting chat-template fragments". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:49:53 +02:00
Pepijn	b95eebff77	fix(smolvla2): force min_new_tokens + sampling so memorised LM emits something Real-robot run confirmed the LM head is producing 0 tokens at every chunk boundary (empty:N counter climbing, no exception in scrollback): the model EOS-es at decode step 0. That's the memorisation collapse — training reached text_loss=6e-6 by overfitting one trajectory whose supervised subtask turn ended in EOS, and at inference the head's argmax for token 0 is EOS regardless of the actual frame. Two changes in select_message: * ``min_new_tokens`` parameter masks the EOS logit to -inf until at least N real tokens have been decoded. Without this the head's "EOS first" prior produces an empty completion every single time. * The runtime callers now pass ``min_new_tokens=5..10`` plus ``temperature=0.4..0.5`` + ``top_p=0.9``. Sampling at moderate temperature with nucleus filtering also helps break the greedy argmax collapse — when the model has memorised one continuation, greedy keeps replaying it; nucleus sampling forces it to commit to some coherent continuation that's well-supported by the prefix even when greedy's top-1 is degenerate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:48:08 +02:00
Pepijn	fbcac95662	feat(smolvla2-runtime): scrollback in autonomous panel + empty-gen counter Two improvements for diagnosing why ``last_raw`` stays empty: 1. The autonomous panel-redraw thread calls console.clear() every 0.5 s, wiping any log lines the runtime printed since the last redraw. So warnings from generation (``[warn] subtask gen failed: ...``, ``[info] subtask gen rejected (gibberish): ...``) flashed for milliseconds and disappeared, leaving the operator blind. Capture log_lines from each tick into a bounded scrollback (last 12 entries) and render them inside the panel itself, below the diag row. They now stick across redraws until rotated out. 2. ``empty`` counter for subtask gen. Persistent empty completions are their own failure mode — the LM head EOS-es immediately from the chat-template generation prompt, distinct from "generated something but filter rejected it". The diag row now reads: subtask diag repeat:0 gibberish:0 empty:14 last_raw: '(empty)' ^^^^^^^ plus a periodic log line every 10 empties so the cause is also surfaced in the scrollback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:42:13 +02:00
Pepijn	b9db4d21a2	fix(smolvla2): high-level steps must run before LowLevelForward refills Both HighLevelSubtaskFwd and LowLevelForward are gated on 'action queue is empty'. With LowLevelForward listed first, it refilled the queue on the empty-queue tick before HighLevelSubtaskFwd got to check — so the gate I added in the previous commit made the high-level step a permanent no-op after the initial bootstrap. Visible symptom: subtask string never advances past whatever bootstrap seeded, no subtask_change events, memory stays unset, and the new overfit diagnostics never appear on the panel because last_subtask_raw is never written. Move all high-level steps (subtask, memory, interjection, vqa) ahead of LowLevelForward. On an empty-queue tick the subtask refreshes first, the new string flows into the next chunk's prompt, then LowLevelForward generates the chunk, then DispatchAction drains it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:38:06 +02:00
Pepijn	aecb80a9d2	feat(smolvla2-runtime): overfit/memorisation diagnostics on the panel The autonomous-mode panel now surfaces what the model is actually producing at every chunk boundary, not just what got accepted: * last_subtask_raw most recent generation (accepted or not) * subtask_repeat_count times the same accepted string regenerated * subtask_gibberish_count rejections by the gibberish filter * memory_gibberish_count / plan_gibberish_count for the other heads These let the operator see memorisation collapse without scrolling back through logs: subtask diag repeat:8 gibberish:0 last_raw: '<same string>' ^^^^^^^^^^ → model can't move past current phase subtask diag repeat:0 gibberish:14 last_raw: 'Ass:::' ^^^^^^^^^^^^^^^^^^^^^^ → LM collapsed to template salad Also silences the per-action ``Relative goal position magnitude had to be clamped`` warning. The clamp fires every dispatch tick when the model emits stale joint targets, flooding the panel at ctrl_hz=30. Replaced the bare ``logging.warning`` call in robots/utils.py with a module logger so it can be selectively raised to ERROR. Operators who need the per-tick clamp detail can use ``-v``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:31:04 +02:00
Pepijn	c98c695127	feat(smolvla2-runtime): 'rephrase:' prefix to swap task string in place Adds a third stdin channel alongside 'task:' and bare interjections: rephrase: <text> Swaps state['task'] with the new string while preserving plan/memory/ subtask. Lets the operator probe how robust the model is to wording variations of the same task — the trained augmentation provided n_task_rephrasings≈30 task wordings per dataset task, and this is the direct way to exercise that distribution at inference without generating a fresh plan via user_interjection_response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:26:59 +02:00
Pepijn	d528078aca	fix(smolvla2-runtime): allow task switching mid-run via 'task:' prefix Both stdin handlers (autonomous mode and rich REPL) gated 'task:' to 'only if no task is set yet' — once the initial task existed, typing 'task: <new task>' silently fell through to the interjection branch. Make 'task:' always override the active task and clear stale plan/memory/subtask so the next high-level pass regenerates context from scratch for the new task. For rephrasings within the same task, the interjection path (user_interjection_response recipe) is still the right channel — it refreshes the plan and emits a paired <say> in one trained call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:24:16 +02:00
Pepijn	a648da0455	fix(smolvla2): unblock action dispatch when high-level LLM stalls loop The runtime is single-threaded. `HighLevelSubtaskFwd` at HzTrigger(1.0) fires every loop iteration on MPS because each `select_message` call takes ~2 s, longer than its 1/hz period. The whole tick stretches to ~2.5 s, so `DispatchAction` (HzTrigger 30) only pops a single action per loop iteration — the queue drains at ~0.4 actions/sec instead of 30 and the robot barely moves between chunk refreshes. Two changes, both purely about scheduling — no threading: * Gate `HighLevelSubtaskFwd` to fire only when the action queue is empty, matching `LowLevelForward`'s refresh condition. The slow LLM call now happens during the "think" phase between chunks, not on every dispatch tick. Restores a clean sense → think → act cycle. * `DispatchAction` catches up via wall-clock: when the trigger fires after a stall, pop `round(elapsed * hz)` entries and send only the most recent. Open-loop chunks are timestamped at ctrl_hz; sending stale joint targets one-by-one would just lag the robot further behind. The dynamixel smooths to the latest goal anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:23:09 +02:00
Pepijn	d866c2c9fd	fix(smolvla2): only regenerate chunk when queue is fully drained The previous refresh threshold (queue > chunk_size // 2) made each new chunk telescope past the previous one: at queue=25, we kicked off a new chunk forward from the current observation, but by the time the new chunk's first action was actually dispatched, the robot had executed the remaining 25 actions of the previous chunk — so the new chunk was planned from an observation 25+ steps stale. Canonical sense → think → act loop: execute the full chunk, then re-observe and replan. Refresh only when the queue is empty. Every step of every chunk still gets dispatched to the robot (no behaviour change there), but each chunk is now planned from an observation that's at most one chunk's worth of dispatch latency old, not "previous chunk's worth of stale state on top of that". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:15:02 +02:00
Pepijn	01e2228b24	feat(smolvla2): per-component prompt dropout + augmented training script Two complementary regularisers to attack the ``text_loss=6e-6 = memorised one dataset`` failure mode that's making the model collapse on real-robot input: 1. Per-component prompt dropout (Pi0.7 §V.E / plan's ``feat/pi05-prompt-dropout`` follow-up). ``SmolVLA2ChatTokenizerStep`` gains ``plan_dropout_prob`` / ``memory_dropout_prob`` / ``subtask_dropout_prob`` knobs (default 0.0 — opt-in). At training, non-target messages whose rendered content starts with ``Plan:`` / ``Memory:`` / ``Current subtask:`` etc. are dropped with their respective probability before tokenisation, with a deterministic per-sample RNG keyed off the dataset ``index``. ``target_message_indices`` is re-mapped so the supervision still lands on the right turn. Forces the model to handle missing plan/memory/subtask context — directly attacks the real-robot collapse where a stale or empty plan field puts the prompt OOD. Surfaced on ``SmolVLA2Config`` as three floats so they're ``--policy.<knob>=<value>``-controllable from the train CLI; plumbed through ``make_smolvla2_pre_post_processors``. 2. Image augmentation is already wired in lerobot via ``--dataset.image_transforms.enable=true`` (torchvision v2 ColorJitter + SharpnessJitter + RandomAffine, default 3 of 6 sampled per frame). No code change needed — just a CLI flag. ``examples/training/smolvla2_hirobot.slurm`` shows the full training command with both enabled. Drop-in replacement for the ad-hoc SLURM script Pepijn was using locally; same args, plus the three dropout probs and the image-transforms flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:52:32 +02:00
Pepijn	c36de3a3e8	fix(smolvla2): enqueue full chunk via predict_action_chunk ``LowLevelForward`` was calling ``select_action()`` once per ``chunk_hz`` tick. SmolVLA's ``select_action`` is a thin queue-pop: it returns one action per call and only re-runs the expensive flow-matching forward when its private internal queue empties. Result: we got one action back per chunk_hz tick (1Hz default), ``DispatchAction`` at ctrl_hz=30 popped it instantly, then queue sat empty for ~1s waiting for the next tick. Net throughput was 1 dispatched action/sec instead of the 30 we wanted. Switch to ``predict_action_chunk`` and enqueue every step of the returned ``(batch, n_action_steps, action_dim)`` chunk. Refresh only when the queue is below half a chunk so we don't burn one flow-matching forward per chunk_hz tick — saves ~5x inference cost on this hot path. At ctrl_hz=30, chunk_size=50, the queue drains in ~1.7s before the next refresh, giving smooth dispatch at the control rate the robot was trained on. Side effect: ``state['last_chunk_size']`` records how many actions the most recent chunk produced — useful for the panel later if we want to surface "chunks generated" alongside "dispatched". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:27:23 +02:00
Pepijn	cbfaf2c544	feat(smolvla2): action-dispatch counter + tighter gibberish filter Real-robot run was unreadable for two reasons: 1. The panel surfaced ``queued actions: 0`` (always zero — dispatch pops faster than chunk_hz generates) and gave no signal that actions were actually reaching the robot. The only sign of life was the safety-clamp warning lines scrolling past. 2. The text head consistently collapses to ``the`` / ``Ass`` fragments on real-camera input (memorisation wall). The old gibberish filter caught ``":":":"`` JSON salad but let single-token fragments through, and the ``[info] subtask gen produced no text this tick`` line flooded the panel every second. Changes: * ``DispatchAction`` bumps ``state["actions_dispatched"]`` each tick; panel renders it next to queue depth. Operator can see the policy IS issuing actions even when text is broken. * ``_looks_like_gibberish`` now also rejects: - too few unique alphabetic tokens (``the``, ``the the``, ...) - chat-template marker leakage (``Assistant:``, ``Ass\\n::``) catching the actual failure mode on real-robot frames. * Gibberish rejections log only the first occurrence + every 30th after that, with a count, so the panel stays legible. * Empty completions no longer log at all (was every tick). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:22:36 +02:00
Pepijn	d0278ea093	feat(smolvla2): render state panel in autonomous mode too Dry-run REPL had a clean ANSI-clear-+-rich-panel layout via ``_redraw`` showing task / subtask / plan / memory / queued-actions / pending-tool-calls; autonomous mode just had bare ``> `` plus log lines scrolling past the user. Same data, two presentations. Extract ``_make_state_panel_renderer(runtime, mode_label=...)`` and use it from both ``_run_repl`` (called per user input) and ``_run_autonomous`` (called both on user input and on a 0.5s background timer so subtask / plan / memory refreshes from the runtime's own loop become visible without the user typing anything). Title bar shows ``dry-run`` vs ``autonomous`` so it's obvious which mode you're in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:16:28 +02:00
Pepijn	15f6b08b0e	fix(smolvla2): use canonical _strip_lerobot_blocks for inference msgs Training tokenises messages through ``_strip_lerobot_blocks`` (in ``chat_processor_smolvla2.py``), which normalises every variant of ``message['content']`` into the ``[{type:text, text:...}]`` list shape SmolVLM's chat template expects: * ``list[block]`` → keep text blocks, drop images * ``None`` → ``[{type:text, text:""}]`` * ``str`` / other → ``[{type:text, text:str(content)}]`` Inference was doing a partial inline conversion that only handled the ``str`` case — ``None`` and pre-formatted ``list`` content slipped through unchanged. ``memory_update``'s ``Previous memory: ...`` assistant turn ends up with ``None`` content when there's no prior memory, which then renders as no-content / role-marker-only and the model hallucinates ``Assistant:`` fragments. Subtask gen got further because its prompt always has at least the task string. Reuse ``_strip_lerobot_blocks`` directly. Now the inference prompt shape matches the exact tokenisation training did — no more "trained on shape X, asked to predict shape Y" mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:07:39 +02:00
Pepijn	fc715db4a3	fix(smolvla2): coerce str content to list-of-blocks for chat template SmolVLM's chat template (and many other multimodal templates) declares ``message['content']`` as a list of typed blocks and iterates it expecting dicts with a ``'type'`` field: {% for line in message['content'] %} {% if line['type'] == 'text' %}{{ line['text'] }} {% elif line['type'] == 'image' %}{{ '<image>' }} {% endif %} {% endfor %} When the caller passes ``content`` as a plain ``str`` (which we did throughout ``_msgs_for_subtask`` / ``_msgs_for_memory`` etc.), Jinja silently iterates the string character-by-character. ``'P'['type']`` returns nothing; neither branch fires; no text tokens get emitted. The model receives a prompt containing only role markers (``User:<end_of_utterance>\nAssistant:``) and predictably continues by emitting ``Assistant:`` fragments — the gibberish ``subtask: Ass\n::`` on the runtime panel. Before calling ``apply_chat_template``, walk the messages and rewrite any string ``content`` into ``[{'type': 'text', 'text': content}]``. The template's text branch then fires correctly and the model sees the actual user/assistant text, not just structural tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:01:53 +02:00
Pepijn	fe4bd2b6ba	fix(smolvla2): pass flat batch dict to preprocessor (no manual wrap) ``PolicyProcessorPipeline.__call__`` already wraps its input via ``to_transition`` (defaulting to ``batch_to_transition``) before running the steps, and unwraps via ``to_output`` (defaulting to ``transition_to_batch``) afterwards. The input format is therefore a flat batch dict keyed by ``observation.`` / ``action`` / etc., not an ``EnvTransition``. Previous attempt pre-wrapped the observation into a transition with ``TransitionKey.OBSERVATION`` as the key, then handed that* to the pipeline — which fed it to ``batch_to_transition``, which looked for top-level ``observation.*`` entries, found none (they were nested inside the enum key), and produced an empty observation. Every step then bailed with ``ObservationProcessorStep requires an observation in the transition.`` Pass the flat dict from ``build_inference_frame`` straight to the preprocessor — it does the wrap/unwrap itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:54:48 +02:00
Pepijn	3f7436ff8a	fix(smolvla2): use TransitionKey enum (not .value) as transition keys ``EnvTransition`` is declared as a ``TypedDict`` keyed by ``TransitionKey.OBSERVATION.value`` (the string ``'observation'``), but every concrete ``ProcessorStep`` in the pipeline indexes the transition with the enum member (``transition[TransitionKey. OBSERVATION]`` / ``transition.get(TransitionKey.OBSERVATION)``). Those are two different keys in a Python dict — string key vs enum key — so steps couldn't find the observation we'd placed under the string variant, and bailed every tick with ``ObservationProcessorStep requires an observation in the transition``. Build the transition with the enum members directly. Matches how ``BatchProcessor``, ``RelativeActionProcessor``, ``HilProcessor``, etc. read the dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:50:22 +02:00
Pepijn	992d13d4e9	fix(smolvla2): use build_inference_frame for raw robot observations ``robot.get_observation()`` on omx_follower (and most lerobot robots) returns: * per-joint scalar floats with ``.pos`` suffix (``shoulder_pan.pos: 0.123``, ``shoulder_lift.pos: 0.456``, ...) * per-camera ndarrays keyed by the camera config name (``wrist: ndarray(H,W,3)``) But the trained policy expects: * single ``observation.state: tensor[N_joints]`` vector * image keys prefixed: ``observation.images.<cam_key>: tensor[1, 3, H, W]`` ``prepare_observation_for_inference`` only handles the tensor / batch-dim / device step — it crashes on scalar floats with ``expected np.ndarray (got float)``. The right helper is ``build_inference_frame`` which uses the dataset's feature schema (``ds_meta.features``) to: 1. extract the right raw keys per dataset feature, 2. fold ``shoulder_pan.pos`` / ``shoulder_lift.pos`` / ... into a single ``observation.state`` ndarray, 3. prefix camera keys with ``observation.images.``, 4. delegate to ``prepare_observation_for_inference`` for the tensor / batch / device step. Pass ``ds_meta.features`` into the observation provider and switch to ``build_inference_frame`` when available; fall back to the bare ``prepare_observation_for_inference`` only when no dataset is provided (rare — autonomous mode already requires it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:47:59 +02:00
Pepijn	afe40a016b	fix(smolvla2): wrap robot obs in EnvTransition before preprocessor The policy preprocessor pipeline is transition-shaped — its steps read ``TransitionKey.OBSERVATION`` off an ``EnvTransition`` dict, not a flat ``RobotObservation`` dict. Passing the raw observation through made every step bail with ``ObservationProcessorStep requires an observation in the transition``, which the runtime swallowed at warning level. ``select_message`` then got called with no ``observation.images.*`` features and crashed with ``All image features are missing from the batch``. Mirror ``lerobot-record``'s preamble: 1. ``prepare_observation_for_inference`` → numpy → torch, ``CHW`` image layout, ``[0,1]`` scaling, add batch dim, move to device. 2. Wrap into an ``EnvTransition`` (``{TransitionKey.OBSERVATION.value: ...}`` plus ``COMPLEMENTARY_DATA: {}`` and ``None``s for the rest) so transition-aware steps see the keys they expect. 3. Run preprocessor. 4. Unwrap the transition's ``OBSERVATION`` slot to get the final flat dict the policy's ``select_action`` / ``select_message`` consume. Image features now reach the policy; the autonomous loop produces real actions instead of swallowing warnings every tick. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:44:24 +02:00
Pepijn	41095e3cc3	fix(smolvla2): instantiate CameraConfig subclasses from JSON dicts ``--robot.cameras`` parses the JSON into ``dict[str, dict]``, but ``RobotConfig`` expects ``dict[str, CameraConfig]`` — each inner value must be the actual ``CameraConfig`` subclass instance for the chosen backend (e.g. ``OpenCVCameraConfig``). Passing raw dicts blew up in ``RobotConfig.__post_init__`` with ``AttributeError: 'dict' object has no attribute 'width'`` when it iterated cameras and tried to read attributes. Look up the right subclass per-camera by its ``"type"`` field via ``CameraConfig.get_choice_class(...)`` (mirroring the lazy-import dance we already do for ``RobotConfig``: eagerly walk ``lerobot.cameras``'s submodules so the registry is populated before lookup). Construct an instance with the rest of the dict's fields. On an unknown camera type, raise a clean ``ValueError`` listing the available choices. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:39:28 +02:00
Pepijn	e0fa957569	fix(smolvla2): eagerly import robot submodules before get_choice_class ``RobotConfig._choice_registry`` is populated as a side-effect of each robot's ``@RobotConfig.register_subclass`` decorator running, and those decorators only fire when the corresponding ``lerobot.robots.<name>`` module is imported. The package's ``__init__.py`` doesn't import them — instead ``make_robot_from_config`` does it lazily in its big if/elif chain. ``_build_robot`` jumped the gun: called ``RobotConfig.get_choice_class (robot_type)`` before any robot module had been imported, so the registry was empty and every ``--robot.type=<X>`` produced ``KeyError: 'X'`` (e.g. ``KeyError: 'omx_follower'``). Walk ``lerobot.robots``'s submodules via ``pkgutil.iter_modules`` and ``importlib.import_module`` each one before the lookup. ~200ms on the first invocation, negligible for an autonomous run. On a real ``KeyError`` (typo / unsupported robot), raise a clean ``ValueError`` listing the registry's available choices instead of a bare KeyError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:31:58 +02:00
Pepijn	c661d81409	fix(smolvla2): use RobotConfig.max_relative_target, drop --max_action_norm The hand-rolled action-norm safety clip duplicated what every ``RobotConfig`` already exposes — ``max_relative_target`` — and at the wrong layer (after postprocess but before send_action, instead of inside the robot driver where every other lerobot entry point puts it). The norm clip also rejected entire actions instead of clipping per-motor relative motion, so a single rogue joint would kill the whole tick. Replace with ``--robot.max_relative_target``: a string parsed as either a bare float (uniform per-motor cap) or a JSON object mapping motor name → cap. Passed through to ``RobotConfig(max_relative_target=...)`` at robot construction; the driver's ``send_action`` clips each commanded joint position relative to the current measured one before issuing it on the bus — same behaviour ``lerobot-record`` ships. Also bump ``--chunk_hz`` default from ``4.0`` to ``1.0``. One new chunk per second is what the trained checkpoint can comfortably keep up with on common hardware and gives smoother motion than sub-second chunk regenerations (no RTC interpolation between chunks yet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:41:57 +02:00
Pepijn	33a4b4a5a0	feat(smolvla2): autonomous robot mode in lerobot-smolvla2-runtime The runtime CLI was deliberately scoped to dry-run only: it hard-coded ``robot_executor=None`` and printed a "real-robot integration is a follow-up" warning even when ``--no_robot`` was omitted. The runtime engine was already structured for real-robot operation (separate ``LowLevelForward`` chunk-rate generation + ``DispatchAction`` ctrl-rate dispatch with a ``robot_executor`` hook); only the wiring was missing. Add the wiring: * ``_load_policy_and_preprocessor`` now also returns the postprocessor (action denormaliser). * ``--robot.type`` / ``--robot.port`` / ``--robot.id`` / ``--robot.cameras`` (JSON) build a ``Robot`` via ``make_robot_from_config`` and connect it. * ``_build_robot_observation_provider`` reads ``robot.get_observation()`` each call, drops the language columns (runtime drives messages itself), and runs the policy's preprocessor (rename → batch → device → normalise). * ``_build_robot_action_executor`` postprocesses the policy's action tensor (denormalise), converts to the ``{joint: value}`` dict via ``make_robot_action(action, ds_meta.features)``, and calls ``robot.send_action(...)``. Optional ``--max_action_norm`` safety clip rejects ticks whose action L2 norm exceeds the threshold (kill-switch when bringing up a new robot). * ``_run_autonomous`` runs ``runtime.run()`` in a background thread (the policy must keep generating chunks at chunk_hz and dispatching at ctrl_hz regardless of stdin) and handles user interjections / VQA queries from the foreground stdin loop. Confirmation prompt before start (skip with ``--auto_start``); Ctrl+C stops the thread and disconnects the robot cleanly. * Autonomous mode requires ``--dataset.repo_id`` for action stats / feature shapes — pass the same dataset the policy was trained on. The bootstrap path that pulls canonical task / plan / memory runs in both REPL and autonomous modes so the model's first prompt matches training distribution. Dry-run REPL behaviour is unchanged when ``--robot.type`` is not passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 18:30:56 +02:00

1 2 3 4 5 ...

1547 Commits