lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-16 17:20:05 +00:00

Author	SHA1	Message	Date
Pepijn	12cce8f2cc	fix(smolvla2): align flow_loss_weight default with Pi 0.5 paper's α=10 Pi 0.5 paper §IV.D Eq. (1) sets the loss balance to α=10 between text CE and flow MSE: actions are the primary output and the flow head should dominate the gradient signal. SmolVLA2 was defaulting both weights to 1.0, which inverts that — text CE (~0.5-2.0 nats) ends up larger than flow MSE (~0.1-1.0), so the action expert gets less gradient than the LM head despite being the primary task. Match the paper's split: text_loss_weight=1.0, flow_loss_weight=10.0. Same as ``pi052`` (the new full reproduction policy). Also pin the values explicitly in the SLURM launcher so the choice is visible and overridable per-run rather than buried in the config default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:02:17 +02:00
Pepijn	ef5879a02a	feat(pi052): π0.5 v2 — full reproduction of the π0.5 paper recipe New ``lerobot.policies.pi052`` (parallel to ``smolvla2``) that adds text-prediction + hierarchical-inference on top of the existing π0.5 implementation. Mirrors the paper's §IV.D dual-head training: L = H(text) + α * ‖ω - a - f_θ_action(...)‖², α = 10 Components: * ``configuration_pi052.py`` thin PI05Config subclass; adds recipe_path, text/flow loss weights (default α=10 per paper), prompt dropout knobs, ``unfreeze_lm_head``. * ``text_processor_pi052.py`` PI052TextTokenizerStep — concatenates rendered messages as ``Role: ...`` plain text (PaliGemma has no chat template), tokenises with the PaliGemma tokenizer, builds a label mask covering supervised target spans. Includes Pi 0.7 §V.E per-component prompt dropout. * ``processor_pi052.py`` make_pi052_pre_post_processors — Rename + Batch + Relative + Normalize + RenderMessagesStep + PI052TextTokenizerStep + Device. Falls back to π0.5's plain pipeline when recipe_path is unset. * ``modeling_pi052.py`` PI052Policy(PI05Policy) — re-enables PaliGemma ``lm_head``, computes text_loss via CE on the supervised span, sums with flow_loss in forward(), and adds select_message for AR text generation at inference (same surface as SmolVLA2Policy.select_message so SmolVLA2Runtime drives it unchanged). Plus the supporting plumbing: * recipe ``configs/recipes/pi052_hirobot.yaml`` — same Hi-Robot blend as smolvla2_hirobot.yaml, with the same ``${subtask}`` / ``if_present`` supervision fix (current span at every frame, not ``${next_subtask}``). * SLURM ``examples/training/pi052_hirobot.slurm`` — full training command matching the SmolVLA2 launcher. * factory registration: ``--policy.type=pi052`` resolves to PI052Policy with the new processor. Same multi-rate runtime (``lerobot.policies.smolvla2.inference``) drives this policy too — both expose ``predict_action_chunk`` for the action expert and ``select_message`` for the LM head. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 10:59:26 +02:00
Pepijn	1d24301b67	chore(training): STEPS=15000 default + dropout walked back to 0.30/0.30/0.20 After _tool-good (2000 steps, 0.50/0.50/0.20 dropout) the LM head's distribution at position 0 shifted from EOS to subtask-vocabulary tokens but emitted bag-of-words ("cube arm and") rather than well- formed sentences. That's the expected mid-fine-tuning phase: token- level supervision has landed, sequence-level grammar hasn't. Two changes for the next retrain: * STEPS=15000 (from 2000) — chat-pretrained backbones need O(10k+) steps to walk their pretraining priors down far enough to commit to the fine-tuned distribution structurally, not just at the token level. _tool-g2's bag-of-words output proves the model is on the right path; it just needs more gradient signal. * plan/memory dropout 0.50 -> 0.30 — 0.50 was probably too aggressive for a small dataset. Half the training samples had crucial context missing, which slows down learning the full conditional structure. 0.30 still regularises against prompt leakage but lets the model learn proper grammar first; the higher dropout can be revisited once the head is solid. Subtask dropout stays at 0.20 since subtask isn't in the high-level prompt anyway (recipe fix removed the "Current subtask:" message). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 10:46:19 +02:00
Pepijn	3a20ea337e	feat(smolvla2-runtime): --text_min_new_tokens / --text_temperature CLI debug knobs The recipe fix (target=${subtask} instead of ${next_subtask}) shifted the LM head's failure mode from "emit newlines" to "emit EOS at position 0". On the new ``_tool-good`` checkpoint inference produces exactly one token (``<end_of_utterance>``, id 49279) and decodes to empty. That's the chat-pretrained backbone's short-turn EOS prior not yet being overridden by 2000 steps of fine-tuning supervision. Expose three knobs so the operator can probe whether the head has real subtask-token probability mass under the EOS argmax without recompiling or retraining: --text_min_new_tokens=N suppress EOS for the first N tokens --text_temperature=T sample at temperature T --text_top_p=P nucleus filtering at top-p These are explicitly off-policy (training was greedy / no min-tokens), so they shouldn't ship in production runs — but they let us tell whether the model has learned subtask prediction (just under EOS) or hasn't yet. If forcing min_new_tokens=3 with temperature=0.5 produces a sensible subtask, the model is fine and just needs more training steps to walk EOS down. If it produces gibberish, training hasn't progressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 21:39:33 +02:00
Pepijn	b6fb536460	chore(training): bump plan/memory dropout to 0.50 to force vision-grounding After the recipe fix (target=${subtask} at every frame) the model can still reach low text_loss by reading the answer off the plan in the prompt: at training the prompt contains the 6-step plan, and the current subtask is one of those steps, so the model just learns "active step N matches subtask N" and never needs to look at the image. Symptom at inference: subtask string is set but never updates because the model isn't really conditioning on the visual progress. Drop plan and memory with p=0.50 each — half of training frames the prompt is just "${task}" (constant for this dataset) + visual prefix, which is the only place the answer can come from. Forces the LM head to actually use vision. ``subtask_dropout`` stays at 0.20 because subtask isn't in the high-level prompt anymore (recipe fix removed the "Current subtask: X" message); the knob still affects other sub-recipes that reference it as context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 21:31:00 +02:00
pepijn	bfd3bb1791	fix(smolvla2): handle batched sample indices in chat tokenizer Normalize tensor and sequence sample indices before prompt dropout so distributed batched preprocessing does not try to cast full index tensors to scalars. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 16:56:13 +00:00
Pepijn	4908433f9a	chore(training): align smolvla2_hirobot.slurm with what's actually run Match the operator's current training command for the _tool6 retrain: * default DATASET / POLICY_REPO_ID / JOB_NAME point at the tool6 iteration (super_poulain_full_tool3 → smolvla2_hirobot_super_poulain_tool6) * STEPS default 2000 (short enough to iterate; bump to 10k for full) * save_freq=$STEPS so the only checkpoint is the final one * OUTPUT_DIR includes step count so successive runs don't clobber * Drop the wider augmentation envelope I added earlier — back to default ColorJitter ranges (brightness ±20% etc) since the high_level_subtask recipe fix (current-subtask supervision) is expected to fix the LM-head collapse on its own; the augmentation is just the standard regulariser, not a load-bearing widener. * prompt-dropout fractions stay at the original 0.15 / 0.15 / 0.20. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:45:38 +02:00
Pepijn	6ce1f36002	fix(smolvla2): supervise high-level head with current subtask at every frame The high_level_subtask recipe targeted ``nth_next(style=subtask, offset=1)``, which on the last span of any episode resolves to None. The recipe had no ``if_present`` guard on the target, so the renderer emitted an empty assistant turn and cross-entropy supervised the model on the chat template's structural newlines (``\n``). Across the dataset this trained the LM head's argmax at position 0 to collapse to ``\n`` whenever no transition was imminent (i.e. most frames). Visible failure mode at inference: the head emits 40+ newlines + ``<end_of_utterance>`` every chunk boundary while the action expert keeps working — confirmed by running the dry-run on dataset frame 0 with the dataset's own image and seeing the same ``\n × 44`` collapse. Switch to the Pi 0.5 / Pi 0.7 supervision pattern: at every frame, the assistant target is the current active subtask span text (via ``${subtask}`` → ``active_at(t, style=subtask)``). Always non-empty, always scene-grounded, ``if_present: subtask`` skips frames with no active span instead of emitting a degenerate empty turn. Runtime callsite update: ``_msgs_for_subtask`` no longer feeds a "Current subtask: X" user message into the prompt (that would be circular — we'd be telling the model the answer). Transition detection moves into the runtime — when the predicted subtask differs from ``state['current_subtask']``, the existing ``set_if_changed`` path fires ``subtask_change`` and downstream memory updates. Same event surface, supervision target is now always meaningful. Requires re-annotating the dataset and retraining for the fix to land in the checkpoint, but the recipe + runtime change is what enables it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:42:59 +02:00
Pepijn	731576be80	chore(smolvla2-runtime): auto-fire one tick at dry-run startup Previously the dry-run REPL only ticked on user input (empty Enter just redrew), so the bisection test "does the LM head produce text on start_frame=0?" required typing something arbitrary to drive a tick. Just run ``step_once`` at startup — the obs diagnostic and the subtask gen both fire automatically, the diag row populates, and the operator can read the result before pressing any key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:34:42 +02:00
Pepijn	47fb8318b1	chore(training): widen augmentation envelope after live-robot diagnostic The tensor-level comparison between dry-run (dataset frame) and live- robot inference proved the runtime is bug-free — same shape, dtype, device, channel order, batch dim, and normalization on both paths. The remaining variable: front-camera mean brightness was 0.26 live vs 0.39 on the dataset frame, ~33% darker. Training augmentation only covered ±20% brightness, so the live scene sits just outside the supervised envelope and the LM head collapses to its dominant prior. Widen the augmentation knobs for the next retrain: * brightness 0.8–1.2 → 0.5–1.6 (covers ~30% darker / 60% lighter) * contrast 0.8–1.2 → 0.6–1.5 * saturation 0.5–1.5 → 0.3–1.7 * hue ±0.05 → ±0.10 * affine ±5°/±5% → ±15°/±15% (covers cube placement / camera drift) * max_num_transforms 3 → 4 And bump prompt-component dropout (subtask 0.20 → 0.30) so the LM can't lean on stale memorised plan/memory at inference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:25:41 +02:00
Pepijn	53172873e3	chore(smolvla2-runtime): probe obs once at dry-run startup The dry-run REPL only fires a tick when the user types, so the ``_log_obs_tensors_once`` diagnostic never reached stdout (the provider was never called). Probe the provider once at startup — the result is discarded; we only care about the obs log it triggers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:21:58 +02:00
Pepijn	fcdae0ce8e	chore(smolvla2-runtime): tensor-level obs print for both inference paths Helper that prints (once per provider lifetime) every ``observation.`` tensor the policy is about to see, with its shape, dtype, device, and per-channel min/max/mean/std. Wired into both the dry-run dataset path and the live-robot path. Now we can bisect train/inference mismatch at the tensor level* — if the same checkpoint produces coherent text on one path's tensors and ``\n`` on the other's, and the printed tensor stats differ materially, the bug is in the observation prep, not in the model or the training distribution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:19:18 +02:00
Pepijn	4852b9f952	feat(smolvla2-runtime): --dataset.augment_at_inference for the bisection test Apply the training-time torchvision-v2 ColorJitter / SharpnessJitter / RandomAffine pipeline to dataset frames in dry-run, so we can isolate whether the LM head's collapse to '\n' on live frames is: * pure scene-content OOD (unaugmented dataset frames work, mildly augmented ones still work — model has learned the augmentation distribution, only fails when the scene content itself diverges) * hyper-specific memorisation (dry-run with augmentation also collapses to '\n' — head is nailed to the exact unperturbed training samples and only the retrain helps) Usage: lerobot-smolvla2-runtime --no_robot --policy.path=... \ --dataset.repo_id=... --dataset.episode=0 \ --dataset.start_frame=1000 \ --dataset.augment_at_inference Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:14:57 +02:00
Pepijn	0410705aff	chore(smolvla2-runtime): print live state vector once at startup So the operator can compare live joint values to the dataset's ``observation.state`` mean/std and spot when the robot's home pose is several σ off the supervised support region. State OOD is the remaining viable hypothesis for why the live LM head collapses to ``\n`` even though images are pixel-shape-matched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:12:27 +02:00
Pepijn	398a8cf730	chore(smolvla2-runtime): log first-tick resize so train/inference match is verifiable Print one warning the first time the robot observation provider runs through, showing live camera resolution and the dataset's training resolution, plus whether we resized. Lets the operator confirm at a glance that the visual prefix really is being fed at the same shape the model saw at training — instead of guessing whether the resize fired silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 18:06:00 +02:00
Pepijn	ab5c1dc392	fix(smolvla2-runtime): match training visual distribution on robot frames Root cause for the LM head's empty-completion symptom on the live robot (while the same checkpoint produced sensible subtask/plan/memory in ``--no_robot`` dry-run on dataset frames): the camera observation was flowing into the model at its native resolution. A Mac/USB webcam hands us 1280×720 or 1920×1080; the dataset was recorded at the feature schema's ``observation.images.['shape']`` resolution (typically 480×640). SmolVLA's internal ``resize_with_pad(512, 512)`` does* fit both — but with very different pad geometry, so visual tokens at each tile carry different content than at training. Action expert tolerates this; the tightly-supervised LM head goes OOD and the head's distribution at position 0 collapses to its dominant mode (``\n`` ×N then ``<end_of_utterance>`` for this checkpoint). The fix: in ``_build_robot_observation_provider``, pre-compute the camera-key → (H, W) target from ``ds_features`` and ``cv2.resize`` each live frame to that shape before tensorising. The downstream ``resize_with_pad`` then sees the same input geometry as training and the LM head returns to producing readable subtask text under plain greedy decoding — the same as dry-run. Also drops the inference-time patches (``min_new_tokens``, ``temperature``, ``top_p`` overrides) on the four high-level callers. They were band-aids around the visual-distribution shift, not a real LM problem, and they drift inference off the training distribution. Greedy argmax is what training matched. The ``select_message`` signature still accepts the knobs for callers that want them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:59:24 +02:00
Pepijn	1292304c42	fix(smolvla2): suppress all special tokens during min_new_tokens window Previous attempt only masked the tokenizer's eos_token_id during the min_new_tokens prefix. The empty-completion symptom persisted because a memorised SmolVLM head doesn't just want EOS — its top-1 at position 0 is some special token, and when EOS is masked the argmax shifts to a sibling (``<\|im_end\|>``, ``<image>``, ``<fake_token_around_image>``, ``<row_X_col_Y>``, …). Those tokens survive generation but then get stripped by ``decode(skip_special_tokens=True)``, so the runtime still saw ``last_raw='(empty)'`` every chunk boundary. Mask the full ``tokenizer.all_special_ids`` set instead. Forces the head to commit to a normal vocabulary token before it can close or quietly poison the turn. Also: when decode returns empty but tokens were generated, expose the raw token ids and the special-tokens-included decoded string via ``policy._last_select_message_debug``. The runtime surfaces this in the scrollback so the operator can see what the head is actually emitting — distinguishing "head EOS-ing" from "head emitting image placeholders" from "head emitting chat-template fragments". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:49:53 +02:00
Pepijn	b95eebff77	fix(smolvla2): force min_new_tokens + sampling so memorised LM emits something Real-robot run confirmed the LM head is producing 0 tokens at every chunk boundary (empty:N counter climbing, no exception in scrollback): the model EOS-es at decode step 0. That's the memorisation collapse — training reached text_loss=6e-6 by overfitting one trajectory whose supervised subtask turn ended in EOS, and at inference the head's argmax for token 0 is EOS regardless of the actual frame. Two changes in select_message: * ``min_new_tokens`` parameter masks the EOS logit to -inf until at least N real tokens have been decoded. Without this the head's "EOS first" prior produces an empty completion every single time. * The runtime callers now pass ``min_new_tokens=5..10`` plus ``temperature=0.4..0.5`` + ``top_p=0.9``. Sampling at moderate temperature with nucleus filtering also helps break the greedy argmax collapse — when the model has memorised one continuation, greedy keeps replaying it; nucleus sampling forces it to commit to some coherent continuation that's well-supported by the prefix even when greedy's top-1 is degenerate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:48:08 +02:00
Pepijn	fbcac95662	feat(smolvla2-runtime): scrollback in autonomous panel + empty-gen counter Two improvements for diagnosing why ``last_raw`` stays empty: 1. The autonomous panel-redraw thread calls console.clear() every 0.5 s, wiping any log lines the runtime printed since the last redraw. So warnings from generation (``[warn] subtask gen failed: ...``, ``[info] subtask gen rejected (gibberish): ...``) flashed for milliseconds and disappeared, leaving the operator blind. Capture log_lines from each tick into a bounded scrollback (last 12 entries) and render them inside the panel itself, below the diag row. They now stick across redraws until rotated out. 2. ``empty`` counter for subtask gen. Persistent empty completions are their own failure mode — the LM head EOS-es immediately from the chat-template generation prompt, distinct from "generated something but filter rejected it". The diag row now reads: subtask diag repeat:0 gibberish:0 empty:14 last_raw: '(empty)' ^^^^^^^ plus a periodic log line every 10 empties so the cause is also surfaced in the scrollback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:42:13 +02:00
Pepijn	b9db4d21a2	fix(smolvla2): high-level steps must run before LowLevelForward refills Both HighLevelSubtaskFwd and LowLevelForward are gated on 'action queue is empty'. With LowLevelForward listed first, it refilled the queue on the empty-queue tick before HighLevelSubtaskFwd got to check — so the gate I added in the previous commit made the high-level step a permanent no-op after the initial bootstrap. Visible symptom: subtask string never advances past whatever bootstrap seeded, no subtask_change events, memory stays unset, and the new overfit diagnostics never appear on the panel because last_subtask_raw is never written. Move all high-level steps (subtask, memory, interjection, vqa) ahead of LowLevelForward. On an empty-queue tick the subtask refreshes first, the new string flows into the next chunk's prompt, then LowLevelForward generates the chunk, then DispatchAction drains it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:38:06 +02:00
Pepijn	aecb80a9d2	feat(smolvla2-runtime): overfit/memorisation diagnostics on the panel The autonomous-mode panel now surfaces what the model is actually producing at every chunk boundary, not just what got accepted: * last_subtask_raw most recent generation (accepted or not) * subtask_repeat_count times the same accepted string regenerated * subtask_gibberish_count rejections by the gibberish filter * memory_gibberish_count / plan_gibberish_count for the other heads These let the operator see memorisation collapse without scrolling back through logs: subtask diag repeat:8 gibberish:0 last_raw: '<same string>' ^^^^^^^^^^ → model can't move past current phase subtask diag repeat:0 gibberish:14 last_raw: 'Ass:::' ^^^^^^^^^^^^^^^^^^^^^^ → LM collapsed to template salad Also silences the per-action ``Relative goal position magnitude had to be clamped`` warning. The clamp fires every dispatch tick when the model emits stale joint targets, flooding the panel at ctrl_hz=30. Replaced the bare ``logging.warning`` call in robots/utils.py with a module logger so it can be selectively raised to ERROR. Operators who need the per-tick clamp detail can use ``-v``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:31:04 +02:00
Pepijn	c98c695127	feat(smolvla2-runtime): 'rephrase:' prefix to swap task string in place Adds a third stdin channel alongside 'task:' and bare interjections: rephrase: <text> Swaps state['task'] with the new string while preserving plan/memory/ subtask. Lets the operator probe how robust the model is to wording variations of the same task — the trained augmentation provided n_task_rephrasings≈30 task wordings per dataset task, and this is the direct way to exercise that distribution at inference without generating a fresh plan via user_interjection_response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:26:59 +02:00
Pepijn	d528078aca	fix(smolvla2-runtime): allow task switching mid-run via 'task:' prefix Both stdin handlers (autonomous mode and rich REPL) gated 'task:' to 'only if no task is set yet' — once the initial task existed, typing 'task: <new task>' silently fell through to the interjection branch. Make 'task:' always override the active task and clear stale plan/memory/subtask so the next high-level pass regenerates context from scratch for the new task. For rephrasings within the same task, the interjection path (user_interjection_response recipe) is still the right channel — it refreshes the plan and emits a paired <say> in one trained call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:24:16 +02:00
Pepijn	a648da0455	fix(smolvla2): unblock action dispatch when high-level LLM stalls loop The runtime is single-threaded. `HighLevelSubtaskFwd` at HzTrigger(1.0) fires every loop iteration on MPS because each `select_message` call takes ~2 s, longer than its 1/hz period. The whole tick stretches to ~2.5 s, so `DispatchAction` (HzTrigger 30) only pops a single action per loop iteration — the queue drains at ~0.4 actions/sec instead of 30 and the robot barely moves between chunk refreshes. Two changes, both purely about scheduling — no threading: * Gate `HighLevelSubtaskFwd` to fire only when the action queue is empty, matching `LowLevelForward`'s refresh condition. The slow LLM call now happens during the "think" phase between chunks, not on every dispatch tick. Restores a clean sense → think → act cycle. * `DispatchAction` catches up via wall-clock: when the trigger fires after a stall, pop `round(elapsed * hz)` entries and send only the most recent. Open-loop chunks are timestamped at ctrl_hz; sending stale joint targets one-by-one would just lag the robot further behind. The dynamixel smooths to the latest goal anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:23:09 +02:00
Pepijn	d866c2c9fd	fix(smolvla2): only regenerate chunk when queue is fully drained The previous refresh threshold (queue > chunk_size // 2) made each new chunk telescope past the previous one: at queue=25, we kicked off a new chunk forward from the current observation, but by the time the new chunk's first action was actually dispatched, the robot had executed the remaining 25 actions of the previous chunk — so the new chunk was planned from an observation 25+ steps stale. Canonical sense → think → act loop: execute the full chunk, then re-observe and replan. Refresh only when the queue is empty. Every step of every chunk still gets dispatched to the robot (no behaviour change there), but each chunk is now planned from an observation that's at most one chunk's worth of dispatch latency old, not "previous chunk's worth of stale state on top of that". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 17:15:02 +02:00
Pepijn	01e2228b24	feat(smolvla2): per-component prompt dropout + augmented training script Two complementary regularisers to attack the ``text_loss=6e-6 = memorised one dataset`` failure mode that's making the model collapse on real-robot input: 1. Per-component prompt dropout (Pi0.7 §V.E / plan's ``feat/pi05-prompt-dropout`` follow-up). ``SmolVLA2ChatTokenizerStep`` gains ``plan_dropout_prob`` / ``memory_dropout_prob`` / ``subtask_dropout_prob`` knobs (default 0.0 — opt-in). At training, non-target messages whose rendered content starts with ``Plan:`` / ``Memory:`` / ``Current subtask:`` etc. are dropped with their respective probability before tokenisation, with a deterministic per-sample RNG keyed off the dataset ``index``. ``target_message_indices`` is re-mapped so the supervision still lands on the right turn. Forces the model to handle missing plan/memory/subtask context — directly attacks the real-robot collapse where a stale or empty plan field puts the prompt OOD. Surfaced on ``SmolVLA2Config`` as three floats so they're ``--policy.<knob>=<value>``-controllable from the train CLI; plumbed through ``make_smolvla2_pre_post_processors``. 2. Image augmentation is already wired in lerobot via ``--dataset.image_transforms.enable=true`` (torchvision v2 ColorJitter + SharpnessJitter + RandomAffine, default 3 of 6 sampled per frame). No code change needed — just a CLI flag. ``examples/training/smolvla2_hirobot.slurm`` shows the full training command with both enabled. Drop-in replacement for the ad-hoc SLURM script Pepijn was using locally; same args, plus the three dropout probs and the image-transforms flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:52:32 +02:00
Pepijn	c36de3a3e8	fix(smolvla2): enqueue full chunk via predict_action_chunk ``LowLevelForward`` was calling ``select_action()`` once per ``chunk_hz`` tick. SmolVLA's ``select_action`` is a thin queue-pop: it returns one action per call and only re-runs the expensive flow-matching forward when its private internal queue empties. Result: we got one action back per chunk_hz tick (1Hz default), ``DispatchAction`` at ctrl_hz=30 popped it instantly, then queue sat empty for ~1s waiting for the next tick. Net throughput was 1 dispatched action/sec instead of the 30 we wanted. Switch to ``predict_action_chunk`` and enqueue every step of the returned ``(batch, n_action_steps, action_dim)`` chunk. Refresh only when the queue is below half a chunk so we don't burn one flow-matching forward per chunk_hz tick — saves ~5x inference cost on this hot path. At ctrl_hz=30, chunk_size=50, the queue drains in ~1.7s before the next refresh, giving smooth dispatch at the control rate the robot was trained on. Side effect: ``state['last_chunk_size']`` records how many actions the most recent chunk produced — useful for the panel later if we want to surface "chunks generated" alongside "dispatched". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:27:23 +02:00
Pepijn	cbfaf2c544	feat(smolvla2): action-dispatch counter + tighter gibberish filter Real-robot run was unreadable for two reasons: 1. The panel surfaced ``queued actions: 0`` (always zero — dispatch pops faster than chunk_hz generates) and gave no signal that actions were actually reaching the robot. The only sign of life was the safety-clamp warning lines scrolling past. 2. The text head consistently collapses to ``the`` / ``Ass`` fragments on real-camera input (memorisation wall). The old gibberish filter caught ``":":":"`` JSON salad but let single-token fragments through, and the ``[info] subtask gen produced no text this tick`` line flooded the panel every second. Changes: * ``DispatchAction`` bumps ``state["actions_dispatched"]`` each tick; panel renders it next to queue depth. Operator can see the policy IS issuing actions even when text is broken. * ``_looks_like_gibberish`` now also rejects: - too few unique alphabetic tokens (``the``, ``the the``, ...) - chat-template marker leakage (``Assistant:``, ``Ass\\n::``) catching the actual failure mode on real-robot frames. * Gibberish rejections log only the first occurrence + every 30th after that, with a count, so the panel stays legible. * Empty completions no longer log at all (was every tick). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:22:36 +02:00
Pepijn	d0278ea093	feat(smolvla2): render state panel in autonomous mode too Dry-run REPL had a clean ANSI-clear-+-rich-panel layout via ``_redraw`` showing task / subtask / plan / memory / queued-actions / pending-tool-calls; autonomous mode just had bare ``> `` plus log lines scrolling past the user. Same data, two presentations. Extract ``_make_state_panel_renderer(runtime, mode_label=...)`` and use it from both ``_run_repl`` (called per user input) and ``_run_autonomous`` (called both on user input and on a 0.5s background timer so subtask / plan / memory refreshes from the runtime's own loop become visible without the user typing anything). Title bar shows ``dry-run`` vs ``autonomous`` so it's obvious which mode you're in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:16:28 +02:00
Pepijn	15f6b08b0e	fix(smolvla2): use canonical _strip_lerobot_blocks for inference msgs Training tokenises messages through ``_strip_lerobot_blocks`` (in ``chat_processor_smolvla2.py``), which normalises every variant of ``message['content']`` into the ``[{type:text, text:...}]`` list shape SmolVLM's chat template expects: * ``list[block]`` → keep text blocks, drop images * ``None`` → ``[{type:text, text:""}]`` * ``str`` / other → ``[{type:text, text:str(content)}]`` Inference was doing a partial inline conversion that only handled the ``str`` case — ``None`` and pre-formatted ``list`` content slipped through unchanged. ``memory_update``'s ``Previous memory: ...`` assistant turn ends up with ``None`` content when there's no prior memory, which then renders as no-content / role-marker-only and the model hallucinates ``Assistant:`` fragments. Subtask gen got further because its prompt always has at least the task string. Reuse ``_strip_lerobot_blocks`` directly. Now the inference prompt shape matches the exact tokenisation training did — no more "trained on shape X, asked to predict shape Y" mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:07:39 +02:00
Pepijn	fc715db4a3	fix(smolvla2): coerce str content to list-of-blocks for chat template SmolVLM's chat template (and many other multimodal templates) declares ``message['content']`` as a list of typed blocks and iterates it expecting dicts with a ``'type'`` field: {% for line in message['content'] %} {% if line['type'] == 'text' %}{{ line['text'] }} {% elif line['type'] == 'image' %}{{ '<image>' }} {% endif %} {% endfor %} When the caller passes ``content`` as a plain ``str`` (which we did throughout ``_msgs_for_subtask`` / ``_msgs_for_memory`` etc.), Jinja silently iterates the string character-by-character. ``'P'['type']`` returns nothing; neither branch fires; no text tokens get emitted. The model receives a prompt containing only role markers (``User:<end_of_utterance>\nAssistant:``) and predictably continues by emitting ``Assistant:`` fragments — the gibberish ``subtask: Ass\n::`` on the runtime panel. Before calling ``apply_chat_template``, walk the messages and rewrite any string ``content`` into ``[{'type': 'text', 'text': content}]``. The template's text branch then fires correctly and the model sees the actual user/assistant text, not just structural tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:01:53 +02:00
Pepijn	fe4bd2b6ba	fix(smolvla2): pass flat batch dict to preprocessor (no manual wrap) ``PolicyProcessorPipeline.__call__`` already wraps its input via ``to_transition`` (defaulting to ``batch_to_transition``) before running the steps, and unwraps via ``to_output`` (defaulting to ``transition_to_batch``) afterwards. The input format is therefore a flat batch dict keyed by ``observation.`` / ``action`` / etc., not an ``EnvTransition``. Previous attempt pre-wrapped the observation into a transition with ``TransitionKey.OBSERVATION`` as the key, then handed that* to the pipeline — which fed it to ``batch_to_transition``, which looked for top-level ``observation.*`` entries, found none (they were nested inside the enum key), and produced an empty observation. Every step then bailed with ``ObservationProcessorStep requires an observation in the transition.`` Pass the flat dict from ``build_inference_frame`` straight to the preprocessor — it does the wrap/unwrap itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:54:48 +02:00
Pepijn	3f7436ff8a	fix(smolvla2): use TransitionKey enum (not .value) as transition keys ``EnvTransition`` is declared as a ``TypedDict`` keyed by ``TransitionKey.OBSERVATION.value`` (the string ``'observation'``), but every concrete ``ProcessorStep`` in the pipeline indexes the transition with the enum member (``transition[TransitionKey. OBSERVATION]`` / ``transition.get(TransitionKey.OBSERVATION)``). Those are two different keys in a Python dict — string key vs enum key — so steps couldn't find the observation we'd placed under the string variant, and bailed every tick with ``ObservationProcessorStep requires an observation in the transition``. Build the transition with the enum members directly. Matches how ``BatchProcessor``, ``RelativeActionProcessor``, ``HilProcessor``, etc. read the dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:50:22 +02:00
Pepijn	992d13d4e9	fix(smolvla2): use build_inference_frame for raw robot observations ``robot.get_observation()`` on omx_follower (and most lerobot robots) returns: * per-joint scalar floats with ``.pos`` suffix (``shoulder_pan.pos: 0.123``, ``shoulder_lift.pos: 0.456``, ...) * per-camera ndarrays keyed by the camera config name (``wrist: ndarray(H,W,3)``) But the trained policy expects: * single ``observation.state: tensor[N_joints]`` vector * image keys prefixed: ``observation.images.<cam_key>: tensor[1, 3, H, W]`` ``prepare_observation_for_inference`` only handles the tensor / batch-dim / device step — it crashes on scalar floats with ``expected np.ndarray (got float)``. The right helper is ``build_inference_frame`` which uses the dataset's feature schema (``ds_meta.features``) to: 1. extract the right raw keys per dataset feature, 2. fold ``shoulder_pan.pos`` / ``shoulder_lift.pos`` / ... into a single ``observation.state`` ndarray, 3. prefix camera keys with ``observation.images.``, 4. delegate to ``prepare_observation_for_inference`` for the tensor / batch / device step. Pass ``ds_meta.features`` into the observation provider and switch to ``build_inference_frame`` when available; fall back to the bare ``prepare_observation_for_inference`` only when no dataset is provided (rare — autonomous mode already requires it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:47:59 +02:00
Pepijn	afe40a016b	fix(smolvla2): wrap robot obs in EnvTransition before preprocessor The policy preprocessor pipeline is transition-shaped — its steps read ``TransitionKey.OBSERVATION`` off an ``EnvTransition`` dict, not a flat ``RobotObservation`` dict. Passing the raw observation through made every step bail with ``ObservationProcessorStep requires an observation in the transition``, which the runtime swallowed at warning level. ``select_message`` then got called with no ``observation.images.*`` features and crashed with ``All image features are missing from the batch``. Mirror ``lerobot-record``'s preamble: 1. ``prepare_observation_for_inference`` → numpy → torch, ``CHW`` image layout, ``[0,1]`` scaling, add batch dim, move to device. 2. Wrap into an ``EnvTransition`` (``{TransitionKey.OBSERVATION.value: ...}`` plus ``COMPLEMENTARY_DATA: {}`` and ``None``s for the rest) so transition-aware steps see the keys they expect. 3. Run preprocessor. 4. Unwrap the transition's ``OBSERVATION`` slot to get the final flat dict the policy's ``select_action`` / ``select_message`` consume. Image features now reach the policy; the autonomous loop produces real actions instead of swallowing warnings every tick. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:44:24 +02:00
Pepijn	41095e3cc3	fix(smolvla2): instantiate CameraConfig subclasses from JSON dicts ``--robot.cameras`` parses the JSON into ``dict[str, dict]``, but ``RobotConfig`` expects ``dict[str, CameraConfig]`` — each inner value must be the actual ``CameraConfig`` subclass instance for the chosen backend (e.g. ``OpenCVCameraConfig``). Passing raw dicts blew up in ``RobotConfig.__post_init__`` with ``AttributeError: 'dict' object has no attribute 'width'`` when it iterated cameras and tried to read attributes. Look up the right subclass per-camera by its ``"type"`` field via ``CameraConfig.get_choice_class(...)`` (mirroring the lazy-import dance we already do for ``RobotConfig``: eagerly walk ``lerobot.cameras``'s submodules so the registry is populated before lookup). Construct an instance with the rest of the dict's fields. On an unknown camera type, raise a clean ``ValueError`` listing the available choices. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:39:28 +02:00
Pepijn	e0fa957569	fix(smolvla2): eagerly import robot submodules before get_choice_class ``RobotConfig._choice_registry`` is populated as a side-effect of each robot's ``@RobotConfig.register_subclass`` decorator running, and those decorators only fire when the corresponding ``lerobot.robots.<name>`` module is imported. The package's ``__init__.py`` doesn't import them — instead ``make_robot_from_config`` does it lazily in its big if/elif chain. ``_build_robot`` jumped the gun: called ``RobotConfig.get_choice_class (robot_type)`` before any robot module had been imported, so the registry was empty and every ``--robot.type=<X>`` produced ``KeyError: 'X'`` (e.g. ``KeyError: 'omx_follower'``). Walk ``lerobot.robots``'s submodules via ``pkgutil.iter_modules`` and ``importlib.import_module`` each one before the lookup. ~200ms on the first invocation, negligible for an autonomous run. On a real ``KeyError`` (typo / unsupported robot), raise a clean ``ValueError`` listing the registry's available choices instead of a bare KeyError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:31:58 +02:00
Pepijn	c661d81409	fix(smolvla2): use RobotConfig.max_relative_target, drop --max_action_norm The hand-rolled action-norm safety clip duplicated what every ``RobotConfig`` already exposes — ``max_relative_target`` — and at the wrong layer (after postprocess but before send_action, instead of inside the robot driver where every other lerobot entry point puts it). The norm clip also rejected entire actions instead of clipping per-motor relative motion, so a single rogue joint would kill the whole tick. Replace with ``--robot.max_relative_target``: a string parsed as either a bare float (uniform per-motor cap) or a JSON object mapping motor name → cap. Passed through to ``RobotConfig(max_relative_target=...)`` at robot construction; the driver's ``send_action`` clips each commanded joint position relative to the current measured one before issuing it on the bus — same behaviour ``lerobot-record`` ships. Also bump ``--chunk_hz`` default from ``4.0`` to ``1.0``. One new chunk per second is what the trained checkpoint can comfortably keep up with on common hardware and gives smoother motion than sub-second chunk regenerations (no RTC interpolation between chunks yet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:41:57 +02:00
Pepijn	33a4b4a5a0	feat(smolvla2): autonomous robot mode in lerobot-smolvla2-runtime The runtime CLI was deliberately scoped to dry-run only: it hard-coded ``robot_executor=None`` and printed a "real-robot integration is a follow-up" warning even when ``--no_robot`` was omitted. The runtime engine was already structured for real-robot operation (separate ``LowLevelForward`` chunk-rate generation + ``DispatchAction`` ctrl-rate dispatch with a ``robot_executor`` hook); only the wiring was missing. Add the wiring: * ``_load_policy_and_preprocessor`` now also returns the postprocessor (action denormaliser). * ``--robot.type`` / ``--robot.port`` / ``--robot.id`` / ``--robot.cameras`` (JSON) build a ``Robot`` via ``make_robot_from_config`` and connect it. * ``_build_robot_observation_provider`` reads ``robot.get_observation()`` each call, drops the language columns (runtime drives messages itself), and runs the policy's preprocessor (rename → batch → device → normalise). * ``_build_robot_action_executor`` postprocesses the policy's action tensor (denormalise), converts to the ``{joint: value}`` dict via ``make_robot_action(action, ds_meta.features)``, and calls ``robot.send_action(...)``. Optional ``--max_action_norm`` safety clip rejects ticks whose action L2 norm exceeds the threshold (kill-switch when bringing up a new robot). * ``_run_autonomous`` runs ``runtime.run()`` in a background thread (the policy must keep generating chunks at chunk_hz and dispatching at ctrl_hz regardless of stdin) and handles user interjections / VQA queries from the foreground stdin loop. Confirmation prompt before start (skip with ``--auto_start``); Ctrl+C stops the thread and disconnects the robot cleanly. * Autonomous mode requires ``--dataset.repo_id`` for action stats / feature shapes — pass the same dataset the policy was trained on. The bootstrap path that pulls canonical task / plan / memory runs in both REPL and autonomous modes so the model's first prompt matches training distribution. Dry-run REPL behaviour is unchanged when ``--robot.type`` is not passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 18:30:56 +02:00
Pepijn	a764c3e1d6	fix(datasets,annotate): tag pushed dataset + clean revision error Two bugs combining to make the brand-new ``_tool3`` dataset unloadable: 1. ``lerobot_annotate.py:_push_to_hub`` uploads the annotated dataset folder but never creates a codebase-version tag, so ``api/datasets/<repo>/refs`` returns ``"tags": []``. Then ``LeRobotDatasetMetadata`` → ``get_safe_version`` → ``get_repo_versions`` returns empty and the loader raises ``RevisionNotFoundError``. 2. ``RevisionNotFoundError`` itself was unconstructible: its ``HfHubHTTPError.__init__`` indexes ``response.headers`` unconditionally on current ``huggingface_hub`` versions, so constructing it without a real ``Response`` blew up with ``AttributeError: 'NoneType' object has no attribute 'headers'``, masking the real "no tag" message. Fix #1: after upload, read ``meta/info.json["codebase_version"]`` and ``HfApi.create_tag(..., tag=<v3.x>, repo_type='dataset', exist_ok=True)`` so the dataset is loadable straight from the Hub on the next ``LeRobotDataset(repo_id)`` call. Falls back to the in-tree ``CODEBASE_VERSION`` if info.json is missing/malformed; on tag creation failure, prints the manual one-liner the user needs. Fix #2: stop trying to instantiate ``RevisionNotFoundError`` (which inherits HfHubHTTPError) for what is really a config issue, not an HTTP failure. Raise plain ``RuntimeError`` with the same message — the caller actually sees what's wrong instead of an upstream attribute error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 18:23:18 +02:00
Pepijn	b416f287f2	fix(datasets): raise readable error when repo has no version tags ``RevisionNotFoundError`` inherits from ``huggingface_hub.HfHubHTTPError`` which made ``response`` a required keyword-only argument on recent versions. Constructing it with just a message string blew up with ``TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only argument: 'response'`` instead of surfacing the actual problem (the dataset/checkpoint repo doesn't exist on the Hub yet). Pass ``response=None`` explicitly. Fall back to the bare-message form for older ``huggingface_hub`` versions that don't accept the kwarg. Also clarify the message to call out the most common cause: typing a hub repo id that hasn't been pushed yet (instead of just "needs a version tag"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 18:12:40 +02:00
Pepijn	aa749d4947	chore(annotate): throttle Module 3 + executor parallelism to fix vLLM stall Last bump combined ``module_3.K=3`` with ``vqa_emission_hz=2.0`` and ``executor.episode_parallelism=32``. With 2 cameras per dataset that produced ~12× the original VQA call volume, all submitted concurrently. Module 3 latency went from ~30s/phase to ~490s per episode, vLLM's KV cache pegged at 94% with 800+ in-flight requests, and the multimodal cache corrupted with ``AssertionError: Expected a cached item for mm_hash='...'`` (a known vLLM bug under image-heavy concurrency). Module 1 and 2 ran fine; Module 3 was the bottleneck. Pull back the multipliers to land in a sustainable spot: * module_3.K: 3 (kept) — three diverse questions per emission, where the diversity actually helps the LM head. * module_3.vqa_emission_hz: 2.0 → 1.0 — back to the original emission rate. Net VQA volume is now ~3× original (K alone) on a single camera, ~6× across both cameras — manageable. * module_2.max_interjections_per_episode: 9 → 6 — still 2× the default, fewer than the prior 3× to keep total request volume in check. * vlm.client_concurrency: 256 → 128 — gives vLLM headroom on the multimodal request path so the mm_cache doesn't desync. * executor.episode_parallelism: 32 → 16 — half the episodes in flight at once, so peak vLLM load is ~half. n_task_rephrasings stays at 30 (text-only, doesn't load the image path) and vlm.temperature stays at 0.7. The diversity gains are preserved; only the throughput knobs come down. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:07:18 +02:00
Pepijn	1394a6ab5d	chore(annotate): bump diversity knobs ~3x to fight memorisation Following Pi0.7 §V (prompt expansion / diverse context conditioning), push more atom variants per episode and higher VLM sampling temperature so the training distribution has enough wording diversity that the LM head is forced to use its parameters rather than memorise specific (prompt, target) pairs. Changes vs prior annotation pass: * vlm.temperature: 0.2 (default) → 0.7 — every Module-1/2/3 call now produces diverse phrasings; same prompt yields different completions across emissions. * module_1.n_task_rephrasings: 10 → 30 — three times as many ``task_aug`` rows in language_persistent. ``${task}`` already rotates through them deterministically per sample_idx (see ``_resolve_task`` in language_render.py). * module_2.max_interjections_per_episode: 3 (default) → 9 — more ``user_interjection_response`` training samples + more plan refresh events. * module_3.K: 1 → 3 — three VQA pairs per emission tick instead of one. Combined with the hz bump below, ~6× more VQA samples. * module_3.vqa_emission_hz: 1.0 → 2.0 — double the VQA emission rate within each subtask span. Pushes to a new hub repo (``_tool3``) so the working ``_tool2`` dataset stays intact for comparison. ``${task}`` already wired to rotate through ``task_aug`` rows, so no renderer change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:32:05 +02:00
Pepijn	db9118f16f	fix(smolvla2): reject gibberish high-level generations Memorised models can collapse to dominant-mode outputs (the JSON-token salad ``":":":":...`` from VQA training) when the prompt drifts even slightly from training distribution. Without a guard, that gibberish lands in ``current_subtask`` / ``current_plan`` / ``current_memory``, which feeds the next tick's prompt and cascades into worse outputs. The user observed exactly this: a clean run followed by a tick that wrote ``" " "`` into plan and memory, then slow recovery several ticks later. Add ``_looks_like_gibberish`` heuristic (alpha density, repeating chars, JSON-prefix sniff) and apply it before mutating state in ``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd``. Bad generations are logged inline (``[info] subtask gen rejected (gibberish): "":":":..."``) so the user can see what was dropped, but the state stays at its last-known-good value (typically the dataset bootstrap) instead of being polluted. VQA path is intentionally exempt — its training targets are JSON-shaped, so the heuristic would false-positive on them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:07:25 +02:00
Pepijn	7a945d7bdc	fix(smolvla2): bootstrap canonical task + plan/memory from dataset The user-typed task and the dataset's canonical task differ in wording (capitalisation, ``green box`` vs ``green bin``, etc.). With ``text_loss`` driven down to ~6e-6 across 78 epochs the model is memorised on the exact rendered training prompts: any wording drift puts the prompt out of distribution and the model collapses to its dominant training mode (VQA JSON output). When ``--dataset.repo_id`` is set, automatically: * read the canonical task string from the chosen episode (and use it as ``--task`` when the user didn't pass one); * pull the active ``plan`` / ``memory`` / ``subtask`` rows from the persistent slice (latest row whose timestamp ≤ start frame's timestamp — same semantics as the renderer's ``active_at``) and seed them into the runtime state. The first prompt the runtime builds at REPL start now mirrors what the recipe rendered during training (task + active plan + active memory + optional current subtask). The user can still override any of these by typing. Memorisation itself is upstream (training mix collapsed to too few unique high-level targets); this commit only fixes the inference-side prompt mismatch that was making the memorisation surface as gibberish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:00:36 +02:00
Pepijn	a47e535b02	fix(smolvla2): per-recipe inference prompts to match training shape The four high-level steps shared one generic ``_control_context_messages`` that jammed task + plan + memory + completed_subtask into a single user message. The recipes in ``smolvla2_hirobot.yaml`` each have a specific multi-message layout (``memory_update``: ``user(task) → assistant(prev memory) → user(completed subtask)``; ``high_level_subtask``: ``user(task+plan+ memory) → user(current subtask)``; ``user_interjection_response``: ``user(task) → assistant(prev plan) → user(interjection)``). After ``apply_chat_template`` those layouts produce different prompts than the runtime's flattened single-user-turn version, and the model fell back to its dominant training mode (VQA JSON output) — generating ``":":":":":":...`` repetition. Add four per-recipe prompt builders (``_msgs_for_subtask``, ``_msgs_for_memory``, ``_msgs_for_interjection``, ``_msgs_for_vqa``), each mirroring its sub-recipe's exact message structure including the ``if_present`` skips. Wire each high-level step to its matching builder. Inference prompts now line up with what the model saw in training, so generation should produce coherent text instead of repeated tokens. Generic ``_control_context_messages`` is kept (still used by tests and the no-recipe fallback path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:47:22 +02:00
Pepijn	6d9b431b54	fix(smolvla2): match training's text-loss forward in select_message Previous rewrite drove generation through ``vlm.generate()`` (the standard SmolVLM path), which ignores SmolVLA's custom ``embed_prefix`` that interleaves images + lang + state. Result: the model received a prompt format it had never been trained on at inference and emitted JSON-fragment gibberish (``" " " ,",","`` ``cube lift {"...``). Revert to the cumulative-buffer AR loop driven through ``vlm_with_expert.forward`` — the same forward call ``_compute_text_loss`` makes during training (``inputs_embeds=[prefix_embs, None], use_cache=False, fill_kv_cache=True``). With ``fill_kv_cache=True``, every layer routes through ``forward_attn_layer``, which gracefully skips ``None`` expert inputs (``if hidden_states is None or layer is None: continue``); cross-attention layers — which would otherwise hard- require a non-None expert input — are bypassed entirely. Inference now sees the same prefix structure as training: images + lang + state, with new tokens appended to the lang region. The text distribution matches what the model was trained to produce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:42:15 +02:00
Pepijn	347e706326	fix(smolvla2): drop pixel_values from select_message generate path SmolVLA's image preprocessor sizes frames to whatever the action expert was trained on, but SmolVLM's standard vision tower expects its own default tile grid (e.g. 384/14 → 27×27 patches). The mismatch surfaces deep in the post-vision reshape as ``RuntimeError: shape '[2, 34, 34, 768]' is invalid for input of size 1843200`` — the model has 1200 patches but expects 34×34=1156. Drop ``pixel_values`` from ``vlm.generate(...)`` so SmolVLM runs as a text-only LM at REPL time. The high-level branches (subtask / plan / memory) are dominated by their text context anyway, so this is acceptable for dry-run inference. VQA loses its image grounding — that will be marked as expected for the dry-run path until a follow-up either re-processes images through SmolVLM's own ``ImageProcessor`` to match its tile grid, or gives ``vlm_with_expert`` a real AR text decode mode that handles state and image embeddings the way training does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:36:53 +02:00
Pepijn	fa8ae1e89b	fix(smolvla2): drive select_message through SmolVLM.generate The hand-rolled AR loop in ``select_message`` was fighting the underlying ``vlm_with_expert.forward`` design, which assumes the "prefix-once + suffix-always-via-expert" pattern that ``denoise_step`` uses for action chunks. Cross-attn layers (every other layer with ``attention_mode='cross_attn'`` + ``self_attn_every_n_layers=2``) hard-require an expert input on every call: passing ``inputs_embeds=[current_embs, None]`` crashed at ``expert_layer.input_layernorm(None)`` with ``'NoneType' object has no attribute 'dtype'``. Earlier KV-cache attempts ran into the matching ``[15, 139] vs [15, 1]`` shape mismatch because the cache gets overwritten, not appended, on each ``fill_kv_cache=True`` call — there's just no AR-text-decode mode in this forward. Stop fighting it: drive AR text generation through the underlying SmolVLM via ``vlm.generate(input_ids=..., attention_mask=..., pixel_values=...)``. KV caching, sampling/greedy, EOS handling all come from HF's standard implementation. Trade-off: ``state`` drops out of the prefix at inference (no slot for it on the standard SmolVLM path), so high-level generations may drift from training distribution slightly. That's acceptable for the dry-run REPL — the high-level branches (subtask / plan / memory / vqa) are mostly vision+language conditioned anyway, and the action expert (where state actually matters) goes through the unchanged ``select_action`` path. Image features the runtime merged in (``observation.images.*``) are stacked into the ``[B, num_images, C, H, W]`` shape SmolVLM expects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:39:34 +02:00
Pepijn	3ff6c6860e	fix(smolvla2): rewrite select_message decode loop without KV cache SmolVLA's ``vlm_with_expert.forward`` doesn't actually support incremental KV cache growth — its only ``fill_kv_cache=True`` mode overwrites the cache with the latest call's key/value states, and its only ``fill_kv_cache=False`` mode concatenates ``cache + new`` into a local ``key_states`` for one matmul without ever updating the cache itself. The original ``select_message`` decode loop tried to use ``fill_kv_cache=True`` per step, which clobbered the cache to 1 token after the first decode and threw ``Expected size for first two dimensions of batch2 tensor to be: [15, 139] but got: [15, 1]`` — the attention mask still expected 139 keys but the cached + new key_states only had 1. Match the pattern ``denoise_step`` already uses successfully: maintain a cumulative ``(embs, pad, att)`` buffer that starts as the prefix and grows by one bool/embedding row per step. Each step forwards the full sequence with ``use_cache=False, fill_kv_cache=False, past_key_values=None`` so the matmul shapes always line up. Generated-token rows are tagged ``pad=1, att=1`` which makes them fully causal among themselves while still able to attend back to the entire prefix (per ``make_att_2d_masks`` semantics: a token can attend to any earlier token whose cumulative ``att`` count is ≤ its own). Image encoding is still done once via the initial ``embed_prefix`` call — the expensive part doesn't repeat. The remaining cost is O(n²) text-only transformer forwards, which is fine for the dry-run REPL's 50–100 token responses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:15:28 +02:00

1 2 3 4 5 ...

1536 Commits