lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-17 23:11:45 +00:00

Author	SHA1	Message	Date
Pepijn	34269a5d78	fix(pi052): register PaliGemma <loc> tokens so they tokenize as single ids THE bug behind the <loc>-salad. PaliGemma's vocab reserves ids [256000, 257023] for <locDDDD> detection / pointing tokens, but the stock AutoTokenizer does NOT match them on raw text — it BPE-splits <loc0162> into SEVEN pieces (<, loc, 0, 1, 6, 2, >). So a VQA target like "<loc0162><loc0759> green box<eos>" tokenized to 16 pieces, not 5, and training the LM head supervised those generic BPE pieces instead of one detection-vocab id. The piece logits got pumped up across ~25% of supervised positions; at inference they dominated every turn — even subtask prompts produced <loc>-salad followed by the actual answer. Register the 1024 <locDDDD> tokens via tokenizer.add_tokens once on load, in every path the policy uses: PI052TextTokenizerStep (training encode), _build_text_batch_pi052 (runtime encode), and select_message's default tokenizer (runtime decode). Verified empirically with the real PaliGemma tokenizer: VQA target now tokenizes to 5 ids matching the loc-vocab range (256162, 256759, ...) with correct offset_mapping. This unlocks PaliGemma's actual detection prior; <loc>-salad cannot recur because each <locDDDD> is a single class on the LM head, not a character sequence the head accidentally learns to extend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 11:41:41 +02:00
Pepijn	75507491bf	fix(pi052): VQA <loc> conversion treats coords as 0-1000 normalized Confirmed empirically on the published dataset: VQA bbox/keypoint coordinates are Qwen2.5-VL's 0–1000 normalized grounding output, NOT pixels. Scanning 8207 samples showed x and y both spanning 0..1000 with ~30% of values exceeding the camera's pixel dimensions (which is impossible if they were pixels). _vqa_answer_to_loc was dividing by the observation image's H/W, so e.g. point [742, 158] on a 640x480 wrist cam clamped x to <loc1023> (the far-right edge) instead of mapping to <loc0760> (~74% across). Fix: divide by 1000 — the actual Qwen scale. The conversion is now camera-resolution-independent, so _camera_image_shapes and the image_shapes plumbing through __call__ / _encode_messages / _messages_vqa_to_loc are dropped. Tests updated to the new signature and the 0–1000 round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:21:28 +02:00
Pepijn	88519cb14c	fix(pi052): quantile-normalize actions before FAST tokenizer fit base.fit() rejected the data with "Vocab size 1024 is too small for the range of tokens 9339": the FAST tokenizer was fit on raw motor-unit actions, whose DCT-token range vastly exceeds the 1024 codebook. Two problems, one fix. (1) Raw actions blow up the token range. (2) At training time ActionTokenizerProcessorStep runs after the QUANTILES NormalizerProcessorStep, so it encodes normalized actions — fitting on raw actions mismatches that space. Replicate QUANTILES normalization (per-dim [q01,q99] -> [-1,1], clipped) before base.fit() so the fit and the training-time encode see the same distribution and the token range fits the codebook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:02:20 +02:00
Pepijn	bc0c993b25	fix(pi052): FAST tokenizer fit read actions from column, not ds[i] fit_fast_tokenizer collected action chunks via ds[i]["action"], which builds a full training item — delta-timestamp expansion, video decode, image transforms. A single video-decode failure threw, was swallowed at debug level, and silently starved the fit of every chunk → "FAST fit collected zero action chunks", falling back to the universal tokenizer. Read the ``action`` column straight from the HF dataset instead: it carries no video, so it is immune to decode errors and far faster. Also fail fast with a clear message when the dataset has no ``action`` feature or all episodes are shorter than chunk_size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:51:53 +02:00
Pepijn	ddf4bc2063	fix(pi052): knowledge insulation crashed on wrong _gated_residual import _compute_layer_ki called modeling_gemma._gated_residual, but that adaRMSNorm gated-residual helper is a lerobot helper in pi_gemma, not part of HF transformers — so enabling knowledge_insulation crashed with AttributeError on the first training step. Import _gated_residual from pi_gemma, matching pi05's own layer code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:48:02 +02:00
Pepijn	b7317b6c29	test(pi052): round-trip coverage for VQA <loc> conversion Pins JSON pixel coords -> PaliGemma <loc> -> runtime parse back: the conversion preserves coordinate order (JSON x-first, <loc> y-first) and per-axis normalization, losing only <loc>-grid quantization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:24:24 +02:00
Pepijn	c026aed8f8	feat(pi052): train VQA spatial answers in PaliGemma <loc> format Spatial VQA answers (bbox / keypoint) were trained as pixel-coordinate JSON, which fights PaliGemma's detection prior and leaks <loc>-token salad at inference. Convert them to PaliGemma's native <locNNNN> vocabulary instead so the LM head reuses that prior. Training side (text_processor_pi052.py): a target turn whose content parses as a bbox/keypoint answer is rewritten to <loc> text, using the camera frame's native (H, W) from the observation and the preceding image block. Non-spatial answers, subtask/memory targets and SmolVLA2 keep their JSON form — the dataset stays backbone-agnostic. Runtime side (smolvla2/inference/vqa.py): parse_vqa_answer detects <loc> answers (2 locs -> keypoint, 4 -> bbox), returning normalized [0,1] coords with a normalized flag; draw_vqa_overlay denormalizes against the chosen camera frame's pixel size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 20:23:46 +02:00
pepijn	e425dfd624	fix(processor): fallback to task message when recipe misses Keep action-only samples trainable by rendering the task as a low-level user message when no recipe branch matches. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 15:32:09 +00:00
Pepijn	15f79b5e5e	fix(pi052): supervise an EOS token at the end of each text target PI052TextTokenizerStep masked text_labels over the assistant turn's content only — the trailing newline was excluded and no EOS token was ever a supervised label. So the LM head was never given a stop signal: at inference select_message decoded to max_new_tokens, producing the runaway subtask paragraphs and the "}"}"}-style VQA tails. _format_messages now appends the tokenizer's EOS to each supervised target turn and extends that turn's span to cover it, so the EOS lands in text_labels. _shifted_ce then trains "<last content token> -> EOS" and the model learns to terminate; select_message stops on it. Inference callers (the runtime's _build_text_batch_pi052) pass no target_indices / eos_token, so no EOS is baked into the prompt — the model generates it. Verified end-to-end with the PaliGemma tokenizer: the supervised span is `<content><eos>` and the trailing newline stays unsupervised. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:22:22 +02:00
Pepijn	725ac95b0d	feat(runtime): make the interactive runtime drive PI052 too The runtime's text path was hard-wired to SmolVLA2: _build_text_batch read policy.config.vlm_model_name (which PI052Config doesn't have) and built a SmolVLM2 chat-template prompt. PI052/PaliGemma is not chat-pretrained and trains on a flat `User: ... \nAssistant: ...` prompt, so the runtime crashed or fed an out-of-distribution prefix. - _build_text_batch now dispatches on policy.config.type: smolvla2 -> chat template (renamed _build_text_batch_chat); pi052 -> flat role-prefixed text via PI052TextTokenizerStep's own _format_messages / _strip_blocks / _flatten_say_tool_calls, so the inference prefix matches PI052 training exactly. - Add a lerobot-pi052-runtime entry point (alias of the same main; the policy type is read from the checkpoint) so the command name isn't misleading. argparse prog now defaults to the invoked command name. PI052's select_message / predict_action_chunk already work with the runtime; this was the one SmolVLA2-only coupling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:28:55 +02:00
Pepijn	7b64e5498d	revert(annotate): move memory + speech prompts to base PR (#3471 ) The first-person memory narrative, task-rephrasing and initial-speech prompt tweaks belong in the annotation pipeline itself. Applied to feat/language-annotation-pipeline (#3471); reverting them here to the merge-base so they drop out of this PR's diff. general_vqa.py keeps its docstring fix since it references a recipe this PR introduces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:17:52 +02:00
Pepijn	182f10184f	revert(annotate): move pipeline changes to base PR (#3471 ) The deterministic-plan rewrite, single-frame VQA (K 3->1), dataset version tagging, telegraphic-subtask prompt and shorter interjection prompt belong in the annotation pipeline itself, not in the SmolVLA training PR. They have been applied to feat/language-annotation- pipeline (#3471). Reverting these six files here to the merge-base so they drop out of this PR's diff; #3491 will inherit the canonical versions when it next rebases on its base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:07:23 +02:00
pepijn	bb31988915	fix(pi052): pass 4d masks to prefix-only forwards Convert PI052 prefix-only attention masks before calling PaliGemma so text-only batches and generation use the same mask shape as fused training. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 21:07:13 +00:00
pepijn	2629175d2d	fix(pi05): use fused AdamW by default Route full PI05/PI052 fine-tuning through PyTorch's fused AdamW path to avoid the single-tensor Adam denominator allocation near GPU memory limits. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 19:23:17 +00:00
pepijn	2b4c5f49e3	fix(pi05): disable foreach AdamW by default Avoid the multi-tensor AdamW temporary that can OOM full PI05/PI052 fine-tuning near GPU memory limits. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 18:58:17 +00:00
pepijn	22c9c4905e	fix(pi052): avoid dense CE over padded tokens Select only supervised text and FAST action-code positions before cross-entropy to avoid full-vocabulary loss tensors over padded sequences. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 18:40:34 +00:00
pepijn	7960cc14ec	fix(pi052): call policy preprocessing helpers Use PI05Policy helpers for action padding and image preprocessing in PI052 fused losses instead of looking them up on the inner PI05Pytorch module. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 17:52:47 +00:00
pepijn	1750a87104	fix(pi052): handle batched rendered messages Tokenize batched recipe outputs in PI052 so training batches with nested message lists do not crash before model forward. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 17:41:58 +00:00
pepijn	0e2dc1b76f	fix(pi052): supervise only FAST action-code tokens Mask the FAST auxiliary loss to discrete action-code tokens so wrapper formatting tokens do not affect action co-training. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 17:38:34 +00:00
Pepijn	474c5478d9	tune(annotations): VQA emission anchors a single frame (K 3 -> 1) Module 3 anchored each VQA emission tick to K=3 consecutive frames (~0.1s at 30fps). The VLM grounds the answer — bbox/keypoint coordinates especially — against the first frame's image, so copying it onto frames 2-3 smears a stale label over a moving scene. Default K=1: a VQA pair lands on exactly its emission frame, no temporal smear. VQA frames get sparser; the WeightedEpisodeAwareSampler (vqa_target_fraction) is the knob to compensate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:24:36 +02:00
Pepijn	0f5f0e4091	refactor(recipes): rename recipes, drop pi05_hirobot - hirobot.yaml -> subtasks_vqa.yaml - hirobot_memory.yaml -> subtask_mem_vqa_speech.yaml - pi05_hirobot.yaml -> deleted (stale: uses plan, top-camera names; superseded by the two recipes above) - smolvla2_hirobot.yaml -> deleted (was untracked stale junk) Updated the smolvla2 / pi052 `recipe_path` config defaults, all docstring / comment references, the annotation-pipeline + recipe docs, and the three tests that loaded pi05_hirobot.yaml (repointed to the renamed recipes; the low-level-branch and pipeline-render assertions now accept a flow-only `low_level` stream as valid supervision, since the new recipes' low_level_execution has no text-CE target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:02:15 +02:00
Pepijn	426d48dbbf	fix(pi052): port the smolvla2 text-head fixes to pi052 pi052 had the same text-CE collapse bug smolvla2 had — PaliGemma's embed_prefix flags the language block att=0, so make_att_2d_masks makes it fully bidirectional and the text cross-entropy degenerates into a copy task. Ported the three model-specific fixes: - _mark_target_span_causal: set att=1 on supervised target language positions so the text-CE is genuine causal next-token prediction. Applied in both _compute_all_losses_fused and _compute_text_and_fast_loss. - flow_loss_weight 10.0 -> 5.0: the paper's a=10 swamps the LM head once the flow-only low_level recipe fires often (matches SmolVLA2Config). - _flatten_say_tool_calls in the text tokenizer: serialize `say` tool calls into a <say>...</say> marker so the spoken reply is tokenized and supervised (PaliGemma's flat prompt has no structured calls, so they were dropped entirely). select_message needed no change: pi052's prefix is [images, language] with no trailing state token, so it already decodes from the last language token. Regression tests mirror the smolvla2 attention-masking + tool-call suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:42:19 +02:00
Pepijn	fbcb9225f5	feat: oversample sparse VQA annotations (recipe consumption + weighted sampler) VQA annotations are sparse, so VQA was badly underrepresented in training: its effective share was weight x density, and blend draws that picked an ask_vqa* sub-recipe for a non-VQA frame were wasted entirely. Two pieces: 1. Recipe-side consumption (language_render.py): render_sample now routes any frame that carries a VQA annotation to a matching ask_vqa* sub-recipe, regardless of the weighted blend draw. No VQA annotation is wasted and no draw lands on a non-renderable VQA recipe — VQA's recipe-side share now equals the VQA-annotation density. 2. Dataset-side oversampling (WeightedEpisodeAwareSampler + vqa_target_fraction): a new weighted, episode-aware sampler draws frames with replacement by per-frame weight. When TrainPipelineConfig.vqa_target_fraction is set, the train script scans language_events, weights VQA frames so they make up ~that fraction of the training stream, and uses the weighted sampler. This is what actually lets VQA exceed its natural density. Default None keeps uniform episode-aware sampling unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:30:00 +02:00
Pepijn	b319ccf688	fix(smolvla2): only prompt for a camera when a VQA overlay is drawn The VLM already sees every camera, so the operator never needs to name one to ask a question. Move the camera prompt to after generation and only fire it when the answer actually carries a bounding box / point (whose pixel coordinates are camera-specific and need a target frame). Non-spatial answers (count / attribute / spatial / plain text) now skip the prompt entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:50:19 +02:00
Pepijn	3174e14bc0	fix(smolvla2): feed all cameras to VQA generation, not just the chosen one handle_vqa_query filtered the observation down to the single chosen camera before calling the VLM. But training feeds every camera: the ask_vqa_* recipes' image blocks are stripped before tokenization and the frames reach the model via OBS_IMAGES_*, where embed_prefix consumes all config.image_features regardless of the per-camera recipe tag. Filtering to one camera changed the image-token count in the prefix (the dropped camera zero-padded with mask=0) — a prefix shape the model never saw at training. Now the full observation is passed to select_message; the chosen camera is used only to pick which frame the bbox/point overlay is drawn on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:46:38 +02:00
Pepijn	dc530e10fe	feat(smolvla2): VQA example prompts in the panel; drop quotes from hints Command arguments never needed quotes (`_strip_quotes` only strips a matching pair if present) — `/question point to the yellow cube` works. The hints wrongly implied `""` were required; all hints/help now show `/action <task>` / `/question <text>`. Also adds a reference line to the state panel showing the two overlay-producing VQA prompt shapes: /question point to the yellow cube -> point overlay /question detect the blue cube -> bounding-box overlay plus the same examples in /help. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:42:32 +02:00
Pepijn	e7c5613a39	refactor(smolvla2): command-driven runtime — no startup prompts Replace the startup mode prompt + task picker with a single command-driven prompt. The runtime now comes up immediately at the command line in `paused` mode (robot idle) and the operator drives it: /action "task" run the robot on a task (bare = resume, number = timed burst) /pause stop the action loop — robot holds position /question "..." pause and answer one VQA question (camera prompt + overlay) /help / stop - Removed _select_mode_interactively / _select_task_interactively / _dataset_task_strings (the interactive pickers). - mode value renamed "question" -> "paused"; --mode choices are now action\|paused (default paused). - /question takes the question inline and runs it via _handle_slash_command (pauses first, so the policy isn't used concurrently). - The ENTER-to-start gate only fires when starting in action mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:37:51 +02:00
Pepijn	516ffc7687	feat(smolvla2): --mode flag, skip task picker with --task, timed /action Lets the operator skip the interactive startup entirely and go straight to the command line: - New --mode {action,question} arg; when given, the startup mode prompt is skipped. - When --task is passed explicitly on the CLI, the startup task picker is skipped (the dataset-bootstrap task still shows the picker so you can override it). Also adds a timed action burst: /action <seconds> runs the robot for N seconds, then the autonomous loop auto-reverts to question mode and clears the action queue. Plain /action stays unlimited. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:26:12 +02:00
Pepijn	7a68bf13d9	feat(recipes): add hirobot_memory — hirobot + memory + spoken tool-call replies New recipe alongside hirobot.yaml (kept as the lean baseline). Superset that adds two text-supervised sub-recipes: - memory_update: compress progress into a memory note. - user_interjection_response: reply to a user interjection with a `say` tool call only (no plan/subtask text). The SmolVLA2 chat tokenizer flattens the call to a `<say>...</say>` marker the runtime parses back. Plan is intentionally omitted; memory is the only persistent high-level state. Weights: low_level 0.40, subtask 0.25, memory 0.10, interjection 0.10, vqa 0.075 x2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:21:41 +02:00
Pepijn	15229468d0	feat(smolvla2): startup mode prompt; rename /vlm mode to /question Add a mode prompt at startup, shown before the task picker, so the operator chooses action (run the robot) vs question (VQA only) up front instead of having to discover /vlm mid-run. Also rename the VQA mode from "vlm" to the clearer "question": - state["mode"] value is now "action" \| "question" - the command is /question (/vlm and /vqa kept as aliases) - panels, hints and help text updated to match handle_vqa_query now reports via both push_log and direct stdout, so VQA answers / overlay paths are visible in autonomous question mode where the panel redraw is suspended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:17:03 +02:00
Pepijn	a9cea3e8dd	fix(smolvla2): make the autonomous REPL usable for slash commands / VQA The autonomous panel redraw cleared the screen every 0.5s, so the "> " prompt and the one-shot command hint vanished — the operator could not see what to type or what they were typing, making /vlm unreachable. - Suspend the timer redraw entirely while in /vlm mode (the action loop is paused, nothing changes in the background) so the VQA question and camera prompt stay on a stable screen. - Re-print the "> " prompt after each redraw so it is always visible. - Show an always-on command hint in the panel (/vlm, /help, /action) instead of relying on the startup line that scrolls away. - Redraw immediately after a slash command so the mode flip is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:10:13 +02:00
Pepijn	89d4846590	fix(smolvla2): always show the startup task picker on a TTY The picker was skipped whenever a task was already resolved — which is always the case with --dataset.repo_id, since the dataset's canonical task is auto-filled. The operator never got to choose. Now the picker always runs on an interactive terminal: the resolved task is shown as "(current)" and selected by an empty Enter, so the dataset-canonical default still works while letting the operator pick another task or type a custom one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:04:53 +02:00
Pepijn	26cb38a7d0	feat(smolvla2): startup task picker, /vlm mode toggle, interactive VQA overlay Three additions to the SmolVLA2 interactive runtime: 1. Startup task picker — when no --task is given, the runtime lists the dataset's task strings as a numbered menu (plus a custom-task option) instead of silently waiting for the first stdin line. 2. Mode toggle — /action and /vlm slash commands flip a persistent run mode. /vlm pauses the whole action loop (HighLevelSubtaskFwd, LowLevelForward and DispatchAction gate on state["mode"]) and clears the action queue so the robot holds position; /action resumes it. The mode is shown in the state panel. 3. Interactive VQA — in /vlm mode a typed line is a VQA question. The new inference/vqa.py module asks which camera to ground on, runs the VLM on that single camera, and when the answer is a bbox/keypoint it draws the overlay, saves a PNG to ./vqa_overlays/ and auto-opens it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:20:57 +02:00
Pepijn	bfb8cfb432	fix(smolvla2): flatten say tool_calls into <say> marker before tokenizing The chat tokenizer passed assistant `tool_calls` straight to `apply_chat_template`, which renders them as a structured JSON `<tool_call>` block — so the LM head was trained to emit JSON. But the inference parser `_split_plan_and_say` looks for a `<say>...</say>` marker, which the model never saw in training, so the `say` tool never fired at inference. `_flatten_say_tool_calls` is the missing training-time serializer (the one `_split_plan_and_say`'s docstring already assumed existed): it rewrites a `say` tool call into a `<say>...</say>` marker inside the content text before the chat template runs, so the template only tokenizes plain text and the supervised target span trains the model to emit exactly the marker the runtime parses back (Pi 0.5-style flat tool-call serialization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:47:31 +02:00
Pepijn	5e3b9ba82c	tune(smolvla2): override optimizer_lr to 2.5e-5 for pretrained-LM fine-tuning SmolVLA's 1e-4 is safe only because it freezes the language head. SmolVLA2 unfreezes lm_head + the last text layer and fine-tunes the pretrained SmolVLM2 language weights; 1e-4 is too aggressive there and destabilises generation into degenerate repetition. Match pi05's 2.5e-5 peak LR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:41:13 +02:00
Pepijn	083d3cd419	tune(smolvla2): soften flow:text loss split from 10:1 to 5:1 The Pi 0.5 α=10 split assumed text is a rare auxiliary task. With the flow-only `low_level` recipe (~40% of the blend) now rendering, the flow term fires often and at 10x weight dominates the shared VLM backbone, starving the text head into degenerate repetition decoding. A 5:1 split keeps actions primary while leaving the language head enough gradient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 16:00:08 +02:00
Pepijn	bf996c7938	fix(datasets): render flow-only low_level recipes instead of dropping them A recipe whose only supervision is the action-expert flow loss (e.g. `low_level_execution`: `user(${subtask})` with `stream: low_level` and no `target` turn) was rejected at render time by `_render_message_recipe` and `_validate_rendered`, both of which required at least one target turn. The result: every blend draw of the flow-only recipe rendered to `None`, `predict_actions` was never set, `run_flow` never fired, and the action expert received no flow loss — leaving it at random init. Both gates now also accept a `low_level`-stream turn as valid supervision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 13:20:39 +02:00
Pepijn	0d88eaf8eb	test(smolvla2): attention masking of the language target span Regression coverage for the text-CE collapse bug fixed in `3cd348ff`. Pure-function tests over ``_mark_target_span_causal`` / ``_locate_lang_range`` / ``make_att_2d_masks`` — no model load, fast. Pins: * the target span flips to att=1, prompt/images stay att=0; * target tokens attend causally among themselves (no peeking at future targets) — genuine next-token prediction; * targets still attend bidirectionally to images + the user prompt; * the action-expert (state) token still attends to every target; * a no-target subtask (low_level_execution user turn, labels all -100) leaves the mask bidirectional; * an explicit test documenting the bug: the raw embed_prefix mask lets the first target token see the last — the copy-task collapse. Skips cleanly when transformers isn't installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 18:28:44 +02:00
Pepijn	3cd348ffe2	fix(smolvla2): causal mask on the text-CE target span (THE collapse bug) Root cause of every collapsed inference run. ``embed_prefix`` flags all language tokens ``att=0``; ``make_att_2d_masks`` turns that into a single fully BIDIRECTIONAL block. So during the text-loss forward, a supervised subtask token's hidden state attends to the very tokens it is trained to predict. The cross-entropy degenerates into a copy task — ``text_loss → ~3e-5`` not because the model learned to predict subtasks but because it can see the answer. At inference ``select_message`` decodes autoregressively (causally): each token must be predicted WITHOUT seeing it — a task the model was never actually trained on. Hence the universal collapse: a coherent first token or two ("grasp the yellow cube"), then a loop ("cover cover cover", "icatorsicators", "the the the"). Fix: ``_mark_target_span_causal`` sets ``att=1`` on the language positions that are supervised targets (``text_labels != -100``). With make_att_2d_masks's cumulative-block rule each target token then attends to images + the user prompt bidirectionally and to EARLIER target tokens only — genuine causal next-token prediction, matching select_message. Applied in both ``_compute_text_loss`` and ``_compute_fused_loss``. Per-sample correct: high_level_subtask targets become causal; low_level_execution subtasks (a user turn, labels all -100) stay bidirectional so the action expert reads them as bidirectional context. The action expert is otherwise unaffected — the suffix has a strictly higher cumsum and still attends to the whole prefix. Requires retraining: this changes the training objective. Existing checkpoints were all trained on the degenerate copy task and cannot generate text. Expect ``text_loss`` to settle MUCH higher than 3e-5 after this — that is correct; it is now a real prediction task. NOTE: pi052's text path (PaliGemma prefix-LM) has the same bidirectional-block structure and needs the analogous fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 18:24:44 +02:00
Pepijn	db03fc6dc4	fix(smolvla2): select_message must decode from the language position ``embed_prefix`` lays the prefix out as ``[images, lang, state]`` with the state token LAST. Training supervises the text head on the language positions (``_compute_text_loss`` / ``_compute_fused_loss`` slice ``prefix_out[lang_start:lang_end]`` and run lm_head there). But ``select_message`` started AR generation from the full prefix and read ``prefix_out[:, -1:]`` — the state token — to decode the first subtask token. The state token's hidden state exists for the action expert to read; the lm_head was never trained to produce subtask text from it. So inference decoded the high-level head from a position entirely outside the training distribution: the text head collapses (``the arm the arm``, ``grasp the surface population``, ``_333 absburg…``) no matter how cleanly ``text_loss`` converged. Fix: truncate the state token off the prefix before the AR loop, so ``prefix_out[:, -1:]`` is the last language token (right after the ``Assistant:`` generation prompt) — exactly where training supervised. Inference-only change — no retraining needed; existing checkpoints benefit immediately. The action path (``predict_action_chunk``) is untouched: state belongs in the action expert's prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 15:05:16 +02:00
Pepijn	56068d37ea	fix(smolvla2): default load_vlm_weights=True — don't train from scratch SmolVLAConfig defaults ``load_vlm_weights=False``. With that and no ``--policy.path``, ``SmolVLMWithExpert.__init__`` builds the VLM via ``SmolVLMForConditionalGeneration(config=...)`` — i.e. a fully random-initialised 500M backbone, including a random ``lm_head``. For plain SmolVLA that's a deliberate "pre-train the expert" mode. For SmolVLA2 it's a footgun: the high-level text head is the SmolVLM2 ``lm_head``. Training subtask prediction from a random language model can only memorise — which is exactly the repetition collapse seen on the real robot ("the arm the arm the arm …"). SmolVLA2 now defaults ``load_vlm_weights=True`` so every run fine-tunes the pretrained ``HuggingFaceTB/SmolVLM2-500M-Video-Instruct`` backbone (vision tower + language model + lm_head). The action expert still trains from scratch on the robot data (standard SmolVLA fine-tuning); start it from pretrained too by fine-tuning a full ``lerobot/smolvla_base`` checkpoint via ``--policy.path``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 16:44:00 +02:00
Pepijn	e727688052	annotate: telegraphic subtasks — ≤4 words, verb+object, consistent nouns Tighten the subtask prompt further per real-data feedback. The old ≤5-word cap still produced things like "release the yellow block into the green bin" (8 words, articles, destination, and "block" where the task said "cube"). New rules: * Hard cap ≤ 4 words, ideally 2-3. Form: VERB + (color) + OBJECT. * No articles, no destinations, no adverbs, no "robot/arm/gripper". * Must reuse the exact object nouns from the task — no block/cube, bin/box/container drift across the episode. * Concrete good/bad examples anchored on the cube task. Shorter, templated, consistent targets are far more robust for the autoregressive LM head — fewer tokens to drift on, fewer dominant n-grams to repetition-collapse into. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 14:14:42 +02:00
Pepijn	f1a0a663cc	fix(inference): gibberish detector catches long repetition collapse The ``_looks_like_gibberish`` low-unique-token check was gated on ``len(stripped) < 80``, so an LM head that loops an n-gram for the whole 256-token budget — "the arm the arm … the the the the" — sailed straight through (``gibberish:0`` in the panel) and the garbage subtask got accepted and fed to the action expert. Added a length-independent check: ``>= 8 tokens`` but unique-token count ``<= max(3, tokens // 10)`` ⇒ repetition collapse. Now the runtime rejects the looped output and keeps the previous (real) subtask instead of propagating nonsense. This is a guard, not a cure — the underlying issue is the LM head on the current checkpoint being undertrained / collapsed; re- annotate with the short prompts and train longer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 13:52:26 +02:00
Pepijn	6e64c20cf1	runtime: stop seeding plan/memory from the dataset (unused) The current recipe trains neither plan nor memory, and no inference step consumes them — ``_msgs_for_subtask`` renders the bare task and ``LowLevelForward`` conditions on the subtask. Bootstrapping ``current_plan`` / ``current_memory`` from the dataset's ``language_persistent`` annotations therefore only placed a stale, do-nothing plan in the status panel. Keep seeding ``current_subtask`` — it's a useful first-frame fallback for ``LowLevelForward`` before ``HighLevelSubtaskFwd`` produces its first subtask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 13:47:33 +02:00
Pepijn	b29cccb37e	runtime: restore the subtask hierarchy — generated subtask drives actions Reverts the previous "condition actions on the task" shortcut. The action expert is conditioned on the SUBTASK again: * ``low_level_execution`` recipe back to ``user(${subtask})``. * ``LowLevelForward`` conditions on ``current_subtask`` (falls back to the task only on the first frame, before the high-level loop has produced a subtask). * ``HighLevelSubtaskFwd`` re-added to the runtime pipeline so the subtask is actually generated each high-level tick and written to ``current_subtask`` before ``LowLevelForward`` consumes it. * ``_msgs_for_subtask`` now renders just ``${task}`` (no ``Plan: ``/``Memory: `` lines) to match the current ``high_level_subtask`` recipe, whose user turn is the bare task. So the loop is: task → HighLevelSubtaskFwd (LM head) → subtask → LowLevelForward → action chunk conditioned on that subtask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 13:43:04 +02:00
Pepijn	f161e27e96	recipe+runtime: condition the action expert on the task, not the subtask Real-robot runs shook and failed the task despite a low flow loss. Root cause: train/inference conditioning mismatch — not a flow-loss bug (``_compute_fused_loss``'s flow path is byte-identical to ``SmolVLAModel.forward``). At training, ``low_level_execution`` conditioned the action expert on ``${subtask}``, and every frame's subtask was the correct one for that frame. At inference the runtime has no high-level subtask generator (VQA-only pipeline), so ``current_subtask`` was frozen — the action expert got "move towards the blue cube" for the entire episode. Once the arm reached the cube, that (image, subtask) pair never occurred in training → OOD conditioning → incoherent flow output → shaking. Fix: ``low_level_execution`` now renders ``user(${task})``. The task is stable for the whole episode and always available, so the action expert's conditioning is identical at train and inference with no high-level loop required. ``LowLevelForward`` updated to build the same ``[user(task)]`` prompt. ``high_level_subtask`` still trains the text head to predict subtasks (kept for when a reliable subtask loop is reintroduced) — it's just no longer on the action expert's critical path. Requires re-training for the recipe change to take effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 13:40:15 +02:00
Pepijn	d5f293a1c9	recipe+runtime: VQA + subtask only — drop plan & memory Scope reduction while the core subtask + action loop is validated: Recipe (hirobot.yaml) * Removed ``plan_generation`` sub-recipe entirely. * Removed the memory tail from ``high_level_subtask`` (the ``new_memory`` binding + the second assistant turn). * ``high_level_subtask`` user turn is now just ``${task}`` — no ``Plan: …\nMemory: …`` context. * Weights rebalanced over the four remaining sub-recipes: high_level_subtask 0.40, low_level_execution 0.40, ask_vqa_top/wrist 0.10 each. Runtime (inference/runtime.py) * Pipeline trimmed to VQA + the action loop: AskVQAFwd → LowLevelForward → DispatchAction → DispatchToolCalls. * Dropped HighLevelSubtaskFwd / MemoryUpdateFwd / UserInterjectionFwd from the default pipeline. They remain importable from ``inference.steps`` for when plan/memory/subtask generation is brought back. The action expert conditions on the task string directly via LowLevelForward's ``current_subtask or task`` fallback. This commit lands on top of a rollback of the previous two commits (repetition_penalty / no_repeat_ngram_size knobs, and the deterministic plan-walker) — both were bandaids for the LM-head repetition collapse that the reduced-scope recipe sidesteps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 08:02:06 +02:00
Pepijn	95033733fc	deps: add sentencepiece to the pi extra (FAST action tokenizer) PI052 and PI0_FAST both load ``physical-intelligence/fast`` as their action tokenizer. That tokenizer's HF backend requires ``sentencepiece`` to instantiate (or ``tiktoken``); without it ``AutoProcessor.from_pretrained`` raises: ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece or tiktoken installed [...] It wasn't listed in pyproject so fresh installs missed it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:52:55 +02:00
Pepijn	c3503b774f	fix(debug): dumper now shows real stream + target flags The dumper was printing ``stream=None target=None`` for every message because it read those fields off the message dicts, but the recipe renderer keeps them in parallel arrays (``message_streams`` / ``target_message_indices`` in COMPLEMENTARY_DATA) so the chat template doesn't see unknown keys. Zip them back into the dump-time dicts so the printed metadata is accurate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:43:51 +02:00
Pepijn	99ebee4d16	annotate: tighter subtask + memory prompts (≤5 / ≤10 words) Both feed into the high-level prompt and the plan rendering, so keeping them short directly reduces the rendered ``${task}\nPlan: …\nMemory: …`` prefix the model has to chew through at inference. Subtasks * Hard cap: ≤ 5 words. Verb + object only, drop articles/adverbs. * Concrete good/bad examples to anchor the VLM. Memory * Hard cap: ≤ 10 words. Telegraphic noun→location fragments ("bowl in box, lid open"), no past-tense verbs, drop attributes that don't matter for downstream subtasks. * Allow empty string when no material change occurred — keeps the rendered memory line literally blank instead of forcing a no-op sentence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:28:09 +02:00

1 2 3 4 5 ...

1608 Commits