Use PI05Policy helpers for action padding and image preprocessing in PI052 fused losses instead of looking them up on the inner PI05Pytorch module.
Co-authored-by: Cursor <cursoragent@cursor.com>
Tokenize batched recipe outputs in PI052 so training batches with nested message lists do not crash before model forward.
Co-authored-by: Cursor <cursoragent@cursor.com>
Mask the FAST auxiliary loss to discrete action-code tokens so wrapper formatting tokens do not affect action co-training.
Co-authored-by: Cursor <cursoragent@cursor.com>
Module 3 anchored each VQA emission tick to K=3 consecutive frames
(~0.1s at 30fps). The VLM grounds the answer — bbox/keypoint
coordinates especially — against the first frame's image, so copying it
onto frames 2-3 smears a stale label over a moving scene.
Default K=1: a VQA pair lands on exactly its emission frame, no
temporal smear. VQA frames get sparser; the WeightedEpisodeAwareSampler
(vqa_target_fraction) is the knob to compensate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- hirobot.yaml -> subtasks_vqa.yaml
- hirobot_memory.yaml -> subtask_mem_vqa_speech.yaml
- pi05_hirobot.yaml -> deleted (stale: uses plan, top-camera names;
superseded by the two recipes above)
- smolvla2_hirobot.yaml -> deleted (was untracked stale junk)
Updated the smolvla2 / pi052 `recipe_path` config defaults, all
docstring / comment references, the annotation-pipeline + recipe docs,
and the three tests that loaded pi05_hirobot.yaml (repointed to the
renamed recipes; the low-level-branch and pipeline-render assertions
now accept a flow-only `low_level` stream as valid supervision, since
the new recipes' low_level_execution has no text-CE target).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pi052 had the same text-CE collapse bug smolvla2 had — PaliGemma's
embed_prefix flags the language block att=0, so make_att_2d_masks makes
it fully bidirectional and the text cross-entropy degenerates into a
copy task. Ported the three model-specific fixes:
- _mark_target_span_causal: set att=1 on supervised target language
positions so the text-CE is genuine causal next-token prediction.
Applied in both _compute_all_losses_fused and _compute_text_and_fast_loss.
- flow_loss_weight 10.0 -> 5.0: the paper's a=10 swamps the LM head once
the flow-only low_level recipe fires often (matches SmolVLA2Config).
- _flatten_say_tool_calls in the text tokenizer: serialize `say` tool
calls into a <say>...</say> marker so the spoken reply is tokenized
and supervised (PaliGemma's flat prompt has no structured calls, so
they were dropped entirely).
select_message needed no change: pi052's prefix is [images, language]
with no trailing state token, so it already decodes from the last
language token.
Regression tests mirror the smolvla2 attention-masking + tool-call suite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VQA annotations are sparse, so VQA was badly underrepresented in training:
its effective share was weight x density, and blend draws that picked an
ask_vqa* sub-recipe for a non-VQA frame were wasted entirely.
Two pieces:
1. Recipe-side consumption (language_render.py): render_sample now routes
any frame that carries a VQA annotation to a matching ask_vqa* sub-recipe,
regardless of the weighted blend draw. No VQA annotation is wasted and no
draw lands on a non-renderable VQA recipe — VQA's recipe-side share now
equals the VQA-annotation density.
2. Dataset-side oversampling (WeightedEpisodeAwareSampler + vqa_target_fraction):
a new weighted, episode-aware sampler draws frames with replacement by
per-frame weight. When TrainPipelineConfig.vqa_target_fraction is set, the
train script scans language_events, weights VQA frames so they make up
~that fraction of the training stream, and uses the weighted sampler. This
is what actually lets VQA exceed its natural density. Default None keeps
uniform episode-aware sampling unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The VLM already sees every camera, so the operator never needs to name
one to ask a question. Move the camera prompt to after generation and
only fire it when the answer actually carries a bounding box / point
(whose pixel coordinates are camera-specific and need a target frame).
Non-spatial answers (count / attribute / spatial / plain text) now skip
the prompt entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handle_vqa_query filtered the observation down to the single chosen
camera before calling the VLM. But training feeds every camera: the
ask_vqa_* recipes' image blocks are stripped before tokenization and
the frames reach the model via OBS_IMAGES_*, where embed_prefix
consumes all config.image_features regardless of the per-camera recipe
tag. Filtering to one camera changed the image-token count in the
prefix (the dropped camera zero-padded with mask=0) — a prefix shape
the model never saw at training.
Now the full observation is passed to select_message; the chosen
camera is used only to pick which frame the bbox/point overlay is
drawn on.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Command arguments never needed quotes (`_strip_quotes` only strips a
matching pair if present) — `/question point to the yellow cube` works.
The hints wrongly implied `""` were required; all hints/help now show
`/action <task>` / `/question <text>`.
Also adds a reference line to the state panel showing the two
overlay-producing VQA prompt shapes:
/question point to the yellow cube -> point overlay
/question detect the blue cube -> bounding-box overlay
plus the same examples in /help.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the startup mode prompt + task picker with a single
command-driven prompt. The runtime now comes up immediately at the
command line in `paused` mode (robot idle) and the operator drives it:
/action "task" run the robot on a task (bare = resume, number = timed burst)
/pause stop the action loop — robot holds position
/question "..." pause and answer one VQA question (camera prompt + overlay)
/help / stop
- Removed _select_mode_interactively / _select_task_interactively /
_dataset_task_strings (the interactive pickers).
- mode value renamed "question" -> "paused"; --mode choices are now
action|paused (default paused).
- /question takes the question inline and runs it via _handle_slash_command
(pauses first, so the policy isn't used concurrently).
- The ENTER-to-start gate only fires when starting in action mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets the operator skip the interactive startup entirely and go straight
to the command line:
- New --mode {action,question} arg; when given, the startup mode prompt
is skipped.
- When --task is passed explicitly on the CLI, the startup task picker
is skipped (the dataset-bootstrap task still shows the picker so you
can override it).
Also adds a timed action burst: /action <seconds> runs the robot for N
seconds, then the autonomous loop auto-reverts to question mode and
clears the action queue. Plain /action stays unlimited.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New recipe alongside hirobot.yaml (kept as the lean baseline). Superset
that adds two text-supervised sub-recipes:
- memory_update: compress progress into a memory note.
- user_interjection_response: reply to a user interjection with a `say`
tool call only (no plan/subtask text). The SmolVLA2 chat tokenizer
flattens the call to a `<say>...</say>` marker the runtime parses back.
Plan is intentionally omitted; memory is the only persistent high-level
state. Weights: low_level 0.40, subtask 0.25, memory 0.10, interjection
0.10, vqa 0.075 x2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a mode prompt at startup, shown before the task picker, so the
operator chooses action (run the robot) vs question (VQA only) up front
instead of having to discover /vlm mid-run.
Also rename the VQA mode from "vlm" to the clearer "question":
- state["mode"] value is now "action" | "question"
- the command is /question (/vlm and /vqa kept as aliases)
- panels, hints and help text updated to match
handle_vqa_query now reports via both push_log and direct stdout, so
VQA answers / overlay paths are visible in autonomous question mode
where the panel redraw is suspended.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The autonomous panel redraw cleared the screen every 0.5s, so the "> "
prompt and the one-shot command hint vanished — the operator could not
see what to type or what they were typing, making /vlm unreachable.
- Suspend the timer redraw entirely while in /vlm mode (the action loop
is paused, nothing changes in the background) so the VQA question and
camera prompt stay on a stable screen.
- Re-print the "> " prompt after each redraw so it is always visible.
- Show an always-on command hint in the panel (/vlm, /help, /action)
instead of relying on the startup line that scrolls away.
- Redraw immediately after a slash command so the mode flip is visible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The picker was skipped whenever a task was already resolved — which is
always the case with --dataset.repo_id, since the dataset's canonical
task is auto-filled. The operator never got to choose. Now the picker
always runs on an interactive terminal: the resolved task is shown as
"(current)" and selected by an empty Enter, so the dataset-canonical
default still works while letting the operator pick another task or
type a custom one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions to the SmolVLA2 interactive runtime:
1. Startup task picker — when no --task is given, the runtime lists the
dataset's task strings as a numbered menu (plus a custom-task option)
instead of silently waiting for the first stdin line.
2. Mode toggle — /action and /vlm slash commands flip a persistent run
mode. /vlm pauses the whole action loop (HighLevelSubtaskFwd,
LowLevelForward and DispatchAction gate on state["mode"]) and clears
the action queue so the robot holds position; /action resumes it.
The mode is shown in the state panel.
3. Interactive VQA — in /vlm mode a typed line is a VQA question. The
new inference/vqa.py module asks which camera to ground on, runs the
VLM on that single camera, and when the answer is a bbox/keypoint it
draws the overlay, saves a PNG to ./vqa_overlays/ and auto-opens it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chat tokenizer passed assistant `tool_calls` straight to
`apply_chat_template`, which renders them as a structured JSON
`<tool_call>` block — so the LM head was trained to emit JSON. But the
inference parser `_split_plan_and_say` looks for a `<say>...</say>`
marker, which the model never saw in training, so the `say` tool never
fired at inference.
`_flatten_say_tool_calls` is the missing training-time serializer (the
one `_split_plan_and_say`'s docstring already assumed existed): it
rewrites a `say` tool call into a `<say>...</say>` marker inside the
content text before the chat template runs, so the template only
tokenizes plain text and the supervised target span trains the model to
emit exactly the marker the runtime parses back (Pi 0.5-style flat
tool-call serialization).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLA's 1e-4 is safe only because it freezes the language head. SmolVLA2
unfreezes lm_head + the last text layer and fine-tunes the pretrained
SmolVLM2 language weights; 1e-4 is too aggressive there and destabilises
generation into degenerate repetition. Match pi05's 2.5e-5 peak LR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Pi 0.5 α=10 split assumed text is a rare auxiliary task. With the
flow-only `low_level` recipe (~40% of the blend) now rendering, the flow
term fires often and at 10x weight dominates the shared VLM backbone,
starving the text head into degenerate repetition decoding. A 5:1 split
keeps actions primary while leaving the language head enough gradient.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A recipe whose only supervision is the action-expert flow loss (e.g.
`low_level_execution`: `user(${subtask})` with `stream: low_level` and no
`target` turn) was rejected at render time by `_render_message_recipe` and
`_validate_rendered`, both of which required at least one target turn.
The result: every blend draw of the flow-only recipe rendered to `None`,
`predict_actions` was never set, `run_flow` never fired, and the action
expert received no flow loss — leaving it at random init. Both gates now
also accept a `low_level`-stream turn as valid supervision.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the text-CE collapse bug fixed in 3cd348ff.
Pure-function tests over ``_mark_target_span_causal`` /
``_locate_lang_range`` / ``make_att_2d_masks`` — no model load, fast.
Pins:
* the target span flips to att=1, prompt/images stay att=0;
* target tokens attend causally among themselves (no peeking at
future targets) — genuine next-token prediction;
* targets still attend bidirectionally to images + the user prompt;
* the action-expert (state) token still attends to every target;
* a no-target subtask (low_level_execution user turn, labels all
-100) leaves the mask bidirectional;
* an explicit test documenting the bug: the raw embed_prefix mask
lets the first target token see the last — the copy-task collapse.
Skips cleanly when transformers isn't installed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of every collapsed inference run. ``embed_prefix`` flags
all language tokens ``att=0``; ``make_att_2d_masks`` turns that into
a single fully BIDIRECTIONAL block. So during the text-loss forward,
a supervised subtask token's hidden state attends to the very tokens
it is trained to predict. The cross-entropy degenerates into a copy
task — ``text_loss → ~3e-5`` not because the model learned to
predict subtasks but because it can see the answer.
At inference ``select_message`` decodes autoregressively (causally):
each token must be predicted WITHOUT seeing it — a task the model
was never actually trained on. Hence the universal collapse: a
coherent first token or two ("grasp the yellow cube"), then a loop
("cover cover cover", "icatorsicators", "the the the").
Fix: ``_mark_target_span_causal`` sets ``att=1`` on the language
positions that are supervised targets (``text_labels != -100``).
With make_att_2d_masks's cumulative-block rule each target token
then attends to images + the user prompt bidirectionally and to
EARLIER target tokens only — genuine causal next-token prediction,
matching select_message. Applied in both ``_compute_text_loss`` and
``_compute_fused_loss``. Per-sample correct: high_level_subtask
targets become causal; low_level_execution subtasks (a user turn,
labels all -100) stay bidirectional so the action expert reads them
as bidirectional context. The action expert is otherwise unaffected
— the suffix has a strictly higher cumsum and still attends to the
whole prefix.
Requires retraining: this changes the training objective. Existing
checkpoints were all trained on the degenerate copy task and cannot
generate text. Expect ``text_loss`` to settle MUCH higher than 3e-5
after this — that is correct; it is now a real prediction task.
NOTE: pi052's text path (PaliGemma prefix-LM) has the same
bidirectional-block structure and needs the analogous fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``embed_prefix`` lays the prefix out as ``[images, lang, state]`` with
the state token LAST. Training supervises the text head on the
*language* positions (``_compute_text_loss`` / ``_compute_fused_loss``
slice ``prefix_out[lang_start:lang_end]`` and run lm_head there).
But ``select_message`` started AR generation from the full prefix and
read ``prefix_out[:, -1:]`` — the **state token** — to decode the
first subtask token. The state token's hidden state exists for the
action expert to read; the lm_head was never trained to produce
subtask text from it. So inference decoded the high-level head from a
position entirely outside the training distribution: the text head
collapses (``the arm the arm``, ``grasp the surface population``,
``_333 absburg…``) no matter how cleanly ``text_loss`` converged.
Fix: truncate the state token off the prefix before the AR loop, so
``prefix_out[:, -1:]`` is the last language token (right after the
``Assistant:`` generation prompt) — exactly where training supervised.
Inference-only change — no retraining needed; existing checkpoints
benefit immediately. The action path (``predict_action_chunk``) is
untouched: state belongs in the action expert's prefix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLAConfig defaults ``load_vlm_weights=False``. With that and no
``--policy.path``, ``SmolVLMWithExpert.__init__`` builds the VLM via
``SmolVLMForConditionalGeneration(config=...)`` — i.e. a fully
**random-initialised** 500M backbone, including a random ``lm_head``.
For plain SmolVLA that's a deliberate "pre-train the expert" mode.
For SmolVLA2 it's a footgun: the high-level text head *is* the
SmolVLM2 ``lm_head``. Training subtask prediction from a random
language model can only memorise — which is exactly the repetition
collapse seen on the real robot ("the arm the arm the arm …").
SmolVLA2 now defaults ``load_vlm_weights=True`` so every run
fine-tunes the pretrained ``HuggingFaceTB/SmolVLM2-500M-Video-Instruct``
backbone (vision tower + language model + lm_head). The action
expert still trains from scratch on the robot data (standard SmolVLA
fine-tuning); start it from pretrained too by fine-tuning a full
``lerobot/smolvla_base`` checkpoint via ``--policy.path``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tighten the subtask prompt further per real-data feedback. The old
≤5-word cap still produced things like "release the yellow block
into the green bin" (8 words, articles, destination, and "block"
where the task said "cube").
New rules:
* Hard cap ≤ 4 words, ideally 2-3. Form: VERB + (color) + OBJECT.
* No articles, no destinations, no adverbs, no "robot/arm/gripper".
* Must reuse the exact object nouns from the task — no block/cube,
bin/box/container drift across the episode.
* Concrete good/bad examples anchored on the cube task.
Shorter, templated, consistent targets are far more robust for the
autoregressive LM head — fewer tokens to drift on, fewer dominant
n-grams to repetition-collapse into.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``_looks_like_gibberish`` low-unique-token check was gated on
``len(stripped) < 80``, so an LM head that loops an n-gram for the
whole 256-token budget — "the arm the arm … the the the the" —
sailed straight through (``gibberish:0`` in the panel) and the
garbage subtask got accepted and fed to the action expert.
Added a length-independent check: ``>= 8 tokens`` but unique-token
count ``<= max(3, tokens // 10)`` ⇒ repetition collapse. Now the
runtime rejects the looped output and keeps the previous (real)
subtask instead of propagating nonsense.
This is a guard, not a cure — the underlying issue is the LM head
on the current checkpoint being undertrained / collapsed; re-
annotate with the short prompts and train longer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The current recipe trains neither plan nor memory, and no inference
step consumes them — ``_msgs_for_subtask`` renders the bare task and
``LowLevelForward`` conditions on the subtask. Bootstrapping
``current_plan`` / ``current_memory`` from the dataset's
``language_persistent`` annotations therefore only placed a stale,
do-nothing plan in the status panel.
Keep seeding ``current_subtask`` — it's a useful first-frame
fallback for ``LowLevelForward`` before ``HighLevelSubtaskFwd``
produces its first subtask.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the previous "condition actions on the task" shortcut.
The action expert is conditioned on the SUBTASK again:
* ``low_level_execution`` recipe back to ``user(${subtask})``.
* ``LowLevelForward`` conditions on ``current_subtask`` (falls back
to the task only on the first frame, before the high-level loop
has produced a subtask).
* ``HighLevelSubtaskFwd`` re-added to the runtime pipeline so the
subtask is actually generated each high-level tick and written to
``current_subtask`` before ``LowLevelForward`` consumes it.
* ``_msgs_for_subtask`` now renders just ``${task}`` (no
``Plan: ``/``Memory: `` lines) to match the current
``high_level_subtask`` recipe, whose user turn is the bare task.
So the loop is: task → HighLevelSubtaskFwd (LM head) → subtask →
LowLevelForward → action chunk conditioned on that subtask.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-robot runs shook and failed the task despite a low flow loss.
Root cause: train/inference conditioning mismatch — not a flow-loss
bug (``_compute_fused_loss``'s flow path is byte-identical to
``SmolVLAModel.forward``).
At training, ``low_level_execution`` conditioned the action expert
on ``${subtask}``, and every frame's subtask was the correct one
for that frame. At inference the runtime has no high-level subtask
generator (VQA-only pipeline), so ``current_subtask`` was frozen —
the action expert got "move towards the blue cube" for the entire
episode. Once the arm reached the cube, that (image, subtask) pair
never occurred in training → OOD conditioning → incoherent flow
output → shaking.
Fix: ``low_level_execution`` now renders ``user(${task})``. The
task is stable for the whole episode and always available, so the
action expert's conditioning is identical at train and inference
with no high-level loop required. ``LowLevelForward`` updated to
build the same ``[user(task)]`` prompt.
``high_level_subtask`` still trains the text head to predict
subtasks (kept for when a reliable subtask loop is reintroduced) —
it's just no longer on the action expert's critical path.
Requires re-training for the recipe change to take effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scope reduction while the core subtask + action loop is validated:
Recipe (hirobot.yaml)
* Removed ``plan_generation`` sub-recipe entirely.
* Removed the memory tail from ``high_level_subtask`` (the
``new_memory`` binding + the second assistant turn).
* ``high_level_subtask`` user turn is now just ``${task}`` — no
``Plan: …\nMemory: …`` context.
* Weights rebalanced over the four remaining sub-recipes:
high_level_subtask 0.40, low_level_execution 0.40,
ask_vqa_top/wrist 0.10 each.
Runtime (inference/runtime.py)
* Pipeline trimmed to VQA + the action loop:
AskVQAFwd → LowLevelForward → DispatchAction → DispatchToolCalls.
* Dropped HighLevelSubtaskFwd / MemoryUpdateFwd / UserInterjectionFwd
from the default pipeline. They remain importable from
``inference.steps`` for when plan/memory/subtask generation is
brought back. The action expert conditions on the task string
directly via LowLevelForward's ``current_subtask or task``
fallback.
This commit lands on top of a rollback of the previous two commits
(repetition_penalty / no_repeat_ngram_size knobs, and the
deterministic plan-walker) — both were bandaids for the LM-head
repetition collapse that the reduced-scope recipe sidesteps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PI052 and PI0_FAST both load ``physical-intelligence/fast`` as
their action tokenizer. That tokenizer's HF backend requires
``sentencepiece`` to instantiate (or ``tiktoken``); without it
``AutoProcessor.from_pretrained`` raises:
ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a tokenizers library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece or tiktoken installed [...]
It wasn't listed in pyproject so fresh installs missed it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dumper was printing ``stream=None target=None`` for every
message because it read those fields off the message dicts, but
the recipe renderer keeps them in parallel arrays
(``message_streams`` / ``target_message_indices`` in
COMPLEMENTARY_DATA) so the chat template doesn't see unknown
keys. Zip them back into the dump-time dicts so the printed
metadata is accurate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both feed into the high-level prompt and the plan rendering, so
keeping them short directly reduces the rendered ``${task}\nPlan:
…\nMemory: …`` prefix the model has to chew through at inference.
Subtasks
* Hard cap: ≤ 5 words. Verb + object only, drop articles/adverbs.
* Concrete good/bad examples to anchor the VLM.
Memory
* Hard cap: ≤ 10 words. Telegraphic noun→location fragments
("bowl in box, lid open"), no past-tense verbs, drop attributes
that don't matter for downstream subtasks.
* Allow empty string when no material change occurred — keeps the
rendered memory line literally blank instead of forcing a no-op
sentence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously only emitted a plan at t=0 and on interjections, so the
active plan rendered into training carried "done" subtasks until
the next interjection. With the new "plan = remaining subtasks"
summariser this meant the plan was stale between boundaries.
Emit a fresh plan row at every subtask start. ``active_at(t)`` then
returns a plan that contains exactly the subtasks whose start ≥
the current span's start — completed subtasks fall off the plan
the moment the next subtask begins.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The plan was being generated by a separate VLM call (one per
episode + one per interjection refresh) with a prompt that asked
the model to "compress the subtasks into a compact hierarchical
plan". In practice the plans came out longer than necessary and
sometimes drifted from the actual subtask sequence the runtime
would execute.
Replaced ``_generate_plan`` with a deterministic numbered list
of the upcoming subtasks. At a refresh time the list shrinks to
subtasks whose start ≥ refresh_t — the plan describes what's
*left* to do, so it gets shorter as work progresses.
Saves the per-episode + per-interjection VLM round-trip in the
annotation pipeline and keeps train-time plan text bit-aligned
with the subtask annotations the rest of Module 1 emits.
Removed the now-unused ``prompts/module_1_plan.txt``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two regressions surfaced by the first training run:
1. ``--policy.type=pi052`` failed with ``invalid choice``. PI052Config
wasn't imported in ``policies/__init__.py``, so its
``@register_subclass("pi052")`` decorator never ran and draccus
didn't see it as a valid policy type. Mirror PI05Config /
SmolVLA2Config in the top-level imports + __all__.
2. ``low_level_execution`` (user-only ``${subtask}`` recipe used for
π0.5-style flow conditioning) tripped
``ValueError: Message recipes must contain at least one target
turn.`` The validator was too strict — a recipe with only a
``stream: low_level`` turn still drives meaningful supervision
(flow MSE on the action expert via ``predict_actions=True``).
Allow either ``target: true`` OR ``stream: low_level`` to satisfy
the "supervises something" requirement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recipe renders ``"\${task}\nPlan: \${plan}\nMemory: \${memory}"``
unconditionally — when a binding resolves to None,
``language_render._substitute`` substitutes an empty string, so the
training-time user turn always contains the literal ``Plan: `` /
``Memory: `` prefixes even with empty values.
The inference message builders were skipping those lines entirely
when ``state['current_plan']`` / ``state['current_memory']`` was
empty, producing a different prompt shape on early frames (before
the plan-generation step runs) and on datasets without plan/memory
annotations.
Factored a shared ``_hirobot_user_head`` helper used by
``_msgs_for_subtask``, ``_msgs_for_memory``, and the legacy
``_control_context_messages`` so they all match training byte-for-
byte regardless of which bindings are populated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a one-shot debug dumper to both chat processors. When the env
var ``LEROBOT_DUMP_RECIPE_SAMPLES`` is set to a positive integer N,
the next N samples processed (rank-0 only) get pretty-printed:
* the recipe-rendered messages (role / stream / target / content),
* the full tokenized prompt (decoded back),
* inline ``[TGT]...[/TGT]`` markers over the spans the LM head is
supervised on,
* token count + target-token count,
* ``predict_actions`` flag.
Usage:
LEROBOT_DUMP_RECIPE_SAMPLES=5 sbatch train_smolvla2.slurm
After N dumps the helper becomes a no-op; training continues
unaffected. Works for both smolvla2 (chat-template renderer) and
pi052 (plain ``Role: content`` concat renderer); each processor has
its own copy to avoid cross-package imports.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The smolvla2 and pi052 recipe blends had drifted to identical content
twice in a row; collapse them to a single ``recipes/hirobot.yaml``
both policies point at. Each backbone's text tokenizer (chat-template
for SmolVLA2, plain ``Role: content`` for PI052) handles the
rendering differences downstream — the recipe spec is shared.
Audit fixes folded into the same commit:
* **Train/inference prefix mismatch on the action expert**
``_build_text_batch`` always passed ``add_generation_prompt=True``,
appending ``<|im_start|>assistant\\n`` tokens that the action
expert never saw at training (the chat tokenizer renders with
``add_generation_prompt=False``). Parameterized the helper and
pass ``False`` from ``LowLevelForward``; ``select_message`` paths
still default to ``True`` for AR text generation.
* **PI052 fallthrough could silently train flow on text-only frames**
When ``text_loss_weight=0`` AND every sample was high-level
(``predict_actions.any()==False``), the previous heuristic
delegated to ``PI05Policy.forward``, which ignores
``predict_actions`` and runs flow on every sample. Reverted to
delegating only on fully unannotated batches.
* **SmolVLA2 silent zero-loss training**
``forward`` returned ``loss=0`` (no error) when neither flow nor
text path fired. Now raises ``RuntimeError`` with the weights and
routing flags — fails loud like PI052 already does.
* **PI052 dropout-seed key**
Was reading ``complementary["dataset_index"]`` (only set by
``MultiDataset`` and means "which sub-dataset", not row index)
with fallback to ``frame_index`` (never set) — every sample got
seed=0, so per-component dropout was deterministic across the
epoch. Switched to ``complementary["index"]`` to match SmolVLA2
and the canonical ``BatchProcessor`` convention.
* **Dead ``DEFAULT_TOOLS`` import**
Removed from ``chat_processor_smolvla2.py`` — unused since the
default-tools list was switched to ``[]`` in the prior commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRITICAL (smolvla2) — the SmolVLM2 chat template was rendering the
``say`` tool's JSON schema as a system message on every training
sample because ``DEFAULT_TOOLS`` was the default in
``SmolVLA2ChatTokenizerStep``. That schema was only relevant to
the now-removed ``user_interjection_response`` recipe; with it
gone the schema is dead weight that polluted every action-expert
prefix AND created a train/inference mismatch (the inference
``_build_text_batch`` doesn't pass ``tools=``). Default is now
``[]``; callers needing tools can still set them via
``with_tools(meta.tools)``.
LIKELY-BUG — ``low_level_execution`` had ``target: true`` on its
assistant turn, so text-CE trained the LM head to predict the
same subtask string the user just stated (trivial "copy previous
turn" supervision that diluted LM head capacity). Dropped the
assistant turn entirely; ``high_level_subtask`` (w=0.50) already
owns subtask prediction from real context.
The chat-tokenizer's ``predict_actions`` detection used to scan
target streams only. With the new no-target low_level recipe it
would mis-fire as False. Switched both
``chat_processor_smolvla2.py`` and ``text_processor_pi052.py`` to
scan all message streams — any ``stream: low_level`` on the
sample is enough to trigger flow loss.
Inference: the low-level loop sends only ``[user(subtask)]`` now,
matching the new recipe shape.
PI052 — hardened the forward fallthrough so a degenerate batch
where every sample's recipe is text-only AND text supervision is
disabled (text_loss_weight<=0 or text_labels missing) cleanly
delegates to ``PI05Policy.forward`` instead of raising
"nothing to train".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously ``action_execution`` rendered ``task + plan + memory +
subtask`` into one prefix and ran the flow loss on it. That meant
the action expert was conditioned on the full hierarchical context
(closer to π0.7 §V.A), not just the subtask.
The π0.5 paper's hierarchical inference has the action expert see
only the *subtask* (plus images and state). Split the recipe to
match:
high_level_subtask (0.50)
user(task + plan + memory) → assistant(subtask)
[+ assistant(new_memory) at boundary frames]
All ``stream: high_level`` → text-CE only, no flow loss.
low_level_execution (0.30)
user(subtask) → assistant(subtask)
Both ``stream: low_level`` → flow loss fires; text CE on the
subtask is a small redundant extra signal. Prefix the action
expert sees: [images, subtask, state].
plan_generation (0.10) — unchanged.
ask_vqa_{top,wrist} (0.05 each) — unchanged.
Runtime: the low-level loop in ``smolvla2/inference/steps.py``
now sends ``[user(subtask), assistant(subtask)]`` to
``predict_action_chunk`` instead of the full task+plan+memory
context. Falls back to ``state['task']`` when no subtask has been
generated yet so the first frame still has something to condition
on.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CRITICAL (smolvla2) — text-CE was applied to the wrong prefix slice.
``num_state`` was being read from ``state.shape[1]`` (the raw
max_state_dim, ~14-32) instead of the *number of state tokens*
(always 1). Compounded by the trailing-padding issue (state is
not at the end of the padded prefix when ``seq_len < prefix_length``),
the lang slice was landing on image / padding hidden states.
New ``_locate_lang_range`` finds the state position via
``att_masks.nonzero()`` (the only ``1`` in the mask), making the
slice robust to both bugs. Used by ``_compute_text_loss`` and
``_compute_fused_loss``.
LIKELY-BUG (smolvla2) — ``_unfreeze_lm_head`` only re-enabled
``lm_head`` and ``text_model.model.norm.weight``. SmolVLA's parent
ALSO freezes the last 1-2 transformer layers, so text-loss
gradients died in a frozen final block. Now mirrors the parent's
freeze targets and unfreezes the matching ``layers.{N-1}`` (and
``N-2`` when num_vlm % num_expert == 0).
CRITICAL (pi052) — flow and FAST CE were not per-sample masked
under per-sample-routing. Text-only recipe samples
(``plan_generation``, ``ask_vqa_*``) contributed to flow/FAST
loss with prompts that deliberately omit the subtask, corrupting
the signal. Threaded ``predict_actions_t`` through both
``_compute_all_losses_fused`` and ``_compute_text_and_fast_loss``;
flow uses ``(per_sample * mask).sum() / mask.sum()``, FAST uses
``shift_valid & sample_mask`` before ``masked_fill(-100)``.
OTHER
* PI052Policy.forward now falls through to PI05Policy.forward on
unannotated batches (no text_labels, no predict_actions, no FAST).
* fit_fast_tokenizer cache key now includes ``chunk_size`` — changing
the chunk size no longer silently loads a wrongly-fit tokenizer.
* Removed dead ``_compute_text_loss`` / ``_compute_fast_action_loss``
in pi052 (superseded by the fused helpers).
* Fixed stale "no-op stub" docstring on ``knowledge_insulation`` —
it's been fully wired since the per-layer KI forward port.
* Stripped unused ``copy`` / ``resize_with_pad`` imports.
* Extracted ``_shifted_ce`` / ``_mask_per_sample`` / ``_fast_ce``
helpers shared between fused and prefix-only paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New text-only sub-recipe at 0.10 weight on both blends:
user : ${task}
assistant : ${current_plan} (high_level target)
Bound to ``active_at(t, style=plan)`` so it supervises the
currently-active plan on every frame, gated by ``if_present`` to
skip frames without a plan annotation.
Weights rebalanced: action_execution 0.85 → 0.75, plan_generation
0.10, VQA top/wrist 0.075 each (sums to 1.0).
Added matching runtime builder ``_msgs_for_plan`` in
``smolvla2/inference/steps.py`` so the high-level loop can call
``select_message`` with the bare-task prompt at episode start /
replanning events.
Closes a gap vs. Pi 0.7 §V — without this recipe the model could
read ``${plan}`` from the prompt but never had to produce one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recipes were over-commented (paper citations, history of removed
sub-recipes, inference-time loop walkthroughs). Stripped down to a
short header + a one-line note on the boundary-frame memory tail.
Also removed the ``_tool3`` diversity-knobs comment block in
``examples/annotation/run_hf_job.py`` — it was a personal note about
a since-merged experiment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recipe changes:
* action_execution now bundles the memory update as a second
assistant target gated on a new ``new_memory`` binding (fires
only at subtask-boundary frames). No "Completed subtask: X"
filler — the model emits the new subtask AND the updated
memory back-to-back in one prefix.
* user_interjection_response sub-recipe removed (current
datasets don't have interjection / say() annotations).
* Standalone memory_update sub-recipe removed (folded above).
* Weights rebalanced: action_execution 0.85, ask_vqa_top/wrist
0.075 each (sums to 1.0).
Runtime ``_msgs_for_memory`` updated to match the new
boundary-frame prompt layout.
Modeling:
* SmolVLA2Policy now fuses the flow + text losses into a SINGLE
backbone forward via ``_compute_fused_loss`` (one
vlm_with_expert pass with [prefix, suffix] embeds, then both
lm_head CE on lang slice + action_out_proj MSE on suffix).
Mirrors pi052's existing ``_compute_all_losses_fused`` —
saves one backbone pass per training step.
Examples:
* Removed the two training SLURM scaffolds; they were
out-of-date with the recipe refactor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a
clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions)
and the hierarchical inference pattern from Pi 0.5 §IV.D.
Flavor 1 — action_execution (60% weight, "main path")
-----------------------------------------------------
One always-on recipe that fuses **all** available context (task,
plan, memory) into a single user prompt and uses the current subtask
as the supervised assistant target. This single recipe supervises
*both* objectives:
* subtask prediction (text CE on the assistant span via lm_head)
* action chunks (flow MSE on the action expert via
stream: low_level, target: true; plus FAST CE on action tokens
when enable_fast_action_loss=True)
At inference, the *same* prompt structure drives both inference
modes:
* select_message(user_prompt_only) → LM head generates the next
subtask. Matches action_execution's training distribution
exactly (prompt is the user turn, target is the subtask).
* predict_action_chunk(user_prompt + assistant_subtask) → action
expert produces the chunk. Matches action_execution's full
prompt+target.
This replaces what used to be a separate high_level_subtask recipe
plus a low_level_execution recipe; both were supervising the same
subtask text, so collapsing them into one is correct and removes
the redundant text-CE gradient.
Flavor 2 — event-driven text-only recipes
-----------------------------------------
Each of these supervises the LM head to predict a specific kind of
text given a specific event-triggered context. ``stream: high_level``
on all targets so they never trigger predict_actions / flow loss.
``if_present`` guards ensure they only fire on frames where the
event annotation is present.
* memory_update (10%) new memory at subtask boundary
* user_interjection_response (15%) new plan + say(...) on input
* ask_vqa_top (7.5%) front-camera VQA
* ask_vqa_wrist (7.5%) wrist-camera VQA
Total weight = 1.0.
Prompt format consistency
-------------------------
User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}``
matches what ``inference/steps.py::_msgs_for_subtask`` and
``_control_context_messages`` already emit at inference time. No
"Task: " prefix — the bare task string is used as the leading
content with literal "Plan: " / "Memory: " labels for the
subsequent components.
What changed structurally
-------------------------
- low_level_execution DROPPED (folded into action_execution)
- high_level_subtask DROPPED (subtask supervision moved into action_execution)
+ action_execution NEW (the fused main recipe)
memory_update kept, prompt cleaned up
user_interjection_response kept, prompt cleaned up
ask_vqa_top / ask_vqa_wrist kept
Runtime compatibility
---------------------
No runtime change needed — ``SmolVLA2Runtime`` and the inference
helpers already build their high-level prompt as just the user turn
(task + plan + memory) and append a ``current_subtask`` assistant
turn for the low-level call. Both match the new ``action_execution``
prompt shape exactly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the forward did 2 backbone passes when all heads were
active: one for flow (via super().forward) and one for the fused
text+FAST helper. This commit reduces it to **one pass** — same
compute as flow-only training.
New ``_compute_all_losses_fused`` builds:
prefix = [images, language, FAST (when provided)]
suffix = [noisy_actions] (action expert via gemma_expert)
and runs a single ``paligemma_with_expert.forward`` with
``inputs_embeds=[prefix_embs, suffix_embs]`` (both experts active
in the same call). Captures *both* prefix_out and suffix_out, slices
each for its respective loss:
flow MSE ← suffix_out (existing action_out_proj + MSE path)
text CE ← prefix_out at language positions (lm_head + CE)
FAST CE ← prefix_out at FAST positions (lm_head + CE)
Critical attention mask override
--------------------------------
``make_att_2d_masks`` produces a cumulative-block attention mask in
which suffix tokens (highest cumsum) attend to *every* lower-cumsum
position by default, including FAST tokens. If we let that stand the
action expert reads the discrete FAST tokens and trivially decodes
them back to the same continuous actions the flow head is supposed
to predict from noise — the entire training signal collapses to a
copy operation.
The fix is a single line right after make_att_2d_masks:
att_2d_masks[:, fast_end:, fast_start:fast_end] = False
Explicitly zeros out *suffix → FAST* attention. Everything else
remains correct under the cumsum semantics:
* prefix images/language stay bidirectional among themselves
* FAST stays causal within itself, attending bidirectionally
to images+language
* FAST cannot see suffix (cumsum < suffix cumsum, default)
* suffix attends bidirectionally among itself, to images+language,
and now NOT to FAST (this override)
Bit-equivalent to the previous separated forward path for text+FAST
losses (the prefix hidden states at language and FAST positions are
unchanged whether suffix is present or not — the prefix doesn't
attend to suffix). For flow loss, suffix→FAST being masked is the
correct behaviour we *want* — if anything the previous separated
path was less correct for production use because the joint
gradient signal through the action expert was missing the prefix
extension.
Forward routing in ``forward()``
--------------------------------
* run_flow=True → _compute_all_losses_fused (one forward, all
three losses)
* run_flow=False, run_text or run_fast → _compute_text_and_fast_loss
(one prefix-only forward, two CE losses, no
suffix → cheaper than fusion)
* neither → RuntimeError (explicit; both losses disabled)
Wall-time per step
------------------
Before this commit: flow + (text+FAST fused) = 2 forwards
After this commit: (flow+text+FAST fused) = 1 forward
Compute parity with flow-only training when all three heads active.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same bug we fixed for high_level_subtask, just on the other
subtask-supervised sub-recipe. ``low_level_execution`` targets
``${subtask}`` (the current active span) but had no
``if_present`` guard. When ``active_at(t, style=subtask)`` returned
None at a frame (gaps in the annotation, or the very first/last
frames of an episode if the annotator's spans don't fully tile),
the assistant message rendered with empty content. The chat
tokenizer still included it in ``target_message_indices`` → text CE
supervised whatever the chat-template's empty assistant turn
decoded to (usually a single ``\n``). That trains the LM head's
prior at the first generation position toward ``\n``, the same
collapse we observed with the original ``${next_subtask}`` target.
Fix: ``if_present: subtask`` on the assistant target in
``low_level_execution`` for both ``smolvla2_hirobot.yaml`` and
``pi052_hirobot.yaml``.
Side effect: frames without an active subtask span no longer
contribute to the flow loss either (the only ``low_level`` target
is skipped, ``predict_actions = bool(targets_by_stream.get("low_level"))``
becomes False). For a well-annotated dataset where subtask spans
tile the whole episode this is a no-op. For datasets with gaps,
those gap frames lose flow supervision — strictly better than the
degenerate text-CE alternative.
Sub-recipe audit summary (no other changes needed):
* memory_update — all if_present guards present, OK
* user_interjection_response — all if_present guards present, OK
* high_level_subtask — fixed earlier, OK
* low_level_execution — fixed by this commit
* ask_vqa_top / ask_vqa_wrist — query+answer both guarded, OK
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the forward did three backbone passes per training step
when all heads were active: one for flow (via super().forward), one
for text CE, and one for FAST CE. That's ~3× the compute of
flow-only training.
The text and FAST losses share their prefix forward exactly — both
are CE on the LM head, evaluated at different slices of the same
hidden states. Adding FAST tokens after language in the prefix is
bit-equivalent for the text loss because the mask_ar convention in
``make_att_2d_masks`` keeps FAST tokens in a strictly-later causal
block: language tokens never see FAST, so their hidden states are
unchanged.
New ``_compute_text_and_fast_loss``:
* embeds [images, language] once
* optionally appends [FAST] (when run_fast is True)
* one backbone forward
* slices ``vlm_out[:, -(fast_len + lang_len):-fast_len]`` for
language hidden states (or ``vlm_out[:, -lang_len:]`` when no
FAST) → text CE
* slices ``vlm_out[:, -fast_len:]`` for FAST hidden states →
FAST CE
* returns both losses, either of which can be None when the
caller doesn't want that head.
forward() now calls this fused helper instead of running the two
separate ``_compute_text_loss`` / ``_compute_fast_action_loss``
methods. Those remain in the file for callers that only want one
head (e.g. ablations).
Why flow isn't fused
--------------------
Flow MSE comes from the action-expert (suffix) hidden states, which
attend to the prefix. If we just concat FAST onto the prefix and let
the action expert attend to it, the expert can trivially decode FAST
back to continuous actions — overfitting via shortcut. Preventing
that requires a custom segment-aware attention mask (action expert
can attend to images+language but NOT to subtask/FAST), which is
what pi05_full does in ``compute_layer_complete_knowledge_insulation``.
That's the full-fusion path; deferred as a follow-up since the
text+FAST fusion already recovers most of the compute.
End-to-end forward pass count
-----------------------------
Before: 1 (flow) + 1 (text) + 1 (FAST) = 3 backbone forwards
After: 1 (flow) + 1 (text+FAST fused) = 2 backbone forwards
~33% wall-time reduction per training step when all three heads
are active.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>