Recipes were over-commented (paper citations, history of removed
sub-recipes, inference-time loop walkthroughs). Stripped down to a
short header + a one-line note on the boundary-frame memory tail.
Also removed the ``_tool3`` diversity-knobs comment block in
``examples/annotation/run_hf_job.py`` — it was a personal note about
a since-merged experiment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recipe changes:
* action_execution now bundles the memory update as a second
assistant target gated on a new ``new_memory`` binding (fires
only at subtask-boundary frames). No "Completed subtask: X"
filler — the model emits the new subtask AND the updated
memory back-to-back in one prefix.
* user_interjection_response sub-recipe removed (current
datasets don't have interjection / say() annotations).
* Standalone memory_update sub-recipe removed (folded above).
* Weights rebalanced: action_execution 0.85, ask_vqa_top/wrist
0.075 each (sums to 1.0).
Runtime ``_msgs_for_memory`` updated to match the new
boundary-frame prompt layout.
Modeling:
* SmolVLA2Policy now fuses the flow + text losses into a SINGLE
backbone forward via ``_compute_fused_loss`` (one
vlm_with_expert pass with [prefix, suffix] embeds, then both
lm_head CE on lang slice + action_out_proj MSE on suffix).
Mirrors pi052's existing ``_compute_all_losses_fused`` —
saves one backbone pass per training step.
Examples:
* Removed the two training SLURM scaffolds; they were
out-of-date with the recipe refactor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a
clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions)
and the hierarchical inference pattern from Pi 0.5 §IV.D.
Flavor 1 — action_execution (60% weight, "main path")
-----------------------------------------------------
One always-on recipe that fuses **all** available context (task,
plan, memory) into a single user prompt and uses the current subtask
as the supervised assistant target. This single recipe supervises
*both* objectives:
* subtask prediction (text CE on the assistant span via lm_head)
* action chunks (flow MSE on the action expert via
stream: low_level, target: true; plus FAST CE on action tokens
when enable_fast_action_loss=True)
At inference, the *same* prompt structure drives both inference
modes:
* select_message(user_prompt_only) → LM head generates the next
subtask. Matches action_execution's training distribution
exactly (prompt is the user turn, target is the subtask).
* predict_action_chunk(user_prompt + assistant_subtask) → action
expert produces the chunk. Matches action_execution's full
prompt+target.
This replaces what used to be a separate high_level_subtask recipe
plus a low_level_execution recipe; both were supervising the same
subtask text, so collapsing them into one is correct and removes
the redundant text-CE gradient.
Flavor 2 — event-driven text-only recipes
-----------------------------------------
Each of these supervises the LM head to predict a specific kind of
text given a specific event-triggered context. ``stream: high_level``
on all targets so they never trigger predict_actions / flow loss.
``if_present`` guards ensure they only fire on frames where the
event annotation is present.
* memory_update (10%) new memory at subtask boundary
* user_interjection_response (15%) new plan + say(...) on input
* ask_vqa_top (7.5%) front-camera VQA
* ask_vqa_wrist (7.5%) wrist-camera VQA
Total weight = 1.0.
Prompt format consistency
-------------------------
User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}``
matches what ``inference/steps.py::_msgs_for_subtask`` and
``_control_context_messages`` already emit at inference time. No
"Task: " prefix — the bare task string is used as the leading
content with literal "Plan: " / "Memory: " labels for the
subsequent components.
What changed structurally
-------------------------
- low_level_execution DROPPED (folded into action_execution)
- high_level_subtask DROPPED (subtask supervision moved into action_execution)
+ action_execution NEW (the fused main recipe)
memory_update kept, prompt cleaned up
user_interjection_response kept, prompt cleaned up
ask_vqa_top / ask_vqa_wrist kept
Runtime compatibility
---------------------
No runtime change needed — ``SmolVLA2Runtime`` and the inference
helpers already build their high-level prompt as just the user turn
(task + plan + memory) and append a ``current_subtask`` assistant
turn for the low-level call. Both match the new ``action_execution``
prompt shape exactly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the forward did 2 backbone passes when all heads were
active: one for flow (via super().forward) and one for the fused
text+FAST helper. This commit reduces it to **one pass** — same
compute as flow-only training.
New ``_compute_all_losses_fused`` builds:
prefix = [images, language, FAST (when provided)]
suffix = [noisy_actions] (action expert via gemma_expert)
and runs a single ``paligemma_with_expert.forward`` with
``inputs_embeds=[prefix_embs, suffix_embs]`` (both experts active
in the same call). Captures *both* prefix_out and suffix_out, slices
each for its respective loss:
flow MSE ← suffix_out (existing action_out_proj + MSE path)
text CE ← prefix_out at language positions (lm_head + CE)
FAST CE ← prefix_out at FAST positions (lm_head + CE)
Critical attention mask override
--------------------------------
``make_att_2d_masks`` produces a cumulative-block attention mask in
which suffix tokens (highest cumsum) attend to *every* lower-cumsum
position by default, including FAST tokens. If we let that stand the
action expert reads the discrete FAST tokens and trivially decodes
them back to the same continuous actions the flow head is supposed
to predict from noise — the entire training signal collapses to a
copy operation.
The fix is a single line right after make_att_2d_masks:
att_2d_masks[:, fast_end:, fast_start:fast_end] = False
Explicitly zeros out *suffix → FAST* attention. Everything else
remains correct under the cumsum semantics:
* prefix images/language stay bidirectional among themselves
* FAST stays causal within itself, attending bidirectionally
to images+language
* FAST cannot see suffix (cumsum < suffix cumsum, default)
* suffix attends bidirectionally among itself, to images+language,
and now NOT to FAST (this override)
Bit-equivalent to the previous separated forward path for text+FAST
losses (the prefix hidden states at language and FAST positions are
unchanged whether suffix is present or not — the prefix doesn't
attend to suffix). For flow loss, suffix→FAST being masked is the
correct behaviour we *want* — if anything the previous separated
path was less correct for production use because the joint
gradient signal through the action expert was missing the prefix
extension.
Forward routing in ``forward()``
--------------------------------
* run_flow=True → _compute_all_losses_fused (one forward, all
three losses)
* run_flow=False, run_text or run_fast → _compute_text_and_fast_loss
(one prefix-only forward, two CE losses, no
suffix → cheaper than fusion)
* neither → RuntimeError (explicit; both losses disabled)
Wall-time per step
------------------
Before this commit: flow + (text+FAST fused) = 2 forwards
After this commit: (flow+text+FAST fused) = 1 forward
Compute parity with flow-only training when all three heads active.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same bug we fixed for high_level_subtask, just on the other
subtask-supervised sub-recipe. ``low_level_execution`` targets
``${subtask}`` (the current active span) but had no
``if_present`` guard. When ``active_at(t, style=subtask)`` returned
None at a frame (gaps in the annotation, or the very first/last
frames of an episode if the annotator's spans don't fully tile),
the assistant message rendered with empty content. The chat
tokenizer still included it in ``target_message_indices`` → text CE
supervised whatever the chat-template's empty assistant turn
decoded to (usually a single ``\n``). That trains the LM head's
prior at the first generation position toward ``\n``, the same
collapse we observed with the original ``${next_subtask}`` target.
Fix: ``if_present: subtask`` on the assistant target in
``low_level_execution`` for both ``smolvla2_hirobot.yaml`` and
``pi052_hirobot.yaml``.
Side effect: frames without an active subtask span no longer
contribute to the flow loss either (the only ``low_level`` target
is skipped, ``predict_actions = bool(targets_by_stream.get("low_level"))``
becomes False). For a well-annotated dataset where subtask spans
tile the whole episode this is a no-op. For datasets with gaps,
those gap frames lose flow supervision — strictly better than the
degenerate text-CE alternative.
Sub-recipe audit summary (no other changes needed):
* memory_update — all if_present guards present, OK
* user_interjection_response — all if_present guards present, OK
* high_level_subtask — fixed earlier, OK
* low_level_execution — fixed by this commit
* ask_vqa_top / ask_vqa_wrist — query+answer both guarded, OK
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the forward did three backbone passes per training step
when all heads were active: one for flow (via super().forward), one
for text CE, and one for FAST CE. That's ~3× the compute of
flow-only training.
The text and FAST losses share their prefix forward exactly — both
are CE on the LM head, evaluated at different slices of the same
hidden states. Adding FAST tokens after language in the prefix is
bit-equivalent for the text loss because the mask_ar convention in
``make_att_2d_masks`` keeps FAST tokens in a strictly-later causal
block: language tokens never see FAST, so their hidden states are
unchanged.
New ``_compute_text_and_fast_loss``:
* embeds [images, language] once
* optionally appends [FAST] (when run_fast is True)
* one backbone forward
* slices ``vlm_out[:, -(fast_len + lang_len):-fast_len]`` for
language hidden states (or ``vlm_out[:, -lang_len:]`` when no
FAST) → text CE
* slices ``vlm_out[:, -fast_len:]`` for FAST hidden states →
FAST CE
* returns both losses, either of which can be None when the
caller doesn't want that head.
forward() now calls this fused helper instead of running the two
separate ``_compute_text_loss`` / ``_compute_fast_action_loss``
methods. Those remain in the file for callers that only want one
head (e.g. ablations).
Why flow isn't fused
--------------------
Flow MSE comes from the action-expert (suffix) hidden states, which
attend to the prefix. If we just concat FAST onto the prefix and let
the action expert attend to it, the expert can trivially decode FAST
back to continuous actions — overfitting via shortcut. Preventing
that requires a custom segment-aware attention mask (action expert
can attend to images+language but NOT to subtask/FAST), which is
what pi05_full does in ``compute_layer_complete_knowledge_insulation``.
That's the full-fusion path; deferred as a follow-up since the
text+FAST fusion already recovers most of the compute.
End-to-end forward pass count
-----------------------------
Before: 1 (flow) + 1 (text) + 1 (FAST) = 3 backbone forwards
After: 1 (flow) + 1 (text+FAST fused) = 2 backbone forwards
~33% wall-time reduction per training step when all three heads
are active.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FAST loss changes
-----------------
1. Gate by ``predict_actions`` (same routing as flow loss). The
ActionTokenizerProcessorStep tokenises actions for *every*
sample regardless of which sub-recipe rendered it; for text-only
recipes (high_level_subtask, memory_update, ...) the action
tokens are still in the batch but mustn't be supervised. Skip
the FAST forward+CE entirely when no sample in the batch has
``predict_actions=True``.
2. Switch from "multiply-by-mask" masking to ``ignore_index=-100``.
The old pattern computed per-token CE for all positions, then
zeroed out invalid ones. Two issues: (a) any out-of-vocab target
id at a padded position would have crashed cross_entropy before
the mask got a chance to zero it out, and (b) the pattern is
needlessly clever. Now ``shift_targets.masked_fill(~mask, -100)``
followed by ``ignore_index=-100`` cleanly drops invalid positions.
Matches the smolvla2 text-loss convention.
3. Clean up unused ``bsize`` variable in _compute_fast_action_loss
and expand the attention-mask docstring with the
``make_att_2d_masks`` mask_ar convention spec (causal vs
bidirectional blocks).
smolvla2 audit (reference review, no code change)
-------------------------------------------------
Compared smolvla2/modeling_smolvla2.py against pi052/modeling_pi052.py
to catch parallel bugs. Findings:
* No ``paligemma.language_model`` vs ``paligemma.model.language_model``
issue — smolvla2 uses SmolVLM (different class, different attribute
layout) so the bug doesn't apply.
* ``fill_kv_cache=True`` is correctly passed to smolvla's
``vlm_with_expert.forward`` — that class *does* accept the kwarg
(unlike pi05's PaliGemmaWithExpertModel.forward, which is why
pi052 must omit it).
* Text-loss alignment is correct: ``_compute_text_loss`` computes
``lang_start`` / ``lang_end`` from the known prefix layout
(``[image_blocks..., lang, state]``) and slices ``prefix_out``
to just the language positions before applying ``lm_head``. The
parallel bug I fixed in pi052 (lm_head over the full prefix,
shape-mismatched against text_labels) was *not* present in
smolvla2.
* Per-sample flow routing via ``predict_actions``: correctly masks
per-sample by calling the parent ``forward(..., reduction='none')``
and applying the predict_actions mask before the mean. pi052 only
has the batch-level any() gate — a parallel improvement for pi052
would require modifying PI05Pytorch.forward to support per-sample
reduction, deferred.
* ``reduction="none"`` returns ``total.expand(bsize)``: identical
scalar-broadcast limitation in both policies. Acknowledged but
low priority (only RA-BC weighting uses the per-sample path and
it's documented as a known approximation in smolvla2).
* Chat tokenizer correctly handles batched/unbatched messages,
pads with -100 for label positions, builds attention masks. No
bugs found.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defaults
--------
* enable_fast_action_loss: False -> True (match paper §III.B-C Eq.1)
* auto_fit_fast_tokenizer: True -> False (opt-in; needs base.fit())
Bug fixes
---------
1. Wrong attribute path on PaliGemma. The KI port copied
pi05_full's ``paligemma.language_model.layers[...]`` literally,
but the production pi05 wrapper exposes the text model at
``paligemma.model.language_model``. With KI enabled, every layer
would have raised AttributeError on first forward. Fixed all
references in _compute_layer_ki + _paligemma_forward_ki.
2. ``fill_kv_cache=True`` passed to PaliGemmaWithExpertModel.forward.
That kwarg is a SmolVLA-only concept; pi05's signature has no
such argument, so every forward call from pi052 (text loss, FAST
loss, select_message) would have crashed with TypeError. Dropped
from all four call sites — pi05's forward already handles the
cache via past_key_values, and re-forwarding the cumulative
sequence each step in select_message is fine for our short
subtask completions.
3. Text-loss shape mismatch. _compute_text_loss applied lm_head to
the *full* vlm_out (image tokens + language tokens), then tried
to cross-entropy that against text_labels which only covers the
language portion — the .view(-1) calls would produce two
tensors of different lengths and CE would fail. Now slices
vlm_out to the last text_labels.shape[1] positions before
running lm_head, matching the [images, language] order
embed_prefix produces.
4. Dead-code conditional in _paligemma_forward_ki's single-expert
fallback. The ``if hasattr(...) else self._pi052_orig_forward``
ternary always took the wrong branch because the attribute is
always set (we save it in PI052Policy.__init__). Simplified to
just call self._pi052_orig_forward directly.
After this commit, pi052 should be runnable end-to-end for the
first time with all three loss heads + KI active. Still worth a
100-step smoke test before kicking off a long run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Pertsch et al. 2025 (FAST paper, [64] in π0.5) and π0.5 §III.C,
the recommended practice is to *fit* the FAST action tokenizer on
the specific dataset's action distribution rather than using the
published universal codebook off the shelf. The universal tokenizer
works on any 6-DoF action sequence but produces suboptimal
compression, which slows CE convergence and wastes vocab capacity.
New utility ``lerobot.policies.pi052.fit_fast_tokenizer``:
* samples N action chunks from the LeRobotDataset (default 1024)
* loads ``physical-intelligence/fast`` as the base
* calls ``.fit(actions)`` (the AutoProcessor API the HF model card
documents) — produces a per-dataset codebook
* saves to ``{cache_dir}/{sha256(dataset, base, n_samples)[:16]}/``
* returns the local path, ready to feed
``ActionTokenizerProcessorStep(action_tokenizer_name=...)``.
Cache is keyed on (dataset, base tokenizer, sample count) so changing
any of them re-runs the fit. Re-running training on the same dataset
re-uses the cache (one fit per dataset per machine).
Auto-fit wiring:
* PI052Config gets ``auto_fit_fast_tokenizer`` (default True),
``fast_tokenizer_cache_dir`` (default ~/.cache/lerobot/...),
``fast_tokenizer_fit_samples`` (default 1024).
* make_pi052_pre_post_processors now takes ``dataset_repo_id``;
when ``enable_fast_action_loss`` and ``auto_fit_fast_tokenizer``
are both True and a repo_id is provided, the factory calls
``fit_fast_tokenizer`` before constructing the processor step
and points it at the fitted path.
* ProcessorConfigKwargs gains ``dataset_repo_id``; the global
factory dispatch threads it through for ``pi052`` policies.
* lerobot_train.py populates ``processor_kwargs['dataset_repo_id']``
from ``--dataset.repo_id`` for pi052 runs.
Failure mode: if ``.fit()`` fails (e.g. older transformers without
the method, or no usable action chunks in the dataset), the factory
logs a warning and falls back to the universal base tokenizer. Train
still works; you just lose the compression improvement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions ported from ``pi05_full`` on branch ``feat/add-pi05``,
giving pi052 full paper-§III.B-C training capabilities alongside the
recipe-driven text supervision it already had:
* **Config flags** in PI052Config:
- ``enable_fast_action_loss`` default False
- ``action_tokenizer_name`` default "physical-intelligence/fast"
- ``max_action_tokens`` default 256
- ``fast_skip_tokens`` default 128
- ``fast_action_loss_weight`` default 1.0
- ``knowledge_insulation`` default False
* **Processor wiring** (processor_pi052.py): when
``enable_fast_action_loss=True``, append an
``ActionTokenizerProcessorStep`` after the text tokenizer. It
tokenises the action tensor with the FAST tokenizer and writes
ACTION_TOKENS / ACTION_TOKEN_MASK into ``COMPLEMENTARY_DATA`` —
the existing batch-collation pipeline forwards them as
``batch['action.tokens']`` / ``batch['action.token_mask']``.
* **FAST CE loss** (modeling_pi052.py::_compute_fast_action_loss):
Re-embeds the prefix [images, language], appends the FAST token
embeddings (using PaliGemma's shared embed_language_tokens),
forwards through the backbone, slices the trailing
``fast_len`` positions, applies the LM head, computes shifted
next-token CE with the action-mask gating the loss. The loss is
summed into ``forward()``'s total with ``fast_action_loss_weight``.
* **Knowledge insulation** (modeling_pi052.py::_compute_layer_ki +
_paligemma_forward_ki): port of pi05_full's per-layer attention
that detaches VLM K/V on the action-query path so action loss
gradients cannot flow back into the VLM's K/V projections. Bound
per-instance via ``types.MethodType`` so it doesn't leak into
stock ``pi05`` policies that share PaliGemmaWithExpertModel.
Activated automatically when ``config.knowledge_insulation=True``.
Combined with the existing recipe-driven text head, pi052 now
supports the full three-loss objective:
L = text_w·H(text) + fast_w·H(FAST actions) + flow_w·MSE(flow)
matching Eq. (1) of arxiv:2504.16054 §IV.D (α=10 by default for the
flow term, 1.0 each for text and FAST CE).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the working SmolVLA2 launch pattern so the two SLURM scripts
are interchangeable:
* literal NUM_PROCESSES / BATCH_SIZE / STEPS (no env-var defaults)
* STEPS=10000 to match the next SmolVLA2 run
* save_freq=$STEPS so only the final checkpoint is saved
* dropouts 0.1/0.1/0.1 (mild — matches the operator's iteration)
* flow_loss_weight / text_loss_weight come from the PI052Config
defaults (10.0 / 1.0 per Pi 0.5 paper §IV.D), no need to pass
them explicitly
Job name and policy_repo_id mirror the SmolVLA2 ``_tool-g2`` naming
so the two runs can be compared side-by-side in WandB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pi 0.5 paper §IV.D Eq. (1) sets the loss balance to α=10 between text
CE and flow MSE: actions are the primary output and the flow head
should dominate the gradient signal. SmolVLA2 was defaulting both
weights to 1.0, which inverts that — text CE (~0.5-2.0 nats) ends up
larger than flow MSE (~0.1-1.0), so the action expert gets less
gradient than the LM head despite being the primary task.
Match the paper's split: text_loss_weight=1.0, flow_loss_weight=10.0.
Same as ``pi052`` (the new full reproduction policy).
Also pin the values explicitly in the SLURM launcher so the choice is
visible and overridable per-run rather than buried in the config
default.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New ``lerobot.policies.pi052`` (parallel to ``smolvla2``) that adds
text-prediction + hierarchical-inference on top of the existing π0.5
implementation. Mirrors the paper's §IV.D dual-head training:
L = H(text) + α * ‖ω - a - f_θ_action(...)‖², α = 10
Components:
* ``configuration_pi052.py`` thin PI05Config subclass; adds
recipe_path, text/flow loss weights
(default α=10 per paper), prompt
dropout knobs, ``unfreeze_lm_head``.
* ``text_processor_pi052.py`` PI052TextTokenizerStep — concatenates
rendered messages as ``Role: ...``
plain text (PaliGemma has no chat
template), tokenises with the
PaliGemma tokenizer, builds a label
mask covering supervised target
spans. Includes Pi 0.7 §V.E
per-component prompt dropout.
* ``processor_pi052.py`` make_pi052_pre_post_processors —
Rename + Batch + Relative +
Normalize + RenderMessagesStep +
PI052TextTokenizerStep + Device.
Falls back to π0.5's plain pipeline
when recipe_path is unset.
* ``modeling_pi052.py`` PI052Policy(PI05Policy) — re-enables
PaliGemma ``lm_head``, computes
text_loss via CE on the supervised
span, sums with flow_loss in
forward(), and adds select_message
for AR text generation at inference
(same surface as
SmolVLA2Policy.select_message so
SmolVLA2Runtime drives it unchanged).
Plus the supporting plumbing:
* recipe ``configs/recipes/pi052_hirobot.yaml`` — same Hi-Robot blend
as smolvla2_hirobot.yaml, with the same ``${subtask}`` /
``if_present`` supervision fix (current span at every frame, not
``${next_subtask}``).
* SLURM ``examples/training/pi052_hirobot.slurm`` — full training
command matching the SmolVLA2 launcher.
* factory registration: ``--policy.type=pi052`` resolves to
PI052Policy with the new processor.
Same multi-rate runtime (``lerobot.policies.smolvla2.inference``)
drives this policy too — both expose ``predict_action_chunk`` for the
action expert and ``select_message`` for the LM head.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After _tool-good (2000 steps, 0.50/0.50/0.20 dropout) the LM head's
distribution at position 0 shifted from EOS to subtask-vocabulary
tokens but emitted bag-of-words ("cube arm and") rather than well-
formed sentences. That's the expected mid-fine-tuning phase: token-
level supervision has landed, sequence-level grammar hasn't.
Two changes for the next retrain:
* STEPS=15000 (from 2000) — chat-pretrained backbones need O(10k+)
steps to walk their pretraining priors down far enough to commit
to the fine-tuned distribution structurally, not just at the
token level. _tool-g2's bag-of-words output proves the model is
on the right path; it just needs more gradient signal.
* plan/memory dropout 0.50 -> 0.30 — 0.50 was probably too
aggressive for a small dataset. Half the training samples had
crucial context missing, which slows down learning the full
conditional structure. 0.30 still regularises against prompt
leakage but lets the model learn proper grammar first; the
higher dropout can be revisited once the head is solid.
Subtask dropout stays at 0.20 since subtask isn't in the high-level
prompt anyway (recipe fix removed the "Current subtask:" message).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recipe fix (target=${subtask} instead of ${next_subtask}) shifted
the LM head's failure mode from "emit newlines" to "emit EOS at
position 0". On the new ``_tool-good`` checkpoint inference produces
exactly one token (``<end_of_utterance>``, id 49279) and decodes to
empty. That's the chat-pretrained backbone's short-turn EOS prior
not yet being overridden by 2000 steps of fine-tuning supervision.
Expose three knobs so the operator can probe whether the head has
real subtask-token probability mass *under* the EOS argmax without
recompiling or retraining:
--text_min_new_tokens=N suppress EOS for the first N tokens
--text_temperature=T sample at temperature T
--text_top_p=P nucleus filtering at top-p
These are explicitly off-policy (training was greedy / no min-tokens),
so they shouldn't ship in production runs — but they let us tell
whether the model has *learned* subtask prediction (just under EOS)
or hasn't yet. If forcing min_new_tokens=3 with temperature=0.5
produces a sensible subtask, the model is fine and just needs more
training steps to walk EOS down. If it produces gibberish, training
hasn't progressed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the recipe fix (target=${subtask} at every frame) the model
can still reach low text_loss by reading the answer off the plan in
the prompt: at training the prompt contains the 6-step plan, and the
current subtask is one of those steps, so the model just learns
"active step N matches subtask N" and never needs to look at the
image. Symptom at inference: subtask string is set but never updates
because the model isn't really conditioning on the visual progress.
Drop plan and memory with p=0.50 each — half of training frames the
prompt is just "${task}" (constant for this dataset) + visual prefix,
which is the only place the answer can come from. Forces the LM head
to actually use vision.
``subtask_dropout`` stays at 0.20 because subtask isn't in the
high-level prompt anymore (recipe fix removed the "Current subtask:
X" message); the knob still affects other sub-recipes that reference
it as context.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Normalize tensor and sequence sample indices before prompt dropout so distributed batched preprocessing does not try to cast full index tensors to scalars.
Co-authored-by: Cursor <cursoragent@cursor.com>
Match the operator's current training command for the _tool6 retrain:
* default DATASET / POLICY_REPO_ID / JOB_NAME point at the tool6
iteration (super_poulain_full_tool3 → smolvla2_hirobot_super_poulain_tool6)
* STEPS default 2000 (short enough to iterate; bump to 10k for full)
* save_freq=$STEPS so the only checkpoint is the final one
* OUTPUT_DIR includes step count so successive runs don't clobber
* Drop the wider augmentation envelope I added earlier — back to
default ColorJitter ranges (brightness ±20% etc) since the
high_level_subtask recipe fix (current-subtask supervision) is
expected to fix the LM-head collapse on its own; the augmentation
is just the standard regulariser, not a load-bearing widener.
* prompt-dropout fractions stay at the original 0.15 / 0.15 / 0.20.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The high_level_subtask recipe targeted ``nth_next(style=subtask, offset=1)``,
which on the last span of any episode resolves to None. The recipe had no
``if_present`` guard on the target, so the renderer emitted an empty
assistant turn and cross-entropy supervised the model on the chat
template's structural newlines (``\n``). Across the dataset this trained
the LM head's argmax at position 0 to collapse to ``\n`` whenever no
transition was imminent (i.e. most frames). Visible failure mode at
inference: the head emits 40+ newlines + ``<end_of_utterance>`` every
chunk boundary while the action expert keeps working — confirmed by
running the dry-run on dataset frame 0 with the dataset's own image
and seeing the same ``\n × 44`` collapse.
Switch to the Pi 0.5 / Pi 0.7 supervision pattern: at every frame, the
assistant target is the *current* active subtask span text (via
``${subtask}`` → ``active_at(t, style=subtask)``). Always non-empty,
always scene-grounded, ``if_present: subtask`` skips frames with no
active span instead of emitting a degenerate empty turn.
Runtime callsite update: ``_msgs_for_subtask`` no longer feeds a
"Current subtask: X" user message into the prompt (that would be
circular — we'd be telling the model the answer). Transition
detection moves into the runtime — when the predicted subtask differs
from ``state['current_subtask']``, the existing ``set_if_changed``
path fires ``subtask_change`` and downstream memory updates. Same
event surface, supervision target is now always meaningful.
Requires re-annotating the dataset and retraining for the fix to land
in the checkpoint, but the recipe + runtime change is what enables it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the dry-run REPL only ticked on user input (empty Enter
just redrew), so the bisection test "does the LM head produce text on
start_frame=0?" required typing something arbitrary to drive a tick.
Just run ``step_once`` at startup — the obs diagnostic *and* the
subtask gen both fire automatically, the diag row populates, and the
operator can read the result before pressing any key.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tensor-level comparison between dry-run (dataset frame) and live-
robot inference proved the runtime is bug-free — same shape, dtype,
device, channel order, batch dim, and normalization on both paths.
The remaining variable: front-camera mean brightness was 0.26 live vs
0.39 on the dataset frame, ~33% darker. Training augmentation only
covered ±20% brightness, so the live scene sits just outside the
supervised envelope and the LM head collapses to its dominant prior.
Widen the augmentation knobs for the next retrain:
* brightness 0.8–1.2 → 0.5–1.6 (covers ~30% darker / 60% lighter)
* contrast 0.8–1.2 → 0.6–1.5
* saturation 0.5–1.5 → 0.3–1.7
* hue ±0.05 → ±0.10
* affine ±5°/±5% → ±15°/±15% (covers cube placement / camera drift)
* max_num_transforms 3 → 4
And bump prompt-component dropout (subtask 0.20 → 0.30) so the LM
can't lean on stale memorised plan/memory at inference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dry-run REPL only fires a tick when the user types, so the
``_log_obs_tensors_once`` diagnostic never reached stdout (the
provider was never called). Probe the provider once at startup —
the result is discarded; we only care about the obs log it triggers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Helper that prints (once per provider lifetime) every
``observation.*`` tensor the policy is about to see, with its shape,
dtype, device, and per-channel min/max/mean/std. Wired into both the
dry-run dataset path and the live-robot path.
Now we can bisect train/inference mismatch *at the tensor level* —
if the same checkpoint produces coherent text on one path's tensors
and ``\n`` on the other's, and the printed tensor stats differ
materially, the bug is in the observation prep, not in the model or
the training distribution.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply the training-time torchvision-v2 ColorJitter / SharpnessJitter /
RandomAffine pipeline to dataset frames in dry-run, so we can isolate
whether the LM head's collapse to '\n' on live frames is:
* pure scene-content OOD (unaugmented dataset frames work, mildly
augmented ones still work — model has learned the augmentation
distribution, only fails when the scene content itself diverges)
* hyper-specific memorisation (dry-run with augmentation also
collapses to '\n' — head is nailed to the exact unperturbed
training samples and only the retrain helps)
Usage:
lerobot-smolvla2-runtime --no_robot --policy.path=... \
--dataset.repo_id=... --dataset.episode=0 \
--dataset.start_frame=1000 \
--dataset.augment_at_inference
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
So the operator can compare live joint values to the dataset's
``observation.state`` mean/std and spot when the robot's home pose is
several σ off the supervised support region. State OOD is the
remaining viable hypothesis for why the live LM head collapses to
``\n`` even though images are pixel-shape-matched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Print one warning the first time the robot observation provider runs
through, showing live camera resolution and the dataset's training
resolution, plus whether we resized. Lets the operator confirm at a
glance that the visual prefix really is being fed at the same shape
the model saw at training — instead of guessing whether the resize
fired silently.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause for the LM head's empty-completion symptom on the live robot
(while the same checkpoint produced sensible subtask/plan/memory in
``--no_robot`` dry-run on dataset frames): the camera observation was
flowing into the model at its native resolution. A Mac/USB webcam
hands us 1280×720 or 1920×1080; the dataset was recorded at the
feature schema's ``observation.images.*['shape']`` resolution
(typically 480×640). SmolVLA's internal ``resize_with_pad(512, 512)``
*does* fit both — but with very different pad geometry, so visual
tokens at each tile carry different content than at training. Action
expert tolerates this; the tightly-supervised LM head goes OOD and
the head's distribution at position 0 collapses to its dominant mode
(``\n`` ×N then ``<end_of_utterance>`` for this checkpoint).
The fix: in ``_build_robot_observation_provider``, pre-compute the
camera-key → (H, W) target from ``ds_features`` and ``cv2.resize``
each live frame to that shape before tensorising. The downstream
``resize_with_pad`` then sees the same input geometry as training and
the LM head returns to producing readable subtask text under plain
greedy decoding — the same as dry-run.
Also drops the inference-time patches (``min_new_tokens``,
``temperature``, ``top_p`` overrides) on the four high-level callers.
They were band-aids around the visual-distribution shift, not a real
LM problem, and they drift inference off the training distribution.
Greedy argmax is what training matched. The ``select_message``
signature still accepts the knobs for callers that want them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous attempt only masked the tokenizer's eos_token_id during the
min_new_tokens prefix. The empty-completion symptom persisted because a
memorised SmolVLM head doesn't just want EOS — its top-1 at position 0
is *some* special token, and when EOS is masked the argmax shifts to a
sibling (``<|im_end|>``, ``<image>``, ``<fake_token_around_image>``,
``<row_X_col_Y>``, …). Those tokens survive generation but then get
stripped by ``decode(skip_special_tokens=True)``, so the runtime still
saw ``last_raw='(empty)'`` every chunk boundary.
Mask the full ``tokenizer.all_special_ids`` set instead. Forces the
head to commit to a normal vocabulary token before it can close or
quietly poison the turn.
Also: when decode returns empty but tokens *were* generated, expose
the raw token ids and the special-tokens-included decoded string via
``policy._last_select_message_debug``. The runtime surfaces this in
the scrollback so the operator can see what the head is actually
emitting — distinguishing "head EOS-ing" from "head emitting image
placeholders" from "head emitting chat-template fragments".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-robot run confirmed the LM head is producing 0 tokens at every
chunk boundary (empty:N counter climbing, no exception in scrollback):
the model EOS-es at decode step 0. That's the memorisation collapse —
training reached text_loss=6e-6 by overfitting one trajectory whose
supervised subtask turn ended in EOS, and at inference the head's
argmax for token 0 is EOS regardless of the actual frame.
Two changes in select_message:
* ``min_new_tokens`` parameter masks the EOS logit to -inf until at
least N real tokens have been decoded. Without this the head's
"EOS first" prior produces an empty completion every single time.
* The runtime callers now pass ``min_new_tokens=5..10`` plus
``temperature=0.4..0.5`` + ``top_p=0.9``. Sampling at moderate
temperature with nucleus filtering also helps break the greedy
argmax collapse — when the model has memorised one continuation,
greedy keeps replaying it; nucleus sampling forces it to commit
to *some* coherent continuation that's well-supported by the
prefix even when greedy's top-1 is degenerate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two improvements for diagnosing why ``last_raw`` stays empty:
1. The autonomous panel-redraw thread calls console.clear() every
0.5 s, wiping any log lines the runtime printed since the last
redraw. So warnings from generation (``[warn] subtask gen failed:
...``, ``[info] subtask gen rejected (gibberish): ...``) flashed
for milliseconds and disappeared, leaving the operator blind.
Capture log_lines from each tick into a bounded scrollback
(last 12 entries) and render them inside the panel itself, below
the diag row. They now stick across redraws until rotated out.
2. ``empty`` counter for subtask gen. Persistent empty completions
are their own failure mode — the LM head EOS-es immediately from
the chat-template generation prompt, distinct from "generated
something but filter rejected it". The diag row now reads:
subtask diag repeat:0 gibberish:0 empty:14 last_raw: '(empty)'
^^^^^^^
plus a periodic log line every 10 empties so the cause is also
surfaced in the scrollback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both HighLevelSubtaskFwd and LowLevelForward are gated on
'action queue is empty'. With LowLevelForward listed first, it refilled
the queue on the empty-queue tick before HighLevelSubtaskFwd got to
check — so the gate I added in the previous commit made the high-level
step a permanent no-op after the initial bootstrap. Visible symptom:
subtask string never advances past whatever bootstrap seeded, no
subtask_change events, memory stays unset, and the new overfit
diagnostics never appear on the panel because last_subtask_raw is
never written.
Move all high-level steps (subtask, memory, interjection, vqa) ahead
of LowLevelForward. On an empty-queue tick the subtask refreshes
first, the new string flows into the next chunk's prompt, then
LowLevelForward generates the chunk, then DispatchAction drains it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The autonomous-mode panel now surfaces what the model is *actually*
producing at every chunk boundary, not just what got accepted:
* last_subtask_raw most recent generation (accepted or not)
* subtask_repeat_count times the same accepted string regenerated
* subtask_gibberish_count rejections by the gibberish filter
* memory_gibberish_count / plan_gibberish_count for the other heads
These let the operator see memorisation collapse without scrolling
back through logs:
subtask diag repeat:8 gibberish:0 last_raw: '<same string>'
^^^^^^^^^^ → model can't move past current phase
subtask diag repeat:0 gibberish:14 last_raw: 'Ass:::'
^^^^^^^^^^^^^^^^^^^^^^ → LM collapsed to template salad
Also silences the per-action ``Relative goal position magnitude had
to be clamped`` warning. The clamp fires every dispatch tick when the
model emits stale joint targets, flooding the panel at ctrl_hz=30.
Replaced the bare ``logging.warning`` call in robots/utils.py with a
module logger so it can be selectively raised to ERROR. Operators
who need the per-tick clamp detail can use ``-v``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third stdin channel alongside 'task:' and bare interjections:
rephrase: <text>
Swaps state['task'] with the new string while preserving plan/memory/
subtask. Lets the operator probe how robust the model is to wording
variations of the same task — the trained augmentation provided
n_task_rephrasings≈30 task wordings per dataset task, and this is the
direct way to exercise that distribution at inference without
generating a fresh plan via user_interjection_response.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both stdin handlers (autonomous mode and rich REPL) gated 'task:' to
'only if no task is set yet' — once the initial task existed, typing
'task: <new task>' silently fell through to the interjection branch.
Make 'task:' always override the active task and clear stale
plan/memory/subtask so the next high-level pass regenerates context
from scratch for the new task.
For rephrasings within the same task, the interjection path
(user_interjection_response recipe) is still the right channel — it
refreshes the plan and emits a paired <say> in one trained call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime is single-threaded. `HighLevelSubtaskFwd` at HzTrigger(1.0)
fires every loop iteration on MPS because each `select_message` call
takes ~2 s, longer than its 1/hz period. The whole tick stretches to
~2.5 s, so `DispatchAction` (HzTrigger 30) only pops a single action per
loop iteration — the queue drains at ~0.4 actions/sec instead of 30 and
the robot barely moves between chunk refreshes.
Two changes, both purely about scheduling — no threading:
* Gate `HighLevelSubtaskFwd` to fire only when the action queue is
empty, matching `LowLevelForward`'s refresh condition. The slow LLM
call now happens during the "think" phase between chunks, not on
every dispatch tick. Restores a clean sense → think → act cycle.
* `DispatchAction` catches up via wall-clock: when the trigger fires
after a stall, pop `round(elapsed * hz)` entries and send only the
most recent. Open-loop chunks are timestamped at ctrl_hz; sending
stale joint targets one-by-one would just lag the robot further
behind. The dynamixel smooths to the latest goal anyway.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous refresh threshold (queue > chunk_size // 2) made each
new chunk *telescope* past the previous one: at queue=25, we kicked
off a new chunk forward from the current observation, but by the
time the new chunk's first action was actually dispatched, the
robot had executed the remaining 25 actions of the previous chunk
— so the new chunk was planned from an observation 25+ steps stale.
Canonical sense → think → act loop: execute the full chunk, then
re-observe and replan. Refresh only when the queue is empty. Every
step of every chunk still gets dispatched to the robot (no
behaviour change there), but each chunk is now planned from an
observation that's at most one chunk's worth of dispatch latency
old, not "previous chunk's worth of stale state on top of that".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary regularisers to attack the
``text_loss=6e-6 = memorised one dataset`` failure mode that's
making the model collapse on real-robot input:
1. **Per-component prompt dropout** (Pi0.7 §V.E / plan's
``feat/pi05-prompt-dropout`` follow-up).
``SmolVLA2ChatTokenizerStep`` gains
``plan_dropout_prob`` / ``memory_dropout_prob`` /
``subtask_dropout_prob`` knobs (default 0.0 — opt-in). At training,
non-target messages whose rendered content starts with
``Plan:`` / ``Memory:`` / ``Current subtask:`` etc. are dropped
with their respective probability before tokenisation, with a
deterministic per-sample RNG keyed off the dataset ``index``.
``target_message_indices`` is re-mapped so the supervision still
lands on the right turn. Forces the model to handle missing
plan/memory/subtask context — directly attacks the real-robot
collapse where a stale or empty plan field puts the prompt OOD.
Surfaced on ``SmolVLA2Config`` as three floats so they're
``--policy.<knob>=<value>``-controllable from the train CLI;
plumbed through ``make_smolvla2_pre_post_processors``.
2. **Image augmentation** is already wired in lerobot via
``--dataset.image_transforms.enable=true`` (torchvision v2
ColorJitter + SharpnessJitter + RandomAffine, default 3 of 6
sampled per frame). No code change needed — just a CLI flag.
``examples/training/smolvla2_hirobot.slurm`` shows the full
training command with both enabled. Drop-in replacement for the
ad-hoc SLURM script Pepijn was using locally; same args, plus the
three dropout probs and the image-transforms flag.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``LowLevelForward`` was calling ``select_action()`` once per
``chunk_hz`` tick. SmolVLA's ``select_action`` is a thin queue-pop:
it returns one action per call and only re-runs the expensive
flow-matching forward when its private internal queue empties.
Result: we got one action back per chunk_hz tick (1Hz default),
``DispatchAction`` at ctrl_hz=30 popped it instantly, then queue
sat empty for ~1s waiting for the next tick. Net throughput was
1 dispatched action/sec instead of the 30 we wanted.
Switch to ``predict_action_chunk`` and enqueue every step of the
returned ``(batch, n_action_steps, action_dim)`` chunk. Refresh
only when the queue is below half a chunk so we don't burn one
flow-matching forward per chunk_hz tick — saves ~5x inference cost
on this hot path. At ctrl_hz=30, chunk_size=50, the queue drains
in ~1.7s before the next refresh, giving smooth dispatch at the
control rate the robot was trained on.
Side effect: ``state['last_chunk_size']`` records how many actions
the most recent chunk produced — useful for the panel later if we
want to surface "chunks generated" alongside "dispatched".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-robot run was unreadable for two reasons:
1. The panel surfaced ``queued actions: 0`` (always zero — dispatch
pops faster than chunk_hz generates) and gave no signal that
actions were actually reaching the robot. The only sign of life
was the safety-clamp warning lines scrolling past.
2. The text head consistently collapses to ``the`` / ``Ass``
fragments on real-camera input (memorisation wall). The old
gibberish filter caught ``":":":"`` JSON salad but let
single-token fragments through, and the ``[info] subtask gen
produced no text this tick`` line flooded the panel every second.
Changes:
* ``DispatchAction`` bumps ``state["actions_dispatched"]`` each
tick; panel renders it next to queue depth. Operator can see
the policy IS issuing actions even when text is broken.
* ``_looks_like_gibberish`` now also rejects:
- too few unique alphabetic tokens (``the``, ``the the``, ...)
- chat-template marker leakage (``Assistant:``, ``Ass\\n::``)
catching the actual failure mode on real-robot frames.
* Gibberish rejections log only the first occurrence + every 30th
after that, with a count, so the panel stays legible.
* Empty completions no longer log at all (was every tick).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dry-run REPL had a clean ANSI-clear-+-rich-panel layout via
``_redraw`` showing task / subtask / plan / memory / queued-actions /
pending-tool-calls; autonomous mode just had bare ``> `` plus log
lines scrolling past the user. Same data, two presentations.
Extract ``_make_state_panel_renderer(runtime, mode_label=...)`` and
use it from both ``_run_repl`` (called per user input) and
``_run_autonomous`` (called both on user input *and* on a 0.5s
background timer so subtask / plan / memory refreshes from the
runtime's own loop become visible without the user typing anything).
Title bar shows ``dry-run`` vs ``autonomous`` so it's obvious which
mode you're in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Training tokenises messages through ``_strip_lerobot_blocks`` (in
``chat_processor_smolvla2.py``), which normalises every variant of
``message['content']`` into the ``[{type:text, text:...}]`` list shape
SmolVLM's chat template expects:
* ``list[block]`` → keep text blocks, drop images
* ``None`` → ``[{type:text, text:""}]``
* ``str`` / other → ``[{type:text, text:str(content)}]``
Inference was doing a partial inline conversion that only handled the
``str`` case — ``None`` and pre-formatted ``list`` content slipped
through unchanged. ``memory_update``'s ``Previous memory: ...``
assistant turn ends up with ``None`` content when there's no prior
memory, which then renders as no-content / role-marker-only and the
model hallucinates ``Assistant:`` fragments. Subtask gen got further
because its prompt always has at least the task string.
Reuse ``_strip_lerobot_blocks`` directly. Now the inference prompt
shape matches the exact tokenisation training did — no more "trained
on shape X, asked to predict shape Y" mismatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLM's chat template (and many other multimodal templates) declares
``message['content']`` as a list of typed blocks and iterates it
expecting dicts with a ``'type'`` field:
{% for line in message['content'] %}
{% if line['type'] == 'text' %}{{ line['text'] }}
{% elif line['type'] == 'image' %}{{ '<image>' }}
{% endif %}
{% endfor %}
When the caller passes ``content`` as a plain ``str`` (which we did
throughout ``_msgs_for_subtask`` / ``_msgs_for_memory`` etc.), Jinja
silently iterates the string character-by-character. ``'P'['type']``
returns nothing; neither branch fires; *no text tokens get emitted*.
The model receives a prompt containing only role markers
(``User:<end_of_utterance>\nAssistant:``) and predictably continues by
emitting ``Assistant:`` fragments — the gibberish ``subtask: Ass\n::``
on the runtime panel.
Before calling ``apply_chat_template``, walk the messages and rewrite
any string ``content`` into ``[{'type': 'text', 'text': content}]``.
The template's text branch then fires correctly and the model sees
the actual user/assistant text, not just structural tokens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``PolicyProcessorPipeline.__call__`` already wraps its input via
``to_transition`` (defaulting to ``batch_to_transition``) before
running the steps, and unwraps via ``to_output`` (defaulting to
``transition_to_batch``) afterwards. The input format is therefore a
*flat batch dict* keyed by ``observation.*`` / ``action`` / etc., not
an ``EnvTransition``.
Previous attempt pre-wrapped the observation into a transition with
``TransitionKey.OBSERVATION`` as the key, then handed *that* to the
pipeline — which fed it to ``batch_to_transition``, which looked for
top-level ``observation.*`` entries, found none (they were nested
inside the enum key), and produced an empty observation. Every step
then bailed with ``ObservationProcessorStep requires an observation
in the transition.``
Pass the flat dict from ``build_inference_frame`` straight to the
preprocessor — it does the wrap/unwrap itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``EnvTransition`` is declared as a ``TypedDict`` keyed by
``TransitionKey.OBSERVATION.value`` (the string ``'observation'``),
but every concrete ``ProcessorStep`` in the pipeline indexes the
transition with the enum *member* (``transition[TransitionKey.
OBSERVATION]`` / ``transition.get(TransitionKey.OBSERVATION)``).
Those are two different keys in a Python dict — string key vs enum
key — so steps couldn't find the observation we'd placed under the
string variant, and bailed every tick with
``ObservationProcessorStep requires an observation in the
transition``.
Build the transition with the enum members directly. Matches how
``BatchProcessor``, ``RelativeActionProcessor``, ``HilProcessor``,
etc. read the dict.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``robot.get_observation()`` on omx_follower (and most lerobot robots)
returns:
* per-joint scalar floats with ``.pos`` suffix
(``shoulder_pan.pos: 0.123``, ``shoulder_lift.pos: 0.456``, ...)
* per-camera ndarrays keyed by the camera config name (``wrist:
ndarray(H,W,3)``)
But the trained policy expects:
* single ``observation.state: tensor[N_joints]`` vector
* image keys prefixed: ``observation.images.<cam_key>:
tensor[1, 3, H, W]``
``prepare_observation_for_inference`` only handles the tensor /
batch-dim / device step — it crashes on scalar floats with
``expected np.ndarray (got float)``. The right helper is
``build_inference_frame`` which uses the dataset's feature schema
(``ds_meta.features``) to:
1. extract the right raw keys per dataset feature,
2. fold ``shoulder_pan.pos`` / ``shoulder_lift.pos`` / ...
into a single ``observation.state`` ndarray,
3. prefix camera keys with ``observation.images.``,
4. delegate to ``prepare_observation_for_inference`` for the
tensor / batch / device step.
Pass ``ds_meta.features`` into the observation provider and switch
to ``build_inference_frame`` when available; fall back to the bare
``prepare_observation_for_inference`` only when no dataset is
provided (rare — autonomous mode already requires it).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The policy preprocessor pipeline is transition-shaped — its steps
read ``TransitionKey.OBSERVATION`` off an ``EnvTransition`` dict, not
a flat ``RobotObservation`` dict. Passing the raw observation through
made every step bail with
``ObservationProcessorStep requires an observation in the transition``,
which the runtime swallowed at warning level. ``select_message`` then
got called with no ``observation.images.*`` features and crashed
with ``All image features are missing from the batch``.
Mirror ``lerobot-record``'s preamble:
1. ``prepare_observation_for_inference`` → numpy → torch, ``CHW``
image layout, ``[0,1]`` scaling, add batch dim, move to device.
2. Wrap into an ``EnvTransition`` (``{TransitionKey.OBSERVATION.value:
...}`` plus ``COMPLEMENTARY_DATA: {}`` and ``None``s for the rest)
so transition-aware steps see the keys they expect.
3. Run preprocessor.
4. Unwrap the transition's ``OBSERVATION`` slot to get the final
flat dict the policy's ``select_action`` / ``select_message``
consume.
Image features now reach the policy; the autonomous loop produces
real actions instead of swallowing warnings every tick.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``--robot.cameras`` parses the JSON into ``dict[str, dict]``, but
``RobotConfig`` expects ``dict[str, CameraConfig]`` — each inner
value must be the actual ``CameraConfig`` subclass instance for the
chosen backend (e.g. ``OpenCVCameraConfig``). Passing raw dicts
blew up in ``RobotConfig.__post_init__`` with
``AttributeError: 'dict' object has no attribute 'width'`` when it
iterated cameras and tried to read attributes.
Look up the right subclass per-camera by its ``"type"`` field via
``CameraConfig.get_choice_class(...)`` (mirroring the lazy-import
dance we already do for ``RobotConfig``: eagerly walk
``lerobot.cameras``'s submodules so the registry is populated
before lookup). Construct an instance with the rest of the dict's
fields. On an unknown camera type, raise a clean ``ValueError``
listing the available choices.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``RobotConfig._choice_registry`` is populated as a side-effect of
each robot's ``@RobotConfig.register_subclass`` decorator running,
and those decorators only fire when the corresponding
``lerobot.robots.<name>`` module is imported. The package's
``__init__.py`` doesn't import them — instead ``make_robot_from_config``
does it lazily in its big if/elif chain.
``_build_robot`` jumped the gun: called ``RobotConfig.get_choice_class
(robot_type)`` before any robot module had been imported, so the
registry was empty and every ``--robot.type=<X>`` produced
``KeyError: 'X'`` (e.g. ``KeyError: 'omx_follower'``).
Walk ``lerobot.robots``'s submodules via ``pkgutil.iter_modules`` and
``importlib.import_module`` each one before the lookup. ~200ms on the
first invocation, negligible for an autonomous run. On a real
``KeyError`` (typo / unsupported robot), raise a clean ``ValueError``
listing the registry's available choices instead of a bare KeyError.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hand-rolled action-norm safety clip duplicated what every
``RobotConfig`` already exposes — ``max_relative_target`` — and at
the wrong layer (after postprocess but before send_action, instead
of inside the robot driver where every other lerobot entry point
puts it). The norm clip also rejected entire actions instead of
clipping per-motor relative motion, so a single rogue joint would
kill the whole tick.
Replace with ``--robot.max_relative_target``: a string parsed as
either a bare float (uniform per-motor cap) or a JSON object
mapping motor name → cap. Passed through to
``RobotConfig(max_relative_target=...)`` at robot construction;
the driver's ``send_action`` clips each commanded joint position
relative to the current measured one before issuing it on the bus —
same behaviour ``lerobot-record`` ships.
Also bump ``--chunk_hz`` default from ``4.0`` to ``1.0``. One new
chunk per second is what the trained checkpoint can comfortably
keep up with on common hardware and gives smoother motion than
sub-second chunk regenerations (no RTC interpolation between
chunks yet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>