refactor(recipes): two-flavor design — one fused action_execution + text-only events

Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a
clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions)
and the hierarchical inference pattern from Pi 0.5 §IV.D.

Flavor 1 — action_execution (60% weight, "main path")
-----------------------------------------------------

One always-on recipe that fuses **all** available context (task,
plan, memory) into a single user prompt and uses the current subtask
as the supervised assistant target. This single recipe supervises
*both* objectives:

  * subtask prediction (text CE on the assistant span via lm_head)
  * action chunks (flow MSE on the action expert via
    stream: low_level, target: true; plus FAST CE on action tokens
    when enable_fast_action_loss=True)

At inference, the *same* prompt structure drives both inference
modes:

  * select_message(user_prompt_only) → LM head generates the next
    subtask. Matches action_execution's training distribution
    exactly (prompt is the user turn, target is the subtask).
  * predict_action_chunk(user_prompt + assistant_subtask) → action
    expert produces the chunk. Matches action_execution's full
    prompt+target.

This replaces what used to be a separate high_level_subtask recipe
plus a low_level_execution recipe; both were supervising the same
subtask text, so collapsing them into one is correct and removes
the redundant text-CE gradient.

Flavor 2 — event-driven text-only recipes
-----------------------------------------

Each of these supervises the LM head to predict a specific kind of
text given a specific event-triggered context. ``stream: high_level``
on all targets so they never trigger predict_actions / flow loss.
``if_present`` guards ensure they only fire on frames where the
event annotation is present.

  * memory_update           (10%)  new memory at subtask boundary
  * user_interjection_response (15%) new plan + say(...) on input
  * ask_vqa_top             (7.5%) front-camera VQA
  * ask_vqa_wrist           (7.5%) wrist-camera VQA

Total weight = 1.0.

Prompt format consistency
-------------------------

User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}``
matches what ``inference/steps.py::_msgs_for_subtask`` and
``_control_context_messages`` already emit at inference time. No
"Task: " prefix — the bare task string is used as the leading
content with literal "Plan: " / "Memory: " labels for the
subsequent components.

What changed structurally
-------------------------

  - low_level_execution            DROPPED  (folded into action_execution)
  - high_level_subtask             DROPPED  (subtask supervision moved into action_execution)
  + action_execution               NEW      (the fused main recipe)
    memory_update                  kept, prompt cleaned up
    user_interjection_response     kept, prompt cleaned up
    ask_vqa_top / ask_vqa_wrist    kept

Runtime compatibility
---------------------

No runtime change needed — ``SmolVLA2Runtime`` and the inference
helpers already build their high-level prompt as just the user turn
(task + plan + memory) and append a ``current_subtask`` assistant
turn for the low-level call. Both match the new ``action_execution``
prompt shape exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pepijn
2026-05-13 12:35:51 +02:00
parent b873fe454c
commit 058b8f3958
2 changed files with 105 additions and 110 deletions
+43 -41
View File
@@ -1,26 +1,49 @@
# π0.5 v2 — Hi-Robot / MEM / ECoT blend, reproducing the paper's
# hierarchical inference recipe on lerobot.
# π0.5 v2 (pi052) — Hi-Robot / MEM / ECoT blend.
#
# Architecturally identical blend to ``smolvla2_hirobot.yaml`` — same
# five sub-recipes (memory_update, user_interjection_response,
# high_level_subtask, low_level_execution, ask_vqa_*) with the same
# message layouts. The only difference is which backbone the renderer's
# output is fed into:
# Architecturally mirrors ``smolvla2_hirobot.yaml`` — same two
# flavors, same sub-recipes — but the rendered messages are fed
# to PaliGemma (PaliGemma is not chat-pretrained, so the
# ``PI052TextTokenizerStep`` concatenates them as ``Role: content``
# plain text rather than calling ``apply_chat_template``).
#
# * SmolVLA2 calls SmolVLM's chat-template tokenizer
# (``apply_chat_template`` with chat-pretrained role markers).
# * π0.5 v2 concatenates the rendered messages as ``Role: content``
# plain text, since PaliGemma is not chat-pretrained. See
# ``PI052TextTokenizerStep`` in ``policies/pi052/text_processor_pi052.py``.
# Two flavors
# -----------
#
# Same supervision target convention as ``smolvla2_hirobot.yaml``: the
# ``high_level_subtask`` recipe targets ``${subtask}`` (the *current*
# active span at every frame) rather than ``${next_subtask}`` (which
# is empty on stable phases and used to train the model to emit
# newlines).
# Flavor 1 — ``action_execution`` (~60% weight)
# The main always-on recipe. Fuses all available context
# (task + plan + memory) into a unified user prompt, and
# uses the current subtask as the assistant target. This
# single recipe supervises *both*:
# * subtask prediction (text CE on the assistant span,
# lm_head), and
# * action chunks (flow MSE on the action expert via
# ``stream: low_level, target: true``, plus the FAST
# CE on the action tokens when enabled).
# Pi 0.7 §V.A — subtask in the prompt + flow on actions.
#
# Flavor 2 — event-driven text-only recipes
# ``memory_update``, ``user_interjection_response``,
# ``ask_vqa_*``. Each handles a specific high-level event
# with a TEXT output. ``if_present`` guards keep them from
# firing on frames without the relevant annotation.
blend:
# ----------------------------------------------------------
# FLAVOR 1: action_execution (main path)
# ----------------------------------------------------------
action_execution:
weight: 0.60
messages:
- role: user
stream: high_level
content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
- {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
# ----------------------------------------------------------
# FLAVOR 2: event-driven text-only paths
# ----------------------------------------------------------
memory_update:
weight: 0.10
bindings:
@@ -34,7 +57,7 @@ blend:
- {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
user_interjection_response:
weight: 0.16
weight: 0.15
bindings:
prior_plan: "nth_prev(style=plan, offset=1)"
current_plan: "emitted_at(t, style=plan)"
@@ -46,29 +69,8 @@ blend:
- {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
- {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech}
# Pi 0.5 / Pi 0.7 supervision: predict the *current* active subtask
# at every frame from task + plan + memory + visual prefix.
# ``if_present: subtask`` skips frames with no active span instead of
# supervising an empty target (the failure mode that produces newline
# collapse).
high_level_subtask:
weight: 0.15
messages:
- {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
# Same ``if_present: subtask`` guard as high_level_subtask above —
# see smolvla2_hirobot.yaml for the full rationale. Skips the
# action-loss supervision on frames without an active subtask span
# rather than emitting a degenerate empty target.
low_level_execution:
weight: 0.35
messages:
- {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
ask_vqa_top:
weight: 0.10
weight: 0.075
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
@@ -82,7 +84,7 @@ blend:
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
ask_vqa_wrist:
weight: 0.10
weight: 0.075
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
@@ -1,21 +1,68 @@
# SmolVLA2 canonical training recipe — Hi Robot / MEM / ECoT blend.
#
# Same blend shape as pi05_hirobot.yaml. SmolVLA2 differs from Pi0.5 in
# how the renderer's output is consumed:
# Inspired by Pi 0.7 §V (Diversifying the Prompt) and Pi 0.5's
# hierarchical subtask training. The blend has **two flavors**:
#
# - SmolVLA2 calls SmolVLM's tokenizer.apply_chat_template(messages,
# tools=DEFAULT_TOOLS) on the rendered messages, since SmolVLM is a
# chat-pretrained backbone.
# - The processor builds a `text_labels` tensor that masks every token
# except those belonging to messages whose index is in
# `target_message_indices`. Cross-entropy on those positions trains
# the LM head.
# - `predict_actions = bool(targets_by_stream.get("low_level"))` —
# same convention as Pi0.5. ``low_level_execution`` is the only
# branch that runs the action expert / flow head.
# Flavor 1 — ``action_execution`` (~60% weight)
# The main always-on recipe. Fuses all available context
# (task + plan + memory) into a unified user prompt, and
# uses the current subtask as the assistant target. This
# single recipe supervises *both*:
# * subtask prediction (text CE on the assistant span,
# lm_head), and
# * action chunks (flow MSE on the action expert via
# ``stream: low_level, target: true``, plus the FAST
# CE on the action tokens when enabled).
# At inference, the same prompt structure is used:
# * the high-level loop calls ``select_message`` with the
# user prompt only → generates the next subtask.
# * the low-level loop calls ``predict_action_chunk`` with
# the user prompt + the generated subtask as the
# assistant turn → generates the action chunk.
# Replaces what used to be three separate recipes
# (``high_level_subtask`` + ``low_level_execution`` + the
# implicit subtask-in-prompt context) in earlier drafts.
# Pi 0.7's §V.A "Subtask instructions" pattern.
#
# Flavor 2 — event-driven text-only recipes
# Each handles a specific high-level event with a TEXT
# output (no action supervision). They fire when the
# binding for the event resolves to non-None:
# * ``memory_update``: at subtask boundary, predict new
# memory from task + prior memory + completed subtask.
# * ``user_interjection_response``: on user input, predict
# new plan + paired ``say()`` tool call.
# * ``ask_vqa_top`` / ``ask_vqa_wrist``: answer a
# camera-grounded visual question.
# All use ``stream: high_level`` (no flow loss) and rely on
# ``if_present`` guards so they only fire on frames where
# the relevant event annotation is present.
#
# How the chat tokenizer interprets the flavor split
# ---------------------------------------------------
# * predict_actions = bool(targets_by_stream.get("low_level"))
# → True only for Flavor 1 (action_execution).
# * text_labels supervises whatever assistant turns are marked
# target=true. For action_execution, this is the subtask
# string. For Flavor 2, it's the corresponding text output.
blend:
# ----------------------------------------------------------
# FLAVOR 1: action_execution (main path)
# ----------------------------------------------------------
action_execution:
weight: 0.60
messages:
- role: user
stream: high_level
content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
- {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
# ----------------------------------------------------------
# FLAVOR 2: event-driven text-only paths
# ----------------------------------------------------------
memory_update:
weight: 0.10
bindings:
@@ -29,7 +76,7 @@ blend:
- {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
user_interjection_response:
weight: 0.16
weight: 0.15
bindings:
prior_plan: "nth_prev(style=plan, offset=1)"
current_plan: "emitted_at(t, style=plan)"
@@ -41,62 +88,8 @@ blend:
- {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
- {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech}
# PR3 Hi-Robot v2: supervise the high-level head with the *current*
# active subtask, not the *next*. Pi 0.5 / Pi 0.7 both do this: at every
# frame the assistant target is "what is the robot doing right now"
# grounded in the current image + state + context, so the supervision
# target is always a non-empty span string.
#
# The original target was ``nth_next(style=subtask, offset=1)`` — at
# most frames within a single span this resolves to the next-span
# string (fine), but on the LAST span of an episode it resolves to
# empty/None. The recipe had no ``if_present`` guard on the target,
# so the renderer emitted an empty assistant turn and cross-entropy
# ended up supervising the chat-template's structural newlines.
# Across a dataset annotated this way, the LM head's argmax at
# position 0 collapses to ``\n`` whenever no transition is happening
# (which is most of the time). At inference: head silently emits
# newlines every chunk boundary while the action expert keeps working.
#
# With ``${subtask}`` (binds to ``active_at(t, style=subtask)``) the
# target is the current span's text — always non-empty, scene-
# grounded. The runtime detects subtask transitions by comparing the
# predicted subtask string to the last known one, the same way Pi 0.5
# does. No information loss.
high_level_subtask:
weight: 0.15
messages:
- {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
# PR3 fix: same ``if_present: subtask`` guard as high_level_subtask
# above. Without it, frames where ``active_at(t, style=subtask)``
# returns None render the assistant turn with empty content, which
# the chat tokenizer still includes in target_message_indices →
# text-CE supervises predicting ``\n`` (the chat template's
# structural newline) and the LM head collapses to that prior.
# The same bug we fixed for high_level_subtask, just on a
# different sub-recipe.
#
# Trade-off of adding the guard: frames without an active subtask
# span no longer contribute to the flow loss either (because
# ``predict_actions = bool(targets_by_stream.get("low_level"))``
# and the only low_level target message is now skipped). For a
# well-annotated dataset where subtask spans tile the whole
# episode this is a no-op. For datasets with gaps, those gap
# frames lose flow supervision — which is strictly better than
# the degenerate alternative.
low_level_execution:
weight: 0.35
messages:
- {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
# Per-camera VQA sub-recipes (PR 1's view-dependent style routing).
# Adjust the camera keys (and add more sub-recipes) to match the
# cameras present on your dataset.
ask_vqa_top:
weight: 0.10
weight: 0.075
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
@@ -110,7 +103,7 @@ blend:
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
ask_vqa_wrist:
weight: 0.10
weight: 0.075
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"