From 058b8f39585100bc84631bcd80b8cb788f3e67e5 Mon Sep 17 00:00:00 2001
From: Pepijn <pepijn@huggingface.co>
Date: Wed, 13 May 2026 12:35:51 +0200
Subject: [PATCH] =?UTF-8?q?refactor(recipes):=20two-flavor=20design=20?=
 =?UTF-8?q?=E2=80=94=20one=20fused=20action=5Fexecution=20+=20text-only=20?=
 =?UTF-8?q?events?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a
clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions)
and the hierarchical inference pattern from Pi 0.5 §IV.D.

Flavor 1 — action_execution (60% weight, "main path")
-----------------------------------------------------

One always-on recipe that fuses **all** available context (task,
plan, memory) into a single user prompt and uses the current subtask
as the supervised assistant target. This single recipe supervises
*both* objectives:

  * subtask prediction (text CE on the assistant span via lm_head)
  * action chunks (flow MSE on the action expert via
    stream: low_level, target: true; plus FAST CE on action tokens
    when enable_fast_action_loss=True)

At inference, the *same* prompt structure drives both inference
modes:

  * select_message(user_prompt_only) → LM head generates the next
    subtask. Matches action_execution's training distribution
    exactly (prompt is the user turn, target is the subtask).
  * predict_action_chunk(user_prompt + assistant_subtask) → action
    expert produces the chunk. Matches action_execution's full
    prompt+target.

This replaces what used to be a separate high_level_subtask recipe
plus a low_level_execution recipe; both were supervising the same
subtask text, so collapsing them into one is correct and removes
the redundant text-CE gradient.

Flavor 2 — event-driven text-only recipes
-----------------------------------------

Each of these supervises the LM head to predict a specific kind of
text given a specific event-triggered context. ``stream: high_level``
on all targets so they never trigger predict_actions / flow loss.
``if_present`` guards ensure they only fire on frames where the
event annotation is present.

  * memory_update           (10%)  new memory at subtask boundary
  * user_interjection_response (15%) new plan + say(...) on input
  * ask_vqa_top             (7.5%) front-camera VQA
  * ask_vqa_wrist           (7.5%) wrist-camera VQA

Total weight = 1.0.

Prompt format consistency
-------------------------

User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}``
matches what ``inference/steps.py::_msgs_for_subtask`` and
``_control_context_messages`` already emit at inference time. No
"Task: " prefix — the bare task string is used as the leading
content with literal "Plan: " / "Memory: " labels for the
subsequent components.

What changed structurally
-------------------------

  - low_level_execution            DROPPED  (folded into action_execution)
  - high_level_subtask             DROPPED  (subtask supervision moved into action_execution)
  + action_execution               NEW      (the fused main recipe)
    memory_update                  kept, prompt cleaned up
    user_interjection_response     kept, prompt cleaned up
    ask_vqa_top / ask_vqa_wrist    kept

Runtime compatibility
---------------------

No runtime change needed — ``SmolVLA2Runtime`` and the inference
helpers already build their high-level prompt as just the user turn
(task + plan + memory) and append a ``current_subtask`` assistant
turn for the low-level call. Both match the new ``action_execution``
prompt shape exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../configs/recipes/pi052_hirobot.yaml        |  84 +++++------
 .../configs/recipes/smolvla2_hirobot.yaml     | 131 +++++++++---------
 2 files changed, 105 insertions(+), 110 deletions(-)

diff --git a/src/lerobot/configs/recipes/pi052_hirobot.yaml b/src/lerobot/configs/recipes/pi052_hirobot.yaml
index b5a410712..4968ee80a 100644
--- a/src/lerobot/configs/recipes/pi052_hirobot.yaml
+++ b/src/lerobot/configs/recipes/pi052_hirobot.yaml
@@ -1,26 +1,49 @@
-# π0.5 v2 — Hi-Robot / MEM / ECoT blend, reproducing the paper's
-# hierarchical inference recipe on lerobot.
+# π0.5 v2 (pi052) — Hi-Robot / MEM / ECoT blend.
 #
-# Architecturally identical blend to ``smolvla2_hirobot.yaml`` — same
-# five sub-recipes (memory_update, user_interjection_response,
-# high_level_subtask, low_level_execution, ask_vqa_*) with the same
-# message layouts. The only difference is which backbone the renderer's
-# output is fed into:
+# Architecturally mirrors ``smolvla2_hirobot.yaml`` — same two
+# flavors, same sub-recipes — but the rendered messages are fed
+# to PaliGemma (PaliGemma is not chat-pretrained, so the
+# ``PI052TextTokenizerStep`` concatenates them as ``Role: content``
+# plain text rather than calling ``apply_chat_template``).
 #
-#   * SmolVLA2 calls SmolVLM's chat-template tokenizer
-#     (``apply_chat_template`` with chat-pretrained role markers).
-#   * π0.5 v2 concatenates the rendered messages as ``Role: content``
-#     plain text, since PaliGemma is not chat-pretrained. See
-#     ``PI052TextTokenizerStep`` in ``policies/pi052/text_processor_pi052.py``.
+# Two flavors
+# -----------
 #
-# Same supervision target convention as ``smolvla2_hirobot.yaml``: the
-# ``high_level_subtask`` recipe targets ``${subtask}`` (the *current*
-# active span at every frame) rather than ``${next_subtask}`` (which
-# is empty on stable phases and used to train the model to emit
-# newlines).
+#   Flavor 1 — ``action_execution`` (~60% weight)
+#     The main always-on recipe. Fuses all available context
+#     (task + plan + memory) into a unified user prompt, and
+#     uses the current subtask as the assistant target. This
+#     single recipe supervises *both*:
+#       * subtask prediction (text CE on the assistant span,
+#         lm_head), and
+#       * action chunks (flow MSE on the action expert via
+#         ``stream: low_level, target: true``, plus the FAST
+#         CE on the action tokens when enabled).
+#     Pi 0.7 §V.A — subtask in the prompt + flow on actions.
+#
+#   Flavor 2 — event-driven text-only recipes
+#     ``memory_update``, ``user_interjection_response``,
+#     ``ask_vqa_*``. Each handles a specific high-level event
+#     with a TEXT output. ``if_present`` guards keep them from
+#     firing on frames without the relevant annotation.
 
 blend:
 
+  # ----------------------------------------------------------
+  # FLAVOR 1: action_execution (main path)
+  # ----------------------------------------------------------
+  action_execution:
+    weight: 0.60
+    messages:
+      - role: user
+        stream: high_level
+        content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
+      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
+
+  # ----------------------------------------------------------
+  # FLAVOR 2: event-driven text-only paths
+  # ----------------------------------------------------------
+
   memory_update:
     weight: 0.10
     bindings:
@@ -34,7 +57,7 @@ blend:
       - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
 
   user_interjection_response:
-    weight: 0.16
+    weight: 0.15
     bindings:
       prior_plan: "nth_prev(style=plan, offset=1)"
       current_plan: "emitted_at(t, style=plan)"
@@ -46,29 +69,8 @@ blend:
       - {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
       - {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech}
 
-  # Pi 0.5 / Pi 0.7 supervision: predict the *current* active subtask
-  # at every frame from task + plan + memory + visual prefix.
-  # ``if_present: subtask`` skips frames with no active span instead of
-  # supervising an empty target (the failure mode that produces newline
-  # collapse).
-  high_level_subtask:
-    weight: 0.15
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
-
-  # Same ``if_present: subtask`` guard as high_level_subtask above —
-  # see smolvla2_hirobot.yaml for the full rationale. Skips the
-  # action-loss supervision on frames without an active subtask span
-  # rather than emitting a degenerate empty target.
-  low_level_execution:
-    weight: 0.35
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
-
   ask_vqa_top:
-    weight: 0.10
+    weight: 0.075
     bindings:
       vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
       vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
@@ -82,7 +84,7 @@ blend:
       - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
 
   ask_vqa_wrist:
-    weight: 0.10
+    weight: 0.075
     bindings:
       vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
       vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
diff --git a/src/lerobot/configs/recipes/smolvla2_hirobot.yaml b/src/lerobot/configs/recipes/smolvla2_hirobot.yaml
index ad7eef465..97c786fee 100644
--- a/src/lerobot/configs/recipes/smolvla2_hirobot.yaml
+++ b/src/lerobot/configs/recipes/smolvla2_hirobot.yaml
@@ -1,21 +1,68 @@
 # SmolVLA2 canonical training recipe — Hi Robot / MEM / ECoT blend.
 #
-# Same blend shape as pi05_hirobot.yaml. SmolVLA2 differs from Pi0.5 in
-# how the renderer's output is consumed:
+# Inspired by Pi 0.7 §V (Diversifying the Prompt) and Pi 0.5's
+# hierarchical subtask training. The blend has **two flavors**:
 #
-#   - SmolVLA2 calls SmolVLM's tokenizer.apply_chat_template(messages,
-#     tools=DEFAULT_TOOLS) on the rendered messages, since SmolVLM is a
-#     chat-pretrained backbone.
-#   - The processor builds a `text_labels` tensor that masks every token
-#     except those belonging to messages whose index is in
-#     `target_message_indices`. Cross-entropy on those positions trains
-#     the LM head.
-#   - `predict_actions = bool(targets_by_stream.get("low_level"))` —
-#     same convention as Pi0.5. ``low_level_execution`` is the only
-#     branch that runs the action expert / flow head.
+#   Flavor 1 — ``action_execution`` (~60% weight)
+#     The main always-on recipe. Fuses all available context
+#     (task + plan + memory) into a unified user prompt, and
+#     uses the current subtask as the assistant target. This
+#     single recipe supervises *both*:
+#       * subtask prediction (text CE on the assistant span,
+#         lm_head), and
+#       * action chunks (flow MSE on the action expert via
+#         ``stream: low_level, target: true``, plus the FAST
+#         CE on the action tokens when enabled).
+#     At inference, the same prompt structure is used:
+#       * the high-level loop calls ``select_message`` with the
+#         user prompt only → generates the next subtask.
+#       * the low-level loop calls ``predict_action_chunk`` with
+#         the user prompt + the generated subtask as the
+#         assistant turn → generates the action chunk.
+#     Replaces what used to be three separate recipes
+#     (``high_level_subtask`` + ``low_level_execution`` + the
+#     implicit subtask-in-prompt context) in earlier drafts.
+#     Pi 0.7's §V.A "Subtask instructions" pattern.
+#
+#   Flavor 2 — event-driven text-only recipes
+#     Each handles a specific high-level event with a TEXT
+#     output (no action supervision). They fire when the
+#     binding for the event resolves to non-None:
+#       * ``memory_update``: at subtask boundary, predict new
+#         memory from task + prior memory + completed subtask.
+#       * ``user_interjection_response``: on user input, predict
+#         new plan + paired ``say()`` tool call.
+#       * ``ask_vqa_top`` / ``ask_vqa_wrist``: answer a
+#         camera-grounded visual question.
+#     All use ``stream: high_level`` (no flow loss) and rely on
+#     ``if_present`` guards so they only fire on frames where
+#     the relevant event annotation is present.
+#
+# How the chat tokenizer interprets the flavor split
+# ---------------------------------------------------
+#   * predict_actions = bool(targets_by_stream.get("low_level"))
+#     → True only for Flavor 1 (action_execution).
+#   * text_labels supervises whatever assistant turns are marked
+#     target=true. For action_execution, this is the subtask
+#     string. For Flavor 2, it's the corresponding text output.
 
 blend:
 
+  # ----------------------------------------------------------
+  # FLAVOR 1: action_execution (main path)
+  # ----------------------------------------------------------
+  action_execution:
+    weight: 0.60
+    messages:
+      - role: user
+        stream: high_level
+        content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
+      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
+
+  # ----------------------------------------------------------
+  # FLAVOR 2: event-driven text-only paths
+  # ----------------------------------------------------------
+
   memory_update:
     weight: 0.10
     bindings:
@@ -29,7 +76,7 @@ blend:
       - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
 
   user_interjection_response:
-    weight: 0.16
+    weight: 0.15
     bindings:
       prior_plan: "nth_prev(style=plan, offset=1)"
       current_plan: "emitted_at(t, style=plan)"
@@ -41,62 +88,8 @@ blend:
       - {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
       - {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech}
 
-  # PR3 Hi-Robot v2: supervise the high-level head with the *current*
-  # active subtask, not the *next*. Pi 0.5 / Pi 0.7 both do this: at every
-  # frame the assistant target is "what is the robot doing right now"
-  # grounded in the current image + state + context, so the supervision
-  # target is always a non-empty span string.
-  #
-  # The original target was ``nth_next(style=subtask, offset=1)`` — at
-  # most frames within a single span this resolves to the next-span
-  # string (fine), but on the LAST span of an episode it resolves to
-  # empty/None. The recipe had no ``if_present`` guard on the target,
-  # so the renderer emitted an empty assistant turn and cross-entropy
-  # ended up supervising the chat-template's structural newlines.
-  # Across a dataset annotated this way, the LM head's argmax at
-  # position 0 collapses to ``\n`` whenever no transition is happening
-  # (which is most of the time). At inference: head silently emits
-  # newlines every chunk boundary while the action expert keeps working.
-  #
-  # With ``${subtask}`` (binds to ``active_at(t, style=subtask)``) the
-  # target is the current span's text — always non-empty, scene-
-  # grounded. The runtime detects subtask transitions by comparing the
-  # predicted subtask string to the last known one, the same way Pi 0.5
-  # does. No information loss.
-  high_level_subtask:
-    weight: 0.15
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
-
-  # PR3 fix: same ``if_present: subtask`` guard as high_level_subtask
-  # above. Without it, frames where ``active_at(t, style=subtask)``
-  # returns None render the assistant turn with empty content, which
-  # the chat tokenizer still includes in target_message_indices →
-  # text-CE supervises predicting ``\n`` (the chat template's
-  # structural newline) and the LM head collapses to that prior.
-  # The same bug we fixed for high_level_subtask, just on a
-  # different sub-recipe.
-  #
-  # Trade-off of adding the guard: frames without an active subtask
-  # span no longer contribute to the flow loss either (because
-  # ``predict_actions = bool(targets_by_stream.get("low_level"))``
-  # and the only low_level target message is now skipped). For a
-  # well-annotated dataset where subtask spans tile the whole
-  # episode this is a no-op. For datasets with gaps, those gap
-  # frames lose flow supervision — which is strictly better than
-  # the degenerate alternative.
-  low_level_execution:
-    weight: 0.35
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
-
-  # Per-camera VQA sub-recipes (PR 1's view-dependent style routing).
-  # Adjust the camera keys (and add more sub-recipes) to match the
-  # cameras present on your dataset.
   ask_vqa_top:
-    weight: 0.10
+    weight: 0.075
     bindings:
       vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
       vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
@@ -110,7 +103,7 @@ blend:
       - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
 
   ask_vqa_wrist:
-    weight: 0.10
+    weight: 0.075
     bindings:
       vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
       vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"