refactor(recipes): two-flavor design — one fused action_execution + text-only events

Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions) and the hierarchical inference pattern from Pi 0.5 §IV.D. Flavor 1 — action_execution (60% weight, "main path") ----------------------------------------------------- One always-on recipe that fuses **all** available context (task, plan, memory) into a single user prompt and uses the current subtask as the supervised assistant target. This single recipe supervises *both* objectives: * subtask prediction (text CE on the assistant span via lm_head) * action chunks (flow MSE on the action expert via stream: low_level, target: true; plus FAST CE on action tokens when enable_fast_action_loss=True) At inference, the *same* prompt structure drives both inference modes: * select_message(user_prompt_only) → LM head generates the next subtask. Matches action_execution's training distribution exactly (prompt is the user turn, target is the subtask). * predict_action_chunk(user_prompt + assistant_subtask) → action expert produces the chunk. Matches action_execution's full prompt+target. This replaces what used to be a separate high_level_subtask recipe plus a low_level_execution recipe; both were supervising the same subtask text, so collapsing them into one is correct and removes the redundant text-CE gradient. Flavor 2 — event-driven text-only recipes ----------------------------------------- Each of these supervises the LM head to predict a specific kind of text given a specific event-triggered context. ``stream: high_level`` on all targets so they never trigger predict_actions / flow loss. ``if_present`` guards ensure they only fire on frames where the event annotation is present. * memory_update (10%) new memory at subtask boundary * user_interjection_response (15%) new plan + say(...) on input * ask_vqa_top (7.5%) front-camera VQA * ask_vqa_wrist (7.5%) wrist-camera VQA Total weight = 1.0. Prompt format consistency ------------------------- User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}`` matches what ``inference/steps.py::_msgs_for_subtask`` and ``_control_context_messages`` already emit at inference time. No "Task: " prefix — the bare task string is used as the leading content with literal "Plan: " / "Memory: " labels for the subsequent components. What changed structurally ------------------------- - low_level_execution DROPPED (folded into action_execution) - high_level_subtask DROPPED (subtask supervision moved into action_execution) + action_execution NEW (the fused main recipe) memory_update kept, prompt cleaned up user_interjection_response kept, prompt cleaned up ask_vqa_top / ask_vqa_wrist kept Runtime compatibility --------------------- No runtime change needed — ``SmolVLA2Runtime`` and the inference helpers already build their high-level prompt as just the user turn (task + plan + memory) and append a ``current_subtask`` assistant turn for the low-level call. Both match the new ``action_execution`` prompt shape exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-18 16:57:12 +00:00 · 2026-05-13 12:35:51 +02:00
parent b873fe454c
commit 058b8f3958
2 changed files with 105 additions and 110 deletions
@@ -1,26 +1,49 @@
-# π0.5 v2 — Hi-Robot / MEM / ECoT blend, reproducing the paper's
-# hierarchical inference recipe on lerobot.
+# π0.5 v2 (pi052) — Hi-Robot / MEM / ECoT blend.
 #
-# Architecturally identical blend to ``smolvla2_hirobot.yaml`` — same
-# five sub-recipes (memory_update, user_interjection_response,
-# high_level_subtask, low_level_execution, ask_vqa_*) with the same
-# message layouts. The only difference is which backbone the renderer's
-# output is fed into:
+# Architecturally mirrors ``smolvla2_hirobot.yaml`` — same two
+# flavors, same sub-recipes — but the rendered messages are fed
+# to PaliGemma (PaliGemma is not chat-pretrained, so the
+# ``PI052TextTokenizerStep`` concatenates them as ``Role: content``
+# plain text rather than calling ``apply_chat_template``).
 #
-#   * SmolVLA2 calls SmolVLM's chat-template tokenizer
-#     (``apply_chat_template`` with chat-pretrained role markers).
-#   * π0.5 v2 concatenates the rendered messages as ``Role: content``
-#     plain text, since PaliGemma is not chat-pretrained. See
-#     ``PI052TextTokenizerStep`` in ``policies/pi052/text_processor_pi052.py``.
+# Two flavors
+# -----------
 #
-# Same supervision target convention as ``smolvla2_hirobot.yaml``: the
-# ``high_level_subtask`` recipe targets ``${subtask}`` (the *current*
-# active span at every frame) rather than ``${next_subtask}`` (which
-# is empty on stable phases and used to train the model to emit
-# newlines).
+#   Flavor 1 — ``action_execution`` (~60% weight)
+#     The main always-on recipe. Fuses all available context
+#     (task + plan + memory) into a unified user prompt, and
+#     uses the current subtask as the assistant target. This
+#     single recipe supervises *both*:
+#       * subtask prediction (text CE on the assistant span,
+#         lm_head), and
+#       * action chunks (flow MSE on the action expert via
+#         ``stream: low_level, target: true``, plus the FAST
+#         CE on the action tokens when enabled).
+#     Pi 0.7 §V.A — subtask in the prompt + flow on actions.
+#
+#   Flavor 2 — event-driven text-only recipes
+#     ``memory_update``, ``user_interjection_response``,
+#     ``ask_vqa_*``. Each handles a specific high-level event
+#     with a TEXT output. ``if_present`` guards keep them from
+#     firing on frames without the relevant annotation.

 blend:

+  # ----------------------------------------------------------
+  # FLAVOR 1: action_execution (main path)
+  # ----------------------------------------------------------
+  action_execution:
+    weight: 0.60
+    messages:
+      - role: user
+        stream: high_level
+        content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
+      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
+
+  # ----------------------------------------------------------
+  # FLAVOR 2: event-driven text-only paths
+  # ----------------------------------------------------------
+
  memory_update:
    weight: 0.10
    bindings:
@@ -34,7 +57,7 @@ blend:
      - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}

  user_interjection_response:
-    weight: 0.16
+    weight: 0.15
    bindings:
      prior_plan: "nth_prev(style=plan, offset=1)"
      current_plan: "emitted_at(t, style=plan)"
@@ -46,29 +69,8 @@ blend:
      - {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
      - {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech}

-  # Pi 0.5 / Pi 0.7 supervision: predict the *current* active subtask
-  # at every frame from task + plan + memory + visual prefix.
-  # ``if_present: subtask`` skips frames with no active span instead of
-  # supervising an empty target (the failure mode that produces newline
-  # collapse).
-  high_level_subtask:
-    weight: 0.15
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
-
-  # Same ``if_present: subtask`` guard as high_level_subtask above —
-  # see smolvla2_hirobot.yaml for the full rationale. Skips the
-  # action-loss supervision on frames without an active subtask span
-  # rather than emitting a degenerate empty target.
-  low_level_execution:
-    weight: 0.35
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
-
  ask_vqa_top:
-    weight: 0.10
+    weight: 0.075
    bindings:
      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
@@ -82,7 +84,7 @@ blend:
      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}

  ask_vqa_wrist:
-    weight: 0.10
+    weight: 0.075
    bindings:
      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
@@ -1,21 +1,68 @@
 # SmolVLA2 canonical training recipe — Hi Robot / MEM / ECoT blend.
 #
-# Same blend shape as pi05_hirobot.yaml. SmolVLA2 differs from Pi0.5 in
-# how the renderer's output is consumed:
+# Inspired by Pi 0.7 §V (Diversifying the Prompt) and Pi 0.5's
+# hierarchical subtask training. The blend has **two flavors**:
 #
-#   - SmolVLA2 calls SmolVLM's tokenizer.apply_chat_template(messages,
-#     tools=DEFAULT_TOOLS) on the rendered messages, since SmolVLM is a
-#     chat-pretrained backbone.
-#   - The processor builds a `text_labels` tensor that masks every token
-#     except those belonging to messages whose index is in
-#     `target_message_indices`. Cross-entropy on those positions trains
-#     the LM head.
-#   - `predict_actions = bool(targets_by_stream.get("low_level"))` —
-#     same convention as Pi0.5. ``low_level_execution`` is the only
-#     branch that runs the action expert / flow head.
+#   Flavor 1 — ``action_execution`` (~60% weight)
+#     The main always-on recipe. Fuses all available context
+#     (task + plan + memory) into a unified user prompt, and
+#     uses the current subtask as the assistant target. This
+#     single recipe supervises *both*:
+#       * subtask prediction (text CE on the assistant span,
+#         lm_head), and
+#       * action chunks (flow MSE on the action expert via
+#         ``stream: low_level, target: true``, plus the FAST
+#         CE on the action tokens when enabled).
+#     At inference, the same prompt structure is used:
+#       * the high-level loop calls ``select_message`` with the
+#         user prompt only → generates the next subtask.
+#       * the low-level loop calls ``predict_action_chunk`` with
+#         the user prompt + the generated subtask as the
+#         assistant turn → generates the action chunk.
+#     Replaces what used to be three separate recipes
+#     (``high_level_subtask`` + ``low_level_execution`` + the
+#     implicit subtask-in-prompt context) in earlier drafts.
+#     Pi 0.7's §V.A "Subtask instructions" pattern.
+#
+#   Flavor 2 — event-driven text-only recipes
+#     Each handles a specific high-level event with a TEXT
+#     output (no action supervision). They fire when the
+#     binding for the event resolves to non-None:
+#       * ``memory_update``: at subtask boundary, predict new
+#         memory from task + prior memory + completed subtask.
+#       * ``user_interjection_response``: on user input, predict
+#         new plan + paired ``say()`` tool call.
+#       * ``ask_vqa_top`` / ``ask_vqa_wrist``: answer a
+#         camera-grounded visual question.
+#     All use ``stream: high_level`` (no flow loss) and rely on
+#     ``if_present`` guards so they only fire on frames where
+#     the relevant event annotation is present.
+#
+# How the chat tokenizer interprets the flavor split
+# ---------------------------------------------------
+#   * predict_actions = bool(targets_by_stream.get("low_level"))
+#     → True only for Flavor 1 (action_execution).
+#   * text_labels supervises whatever assistant turns are marked
+#     target=true. For action_execution, this is the subtask
+#     string. For Flavor 2, it's the corresponding text output.

 blend:

+  # ----------------------------------------------------------
+  # FLAVOR 1: action_execution (main path)
+  # ----------------------------------------------------------
+  action_execution:
+    weight: 0.60
+    messages:
+      - role: user
+        stream: high_level
+        content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
+      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
+
+  # ----------------------------------------------------------
+  # FLAVOR 2: event-driven text-only paths
+  # ----------------------------------------------------------
+
  memory_update:
    weight: 0.10
    bindings:
@@ -29,7 +76,7 @@ blend:
      - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}

  user_interjection_response:
-    weight: 0.16
+    weight: 0.15
    bindings:
      prior_plan: "nth_prev(style=plan, offset=1)"
      current_plan: "emitted_at(t, style=plan)"
@@ -41,62 +88,8 @@ blend:
      - {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
      - {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech}

-  # PR3 Hi-Robot v2: supervise the high-level head with the *current*
-  # active subtask, not the *next*. Pi 0.5 / Pi 0.7 both do this: at every
-  # frame the assistant target is "what is the robot doing right now"
-  # grounded in the current image + state + context, so the supervision
-  # target is always a non-empty span string.
-  #
-  # The original target was ``nth_next(style=subtask, offset=1)`` — at
-  # most frames within a single span this resolves to the next-span
-  # string (fine), but on the LAST span of an episode it resolves to
-  # empty/None. The recipe had no ``if_present`` guard on the target,
-  # so the renderer emitted an empty assistant turn and cross-entropy
-  # ended up supervising the chat-template's structural newlines.
-  # Across a dataset annotated this way, the LM head's argmax at
-  # position 0 collapses to ``\n`` whenever no transition is happening
-  # (which is most of the time). At inference: head silently emits
-  # newlines every chunk boundary while the action expert keeps working.
-  #
-  # With ``${subtask}`` (binds to ``active_at(t, style=subtask)``) the
-  # target is the current span's text — always non-empty, scene-
-  # grounded. The runtime detects subtask transitions by comparing the
-  # predicted subtask string to the last known one, the same way Pi 0.5
-  # does. No information loss.
-  high_level_subtask:
-    weight: 0.15
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
-
-  # PR3 fix: same ``if_present: subtask`` guard as high_level_subtask
-  # above. Without it, frames where ``active_at(t, style=subtask)``
-  # returns None render the assistant turn with empty content, which
-  # the chat tokenizer still includes in target_message_indices →
-  # text-CE supervises predicting ``\n`` (the chat template's
-  # structural newline) and the LM head collapses to that prior.
-  # The same bug we fixed for high_level_subtask, just on a
-  # different sub-recipe.
-  #
-  # Trade-off of adding the guard: frames without an active subtask
-  # span no longer contribute to the flow loss either (because
-  # ``predict_actions = bool(targets_by_stream.get("low_level"))``
-  # and the only low_level target message is now skipped). For a
-  # well-annotated dataset where subtask spans tile the whole
-  # episode this is a no-op. For datasets with gaps, those gap
-  # frames lose flow supervision — which is strictly better than
-  # the degenerate alternative.
-  low_level_execution:
-    weight: 0.35
-    messages:
-      - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
-      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
-
-  # Per-camera VQA sub-recipes (PR 1's view-dependent style routing).
-  # Adjust the camera keys (and add more sub-recipes) to match the
-  # cameras present on your dataset.
  ask_vqa_top:
-    weight: 0.10
+    weight: 0.075
    bindings:
      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
@@ -110,7 +103,7 @@ blend:
      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}

  ask_vqa_wrist:
-    weight: 0.10
+    weight: 0.075
    bindings:
      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"