From 058b8f39585100bc84631bcd80b8cb788f3e67e5 Mon Sep 17 00:00:00 2001 From: Pepijn Date: Wed, 13 May 2026 12:35:51 +0200 Subject: [PATCH] =?UTF-8?q?refactor(recipes):=20two-flavor=20design=20?= =?UTF-8?q?=E2=80=94=20one=20fused=20action=5Fexecution=20+=20text-only=20?= =?UTF-8?q?events?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions) and the hierarchical inference pattern from Pi 0.5 §IV.D. Flavor 1 — action_execution (60% weight, "main path") ----------------------------------------------------- One always-on recipe that fuses **all** available context (task, plan, memory) into a single user prompt and uses the current subtask as the supervised assistant target. This single recipe supervises *both* objectives: * subtask prediction (text CE on the assistant span via lm_head) * action chunks (flow MSE on the action expert via stream: low_level, target: true; plus FAST CE on action tokens when enable_fast_action_loss=True) At inference, the *same* prompt structure drives both inference modes: * select_message(user_prompt_only) → LM head generates the next subtask. Matches action_execution's training distribution exactly (prompt is the user turn, target is the subtask). * predict_action_chunk(user_prompt + assistant_subtask) → action expert produces the chunk. Matches action_execution's full prompt+target. This replaces what used to be a separate high_level_subtask recipe plus a low_level_execution recipe; both were supervising the same subtask text, so collapsing them into one is correct and removes the redundant text-CE gradient. Flavor 2 — event-driven text-only recipes ----------------------------------------- Each of these supervises the LM head to predict a specific kind of text given a specific event-triggered context. ``stream: high_level`` on all targets so they never trigger predict_actions / flow loss. ``if_present`` guards ensure they only fire on frames where the event annotation is present. * memory_update (10%) new memory at subtask boundary * user_interjection_response (15%) new plan + say(...) on input * ask_vqa_top (7.5%) front-camera VQA * ask_vqa_wrist (7.5%) wrist-camera VQA Total weight = 1.0. Prompt format consistency ------------------------- User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}`` matches what ``inference/steps.py::_msgs_for_subtask`` and ``_control_context_messages`` already emit at inference time. No "Task: " prefix — the bare task string is used as the leading content with literal "Plan: " / "Memory: " labels for the subsequent components. What changed structurally ------------------------- - low_level_execution DROPPED (folded into action_execution) - high_level_subtask DROPPED (subtask supervision moved into action_execution) + action_execution NEW (the fused main recipe) memory_update kept, prompt cleaned up user_interjection_response kept, prompt cleaned up ask_vqa_top / ask_vqa_wrist kept Runtime compatibility --------------------- No runtime change needed — ``SmolVLA2Runtime`` and the inference helpers already build their high-level prompt as just the user turn (task + plan + memory) and append a ``current_subtask`` assistant turn for the low-level call. Both match the new ``action_execution`` prompt shape exactly. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../configs/recipes/pi052_hirobot.yaml | 84 +++++------ .../configs/recipes/smolvla2_hirobot.yaml | 131 +++++++++--------- 2 files changed, 105 insertions(+), 110 deletions(-) diff --git a/src/lerobot/configs/recipes/pi052_hirobot.yaml b/src/lerobot/configs/recipes/pi052_hirobot.yaml index b5a410712..4968ee80a 100644 --- a/src/lerobot/configs/recipes/pi052_hirobot.yaml +++ b/src/lerobot/configs/recipes/pi052_hirobot.yaml @@ -1,26 +1,49 @@ -# π0.5 v2 — Hi-Robot / MEM / ECoT blend, reproducing the paper's -# hierarchical inference recipe on lerobot. +# π0.5 v2 (pi052) — Hi-Robot / MEM / ECoT blend. # -# Architecturally identical blend to ``smolvla2_hirobot.yaml`` — same -# five sub-recipes (memory_update, user_interjection_response, -# high_level_subtask, low_level_execution, ask_vqa_*) with the same -# message layouts. The only difference is which backbone the renderer's -# output is fed into: +# Architecturally mirrors ``smolvla2_hirobot.yaml`` — same two +# flavors, same sub-recipes — but the rendered messages are fed +# to PaliGemma (PaliGemma is not chat-pretrained, so the +# ``PI052TextTokenizerStep`` concatenates them as ``Role: content`` +# plain text rather than calling ``apply_chat_template``). # -# * SmolVLA2 calls SmolVLM's chat-template tokenizer -# (``apply_chat_template`` with chat-pretrained role markers). -# * π0.5 v2 concatenates the rendered messages as ``Role: content`` -# plain text, since PaliGemma is not chat-pretrained. See -# ``PI052TextTokenizerStep`` in ``policies/pi052/text_processor_pi052.py``. +# Two flavors +# ----------- # -# Same supervision target convention as ``smolvla2_hirobot.yaml``: the -# ``high_level_subtask`` recipe targets ``${subtask}`` (the *current* -# active span at every frame) rather than ``${next_subtask}`` (which -# is empty on stable phases and used to train the model to emit -# newlines). +# Flavor 1 — ``action_execution`` (~60% weight) +# The main always-on recipe. Fuses all available context +# (task + plan + memory) into a unified user prompt, and +# uses the current subtask as the assistant target. This +# single recipe supervises *both*: +# * subtask prediction (text CE on the assistant span, +# lm_head), and +# * action chunks (flow MSE on the action expert via +# ``stream: low_level, target: true``, plus the FAST +# CE on the action tokens when enabled). +# Pi 0.7 §V.A — subtask in the prompt + flow on actions. +# +# Flavor 2 — event-driven text-only recipes +# ``memory_update``, ``user_interjection_response``, +# ``ask_vqa_*``. Each handles a specific high-level event +# with a TEXT output. ``if_present`` guards keep them from +# firing on frames without the relevant annotation. blend: + # ---------------------------------------------------------- + # FLAVOR 1: action_execution (main path) + # ---------------------------------------------------------- + action_execution: + weight: 0.60 + messages: + - role: user + stream: high_level + content: "${task}\nPlan: ${plan}\nMemory: ${memory}" + - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask} + + # ---------------------------------------------------------- + # FLAVOR 2: event-driven text-only paths + # ---------------------------------------------------------- + memory_update: weight: 0.10 bindings: @@ -34,7 +57,7 @@ blend: - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory} user_interjection_response: - weight: 0.16 + weight: 0.15 bindings: prior_plan: "nth_prev(style=plan, offset=1)" current_plan: "emitted_at(t, style=plan)" @@ -46,29 +69,8 @@ blend: - {role: user, content: "${interjection}", stream: high_level, if_present: interjection} - {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech} - # Pi 0.5 / Pi 0.7 supervision: predict the *current* active subtask - # at every frame from task + plan + memory + visual prefix. - # ``if_present: subtask`` skips frames with no active span instead of - # supervising an empty target (the failure mode that produces newline - # collapse). - high_level_subtask: - weight: 0.15 - messages: - - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level} - - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask} - - # Same ``if_present: subtask`` guard as high_level_subtask above — - # see smolvla2_hirobot.yaml for the full rationale. Skips the - # action-loss supervision on frames without an active subtask span - # rather than emitting a degenerate empty target. - low_level_execution: - weight: 0.35 - messages: - - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level} - - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask} - ask_vqa_top: - weight: 0.10 + weight: 0.075 bindings: vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)" vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)" @@ -82,7 +84,7 @@ blend: - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa} ask_vqa_wrist: - weight: 0.10 + weight: 0.075 bindings: vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)" vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)" diff --git a/src/lerobot/configs/recipes/smolvla2_hirobot.yaml b/src/lerobot/configs/recipes/smolvla2_hirobot.yaml index ad7eef465..97c786fee 100644 --- a/src/lerobot/configs/recipes/smolvla2_hirobot.yaml +++ b/src/lerobot/configs/recipes/smolvla2_hirobot.yaml @@ -1,21 +1,68 @@ # SmolVLA2 canonical training recipe — Hi Robot / MEM / ECoT blend. # -# Same blend shape as pi05_hirobot.yaml. SmolVLA2 differs from Pi0.5 in -# how the renderer's output is consumed: +# Inspired by Pi 0.7 §V (Diversifying the Prompt) and Pi 0.5's +# hierarchical subtask training. The blend has **two flavors**: # -# - SmolVLA2 calls SmolVLM's tokenizer.apply_chat_template(messages, -# tools=DEFAULT_TOOLS) on the rendered messages, since SmolVLM is a -# chat-pretrained backbone. -# - The processor builds a `text_labels` tensor that masks every token -# except those belonging to messages whose index is in -# `target_message_indices`. Cross-entropy on those positions trains -# the LM head. -# - `predict_actions = bool(targets_by_stream.get("low_level"))` — -# same convention as Pi0.5. ``low_level_execution`` is the only -# branch that runs the action expert / flow head. +# Flavor 1 — ``action_execution`` (~60% weight) +# The main always-on recipe. Fuses all available context +# (task + plan + memory) into a unified user prompt, and +# uses the current subtask as the assistant target. This +# single recipe supervises *both*: +# * subtask prediction (text CE on the assistant span, +# lm_head), and +# * action chunks (flow MSE on the action expert via +# ``stream: low_level, target: true``, plus the FAST +# CE on the action tokens when enabled). +# At inference, the same prompt structure is used: +# * the high-level loop calls ``select_message`` with the +# user prompt only → generates the next subtask. +# * the low-level loop calls ``predict_action_chunk`` with +# the user prompt + the generated subtask as the +# assistant turn → generates the action chunk. +# Replaces what used to be three separate recipes +# (``high_level_subtask`` + ``low_level_execution`` + the +# implicit subtask-in-prompt context) in earlier drafts. +# Pi 0.7's §V.A "Subtask instructions" pattern. +# +# Flavor 2 — event-driven text-only recipes +# Each handles a specific high-level event with a TEXT +# output (no action supervision). They fire when the +# binding for the event resolves to non-None: +# * ``memory_update``: at subtask boundary, predict new +# memory from task + prior memory + completed subtask. +# * ``user_interjection_response``: on user input, predict +# new plan + paired ``say()`` tool call. +# * ``ask_vqa_top`` / ``ask_vqa_wrist``: answer a +# camera-grounded visual question. +# All use ``stream: high_level`` (no flow loss) and rely on +# ``if_present`` guards so they only fire on frames where +# the relevant event annotation is present. +# +# How the chat tokenizer interprets the flavor split +# --------------------------------------------------- +# * predict_actions = bool(targets_by_stream.get("low_level")) +# → True only for Flavor 1 (action_execution). +# * text_labels supervises whatever assistant turns are marked +# target=true. For action_execution, this is the subtask +# string. For Flavor 2, it's the corresponding text output. blend: + # ---------------------------------------------------------- + # FLAVOR 1: action_execution (main path) + # ---------------------------------------------------------- + action_execution: + weight: 0.60 + messages: + - role: user + stream: high_level + content: "${task}\nPlan: ${plan}\nMemory: ${memory}" + - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask} + + # ---------------------------------------------------------- + # FLAVOR 2: event-driven text-only paths + # ---------------------------------------------------------- + memory_update: weight: 0.10 bindings: @@ -29,7 +76,7 @@ blend: - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory} user_interjection_response: - weight: 0.16 + weight: 0.15 bindings: prior_plan: "nth_prev(style=plan, offset=1)" current_plan: "emitted_at(t, style=plan)" @@ -41,62 +88,8 @@ blend: - {role: user, content: "${interjection}", stream: high_level, if_present: interjection} - {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech} - # PR3 Hi-Robot v2: supervise the high-level head with the *current* - # active subtask, not the *next*. Pi 0.5 / Pi 0.7 both do this: at every - # frame the assistant target is "what is the robot doing right now" - # grounded in the current image + state + context, so the supervision - # target is always a non-empty span string. - # - # The original target was ``nth_next(style=subtask, offset=1)`` — at - # most frames within a single span this resolves to the next-span - # string (fine), but on the LAST span of an episode it resolves to - # empty/None. The recipe had no ``if_present`` guard on the target, - # so the renderer emitted an empty assistant turn and cross-entropy - # ended up supervising the chat-template's structural newlines. - # Across a dataset annotated this way, the LM head's argmax at - # position 0 collapses to ``\n`` whenever no transition is happening - # (which is most of the time). At inference: head silently emits - # newlines every chunk boundary while the action expert keeps working. - # - # With ``${subtask}`` (binds to ``active_at(t, style=subtask)``) the - # target is the current span's text — always non-empty, scene- - # grounded. The runtime detects subtask transitions by comparing the - # predicted subtask string to the last known one, the same way Pi 0.5 - # does. No information loss. - high_level_subtask: - weight: 0.15 - messages: - - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level} - - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask} - - # PR3 fix: same ``if_present: subtask`` guard as high_level_subtask - # above. Without it, frames where ``active_at(t, style=subtask)`` - # returns None render the assistant turn with empty content, which - # the chat tokenizer still includes in target_message_indices → - # text-CE supervises predicting ``\n`` (the chat template's - # structural newline) and the LM head collapses to that prior. - # The same bug we fixed for high_level_subtask, just on a - # different sub-recipe. - # - # Trade-off of adding the guard: frames without an active subtask - # span no longer contribute to the flow loss either (because - # ``predict_actions = bool(targets_by_stream.get("low_level"))`` - # and the only low_level target message is now skipped). For a - # well-annotated dataset where subtask spans tile the whole - # episode this is a no-op. For datasets with gaps, those gap - # frames lose flow supervision — which is strictly better than - # the degenerate alternative. - low_level_execution: - weight: 0.35 - messages: - - {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level} - - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask} - - # Per-camera VQA sub-recipes (PR 1's view-dependent style routing). - # Adjust the camera keys (and add more sub-recipes) to match the - # cameras present on your dataset. ask_vqa_top: - weight: 0.10 + weight: 0.075 bindings: vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)" vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)" @@ -110,7 +103,7 @@ blend: - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa} ask_vqa_wrist: - weight: 0.10 + weight: 0.075 bindings: vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)" vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"