refactor(recipes): rename recipes, drop pi05_hirobot

- hirobot.yaml            -> subtasks_vqa.yaml
- hirobot_memory.yaml     -> subtask_mem_vqa_speech.yaml
- pi05_hirobot.yaml       -> deleted (stale: uses plan, top-camera names;
  superseded by the two recipes above)
- smolvla2_hirobot.yaml   -> deleted (was untracked stale junk)

Updated the smolvla2 / pi052 `recipe_path` config defaults, all
docstring / comment references, the annotation-pipeline + recipe docs,
and the three tests that loaded pi05_hirobot.yaml (repointed to the
renamed recipes; the low-level-branch and pipeline-render assertions
now accept a flow-only `low_level` stream as valid supervision, since
the new recipes' low_level_execution has no text-CE target).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pepijn
2026-05-18 16:02:15 +02:00
parent 426d48dbbf
commit 0f5f0e4091
15 changed files with 46 additions and 103 deletions
+1 -1
View File
@@ -72,7 +72,7 @@ The executor picks `LocalPipelineExecutor` for small datasets and
## Style-to-recipe consumer mapping ## Style-to-recipe consumer mapping
The pipeline produces exactly the styles consumed by The pipeline produces exactly the styles consumed by
`src/lerobot/configs/recipes/pi05_hirobot.yaml`: `src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml`:
- `low_level_execution`, `high_level_subtask`, `memory_update` consume - `low_level_execution`, `high_level_subtask`, `memory_update` consume
`subtask`/`plan`/`memory` from `language_persistent`. `subtask`/`plan`/`memory` from `language_persistent`.
+1 -1
View File
@@ -101,7 +101,7 @@ The renderer does not apply a tokenizer chat template. Policy processors decide
## Blends ## Blends
Blend recipes select one weighted sub-recipe deterministically from the sample index. Blend recipes select one weighted sub-recipe deterministically from the sample index.
The canonical `recipes/pi05_hirobot.yaml` combines memory updates, interjection responses, high-level subtask prediction, low-level execution, and VQA. `recipes/subtasks_vqa.yaml` trains the core blend — high-level subtask prediction, low-level execution, and VQA. `recipes/subtask_mem_vqa_speech.yaml` is the fuller variant that also adds memory updates and spoken interjection responses.
## Graceful absence ## Graceful absence
@@ -21,7 +21,7 @@ one ``(vqa, user)`` + ``(vqa, assistant)`` pair *per camera*: each pair is
generated against that camera's frame and stamped with the matching generated against that camera's frame and stamped with the matching
``camera`` field on the emitted rows. The resolver disambiguates via ``camera`` field on the emitted rows. The resolver disambiguates via
``camera=...``; recipes that consume VQA do so through one sub-recipe ``camera=...``; recipes that consume VQA do so through one sub-recipe
per camera (see ``recipes/pi05_hirobot.yaml``). per camera (see ``recipes/subtasks_vqa.yaml``).
Within a single (frame, camera) we still emit at most one ``(vqa, user)`` Within a single (frame, camera) we still emit at most one ``(vqa, user)``
and one ``(vqa, assistant)`` row, so the resolver contract stays scalar. and one ``(vqa, assistant)`` row, so the resolver contract stays scalar.
@@ -1,74 +0,0 @@
blend:
memory_update:
weight: 0.10
bindings:
prior_memory: "nth_prev(style=memory, offset=1)"
current_memory: "emitted_at(t, style=memory)"
completed_subtask: "nth_prev(style=subtask, offset=1)"
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
- {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
- {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
user_interjection_response:
weight: 0.16
bindings:
prior_plan: "nth_prev(style=plan, offset=1)"
current_plan: "emitted_at(t, style=plan)"
interjection: "emitted_at(t, style=interjection)"
speech: "emitted_at(t, role=assistant, tool_name=say)"
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "Previous plan:\n${prior_plan}", stream: high_level, if_present: prior_plan}
- {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
- {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech}
high_level_subtask:
weight: 0.15
bindings:
next_subtask: "nth_next(style=subtask, offset=1)"
messages:
- {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
- {role: user, content: "Current subtask: ${subtask}", stream: high_level, if_present: subtask}
- {role: assistant, content: "${next_subtask}", stream: high_level, target: true}
low_level_execution:
weight: 0.35
messages:
- {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: low_level, target: true}
# VQA is view-dependent: bbox / keypoint / count answers only make sense for
# the camera they were grounded against. Each camera gets its own sub-recipe
# so the resolver can disambiguate via `camera=...` and the user-turn carries
# the matching image block. Adjust the camera keys (and add more sub-recipes)
# to match the cameras present on your dataset.
ask_vqa_top:
weight: 0.10
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.top}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
ask_vqa_wrist:
weight: 0.10
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.wrist}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
@@ -1,6 +1,6 @@
# Hi-Robot blend + memory + tool-call (spoken) responses. # subtask_mem_vqa_speech — Hi-Robot blend + memory + spoken responses.
# #
# Superset of hirobot.yaml. Keeps the core subtask + action + VQA # Superset of subtasks_vqa.yaml. Keeps the core subtask + action + VQA
# training, and adds two text-supervised tasks: # training, and adds two text-supervised tasks:
# #
# high_level_subtask — predict the subtask from the task. # high_level_subtask — predict the subtask from the task.
@@ -73,7 +73,7 @@ blend:
# VQA is view-dependent — each camera gets its own sub-recipe so the # VQA is view-dependent — each camera gets its own sub-recipe so the
# resolver disambiguates via `camera=...`. Camera keys match # resolver disambiguates via `camera=...`. Camera keys match
# hirobot.yaml (`front` + `wrist`); adjust to your dataset. # subtasks_vqa.yaml (`front` + `wrist`); adjust to your dataset.
ask_vqa_top: ask_vqa_top:
weight: 0.075 weight: 0.075
bindings: bindings:
@@ -1,10 +1,10 @@
# Hi-Robot blend shared between SmolVLA2 (SmolVLM2 backbone) and # subtasks_vqa — Hi-Robot blend, shared between SmolVLA2 (SmolVLM2
# PI052 (PaliGemma backbone). # backbone) and PI052 (PaliGemma backbone).
# #
# Trains two things only: subtasks and VQA. Plan and memory are # Trains two things only: subtasks and VQA. Plan and memory are
# intentionally left out for now — keeps the prompt short and the # intentionally left out — keeps the prompt short and the training
# training surface small while the core subtask + action loop is # surface small. The fuller blend with memory + spoken replies is
# validated. # ``subtask_mem_vqa_speech.yaml``.
# #
# high_level_subtask — predict the subtask from the task. # high_level_subtask — predict the subtask from the task.
# low_level_execution — flow loss with [images, subtask, state]. # low_level_execution — flow loss with [images, subtask, state].
+1 -1
View File
@@ -24,7 +24,7 @@ Extends :class:`lerobot.policies.pi05.PI05Policy` with:
* per-component prompt dropout (Pi 0.7 §V.E) for regularising the * per-component prompt dropout (Pi 0.7 §V.E) for regularising the
text head against missing context at inference. text head against missing context at inference.
See ``src/lerobot/configs/recipes/hirobot.yaml`` for the See ``src/lerobot/configs/recipes/subtasks_vqa.yaml`` for the
canonical training recipe and canonical training recipe and
``examples/training/pi052_hirobot.slurm`` for the launcher. ``examples/training/pi052_hirobot.slurm`` for the launcher.
""" """
@@ -55,7 +55,7 @@ class PI052Config(PI05Config):
""" """
# Recipe / language stack --------------------------------------------- # Recipe / language stack ---------------------------------------------
recipe_path: str | None = "recipes/hirobot.yaml" recipe_path: str | None = "recipes/subtasks_vqa.yaml"
"""Path (absolute or relative to ``src/lerobot/configs/``) to a """Path (absolute or relative to ``src/lerobot/configs/``) to a
``TrainingRecipe`` YAML. Defaults to the canonical Hi-Robot blend ``TrainingRecipe`` YAML. Defaults to the canonical Hi-Robot blend
shipped alongside this policy. Set to ``None`` to disable recipe shipped alongside this policy. Set to ``None`` to disable recipe
@@ -405,7 +405,7 @@ class SmolVLA2ChatTokenizerStep(ProcessorStep):
"""Probabilistically drop non-target context messages. """Probabilistically drop non-target context messages.
Heuristic content sniffing matches the prefix strings that Heuristic content sniffing matches the prefix strings that
``hirobot.yaml``'s recipes use when injecting plan / ``subtask_mem_vqa_speech.yaml``'s recipes use when injecting plan /
memory / subtask / interjection content. Anything else is memory / subtask / interjection content. Anything else is
kept unchanged. Target messages are never dropped (we still kept unchanged. Target messages are never dropped (we still
need their tokens for supervision). need their tokens for supervision).
@@ -56,7 +56,7 @@ class SmolVLA2Config(SmolVLAConfig):
""" """
# Recipe / language stack --------------------------------------------- # Recipe / language stack ---------------------------------------------
recipe_path: str | None = "recipes/hirobot.yaml" recipe_path: str | None = "recipes/subtasks_vqa.yaml"
"""Path (absolute or relative to ``src/lerobot/configs/``) to a """Path (absolute or relative to ``src/lerobot/configs/``) to a
``TrainingRecipe`` YAML. The default points at the canonical Hi Robot ``TrainingRecipe`` YAML. The default points at the canonical Hi Robot
blend shipped alongside SmolVLA2. Set to ``None`` to disable recipe blend shipped alongside SmolVLA2. Set to ``None`` to disable recipe
@@ -17,7 +17,7 @@ Each step is a tiny class with a ``trigger`` and an ``__call__(state)``;
the runtime applies them in order each tick. When a step's trigger the runtime applies them in order each tick. When a step's trigger
doesn't fire, the step is a no-op and the runtime moves on. doesn't fire, the step is a no-op and the runtime moves on.
Stream-to-step mapping mirrors the ``hirobot.yaml`` recipe: Stream-to-step mapping mirrors the ``subtasks_vqa.yaml`` recipe:
* ``LowLevelForward`` calls ``policy.select_action`` for the * ``LowLevelForward`` calls ``policy.select_action`` for the
action chunk; trained by action chunk; trained by
@@ -721,7 +721,7 @@ def _control_context_messages(
) -> list[dict[str, Any]]: ) -> list[dict[str, Any]]:
"""Build a chat-template-ready prompt from current runtime state. """Build a chat-template-ready prompt from current runtime state.
Mirrors what ``hirobot.yaml`` renders into ``${task}\nPlan: Mirrors what ``subtasks_vqa.yaml`` renders into ``${task}\nPlan:
${plan}\nMemory: ${memory}`` for the high-level branches. ${plan}\nMemory: ${memory}`` for the high-level branches.
""" """
# Always emit ``Plan: `` / ``Memory: `` labels — even with empty # Always emit ``Plan: `` / ``Memory: `` labels — even with empty
@@ -741,7 +741,7 @@ def _control_context_messages(
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Per-recipe prompt builders. Each one mirrors a single sub-recipe's # Per-recipe prompt builders. Each one mirrors a single sub-recipe's
# message layout in ``hirobot.yaml`` so the chat-templated # message layout in ``subtasks_vqa.yaml`` so the chat-templated
# prompt at inference matches what the model saw during training. # prompt at inference matches what the model saw during training.
# Generic ``_control_context_messages`` is kept around as a fallback # Generic ``_control_context_messages`` is kept around as a fallback
# for ad-hoc callers but the four high-level steps now use these. # for ad-hoc callers but the four high-level steps now use these.
@@ -121,7 +121,7 @@ def _load_recipe(path_str: str) -> TrainingRecipe:
Accepts an absolute path or a path relative to Accepts an absolute path or a path relative to
``src/lerobot/configs/`` so recipe authors can write ``src/lerobot/configs/`` so recipe authors can write
``--policy.recipe_path=recipes/hirobot.yaml``. ``--policy.recipe_path=recipes/subtasks_vqa.yaml``.
""" """
p = Path(path_str) p = Path(path_str)
if not p.is_absolute() and not p.exists(): if not p.is_absolute() and not p.exists():
@@ -41,7 +41,12 @@ from lerobot.datasets.language_render import render_sample
from ._helpers import make_canned_responder from ._helpers import make_canned_responder
_RECIPE_PATH = ( _RECIPE_PATH = (
Path(__file__).resolve().parents[2] / "src" / "lerobot" / "configs" / "recipes" / "pi05_hirobot.yaml" Path(__file__).resolve().parents[2]
/ "src"
/ "lerobot"
/ "configs"
/ "recipes"
/ "subtask_mem_vqa_speech.yaml"
) )
@@ -105,22 +110,29 @@ def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
recipe = TrainingRecipe(**loaded) recipe = TrainingRecipe(**loaded)
rendered_any = False rendered_any = False
for ts, persistent, events in zip(timestamps, persistent_lists, events_lists, strict=True): for sample_idx, (ts, persistent, events) in enumerate(
zip(timestamps, persistent_lists, events_lists, strict=True)
):
result = render_sample( result = render_sample(
recipe=recipe, recipe=recipe,
persistent=persistent, persistent=persistent,
events=events, events=events,
t=float(ts), t=float(ts),
sample_idx=0, sample_idx=sample_idx,
dataset_ctx={"task": "Pour water from the bottle into the cup."}, dataset_ctx={"task": "Pour water from the bottle into the cup."},
) )
if result is None: if result is None:
continue continue
if result["messages"]: if result["messages"]:
rendered_any = True rendered_any = True
assert result["target_message_indices"] # A valid render supervises something: a text-CE target turn
# OR a flow-only ``low_level``-stream turn (action loss).
assert (
result["target_message_indices"]
or "low_level" in result["message_streams"]
)
break break
assert rendered_any, "PR 1 recipe rendered no messages from pipeline output" assert rendered_any, "recipe rendered no messages from pipeline output"
# Sanity: speech atom appears in events column intact # Sanity: speech atom appears in events column intact
flat_events = [r for ev in events_lists for r in ev] flat_events = [r for ev in events_lists for r in ev]
+4 -2
View File
@@ -18,7 +18,9 @@ def test_message_recipe_validates_unknown_binding():
def test_canonical_recipe_loads(): def test_canonical_recipe_loads():
recipe = TrainingRecipe.from_yaml(Path("src/lerobot/configs/recipes/pi05_hirobot.yaml")) recipe = TrainingRecipe.from_yaml(
Path("src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml")
)
assert recipe.blend is not None assert recipe.blend is not None
assert set(recipe.blend) == { assert set(recipe.blend) == {
@@ -29,4 +31,4 @@ def test_canonical_recipe_loads():
"ask_vqa_top", "ask_vqa_top",
"ask_vqa_wrist", "ask_vqa_wrist",
} }
assert sum(component.weight for component in recipe.blend.values()) == pytest.approx(0.96) assert sum(component.weight for component in recipe.blend.values()) == pytest.approx(1.0)
+6 -3
View File
@@ -449,7 +449,10 @@ def test_vqa_frame_is_consumed_over_the_weighted_blend():
def test_canonical_recipe_can_render_low_level_branch(): def test_canonical_recipe_can_render_low_level_branch():
recipe = TrainingRecipe.from_yaml(Path("src/lerobot/configs/recipes/pi05_hirobot.yaml")) """The shipped ``subtasks_vqa.yaml`` recipe's ``low_level_execution``
branch renders a flow-only ``user(${subtask})`` turn (no text-CE
target; its supervision is the action-expert flow loss)."""
recipe = TrainingRecipe.from_yaml(Path("src/lerobot/configs/recipes/subtasks_vqa.yaml"))
low_level = TrainingRecipe(blend={"low": recipe.blend["low_level_execution"]}) low_level = TrainingRecipe(blend={"low": recipe.blend["low_level_execution"]})
rendered = render_sample( rendered = render_sample(
@@ -461,6 +464,6 @@ def test_canonical_recipe_can_render_low_level_branch():
task="clean kitchen", task="clean kitchen",
) )
assert rendered["messages"][-1] == {"role": "assistant", "content": "subtask 0"} assert rendered["messages"][-1] == {"role": "user", "content": "subtask 0"}
assert rendered["message_streams"][-1] == "low_level" assert rendered["message_streams"][-1] == "low_level"
assert rendered["target_message_indices"] == [1] assert rendered["target_message_indices"] == []