mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-22 03:59:42 +00:00
recipe+runtime: condition the action expert on the task, not the subtask
Real-robot runs shook and failed the task despite a low flow loss.
Root cause: train/inference conditioning mismatch — not a flow-loss
bug (``_compute_fused_loss``'s flow path is byte-identical to
``SmolVLAModel.forward``).
At training, ``low_level_execution`` conditioned the action expert
on ``${subtask}``, and every frame's subtask was the correct one
for that frame. At inference the runtime has no high-level subtask
generator (VQA-only pipeline), so ``current_subtask`` was frozen —
the action expert got "move towards the blue cube" for the entire
episode. Once the arm reached the cube, that (image, subtask) pair
never occurred in training → OOD conditioning → incoherent flow
output → shaking.
Fix: ``low_level_execution`` now renders ``user(${task})``. The
task is stable for the whole episode and always available, so the
action expert's conditioning is identical at train and inference
with no high-level loop required. ``LowLevelForward`` updated to
build the same ``[user(task)]`` prompt.
``high_level_subtask`` still trains the text head to predict
subtasks (kept for when a reliable subtask loop is reintroduced) —
it's just no longer on the action expert's critical path.
Requires re-training for the recipe change to take effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,11 +3,11 @@
|
|||||||
#
|
#
|
||||||
# Trains two things only: subtasks and VQA. Plan and memory are
|
# Trains two things only: subtasks and VQA. Plan and memory are
|
||||||
# intentionally left out for now — keeps the prompt short and the
|
# intentionally left out for now — keeps the prompt short and the
|
||||||
# training surface small while the core subtask + action loop is
|
# training surface small while the core action loop is validated.
|
||||||
# validated.
|
|
||||||
#
|
#
|
||||||
# high_level_subtask — predict the subtask from the task.
|
# high_level_subtask — predict the subtask from the task (text
|
||||||
# low_level_execution — flow loss with [images, subtask, state].
|
# head only; not on the inference path yet).
|
||||||
|
# low_level_execution — flow loss with [images, task, state].
|
||||||
# ask_vqa_{top,wrist} — camera-grounded VQA.
|
# ask_vqa_{top,wrist} — camera-grounded VQA.
|
||||||
#
|
#
|
||||||
# Each backbone's text tokenizer renders these messages differently
|
# Each backbone's text tokenizer renders these messages differently
|
||||||
@@ -25,12 +25,15 @@ blend:
|
|||||||
low_level_execution:
|
low_level_execution:
|
||||||
weight: 0.40
|
weight: 0.40
|
||||||
messages:
|
messages:
|
||||||
# π0.5-style action conditioning. The action expert sees only
|
# The action expert is conditioned on the TASK (not the subtask).
|
||||||
# [images, this user turn (= bare subtask), state]. No text-CE
|
# The task is always available at inference with no high-level
|
||||||
# target — subtask prediction is owned by ``high_level_subtask``.
|
# generation loop, so this removes the train/inference mismatch
|
||||||
|
# that a subtask-conditioned action head would have while there
|
||||||
|
# is no reliable runtime subtask source. ``high_level_subtask``
|
||||||
|
# still trains the text head to predict subtasks for later use.
|
||||||
# ``stream: low_level`` flips ``predict_actions=True`` so the
|
# ``stream: low_level`` flips ``predict_actions=True`` so the
|
||||||
# flow loss fires.
|
# flow loss fires; no text-CE target here.
|
||||||
- {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
|
- {role: user, content: "${task}", stream: low_level}
|
||||||
|
|
||||||
ask_vqa_top:
|
ask_vqa_top:
|
||||||
weight: 0.10
|
weight: 0.10
|
||||||
|
|||||||
@@ -111,15 +111,12 @@ class LowLevelForward(InferenceStep):
|
|||||||
if observation is None:
|
if observation is None:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
# π0.5-style: the action expert is conditioned on just the
|
# The action expert is conditioned on the TASK string — the
|
||||||
# subtask (+ images + state). No task / plan / memory in the
|
# ``low_level_execution`` recipe renders ``user(${task})``.
|
||||||
# low-level prompt — those are only used by the high-level
|
# The task is stable for the whole episode and always present,
|
||||||
# loop to *generate* the subtask. Matches the training-time
|
# so there is no train/inference mismatch and no dependency on
|
||||||
# ``low_level_execution`` recipe shape (single user turn,
|
# a (currently unreliable) high-level subtask generator.
|
||||||
# no assistant target since text-CE is owned by the
|
ctx = [{"role": "user", "content": state.get("task") or ""}]
|
||||||
# high-level recipe).
|
|
||||||
subtask = state.get("current_subtask") or state.get("task") or ""
|
|
||||||
ctx = [{"role": "user", "content": subtask}]
|
|
||||||
# ``add_generation_prompt=False`` to match the training-time
|
# ``add_generation_prompt=False`` to match the training-time
|
||||||
# prefix shape: at training the action expert sees the rendered
|
# prefix shape: at training the action expert sees the rendered
|
||||||
# user turn ending at ``<|im_end|>`` (no trailing
|
# user turn ending at ``<|im_end|>`` (no trailing
|
||||||
|
|||||||
Reference in New Issue
Block a user