mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-18 16:57:12 +00:00
fix(annotate): stop action records + augmentation from corrupting RoboCasa labels
Three compounding bugs made RoboCasa annotation produce off-task
subtasks ('move stove to stove with left arm') and drifting
augmentations ('wander around the kitchen' for 'Navigate to the stove').
1. action_records.replace_subtask_text now defaults False.
Overwriting the VLM's subtask text with a reconstruction of
hallucinated {verb,object,arm,grasp,dest} fields is high-risk:
navigation / non-manipulation tasks don't fit the schema and render
to nonsense. Records are now additive by default (emit_record_row),
never silently replacing subtask text. Flip replace_subtask_text on
only for manipulation datasets verified to render cleanly.
2. _render_action_record_to_subtask_text drops a degenerate
destination that just echoes the object (verb=move object=stove
destination=stove -> 'move stove' instead of 'move stove to stove').
Also routes 'navigate' through the 'to <dest>' preposition family.
3. module_1_task_aug_axes.txt hardened: variants MUST preserve the
goal/destination. Explicitly forbids 'Navigate to the stove' ->
'wander around the kitchen'. Only wording / arm / orientation /
grasp may vary; verb meaning, object, and destination are fixed.
examples/annotations/run_hf_job.py — corrected for RoboCasa:
* derive_task_from_video=off (was =always). The dataset task string
is authoritative and is what eval conditions on; =always threw it
away, re-derived a hallucinated task from the video, and poisoned
every downstream subtask/plan row. THIS was the dominant cause.
* n_task_rephrasings=0 + task_aug_axes left off — RoboCasa eval uses
exact task strings, so augmentation is unused/harmful.
* action_records left off — manipulation schema doesn't fit atomic /
navigation tasks.
* plan_max_steps=6 to keep atomic-task decomposition tight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -53,13 +53,30 @@ CMD = (
|
||||
"--executor.episode_parallelism=16 "
|
||||
"--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
|
||||
"--vlm.camera_key=observation.images.robot0_agentview_right "
|
||||
# Phase 1 — plan module (subtasks + plan + memory + task_aug).
|
||||
# Phase 1 — plan module (subtasks + plan + memory).
|
||||
"--plan.frames_per_second=1.0 "
|
||||
"--plan.use_video_url=true "
|
||||
"--plan.use_video_url_fps=1.0 "
|
||||
"--plan.derive_task_from_video=always "
|
||||
"--plan.task_aug_axes.enabled=true "
|
||||
"--plan.action_records.enabled=true "
|
||||
# IMPORTANT for RoboCasa: the dataset's task string ("Navigate to the
|
||||
# stove", "Pick the mug...") is authoritative and is what eval uses.
|
||||
# ``derive_task_from_video=off`` keeps that canonical task driving
|
||||
# subtask generation. Do NOT use ``always`` here — it throws the real
|
||||
# task away, asks the VLM "what is this video about?" with no hint,
|
||||
# and the hallucinated task then poisons every subtask + plan row.
|
||||
"--plan.derive_task_from_video=off "
|
||||
# NO task augmentation for RoboCasa: eval conditions on the exact task
|
||||
# strings, so synthetic rephrasings are unused at best and (when they
|
||||
# drift, e.g. "wander around the kitchen") harmful. 0 rephrasings +
|
||||
# axes disabled = the policy only ever sees the canonical task.
|
||||
"--plan.n_task_rephrasings=0 "
|
||||
# action_records OFF: the structured {verb,object,arm,grasp,dest}
|
||||
# schema is a manipulation schema; RoboCasa navigation / atomic tasks
|
||||
# don't fit it and the VLM hallucinates (e.g. "move stove to stove").
|
||||
# Leave off unless annotating long composite manipulation tasks you've
|
||||
# verified render cleanly (and even then replace_subtask_text stays
|
||||
# off by default so records are additive, never overwriting subtasks).
|
||||
# Keep subtask decomposition tight for atomic tasks:
|
||||
"--plan.plan_max_steps=6 "
|
||||
# Phase 2 — interjections + speech.
|
||||
"--interjections.max_interjections_per_episode=6 "
|
||||
# Phase 4 — general VQA.
|
||||
|
||||
@@ -94,12 +94,18 @@ class ActionRecordsConfig:
|
||||
|
||||
A deterministic Python template then renders the record back to
|
||||
canonical subtask text (e.g. ``pick blue cube with left arm using
|
||||
pinch grip``). When ``replace_subtask_text=True`` (default), the
|
||||
rendered text REPLACES the VLM's free-form subtask text — eliminating
|
||||
cross-episode phrasing drift. When ``emit_record_row=True``
|
||||
(default), the structured record is also emitted as a row with
|
||||
``style="action_record"`` so downstream consumers can train on the
|
||||
typed schema directly.
|
||||
pinch grip``). When ``replace_subtask_text=True``, the rendered text
|
||||
REPLACES the VLM's free-form subtask text. This is OFF by default:
|
||||
the structured fields are easy for the VLM to hallucinate on tasks
|
||||
that don't fit the manipulation schema (e.g. navigation tasks yield
|
||||
nonsense like ``move stove to stove``), and silently overwriting the
|
||||
subtask text with a reconstruction is high-risk. Leave it off to keep
|
||||
the original VLM subtask text and treat the record as additive
|
||||
metadata; only flip it on for datasets you've verified render
|
||||
cleanly. When ``emit_record_row=True`` (default), the structured
|
||||
record is also emitted as a row with ``style="action_record"`` so
|
||||
downstream consumers can train on the typed schema directly —
|
||||
without touching the subtask text.
|
||||
|
||||
Cost: one extra VLM call per subtask. For an 8-subtask episode this
|
||||
means ~8x more VLM calls in the plan module — still cheap relative
|
||||
@@ -110,9 +116,11 @@ class ActionRecordsConfig:
|
||||
|
||||
# When True, replace the VLM-generated subtask text with the
|
||||
# deterministic template's rendering of the structured record.
|
||||
# Strongly recommended — it's the whole point of the structured
|
||||
# intermediate. Set False to keep both representations side by side.
|
||||
replace_subtask_text: bool = True
|
||||
# OFF by default — see class docstring. Overwriting good subtask
|
||||
# text with a reconstruction of hallucinated structured fields is
|
||||
# high-risk (navigation / non-manipulation tasks render to
|
||||
# nonsense). Keep records additive (``emit_record_row``) instead.
|
||||
replace_subtask_text: bool = False
|
||||
|
||||
# When True, emit a separate row with ``style="action_record"`` and
|
||||
# ``content=json.dumps(record)`` at the subtask's start timestamp.
|
||||
|
||||
@@ -424,6 +424,13 @@ class PlanSubtasksMemoryModule:
|
||||
if not verb:
|
||||
return ""
|
||||
|
||||
# Drop a degenerate destination that just echoes the object — the
|
||||
# VLM sometimes fills both with the same noun (e.g. navigation:
|
||||
# ``verb=move object=stove destination=stove`` → "move stove to
|
||||
# stove"). Treat that as "no meaningful destination".
|
||||
if dest and obj and dest.strip().lower() == obj.strip().lower():
|
||||
dest = ""
|
||||
|
||||
parts: list[str] = [verb]
|
||||
if obj:
|
||||
parts.append(obj)
|
||||
@@ -431,7 +438,7 @@ class PlanSubtasksMemoryModule:
|
||||
# Pick a sensible preposition per verb family.
|
||||
if verb in {"place", "put", "drop", "insert", "pour", "dump"}:
|
||||
parts.append(f"in {dest}")
|
||||
elif verb in {"move", "transport", "reach"}:
|
||||
elif verb in {"move", "transport", "reach", "navigate"}:
|
||||
parts.append(f"to {dest}")
|
||||
else:
|
||||
parts.append(f"at {dest}")
|
||||
|
||||
@@ -37,9 +37,16 @@ Axes and target counts:
|
||||
orientation, grasp_method) appear in the original task.
|
||||
|
||||
Hard rules:
|
||||
- Each variant MUST preserve the core action and the target object.
|
||||
Do not change which object is involved, the destination, or the
|
||||
high-level action.
|
||||
- Each variant MUST preserve the core action, the target object, AND
|
||||
the goal / destination. Do not change which object is involved, where
|
||||
it goes, or the high-level action. "Navigate to the stove" may become
|
||||
"go to the stove" or "head over to the stove" — it must NEVER become
|
||||
"wander around the kitchen", "explore the room", or anything that
|
||||
drops or generalises the stove destination. If you cannot vary the
|
||||
wording without changing the goal, emit fewer variants.
|
||||
- Only the FIVE listed elements (wording, arm, orientation, grasp
|
||||
method, or a combination) may be varied or omitted. The verb's
|
||||
meaning, the object, and the destination are fixed.
|
||||
- Each variant is plain prose, no markdown, no quotes, no list numbers.
|
||||
- Each variant must be DISTINCT from every other variant in the entire
|
||||
output, both within and across axes. Near-duplicates are not allowed.
|
||||
|
||||
Reference in New Issue
Block a user