From 920c6ef5a2e3eb370b917b42d7c26305692cb39a Mon Sep 17 00:00:00 2001 From: pepijn Date: Tue, 26 May 2026 04:42:10 +0000 Subject: [PATCH 01/45] docs(annotate): disable phase-0 vocabulary discovery by default in run_hf_job Heterogeneous datasets (different tasks/scenes across episodes) don't share a single small subtask + memory vocabulary, so the canonical vocabulary phase narrowed every episode to the wrong target distribution. Flip the example to free-form generation by default and document the ``--vocabulary.enabled=true`` switch for homogeneous datasets where the canonical vocabulary still helps the downstream policy. No pipeline-code changes: ``VocabularyConfig.enabled`` already gates phase 0 (see ``executor.py:_run_vocabulary_phase`` and ``VocabularyConfig`` docstring) and falls back to free-form generation. Co-authored-by: Cursor --- examples/annotations/run_hf_job.py | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/examples/annotations/run_hf_job.py b/examples/annotations/run_hf_job.py index f3e497039..c8219d9e4 100644 --- a/examples/annotations/run_hf_job.py +++ b/examples/annotations/run_hf_job.py @@ -5,13 +5,16 @@ Spawns one ``h200x2`` job that: 1. installs this branch of ``lerobot`` plus the annotation extras, 2. boots two vllm servers (one per GPU) with Qwen3.6-35B-A3B-FP8, - 3. discovers the dataset's canonical subtask + memory vocabulary - from the first 3 sample episodes (phase 0), - 4. runs the plan / interjections / vqa modules across the dataset - (subtasks + memory are constrained to the canonical vocabulary), - 5. uploads the annotated dataset to ``--dest_repo_id`` (when set) + 3. runs the plan / interjections / vqa modules across the dataset + in free-form mode (phase 0 canonical-vocabulary discovery is + disabled — each episode generates its own subtasks + memory), + 4. uploads the annotated dataset to ``--dest_repo_id`` (when set) or back to ``--repo_id``. +Re-enable phase 0 with ``--vocabulary.enabled=true`` (optionally +``--vocabulary.sample_episodes=N``) when the dataset is homogeneous +enough to share one subtask + memory vocabulary across all episodes. + Usage: HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py @@ -54,12 +57,14 @@ CMD = ( "--executor.episode_parallelism=16 " "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' " "--vlm.camera_key=observation.images.wrist " - # Phase 0 — canonical vocabulary discovery from the first N sample - # episodes. The VLM picks the right number of subtask + memory - # entries itself from what it sees; the resulting - # meta/canonical_vocabulary.json constrains every subtask + memory - # string to a small repeatable target distribution. - "--vocabulary.sample_episodes=3 " + # Phase 0 — canonical vocabulary discovery DISABLED by default. + # Heterogeneous datasets (different tasks/scenes across episodes) + # don't share a single small subtask + memory vocabulary, so each + # episode generates its subtasks + memory free-form. Flip to + # ``--vocabulary.enabled=true`` (optionally ``--vocabulary.sample_episodes=N``) + # for homogeneous datasets where a shared canonical vocabulary + # helps the downstream policy. + "--vocabulary.enabled=false " # Phase 1 — plan module (subtasks + plan + memory + task_aug). "--plan.frames_per_second=1.0 " "--plan.use_video_url=true " From 1e7c0d6aa18c44503ae9e0e3af2bc6a16b897ab8 Mon Sep 17 00:00:00 2001 From: pepijn Date: Tue, 26 May 2026 05:14:30 +0000 Subject: [PATCH 02/45] annotate(plan): force composite-action subtasks; ban ultra-fine splits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Tighten ``module_1_subtasks.txt`` so the VLM emits one composite atomic action per subtask instead of decomposing every pick into ``move to X`` / ``grasp X`` / ``lift X``: - Lock the verb vocabulary to the composite set the low-level policy actually learns end-to-end: ``pick up`` (approach + grasp + lift), ``put``/``place`` (transport + release), ``push``, ``pull``, ``turn``, ``press``, ``open``, ``close``, ``pour``, ``insert``. ``go to`` is allowed only as a pure relocation between phases. - Add an explicit ``Forbidden ultra-fine splits`` block enumerating the patterns the VLM was tempted to emit (``move to X``, ``reach for X``, ``grasp X``, ``lift X``, ``release X``) and instructing it to fold each into its parent composite. - Rewrite the Good/Bad examples to match the composite contract; the previous ``"move to blue cube" / "grasp blue cube" / "lift blue cube"`` Good list was actively encouraging the over- segmentation pattern this prompt is supposed to prevent. - Tighten the duration rule: candidates shorter than ``min_subtask_seconds`` must be merged into a neighbour rather than emitted. Pairs with bumping the runtime floor to 3 s so composites have room to land. Pure prompt change — no code or schema change. Existing canonical- vocabulary retry path is unaffected (the new verb whitelist lives in prose, not in the validator). Co-authored-by: Cursor --- .../prompts/module_1_subtasks.txt | 58 +++++++++++++++---- 1 file changed, 46 insertions(+), 12 deletions(-) diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt index 12bbcfba2..9314282be 100644 --- a/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt +++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_1_subtasks.txt @@ -8,14 +8,42 @@ the robot performs. {vocabulary_block}Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts: -- Each subtask = one atomic skill the low-level policy can execute. -- Write each subtask as an IMPERATIVE COMMAND, starting with a verb: - move, reach, pick up, grasp, place, put, push, pull, open, close, - turn, press, lift, insert, pour... +- Each subtask = one COMPOSITE atomic skill the low-level policy can + execute end-to-end. A "skill" bundles its own approach motion with + its terminal action — do NOT split the approach off as its own + subtask. The whole-arm policy already learns to reach as part of + every manipulation primitive. +- Write each subtask as an IMPERATIVE COMMAND, starting with one of + these verbs (extend only when none fits): + pick up — approach + grasp + lift in one subtask + put on/in — transport + release in one subtask + place on/in — synonym of "put"; pick one and stay consistent + push — contact + linear shove + pull — contact + linear retract + turn — rotary actuation + press