annotate(plan): force composite-action subtasks; tune run_hf_job for robocasa_smoke

Subtask prompt (``module_1_subtasks.txt``):
- Lock the verb vocabulary to composite atomic actions (``pick up``,
  ``put``/``place``, ``push``/``pull``, ``turn``, ``press``, ``open``/
  ``close``, ``pour``, ``insert``, ``go to``).
- Add an explicit ``Forbidden ultra-fine splits`` block instructing
  the VLM to fold ``move to X`` / ``reach for X`` / ``grasp X`` /
  ``lift X`` / ``release X`` into the parent composite. Previous
  examples actively encouraged the over-segmentation pattern.
- Rewrite the Good/Bad examples around the composite contract.

Job config (``examples/annotations/run_hf_job.py``):
- Point at ``pepijn223/robocasa_smoke_2atomic_v3`` on ``h200x4``.
- ``--vlm.camera_key=robot0_agentview_left`` (real key for the
  dataset; the prior ``observation.images.wrist`` did not exist
  and would have silenced the VQA module).
- ``--vlm.serve_command`` ``--max-model-len 131072`` (4x): keeps
  90 s @ 1 Hz episode video blocks under context even at full
  Qwen vision resolution. On 1x H200 (144 GB) the 35B-FP8 model
  has plenty of room for the bigger KV cache.
- ``--vocabulary.enabled=false`` — heterogeneous dataset, no
  benefit from a single canonical vocabulary.
- ``--plan.derive_task_from_video=off``, ``--plan.n_task_rephrasings=0``
  — reuse the dataset's own ``episode_task`` strings as-is.
- ``--plan.min_subtask_seconds=3.0``, ``--plan.plan_max_steps=6`` —
  give the new composite-action rules room to land (1.5 s floor
  was too small to host a full grasp-or-place composite).
- ``--vqa.vqa_emission_hz=3.0`` — denser VQA grounding.
- Timeout 24h, episode_parallelism=64, client_concurrency=256 to
  scale to the 25k-trajectory regime when the same recipe is
  pointed at a larger dataset.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
pepijn
2026-05-26 05:14:23 +00:00
parent 4913356564
commit 2686450d68
2 changed files with 110 additions and 55 deletions
+64 -43
View File
@@ -1,15 +1,16 @@
#!/usr/bin/env python
"""Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6 MoE).
Spawns one ``h200x2`` job that:
Spawns one ``h200x4`` job that:
1. installs this branch of ``lerobot`` plus the annotation extras,
2. boots two vllm servers (one per GPU) with Qwen3.6-35B-A3B-FP8,
3. discovers the dataset's canonical subtask + memory vocabulary
from the first 3 sample episodes (phase 0),
4. runs the plan / interjections / vqa modules across the dataset
(subtasks + memory are constrained to the canonical vocabulary),
5. uploads the annotated dataset to ``--dest_repo_id`` (when set)
2. boots four vllm servers (one per H200) with Qwen3.6-35B-A3B-FP8,
3. runs the plan + vqa modules across the dataset in free-form
mode — phase 0 (canonical vocabulary discovery) is disabled so
every episode's subtasks + memory are generated independently;
interjections is also disabled, which short-circuits the
plan_update phase that depends on it,
4. uploads the annotated dataset to ``--dest_repo_id`` (when set)
or back to ``--repo_id``.
Usage:
@@ -37,60 +38,80 @@ CMD = (
"export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 && "
"export VLLM_VIDEO_BACKEND=pyav && "
"lerobot-annotate "
"--repo_id=imstevenpmwork/super_poulain_draft "
"--dest_repo_id=pepijn223/super_poulain_vocab "
"--repo_id=pepijn223/robocasa_smoke_2atomic_v3 "
"--dest_repo_id=pepijn223/robocasa_smoke_2atomic_v3_annotated "
"--push_to_hub=true "
"--vlm.backend=openai "
"--vlm.model_id=Qwen/Qwen3.6-35B-A3B-FP8 "
"--vlm.parallel_servers=2 "
"--vlm.num_gpus=2 "
"--vlm.parallel_servers=4 "
"--vlm.num_gpus=4 "
'--vlm.serve_command="vllm serve Qwen/Qwen3.6-35B-A3B-FP8 '
"--tensor-parallel-size 1 --max-model-len 32768 "
'--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
# 4× the context (32768 → 131072) so long episodes at 1 Hz fit even
# at full Qwen vision resolution: 90 frames @ ~700 vision tokens/frame
# ≈ 63 k tokens, comfortably under 131 k. On 1× H200 (144 GB) the
# 35B-FP8 model leaves plenty of room for the bigger KV cache.
"--tensor-parallel-size 1 --max-model-len 131072 "
'--gpu-memory-utilization 0.85 --uvicorn-log-level warning --port {port}" '
"--vlm.serve_ready_timeout_s=1800 "
"--vlm.client_concurrency=128 "
"--vlm.client_concurrency=256 "
"--vlm.max_new_tokens=512 "
"--vlm.temperature=0.7 "
"--executor.episode_parallelism=16 "
"--executor.episode_parallelism=64 "
"--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
"--vlm.camera_key=observation.images.wrist "
<<<<<<< HEAD:examples/annotation/run_hf_job.py
"--module_1.frames_per_second=1.0 "
"--module_1.use_video_url=true "
"--module_1.use_video_url_fps=1.0 "
"--module_1.derive_task_from_video=always "
"--module_1.n_task_rephrasings=30 "
"--module_2.max_interjections_per_episode=6 "
"--module_3.K=3 "
"--module_3.vqa_emission_hz=1.0 "
"--push_to_hub=pepijn223/super_poulain_full_tool3"
=======
# Phase 0 — canonical vocabulary discovery from the first N sample
# episodes. The VLM picks the right number of subtask + memory
# entries itself from what it sees; the resulting
# meta/canonical_vocabulary.json constrains every subtask + memory
# string to a small repeatable target distribution.
"--vocabulary.sample_episodes=3 "
# Whole-scene agentview is the right choice for subtask reasoning +
# VQA on robocasa: the wrist (``robot0_eye_in_hand``) usually only
# sees the gripper + nearby object, which hurts "what is happening
# in this episode" decomposition. Override per-dataset if your
# cameras are named differently (inspect ``meta/info.json``).
"--vlm.camera_key=observation.images.robot0_agentview_left "
# Phase 0 — canonical vocabulary discovery DISABLED. This dataset's
# episodes span heterogeneous tasks/scenes, so a single shared
# subtask + memory vocabulary would be too narrow — each episode
# generates its subtasks + memory free-form instead.
"--vocabulary.enabled=false "
# Phase 1 — plan module (subtasks + plan + memory + task_aug).
"--plan.enabled=true "
"--plan.frames_per_second=1.0 "
"--plan.use_video_url=true "
"--plan.use_video_url_fps=1.0 "
"--plan.derive_task_from_video=always "
"--plan.n_task_rephrasings=30 "
# Phase 2 — interjections + speech.
"--interjections.max_interjections_per_episode=6 "
# Phase 4 — general VQA.
"--vqa.K=3 "
"--vqa.vqa_emission_hz=1.0"
>>>>>>> origin/feat/language-annotation-pipeline:examples/annotations/run_hf_job.py
# Force coarse, composite subtasks (``pick up X`` = approach + grasp
# + lift in one span, not three). 3 s is large enough to host a
# full grasp-or-place composite at typical 20 fps robocasa speeds;
# any candidate span shorter than this gets merged into a neighbour
# by the prompt's authoring rules (see module_1_subtasks.txt).
"--plan.min_subtask_seconds=3.0 "
# Cap so the VLM can't drift into micro-segmentation. Combined with
# the composite-action rules in the prompt, this targets ~3-6
# meaningful spans per episode for typical pick-and-place demos.
"--plan.plan_max_steps=9 "
# ``off`` keeps the dataset's canonical ``record.episode_task`` as-is
# — no per-episode VLM "what is this video about" call. Switch to
# ``if_short`` (default) only if some episodes have placeholder /
# missing canonical tasks; ``always`` overrides every episode's task.
"--plan.derive_task_from_video=off "
# 0 disables the task_aug pass entirely (see PlanConfig.n_task_rephrasings
# docstring) — no per-episode paraphrase generation, no task_aug rows.
"--plan.n_task_rephrasings=0 "
# Phase 2 — interjections OFF (also skips phase 3 plan_update,
# see executor.py:_run_plan_update_phase guard).
"--interjections.enabled=false "
# Phase 4 — general VQA. K=1 keeps each VQA answer on its own
# emission frame (no temporal smear); see VqaConfig.K docstring.
# 3 Hz cadence: at 20 fps source, that's a VQA tick every ~7 frames.
# NOTE: VQA emits per-camera, so for robocasa (3 cameras) each tick
# produces 3 (user, assistant) row pairs — total call volume ~= 3 *
# 3 Hz * mean_episode_seconds * n_episodes.
"--vqa.enabled=true "
"--vqa.K=1 "
"--vqa.vqa_emission_hz=3.0"
)
job = run_job(
image="vllm/vllm-openai:latest",
command=["bash", "-c", CMD],
flavor="h200x2",
flavor="h200x4",
secrets={"HF_TOKEN": token},
timeout="2h",
timeout="24h",
)
print(f"Job URL: {job.url}")
print(f"Job ID: {job.id}")
@@ -8,14 +8,42 @@ the robot performs.
{vocabulary_block}Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
- Each subtask = one atomic skill the low-level policy can execute.
- Write each subtask as an IMPERATIVE COMMAND, starting with a verb:
move, reach, pick up, grasp, place, put, push, pull, open, close,
turn, press, lift, insert, pour...
- Each subtask = one COMPOSITE atomic skill the low-level policy can
execute end-to-end. A "skill" bundles its own approach motion with
its terminal action — do NOT split the approach off as its own
subtask. The whole-arm policy already learns to reach as part of
every manipulation primitive.
- Write each subtask as an IMPERATIVE COMMAND, starting with one of
these verbs (extend only when none fits):
pick up <obj> — approach + grasp + lift in one subtask
put <obj> on/in <loc> — transport + release in one subtask
place <obj> on/in <loc> — synonym of "put"; pick one and stay consistent
push <obj> — contact + linear shove
pull <obj> — contact + linear retract
turn <knob/dial/handle> — rotary actuation
press <button> — single-press contact
open <drawer/door/lid> — full open motion
close <drawer/door/lid> — full close motion
pour <src> into <dst> — tilt + flow
insert <obj> into <slot>— alignment + push-fit
go to <loc> — ONLY when no grasp / actuation follows
(e.g. a pure relocation between phases).
If the next subtask grasps something at
that location, drop "go to ..." and just
write "pick up ..." instead.
- Forbidden ultra-fine splits — the VLM is NOT allowed to emit these
as standalone subtasks; fold them into the parent composite:
"move to X" → fold into "pick up X" (or whatever follows)
"reach for X" → fold into "pick up X"
"grasp X" → fold into "pick up X"
"lift X" → fold into "pick up X" (or "put X on Y" if it's
the transport phase of a place)
"release X" → fold into "put X on Y" (or "place X in Y")
- Keep it SHORT — a verb phrase, not a sentence. Drop articles
("the", "a") and adverbs ("carefully", "slowly"). Add a "how"
detail (which hand, which grasp point) ONLY when it is needed to
disambiguate.
disambiguate. Every subtask must begin with one of the verbs
above (no leading nouns, no "then", no "first").
- NEVER use third person. Never write "the robot", "the arm", "the
gripper moves", "it picks up" — the robot is implied. Command it,
do not describe it.
@@ -23,16 +51,22 @@ the robot performs.
"cube", every subtask says "cube" — never switch to "block". If it
says "box", never switch to "bin"/"container". Keep vocabulary
consistent across the whole episode.
- Good: "move to blue cube", "grasp blue cube", "lift blue cube",
"place blue cube in box", "open drawer", "release yellow cube".
- Bad: "the robot arm moves towards the blue cube" (third person,
too long), "carefully pick up the cube" (adverb, article),
"release the yellow block" ("block" when the task said "cube").
- Good: "pick up blue cube", "put blue cube in box", "open drawer",
"turn red knob", "press start button", "go to sink".
- Bad: "move to blue cube" (approach as its own subtask — forbidden,
must be folded into "pick up blue cube"); "the robot arm moves
towards the blue cube" (third person, too long); "carefully pick
up the cube" (adverb, article); "release the yellow block"
("block" when the task said "cube", and "release" must be folded
into a "put"/"place" subtask).
- Subtasks are non-overlapping and cover the full episode in order.
Choose the cut points yourself based on what you see in the video
(gripper open/close events, contact, regrasps, transitions).
- Each subtask spans at least {min_subtask_seconds} seconds.
- Do not exceed {max_steps} subtasks total.
- Each subtask spans at least {min_subtask_seconds} seconds. If a
candidate span would be shorter, merge it into its neighbour
rather than emitting it.
- Do not exceed {max_steps} subtasks total. Fewer, larger composites
are preferred over many micro-steps.
- Every subtask's [start_time, end_time] must lie within
[0.0, {episode_duration}] seconds.