feat(annotate): compact steerable annotation prompts

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 14:49:43 +00:00 · 2026-05-04 15:56:52 +02:00
parent 223cc8a9e2
commit 5f7c6ba61d
7 changed files with 40 additions and 30 deletions
@@ -21,9 +21,7 @@ from huggingface_hub import get_token, run_job

 token = os.environ.get("HF_TOKEN") or get_token()
 if not token:
-    raise RuntimeError(
-        "No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`"
-    )
+    raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")

 CMD = (
    "apt-get update -qq && apt-get install -y -qq git ffmpeg && "
@@ -46,13 +44,16 @@ CMD = (
    "--vlm.client_concurrency=256 "
    "--vlm.max_new_tokens=512 "
    "--executor.episode_parallelism=32 "
-    "--vlm.chat_template_kwargs='{enable_thinking: false}' "
+    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
    "--vlm.camera_key=observation.images.wrist "
    "--module_1.frames_per_second=1.0 "
    "--module_1.use_video_url=true "
    "--module_1.use_video_url_fps=1.0 "
-    "--module_3.K=1 --module_3.vqa_emission_hz=0.2 "
-    "--push_to_hub=pepijn223/super_poulain_qwen36moe-3"
+    "--module_1.derive_task_from_video=always "
+    "--module_1.n_task_rephrasings=10 "
+    "--module_3.K=1 "
+    "--module_3.vqa_emission_hz=1.0 "
+    "--push_to_hub=pepijn223/super_poulain_full_tool2"
 )

 job = run_job(
@@ -18,8 +18,14 @@ Previous memory: {prior_memory}
 Just-completed subtask: "{completed_subtask}"
 Remaining subtasks (for relevance judgement only): {remaining_subtasks}

-Update the memory. Drop irrelevant detail. Compress completed steps.
-Keep WHAT happened, drop HOW. Shorter is better.
+Update the memory as a compact state note.
+
+Rules:
+- Keep only facts needed later.
+- Keep WHAT changed; drop HOW it was done.
+- Use fragments when clear.
+- Prefer: "bowl in box; lid still open"
+- Avoid: "The robot placed the bowl into the box and the lid remains open."

 Output strictly valid JSON:
-  {{ "memory": "<one or two short sentences>" }}
+  {{ "memory": "<brief state note>" }}
@@ -1,18 +1,18 @@
 You are the high-level planner for a robot demonstrating: "{episode_task}".

-Given the subtask decomposition below, write a concise hierarchical PLAN
-the robot should follow. Format the plan as a numbered list, one line per
-high-level step. The plan describes the full task; subtasks are the atomic
-skills used to execute it.
+Given the subtask decomposition below, write a compact hierarchical PLAN.
+Use short imperative fragments, like pi0.7 context prompts.

 Subtasks for context:
 {subtasks_text}

 Authoring rules:
 - 3 to {plan_max_steps} steps.
- Each step describes one logical chunk of the task, not one motion.
+- Each step is one logical chunk, not one motion.
 - Steps must be in execution order.
- Plain prose, no JSON, no markdown headers.
+- Brief commands, not full sentences.
+- Prefer: "open air fryer"; avoid: "The robot should open the air fryer."
+- Plain text, no markdown headers.

 Output strictly valid JSON:
  {{ "plan": "1. ...\n2. ...\n3. ..." }}
@@ -4,17 +4,18 @@ The user originally asked: "{episode_task}"

 You are shown the entire demonstration as a single video. Watch the
 whole clip, then segment it into a list of consecutive atomic subtasks
-the robot performs.
+the robot performs. Write compact action labels, not prose.

 Authoring rules — based on Hi Robot (Shi 2025) atom granularity and
-Pi0.7 (Physical Intelligence 2025) "how, not what" detail:
+pi0.7 (Physical Intelligence 2025) compact context prompts:

 - Each subtask is one atomic skill the low-level policy can execute,
  e.g. "pick up one piece of lettuce", "place the bowl into the box",
  "move the right arm to the left".
- Capture HOW the subtask is performed, not only WHAT — e.g. prefer
+- Capture HOW when useful, but keep it brief — e.g. prefer
  "grasp the handle of the sponge with the left hand" to "pick up the
  sponge".
+- Use verb phrases, not full sentences.
 - Subtasks are non-overlapping and cover the full episode in order.
  Choose the cut points yourself based on what you see in the video
  (gripper open/close events, contact, regrasps, transitions).
@@ -9,7 +9,7 @@ Original task:
 Generate exactly {n} alternative phrasings of the same task. Vary:

 - formality (casual / polite / curt)
- verbosity (short imperative vs longer polite request)
+- verbosity (mostly short imperative; occasional polite request)
 - word choice (synonyms, different verbs)
 - sentence structure (imperative / question / suggestion)

@@ -17,7 +17,7 @@ Hard rules:
 - Each phrasing MUST preserve the exact meaning of the original task.
  Do not change which object is involved, the destination, or the
  action. Do not add extra steps. Do not invent new objects.
- Each phrasing must be a single short sentence, plain prose, no
+- Each phrasing must be a short phrase or sentence, plain prose, no
  markdown, no quotes, no list numbers.
 - Phrasings must be distinct — no near-duplicates.
 - Output exactly {n} entries.
@@ -1,10 +1,12 @@
 The user just asked the robot: "{episode_task}".

 Generate a short verbal acknowledgement the robot would speak back before
-beginning the task. Style: confident, friendly, single short sentence.
+beginning the task. Style: compact, confident, friendly.

 Examples (Hi Robot, Shi 2025): "Sure, I won't put cheese on it.",
 "OK, starting with the sponge.", "Got it.".

+Prefer very short replies: "Got it.", "On it.", "OK."
+
 Output strictly valid JSON:
  {{ "text": "<the spoken acknowledgement>" }}
@@ -14,12 +14,10 @@ subtask boundary in the demonstration:
 - Subtask the robot is about to start: "{next_subtask}"
 - Time into episode: {timestamp:.2f}s

-Write ONE interjection the user would naturally say at this moment to
-prompt / confirm / encourage the robot to do "{next_subtask}". Phrase it
-like a real human mid-task remark — conversational, varied, sometimes
-just a nudge, sometimes a clarification, sometimes a small constraint
-that the upcoming motion happens to satisfy. Plus the robot's verbal
-acknowledgement.
+Write ONE compact interjection the user would naturally say at this
+moment to prompt / confirm / encourage the robot to do "{next_subtask}".
+Keep it like a mid-task coaching cue, not a full instruction paragraph.
+Also write the robot's compact verbal acknowledgement.

 Hard rules:

@@ -29,7 +27,9 @@ Hard rules:
  instead", DO NOT — those would contradict the demonstration.
 - The interjection must reference an object, location, or action that
  is plausible given the visible scene and the next subtask text.
- One sentence each. Conversational, not robotic.
+- One short phrase or sentence each. Conversational, not robotic.
+- Prefer direct cues: "{next_subtask}, please."; "Now {next_subtask}."
+- Keep robot speech very short: "OK.", "On it.", "Doing that."

 Style examples (vary the phrasing — don't reuse these verbatim):
  - "Now go ahead and {next_subtask}."
@@ -41,6 +41,6 @@ Style examples (vary the phrasing — don't reuse these verbatim):

 Output strictly valid JSON:
  {{
-    "interjection": "<single sentence the user says, asking for the next subtask>",
-    "speech":       "<single sentence the robot speaks back, confirming and starting>"
+    "interjection": "<short cue from the user, asking for the next subtask>",
+    "speech":       "<short robot acknowledgement>"
  }}