feat(annotate): let the VLM decide vocabulary size

Hardcoding ``n_subtask_target=10`` and ``n_memory_target=6`` baked task
complexity into the config — a simple pick-and-place needs ~6, a
multi-step recipe needs ~20. The VLM already sees the clips, so let it
pick the count itself from what's recurring across episodes.

Drop both knobs from ``VocabularyConfig`` and the ``module_0_vocabulary``
prompt template. The prompt now says "decide the count yourself based
on what you see — the smallest set that still covers every recurring
phase" and adds an "each label must recur across the demos" rule so
the VLM filters out one-off motions.

Update the launcher script + docs to remove the old knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
pepijn
2026-05-22 11:46:31 +00:00
parent 369ab17110
commit 54221ceea2
5 changed files with 31 additions and 27 deletions
+11 -10
View File
@@ -24,16 +24,17 @@ rewrites the data shards in place:
The `plan` module is constrained to a **canonical vocabulary** discovered The `plan` module is constrained to a **canonical vocabulary** discovered
once per dataset by the `vocabulary` module (phase 0). It watches a few once per dataset by the `vocabulary` module (phase 0). It watches a few
sample episode videos (`--vocabulary.sample_episodes`, default `3`) and sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
asks the VLM to derive a small set of imperative subtask labels asks the VLM to derive a small set of imperative subtask labels and
(~`--vocabulary.n_subtask_target`, default `10`) and first-person memory first-person memory milestones that recur across the demos. The VLM
milestones (~`--vocabulary.n_memory_target`, default `6`) that recur picks the right number of entries itself based on what it sees in the
across the demos. The result lands at clips — short pick-and-place demos get ~6 subtask labels, longer
`meta/canonical_vocabulary.json` (human-readable / hand-editable) and is multi-step recipes get more. The result lands at
reused on every subsequent run. The `plan` module then constrains both `meta/canonical_vocabulary.json` (human-readable / hand-editable) and
subtask + memory generation to those exact strings — the downstream is reused on every subsequent run. The `plan` module then constrains
low-level policy sees a small, repeatable target distribution instead of both subtask + memory generation to those exact strings — the
thousands of LLM paraphrases. Disable with `--vocabulary.enabled=False` downstream low-level policy sees a small, repeatable target
to fall back to free-form generation. distribution instead of thousands of LLM paraphrases. Disable with
`--vocabulary.enabled=False` to fall back to free-form generation.
The writer does **not** add a `tools` column to the parquet — the tool The writer does **not** add a `tools` column to the parquet — the tool
catalog lives at `meta/info.json["tools"]` instead (see catalog lives at `meta/info.json["tools"]` instead (see
+4 -5
View File
@@ -55,12 +55,11 @@ CMD = (
"--vlm.chat_template_kwargs='{\"enable_thinking\": false}' " "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
"--vlm.camera_key=observation.images.wrist " "--vlm.camera_key=observation.images.wrist "
# Phase 0 — canonical vocabulary discovery from the first N sample # Phase 0 — canonical vocabulary discovery from the first N sample
# episodes. The resulting meta/canonical_vocabulary.json constrains # episodes. The VLM picks the right number of subtask + memory
# every subtask + memory string to a small repeatable target # entries itself from what it sees; the resulting
# distribution; tune the counts for your task complexity. # meta/canonical_vocabulary.json constrains every subtask + memory
# string to a small repeatable target distribution.
"--vocabulary.sample_episodes=3 " "--vocabulary.sample_episodes=3 "
"--vocabulary.n_subtask_target=10 "
"--vocabulary.n_memory_target=6 "
# Phase 1 — plan module (subtasks + plan + memory + task_aug). # Phase 1 — plan module (subtasks + plan + memory + task_aug).
"--plan.frames_per_second=1.0 " "--plan.frames_per_second=1.0 "
"--plan.use_video_url=true " "--plan.use_video_url=true "
@@ -26,12 +26,13 @@ class VocabularyConfig:
"""Phase 0 — dataset-level canonical vocabulary discovery. """Phase 0 — dataset-level canonical vocabulary discovery.
Watches the first ``sample_episodes`` episode videos and asks the VLM Watches the first ``sample_episodes`` episode videos and asks the VLM
to derive a small canonical vocabulary (~``n_subtask_target`` subtask to derive a small canonical vocabulary (subtask labels + memory
labels + ~``n_memory_target`` memory milestones) that every episode milestones) that every episode in the dataset will reuse. The VLM
in the dataset will reuse. The output lands at decides the count itself from what it sees in the clips — short
``meta/canonical_vocabulary.json`` and feeds phase 1's subtask + pick-and-place demos get ~6 labels, longer multi-step recipes more.
memory generation as both a prompt-side constraint and a post-VLM The output lands at ``meta/canonical_vocabulary.json`` and feeds
validation gate. phase 1's subtask + memory generation as both a prompt-side
constraint and a post-VLM validation gate.
Why this exists: free-form LLM rephrasing per episode produces near- Why this exists: free-form LLM rephrasing per episode produces near-
unique subtask strings, which makes the downstream low-level policy's unique subtask strings, which makes the downstream low-level policy's
@@ -48,8 +49,6 @@ class VocabularyConfig:
enabled: bool = True enabled: bool = True
sample_episodes: int = 3 sample_episodes: int = 3
n_subtask_target: int = 10
n_memory_target: int = 6
max_video_frames_per_episode: int = 32 max_video_frames_per_episode: int = 32
# When True (default), an existing meta/canonical_vocabulary.json is # When True (default), an existing meta/canonical_vocabulary.json is
# loaded as-is and no VLM call is made — lets operators hand-edit the # loaded as-is and no VLM call is made — lets operators hand-edit the
@@ -8,6 +8,13 @@ conditioned on these strings — duplicate phrasings (e.g. "grasp blue
cube" vs "pick up the blue cube") would destroy the conditioning, so cube" vs "pick up the blue cube") would destroy the conditioning, so
pick one wording per concept and reuse it everywhere. pick one wording per concept and reuse it everywhere.
Decide how many entries each list needs YOURSELF based on what you see —
the smallest set that still covers every recurring phase in the demos.
A simple two-object pick-and-place might need ~6 subtask labels and 2
memory milestones; a long multi-step recipe needs more. Err on the side
of FEWER — extra entries that don't recur across episodes weaken the
conditioning.
You output two lists: You output two lists:
1. `subtasks`: imperative, telegraphic commands the robot can execute. 1. `subtasks`: imperative, telegraphic commands the robot can execute.
@@ -16,7 +23,8 @@ You output two lists:
"cube" — never "block" / "object"). "cube" — never "block" / "object").
- Atomic — one skill per subtask (gripper-open events, contact, regrasps, - Atomic — one skill per subtask (gripper-open events, contact, regrasps,
transitions all become cut points). transitions all become cut points).
- Aim for ~{n_subtask_target} labels. Fewer is better than more. - Each label must recur across the demos. If you see a motion only
once across all sample clips, it probably isn't a canonical phase.
- Good: "move to blue cube", "grasp blue cube", "lift blue cube", - Good: "move to blue cube", "grasp blue cube", "lift blue cube",
"place blue cube in box", "release blue cube", "retract arm". "place blue cube in box", "release blue cube", "retract arm".
- Bad: "the robot arm moves towards the blue cube" (third person, - Bad: "the robot arm moves towards the blue cube" (third person,
@@ -30,7 +38,6 @@ You output two lists:
should NOT. should NOT.
- First person, past tense. Start with "I". - First person, past tense. Start with "I".
- One sentence. Functional outcome only — no grasp / motion detail. - One sentence. Functional outcome only — no grasp / motion detail.
- Aim for ~{n_memory_target} milestones.
- Good: "I picked up the blue cube.", "I placed the blue cube in - Good: "I picked up the blue cube.", "I placed the blue cube in
the green box.", "I wiped the counter." the green box.", "I wiped the counter."
- Bad: "The robot arm grasped the blue cube." (third person), - Bad: "The robot arm grasped the blue cube." (third person),
@@ -190,8 +190,6 @@ class VocabularyDiscoveryModule:
prompt = load_prompt("module_0_vocabulary").format( prompt = load_prompt("module_0_vocabulary").format(
episode_task=task_hint or "(unspecified)", episode_task=task_hint or "(unspecified)",
n_episodes=len(sample), n_episodes=len(sample),
n_subtask_target=int(self.config.n_subtask_target),
n_memory_target=int(self.config.n_memory_target),
) )
# Pack one video block per sample episode so the VLM sees the # Pack one video block per sample episode so the VLM sees the
# variation across episodes (different starting poses, different # variation across episodes (different starting poses, different