mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-24 13:09:43 +00:00
feat(annotate): phase 0 — derive canonical vocabulary from sample episodes
The pipeline previously emitted near-unique subtask + memory phrasings
per episode (free-form LLM rephrasing). On the downstream low-level
policy that collapses the action expert's conditioning to noise: every
episode pairs a different paraphrase with similar motions, so the
expert learns a flat scene-prior that ignores the subtask string —
then at inference the high-level head invents *yet another* paraphrase
and the expert produces tiny "uncertain hover" chunks.
Add a vocabulary-discovery phase (phase 0) that runs once per dataset:
- watches the first ``vocabulary.sample_episodes`` (default 3)
episode videos as one Qwen-VL prompt,
- asks the VLM to derive ~``n_subtask_target`` canonical imperative
subtask labels and ~``n_memory_target`` first-person past-tense
memory milestones that recur across the demos,
- persists them to ``meta/canonical_vocabulary.json`` (human-
inspectable, hand-editable), and
- wires the resulting ``Vocabulary`` into the ``plan`` module so
every per-episode subtask + memory call is constrained to those
exact strings (both as prompt-side instructions *and* post-VLM
validation: paraphrases snap to the closest canonical entry via
token-set overlap; below a 0.5 Jaccard floor the subtask is
dropped rather than warped into something semantically wrong).
Operator workflow:
- first run discovers the vocabulary, writes the JSON, and runs
the ``plan`` module against it,
- subsequent runs reuse the on-disk file (``reuse_existing=True``
default) so hand-edits stick,
- set ``--vocabulary.enabled=False`` to fall back to free-form
generation (the original behaviour).
The discovery prompt forbids gerunds / third-person / adverbs and
caps the lists to the requested counts, matching the Hi-Robot /
π0.6-MEM convention of small per-environment vocabularies. The
``plan`` module's subtask + memory prompts grow a conditional
``{vocabulary_block}`` slot rendered only when a vocabulary is
present; without one the templates collapse to their previous
free-form form.
Tests: 11 new unit tests under tests/annotations/test_vocabulary.py
cover the on-disk round-trip, discovery against the fixture dataset,
``reuse_existing`` short-circuit, paraphrase canonicalisation, off-
vocab subtask dropping, and the no-vocabulary pass-through path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -7,7 +7,8 @@
|
||||
|
||||
## What the pipeline produces
|
||||
|
||||
Three modules write into a per-episode staging tree, then a single writer
|
||||
A vocabulary-discovery phase derives a small canonical wording, then three
|
||||
modules write into a per-episode staging tree, then a single writer
|
||||
rewrites the data shards in place:
|
||||
|
||||
| Style / atom | Column | Module |
|
||||
@@ -20,6 +21,20 @@ rewrites the data shards in place:
|
||||
| speech tool-call atom (`style=null`, `say`) | `language_events` | `interjections`|
|
||||
| `vqa` (user / assistant pair) | `language_events` | `vqa` |
|
||||
|
||||
The `plan` module is constrained to a **canonical vocabulary** discovered
|
||||
once per dataset by the `vocabulary` module (phase 0). It watches a few
|
||||
sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
|
||||
asks the VLM to derive a small set of imperative subtask labels
|
||||
(~`--vocabulary.n_subtask_target`, default `10`) and first-person memory
|
||||
milestones (~`--vocabulary.n_memory_target`, default `6`) that recur
|
||||
across the demos. The result lands at
|
||||
`meta/canonical_vocabulary.json` (human-readable / hand-editable) and is
|
||||
reused on every subsequent run. The `plan` module then constrains both
|
||||
subtask + memory generation to those exact strings — the downstream
|
||||
low-level policy sees a small, repeatable target distribution instead of
|
||||
thousands of LLM paraphrases. Disable with `--vocabulary.enabled=False`
|
||||
to fall back to free-form generation.
|
||||
|
||||
The writer does **not** add a `tools` column to the parquet — the tool
|
||||
catalog lives at `meta/info.json["tools"]` instead (see
|
||||
[Tools](./tools)). After every annotation run the pipeline ensures the
|
||||
|
||||
Reference in New Issue
Block a user