Merge origin/feat/language-annotation-pipeline (8 fix(annotate) commits + vocabulary phase)

2026-07-16 14:32:03 +00:00 · 2026-05-25 15:47:25 +02:00
parent 9020635b14 471b2b1b1d
commit c37b1fc7d0
13 changed files with 1139 additions and 50 deletions
@@ -7,7 +7,8 @@

 ## What the pipeline produces

-Three modules write into a per-episode staging tree, then a single writer
+A vocabulary-discovery phase derives a small canonical wording, then three
+modules write into a per-episode staging tree, then a single writer
 rewrites the data shards in place:

 | Style / atom                                | Column                | Module         |
@@ -20,6 +21,21 @@ rewrites the data shards in place:
 | speech tool-call atom (`style=null`, `say`) | `language_events`     | `interjections`|
 | `vqa` (user / assistant pair)               | `language_events`     | `vqa`          |

+The `plan` module is constrained to a **canonical vocabulary** discovered
+once per dataset by the `vocabulary` module (phase 0). It watches a few
+sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
+asks the VLM to derive a small set of imperative subtask labels and
+first-person memory milestones that recur across the demos. The VLM
+picks the right number of entries itself based on what it sees in the
+clips — short pick-and-place demos get ~6 subtask labels, longer
+multi-step recipes get more. The result lands at
+`meta/canonical_vocabulary.json` (human-readable / hand-editable) and
+is reused on every subsequent run. The `plan` module then constrains
+both subtask + memory generation to those exact strings — the
+downstream low-level policy sees a small, repeatable target
+distribution instead of thousands of LLM paraphrases. Disable with
+`--vocabulary.enabled=False` to fall back to free-form generation.
+
 The writer does **not** add a `tools` column to the parquet — the tool
 catalog lives at `meta/info.json["tools"]` instead (see
 [Tools](./tools)). After every annotation run the pipeline ensures the