If two consecutive VLM-emitted subtask spans have ``start`` timestamps
that round to the same source frame after ``snap_to_frame`` (e.g. on
short episodes the VLM sometimes nominates two ~adjacent action
boundaries within one 30 Hz step), the writer emits two
``style=subtask`` rows at the identical persistent timestamp. The
training-time renderer's default binding
``subtask: active_at(t, style=subtask)`` then raises:
ValueError: Ambiguous resolver for style='subtask';
add role=..., tool_name=..., or camera=... to disambiguate.
… and the whole training run dies on the first batch.
Observed concretely on ``pepijn223/super_poulain_vocab2`` (job
22159979): episodes 3 and 30 each had two subtask rows at the same
timestamp (``release yellow cube`` + ``retract arm`` snapping to the
same frame).
Add ``_dedupe_starts_to_distinct_frames`` to walk the cleaned span list
and, whenever a snapped start collides with one already used, push the
later span onto the next free frame timestamp. Both subtasks survive
on distinct timestamps; the renderer can now disambiguate. If the
episode genuinely has no later free frame (extremely unlikely — would
require a same-timestamp collision on the very last frame of the
episode), the later span is dropped with a warning rather than left
to poison the render.
New test ``test_plan_module_bumps_collocated_subtasks_to_distinct_frames``
locks in the contract; full vocabulary suite is 14/14 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The Jaccard-overlap snap was warping VLM output into wrong canonical
labels — e.g. an off-vocab "consult the wizard" span would silently
become "grasp blue cube" if that scored highest. Even with a higher
floor the operator can't tell which subtasks were paraphrases vs
genuine mislabels in the resulting dataset.
Replace with strict exact-match validation + a single targeted retry:
1. Generate subtasks as before.
2. If any returned subtask's normalised form (lowercased, articles
stripped, whitespace collapsed) isn't in the canonical vocab,
fire one retry call naming the offending strings and re-sending
the full canonical list. The retry prompt requires byte-identical
output from the vocab.
3. After the retry, validate again. Spans still off-vocab are
dropped — no fuzzy snapping ever produces a different canonical
label than the VLM actually emitted.
4. If every span ends up off-vocab even after the retry, warn loudly
so the operator extends ``meta/canonical_vocabulary.json`` to
cover the missing phase. The episode is left with empty subtasks
rather than silently fabricated ones — visibility > sweep-under-
the-rug.
Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the
normalisation helper out so the retry-validation and the final
canonicalisation share one source of truth.
Tests:
- test_plan_module_accepts_article_only_difference: "grasp the blue
cube" still maps to canonical "grasp blue cube" (article-tolerant).
- test_plan_module_retries_when_subtask_off_vocab: paraphrase
triggers the retry which the VLM corrects in pass 2.
- test_plan_module_drops_off_vocab_subtask_after_retry: VLM that
refuses to correct → bad span dropped, in-vocab span kept.
- test_plan_module_empty_when_all_off_vocab_after_retry: every
span off-vocab → episode left empty (no warping).
All 13 vocabulary tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
When the canonical vocabulary is enabled and the VLM produces spans
that don't overlap any canonical label, the previous Jaccard-floor
(0.5) dropped them and the episode came out with no subtasks at all
— invisible to the downstream policy. Observed on
``pepijn223/super_poulain_vocab``: some episodes had empty subtask
columns because every VLM-emitted phrase scored below 0.5 against
the discovered vocabulary.
Two-pass canonicalisation:
- First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to
let mild paraphrases through) and drops everything below.
- If that first pass leaves the episode with **zero** subtasks,
fall back to a second pass that always snaps each VLM span to
its nearest canonical label by Jaccard (no floor). The episode
ends up with subtasks even when the vocabulary missed a phase
— a slightly-wrong canonical label is still closer to the right
motion than nothing at all.
- Log loudly when the fallback fires so the operator can spot
coverage gaps in ``meta/canonical_vocabulary.json``.
- Log a per-episode count at INFO when some (but not all) spans
were dropped so it's visible without spamming the run output.
Promote the Jaccard floor + ignore-tokens to class constants so
they're a single edit point. Add ``force=True`` parameter to
``_canonicalize_subtask`` for the no-floor fallback path.
New test ``test_plan_module_snaps_when_all_off_vocab`` covers the
fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is
adjusted to keep at least one in-vocab span so the floor path can
still fire and is exercised. All 12 vocabulary tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The pipeline previously emitted near-unique subtask + memory phrasings
per episode (free-form LLM rephrasing). On the downstream low-level
policy that collapses the action expert's conditioning to noise: every
episode pairs a different paraphrase with similar motions, so the
expert learns a flat scene-prior that ignores the subtask string —
then at inference the high-level head invents *yet another* paraphrase
and the expert produces tiny "uncertain hover" chunks.
Add a vocabulary-discovery phase (phase 0) that runs once per dataset:
- watches the first ``vocabulary.sample_episodes`` (default 3)
episode videos as one Qwen-VL prompt,
- asks the VLM to derive ~``n_subtask_target`` canonical imperative
subtask labels and ~``n_memory_target`` first-person past-tense
memory milestones that recur across the demos,
- persists them to ``meta/canonical_vocabulary.json`` (human-
inspectable, hand-editable), and
- wires the resulting ``Vocabulary`` into the ``plan`` module so
every per-episode subtask + memory call is constrained to those
exact strings (both as prompt-side instructions *and* post-VLM
validation: paraphrases snap to the closest canonical entry via
token-set overlap; below a 0.5 Jaccard floor the subtask is
dropped rather than warped into something semantically wrong).
Operator workflow:
- first run discovers the vocabulary, writes the JSON, and runs
the ``plan`` module against it,
- subsequent runs reuse the on-disk file (``reuse_existing=True``
default) so hand-edits stick,
- set ``--vocabulary.enabled=False`` to fall back to free-form
generation (the original behaviour).
The discovery prompt forbids gerunds / third-person / adverbs and
caps the lists to the requested counts, matching the Hi-Robot /
π0.6-MEM convention of small per-environment vocabularies. The
``plan`` module's subtask + memory prompts grow a conditional
``{vocabulary_block}`` slot rendered only when a vocabulary is
present; without one the templates collapse to their previous
free-form form.
Tests: 11 new unit tests under tests/annotations/test_vocabulary.py
cover the on-disk round-trip, discovery against the fixture dataset,
``reuse_existing`` short-circuit, paraphrase canonicalisation, off-
vocab subtask dropping, and the no-vocabulary pass-through path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>