lerobot

admin/lerobot

Fork 0

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-26 05:59:52 +00:00

Commit Graph

Author	SHA1	Message	Date
pepijn	a15e16c072	fix(annotate): replace fuzzy subtask snapping with strict match + one-shot retry The Jaccard-overlap snap was warping VLM output into wrong canonical labels — e.g. an off-vocab "consult the wizard" span would silently become "grasp blue cube" if that scored highest. Even with a higher floor the operator can't tell which subtasks were paraphrases vs genuine mislabels in the resulting dataset. Replace with strict exact-match validation + a single targeted retry: 1. Generate subtasks as before. 2. If any returned subtask's normalised form (lowercased, articles stripped, whitespace collapsed) isn't in the canonical vocab, fire one retry call naming the offending strings and re-sending the full canonical list. The retry prompt requires byte-identical output from the vocab. 3. After the retry, validate again. Spans still off-vocab are dropped — no fuzzy snapping ever produces a different canonical label than the VLM actually emitted. 4. If every span ends up off-vocab even after the retry, warn loudly so the operator extends ``meta/canonical_vocabulary.json`` to cover the missing phase. The episode is left with empty subtasks rather than silently fabricated ones — visibility > sweep-under- the-rug. Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the normalisation helper out so the retry-validation and the final canonicalisation share one source of truth. Tests: - test_plan_module_accepts_article_only_difference: "grasp the blue cube" still maps to canonical "grasp blue cube" (article-tolerant). - test_plan_module_retries_when_subtask_off_vocab: paraphrase triggers the retry which the VLM corrects in pass 2. - test_plan_module_drops_off_vocab_subtask_after_retry: VLM that refuses to correct → bad span dropped, in-vocab span kept. - test_plan_module_empty_when_all_off_vocab_after_retry: every span off-vocab → episode left empty (no warping). All 13 vocabulary tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-23 09:57:27 +00:00
pepijn	336af85c09	fix(annotate): never leave an episode with zero canonical subtasks When the canonical vocabulary is enabled and the VLM produces spans that don't overlap any canonical label, the previous Jaccard-floor (0.5) dropped them and the episode came out with no subtasks at all — invisible to the downstream policy. Observed on ``pepijn223/super_poulain_vocab``: some episodes had empty subtask columns because every VLM-emitted phrase scored below 0.5 against the discovered vocabulary. Two-pass canonicalisation: - First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to let mild paraphrases through) and drops everything below. - If that first pass leaves the episode with zero subtasks, fall back to a second pass that always snaps each VLM span to its nearest canonical label by Jaccard (no floor). The episode ends up with subtasks even when the vocabulary missed a phase — a slightly-wrong canonical label is still closer to the right motion than nothing at all. - Log loudly when the fallback fires so the operator can spot coverage gaps in ``meta/canonical_vocabulary.json``. - Log a per-episode count at INFO when some (but not all) spans were dropped so it's visible without spamming the run output. Promote the Jaccard floor + ignore-tokens to class constants so they're a single edit point. Add ``force=True`` parameter to ``_canonicalize_subtask`` for the no-floor fallback path. New test ``test_plan_module_snaps_when_all_off_vocab`` covers the fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is adjusted to keep at least one in-vocab span so the floor path can still fire and is exercised. All 12 vocabulary tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 12:44:03 +00:00
pepijn	86a7edc590	feat(annotate): phase 0 — derive canonical vocabulary from sample episodes The pipeline previously emitted near-unique subtask + memory phrasings per episode (free-form LLM rephrasing). On the downstream low-level policy that collapses the action expert's conditioning to noise: every episode pairs a different paraphrase with similar motions, so the expert learns a flat scene-prior that ignores the subtask string — then at inference the high-level head invents yet another paraphrase and the expert produces tiny "uncertain hover" chunks. Add a vocabulary-discovery phase (phase 0) that runs once per dataset: - watches the first ``vocabulary.sample_episodes`` (default 3) episode videos as one Qwen-VL prompt, - asks the VLM to derive ~``n_subtask_target`` canonical imperative subtask labels and ~``n_memory_target`` first-person past-tense memory milestones that recur across the demos, - persists them to ``meta/canonical_vocabulary.json`` (human- inspectable, hand-editable), and - wires the resulting ``Vocabulary`` into the ``plan`` module so every per-episode subtask + memory call is constrained to those exact strings (both as prompt-side instructions and post-VLM validation: paraphrases snap to the closest canonical entry via token-set overlap; below a 0.5 Jaccard floor the subtask is dropped rather than warped into something semantically wrong). Operator workflow: - first run discovers the vocabulary, writes the JSON, and runs the ``plan`` module against it, - subsequent runs reuse the on-disk file (``reuse_existing=True`` default) so hand-edits stick, - set ``--vocabulary.enabled=False`` to fall back to free-form generation (the original behaviour). The discovery prompt forbids gerunds / third-person / adverbs and caps the lists to the requested counts, matching the Hi-Robot / π0.6-MEM convention of small per-environment vocabularies. The ``plan`` module's subtask + memory prompts grow a conditional ``{vocabulary_block}`` slot rendered only when a vocabulary is present; without one the templates collapse to their previous free-form form. Tests: 11 new unit tests under tests/annotations/test_vocabulary.py cover the on-disk round-trip, discovery against the fixture dataset, ``reuse_existing`` short-circuit, paraphrase canonicalisation, off- vocab subtask dropping, and the no-vocabulary pass-through path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 11:40:05 +00:00

Author

SHA1

Message

Date

pepijn

a15e16c072

fix(annotate): replace fuzzy subtask snapping with strict match + one-shot retry

The Jaccard-overlap snap was warping VLM output into wrong canonical
labels — e.g. an off-vocab "consult the wizard" span would silently
become "grasp blue cube" if that scored highest. Even with a higher
floor the operator can't tell which subtasks were paraphrases vs
genuine mislabels in the resulting dataset.

Replace with strict exact-match validation + a single targeted retry:

  1. Generate subtasks as before.
  2. If any returned subtask's normalised form (lowercased, articles
     stripped, whitespace collapsed) isn't in the canonical vocab,
     fire one retry call naming the offending strings and re-sending
     the full canonical list. The retry prompt requires byte-identical
     output from the vocab.
  3. After the retry, validate again. Spans still off-vocab are
     dropped — no fuzzy snapping ever produces a different canonical
     label than the VLM actually emitted.
  4. If every span ends up off-vocab even after the retry, warn loudly
     so the operator extends ``meta/canonical_vocabulary.json`` to
     cover the missing phase. The episode is left with empty subtasks
     rather than silently fabricated ones — visibility > sweep-under-
     the-rug.

Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the
normalisation helper out so the retry-validation and the final
canonicalisation share one source of truth.

Tests:
  - test_plan_module_accepts_article_only_difference: "grasp the blue
    cube" still maps to canonical "grasp blue cube" (article-tolerant).
  - test_plan_module_retries_when_subtask_off_vocab: paraphrase
    triggers the retry which the VLM corrects in pass 2.
  - test_plan_module_drops_off_vocab_subtask_after_retry: VLM that
    refuses to correct → bad span dropped, in-vocab span kept.
  - test_plan_module_empty_when_all_off_vocab_after_retry: every
    span off-vocab → episode left empty (no warping).
All 13 vocabulary tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-23 09:57:27 +00:00

pepijn

336af85c09

fix(annotate): never leave an episode with zero canonical subtasks

When the canonical vocabulary is enabled and the VLM produces spans
that don't overlap any canonical label, the previous Jaccard-floor
(0.5) dropped them and the episode came out with no subtasks at all
— invisible to the downstream policy. Observed on
``pepijn223/super_poulain_vocab``: some episodes had empty subtask
columns because every VLM-emitted phrase scored below 0.5 against
the discovered vocabulary.

Two-pass canonicalisation:

  - First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to
    let mild paraphrases through) and drops everything below.
  - If that first pass leaves the episode with **zero** subtasks,
    fall back to a second pass that always snaps each VLM span to
    its nearest canonical label by Jaccard (no floor). The episode
    ends up with subtasks even when the vocabulary missed a phase
    — a slightly-wrong canonical label is still closer to the right
    motion than nothing at all.
  - Log loudly when the fallback fires so the operator can spot
    coverage gaps in ``meta/canonical_vocabulary.json``.
  - Log a per-episode count at INFO when some (but not all) spans
    were dropped so it's visible without spamming the run output.

Promote the Jaccard floor + ignore-tokens to class constants so
they're a single edit point. Add ``force=True`` parameter to
``_canonicalize_subtask`` for the no-floor fallback path.

New test ``test_plan_module_snaps_when_all_off_vocab`` covers the
fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is
adjusted to keep at least one in-vocab span so the floor path can
still fire and is exercised. All 12 vocabulary tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-22 12:44:03 +00:00

pepijn

86a7edc590

feat(annotate): phase 0 — derive canonical vocabulary from sample episodes

The pipeline previously emitted near-unique subtask + memory phrasings
per episode (free-form LLM rephrasing). On the downstream low-level
policy that collapses the action expert's conditioning to noise: every
episode pairs a different paraphrase with similar motions, so the
expert learns a flat scene-prior that ignores the subtask string —
then at inference the high-level head invents *yet another* paraphrase
and the expert produces tiny "uncertain hover" chunks.

Add a vocabulary-discovery phase (phase 0) that runs once per dataset:

  - watches the first ``vocabulary.sample_episodes`` (default 3)
    episode videos as one Qwen-VL prompt,
  - asks the VLM to derive ~``n_subtask_target`` canonical imperative
    subtask labels and ~``n_memory_target`` first-person past-tense
    memory milestones that recur across the demos,
  - persists them to ``meta/canonical_vocabulary.json`` (human-
    inspectable, hand-editable), and
  - wires the resulting ``Vocabulary`` into the ``plan`` module so
    every per-episode subtask + memory call is constrained to those
    exact strings (both as prompt-side instructions *and* post-VLM
    validation: paraphrases snap to the closest canonical entry via
    token-set overlap; below a 0.5 Jaccard floor the subtask is
    dropped rather than warped into something semantically wrong).

Operator workflow:

  - first run discovers the vocabulary, writes the JSON, and runs
    the ``plan`` module against it,
  - subsequent runs reuse the on-disk file (``reuse_existing=True``
    default) so hand-edits stick,
  - set ``--vocabulary.enabled=False`` to fall back to free-form
    generation (the original behaviour).

The discovery prompt forbids gerunds / third-person / adverbs and
caps the lists to the requested counts, matching the Hi-Robot /
π0.6-MEM convention of small per-environment vocabularies. The
``plan`` module's subtask + memory prompts grow a conditional
``{vocabulary_block}`` slot rendered only when a vocabulary is
present; without one the templates collapse to their previous
free-form form.

Tests: 11 new unit tests under tests/annotations/test_vocabulary.py
cover the on-disk round-trip, discovery against the fixture dataset,
``reuse_existing`` short-circuit, paraphrase canonicalisation, off-
vocab subtask dropping, and the no-vocabulary pass-through path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-22 11:40:05 +00:00

3 Commits