If two consecutive VLM-emitted subtask spans have ``start`` timestamps
that round to the same source frame after ``snap_to_frame`` (e.g. on
short episodes the VLM sometimes nominates two ~adjacent action
boundaries within one 30 Hz step), the writer emits two
``style=subtask`` rows at the identical persistent timestamp. The
training-time renderer's default binding
``subtask: active_at(t, style=subtask)`` then raises:
ValueError: Ambiguous resolver for style='subtask';
add role=..., tool_name=..., or camera=... to disambiguate.
… and the whole training run dies on the first batch.
Observed concretely on ``pepijn223/super_poulain_vocab2`` (job
22159979): episodes 3 and 30 each had two subtask rows at the same
timestamp (``release yellow cube`` + ``retract arm`` snapping to the
same frame).
Add ``_dedupe_starts_to_distinct_frames`` to walk the cleaned span list
and, whenever a snapped start collides with one already used, push the
later span onto the next free frame timestamp. Both subtasks survive
on distinct timestamps; the renderer can now disambiguate. If the
episode genuinely has no later free frame (extremely unlikely — would
require a same-timestamp collision on the very last frame of the
episode), the later span is dropped with a warning rather than left
to poison the render.
New test ``test_plan_module_bumps_collocated_subtasks_to_distinct_frames``
locks in the contract; full vocabulary suite is 14/14 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The Jaccard-overlap snap was warping VLM output into wrong canonical
labels — e.g. an off-vocab "consult the wizard" span would silently
become "grasp blue cube" if that scored highest. Even with a higher
floor the operator can't tell which subtasks were paraphrases vs
genuine mislabels in the resulting dataset.
Replace with strict exact-match validation + a single targeted retry:
1. Generate subtasks as before.
2. If any returned subtask's normalised form (lowercased, articles
stripped, whitespace collapsed) isn't in the canonical vocab,
fire one retry call naming the offending strings and re-sending
the full canonical list. The retry prompt requires byte-identical
output from the vocab.
3. After the retry, validate again. Spans still off-vocab are
dropped — no fuzzy snapping ever produces a different canonical
label than the VLM actually emitted.
4. If every span ends up off-vocab even after the retry, warn loudly
so the operator extends ``meta/canonical_vocabulary.json`` to
cover the missing phase. The episode is left with empty subtasks
rather than silently fabricated ones — visibility > sweep-under-
the-rug.
Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the
normalisation helper out so the retry-validation and the final
canonicalisation share one source of truth.
Tests:
- test_plan_module_accepts_article_only_difference: "grasp the blue
cube" still maps to canonical "grasp blue cube" (article-tolerant).
- test_plan_module_retries_when_subtask_off_vocab: paraphrase
triggers the retry which the VLM corrects in pass 2.
- test_plan_module_drops_off_vocab_subtask_after_retry: VLM that
refuses to correct → bad span dropped, in-vocab span kept.
- test_plan_module_empty_when_all_off_vocab_after_retry: every
span off-vocab → episode left empty (no warping).
All 13 vocabulary tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
When the canonical vocabulary is enabled and the VLM produces spans
that don't overlap any canonical label, the previous Jaccard-floor
(0.5) dropped them and the episode came out with no subtasks at all
— invisible to the downstream policy. Observed on
``pepijn223/super_poulain_vocab``: some episodes had empty subtask
columns because every VLM-emitted phrase scored below 0.5 against
the discovered vocabulary.
Two-pass canonicalisation:
- First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to
let mild paraphrases through) and drops everything below.
- If that first pass leaves the episode with **zero** subtasks,
fall back to a second pass that always snaps each VLM span to
its nearest canonical label by Jaccard (no floor). The episode
ends up with subtasks even when the vocabulary missed a phase
— a slightly-wrong canonical label is still closer to the right
motion than nothing at all.
- Log loudly when the fallback fires so the operator can spot
coverage gaps in ``meta/canonical_vocabulary.json``.
- Log a per-episode count at INFO when some (but not all) spans
were dropped so it's visible without spamming the run output.
Promote the Jaccard floor + ignore-tokens to class constants so
they're a single edit point. Add ``force=True`` parameter to
``_canonicalize_subtask`` for the no-floor fallback path.
New test ``test_plan_module_snaps_when_all_off_vocab`` covers the
fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is
adjusted to keep at least one in-vocab span so the floor path can
still fire and is exercised. All 12 vocabulary tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The pipeline previously emitted near-unique subtask + memory phrasings
per episode (free-form LLM rephrasing). On the downstream low-level
policy that collapses the action expert's conditioning to noise: every
episode pairs a different paraphrase with similar motions, so the
expert learns a flat scene-prior that ignores the subtask string —
then at inference the high-level head invents *yet another* paraphrase
and the expert produces tiny "uncertain hover" chunks.
Add a vocabulary-discovery phase (phase 0) that runs once per dataset:
- watches the first ``vocabulary.sample_episodes`` (default 3)
episode videos as one Qwen-VL prompt,
- asks the VLM to derive ~``n_subtask_target`` canonical imperative
subtask labels and ~``n_memory_target`` first-person past-tense
memory milestones that recur across the demos,
- persists them to ``meta/canonical_vocabulary.json`` (human-
inspectable, hand-editable), and
- wires the resulting ``Vocabulary`` into the ``plan`` module so
every per-episode subtask + memory call is constrained to those
exact strings (both as prompt-side instructions *and* post-VLM
validation: paraphrases snap to the closest canonical entry via
token-set overlap; below a 0.5 Jaccard floor the subtask is
dropped rather than warped into something semantically wrong).
Operator workflow:
- first run discovers the vocabulary, writes the JSON, and runs
the ``plan`` module against it,
- subsequent runs reuse the on-disk file (``reuse_existing=True``
default) so hand-edits stick,
- set ``--vocabulary.enabled=False`` to fall back to free-form
generation (the original behaviour).
The discovery prompt forbids gerunds / third-person / adverbs and
caps the lists to the requested counts, matching the Hi-Robot /
π0.6-MEM convention of small per-environment vocabularies. The
``plan`` module's subtask + memory prompts grow a conditional
``{vocabulary_block}`` slot rendered only when a vocabulary is
present; without one the templates collapse to their previous
free-form form.
Tests: 11 new unit tests under tests/annotations/test_vocabulary.py
cover the on-disk round-trip, discovery against the fixture dataset,
``reuse_existing`` short-circuit, paraphrase canonicalisation, off-
vocab subtask dropping, and the no-vocabulary pass-through path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Port the steerable-pipeline refinements developed on feat/smolvla-on-
steerable back into the annotation pipeline itself:
- module_1_subtasks: imperative verb-first telegraphic labels with a
consistent-object-noun rule and good/bad examples (no hard word cap).
- _generate_plan: drop the VLM round-trip; the plan is now a
deterministic numbered list of still-todo subtasks, re-emitted at
every subtask boundary so it shrinks as work progresses. Removes
module_1_plan.txt.
- VqaConfig.K 3 -> 1: a VQA pair anchors exactly its emission frame, no
stale-label temporal smear.
- lerobot-annotate: tag the pushed dataset with its codebase_version so
LeRobotDataset can resolve a revision and load it.
- module_2_interjection: shorter, more natural mid-task cues.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PyAV segfaulted (exit 139) decoding the AV1 streams modern LeRobot
datasets use — a SIGSEGV that the per-episode try/except cannot catch,
killing the whole job when the interjections phase started.
Replace the PyAV fallback with _decode_frames_ffmpeg, which shells out
to the ffmpeg CLI: a full ffmpeg build decodes AV1, and a child-process
crash is a catchable non-zero exit rather than a segfault. Decoder chain
is now torchcodec -> ffmpeg. _decode_frames_av stays available behind
video_backend="pyav" for callers that want it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pyav fallback routed through lerobot's decode_video_frames(backend=
"pyav"), which uses torchvision.io.VideoReader — removed in torchvision
0.23+. On modern torch stacks (e.g. vllm-openai with torchvision 0.26)
both torchcodec and that path fail, leaving interjection/vqa prompts
without visual context.
Add _decode_frames_av: a self-contained PyAV decoder that picks the
nearest frame per timestamp. It is the always-available tail of the
decoder chain (torchcodec -> pyav) and the target of --video_backend=pyav.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses three of CarolinePascal's frames.py comments (the fourth, the
subprocess re-encode, waits on #3611):
- replace the bespoke _decode_pyav_direct PyAV decoder with
lerobot.datasets.video_utils.decode_video_frames (torchcodec backend,
PyAV fallback) — torchvision's VideoReader removal no longer applies
- frames flow through the provider as torch.Tensor (C, H, W uint8); PIL
is materialised only at the VLM-message boundary in to_image_blocks /
to_video_block, where the chat backends need it
- _decode now returns exactly one frame per timestamp (or [] on failure),
so frames_at pairs them with strict=True
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- name the three modules everywhere (plan / interjections / vqa) instead
of module_1/2/3 — config classes, config fields, executor params,
staging keys and phase names now carry the module name
- rename examples/annotation -> examples/annotations; add the Apache
header to run_hf_job.py
- drop the unused GeneralVqaModule._generate_one
- remove "PR 1" references from comments/docstrings
- frames.py: rely on the always-defined LeRobotDatasetMetadata.camera_keys
- executor.py: read/write meta/info.json via load_info / write_info
- reader.py: load meta/tasks.parquet via io_utils.load_tasks
- make --push_to_hub a bool; push the annotated dataset back to --repo_id
- move the on-disk test dataset builder into tests/fixtures
(build_annotation_dataset); run_e2e_smoke reuses it
- clarify in the docs that the vqa module grounds each pair on a single
frame (K = per-tick anchor count)
- hoist stdlib dynamic imports to module scope
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both tests were stale relative to design changes that landed earlier on
this branch. Update the tests to match the current production contract.
**``test_module1_attaches_video_block_to_subtask_prompt``**
The test took ``captured[0]`` and asserted on its content blocks, but
Module 1 issues several sub-prompts and the rephrasings call (which is
text-only, no video block) usually lands first. Two fixes:
* The test's intent is "the subtask prompt carries the video block" —
not "the first prompt carries it". Pick the call by content
(``"atomic subtasks"`` keyword in the text block) so the test is
resilient to future reordering of unrelated sub-prompts.
* Set ``n_task_rephrasings=0`` so the rephrasings call is skipped
entirely — keeps the test focused on ``_generate_subtasks``.
**``test_module2_mid_episode_emits_paired_interjection_and_speech``**
Two issues both rooted in design changes on the branch:
1. ``InterjectionsAndSpeechModule._mid_episode_interjections`` now
anchors interjections on subtask boundaries from Module 1's staging
tree, bailing out with zero rows when no spans exist. The production
executor runs Module 1 first; the test ran Module 2 in isolation.
Reproduce the contract by seeding two ``style=subtask`` rows in the
staging before calling Module 2 — gives it the single ``0 → 1``
boundary it needs.
2. The test's stub responder used the marker ``"ONE realistic
interruption"`` to match the interjection prompt, but that string is
from a previous prompt version. The current
``module_2_interjection.txt`` says ``"Write ONE interjection..."`` —
the old prompt asked for counterfactual interjections (e.g. "skip the
wipe"), the new one anchors on the upcoming subtask. Marker updated
to ``"Write ONE interjection"``; canned response wording aligned to
the new design.
Sweep on the language stack: 66 passed, 0 failed (was 64 passed, 2
failed). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**Critical: video_for_episode was unreachable dead code.**
``video_for_episode`` was indented inside ``_decode_pyav_direct``, after
its ``return`` statement — Python parsed it as a nested function that
never executed. Module 1's ``_episode_video_block`` calls
``self.frame_provider.video_for_episode(record, target_count)`` on the
``use_video_url=False`` path, which would have AttributeError'd on any
real dataset. Tests passed only because they used ``_StubFrameProvider``
/ ``_NullProvider`` which have the method. Moved it to be a proper
method of ``VideoFrameProvider`` (right after ``frames_at``).
**Thread safety on VideoFrameProvider.**
The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so
the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail``
flag were exposed to concurrent reads/writes. Added a ``threading.Lock``
field, wrapped cache reads/writes and the warn-flag check-and-set in
``with self._lock:``. Stub fixtures unaffected.
**episode_clip_path is now a method of VideoFrameProvider.**
Used to be a free function reaching into ``provider._meta.episodes`` and
``provider._meta.get_video_file_path`` from outside the class. As a
method it just uses ``self._meta``. The only caller (Module 1) updated;
no external callers.
**Atomic write in LanguageColumnsWriter.**
``pq.write_table(new_table, path)`` was overwriting the parquet shard
in place — a crash mid-write would corrupt the file. Now writes to a
sibling ``.tmp`` and ``Path.replace`` atomically.
**Smaller items:**
* ``executor.py`` docstring opened with "four phases" but listed six.
Now says "six phases" to match.
* ``[annotations]`` extra in ``pyproject.toml`` now includes
``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``,
so without it ``_make_openai_client`` would ImportError on a fresh
``uv sync --extra annotations``.
* ``_snap_to_frame`` was duplicated identically in
``plan_subtasks_memory.py`` and ``interjections_and_speech.py``.
Promoted to ``snap_to_frame`` in ``reader.py`` (next to
``EpisodeRecord``); both modules now import it. Backwards-compat alias
not needed — no external callers.
* ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every
call. Now memoizes via a private dataclass field so repeat calls from
different modules pay the cost once. Method signature unchanged.
* ``_extract_first_json_object`` had a redundant ``and not escape`` guard
that was dead because the prior block already handled and reset
``escape``. Replaced with a comment explaining the invariant.
**Pre-existing lint cleanups surfaced once these files entered
pre-commit's scope:**
* dead local ``client = clients[0]`` in ``_make_openai_client`` (the
real round-robin uses ``clients[rr_counter[...]]``).
* ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in
``_spawn_parallel_inference_servers``.
* ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary
collapse in ``_decode_pyav_direct``.
* ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in
``_server_is_up`` — the URL is the user-configured local-server endpoint
the CLI itself spawned, not arbitrary user input.
**Test added.**
``tests/annotations/test_frames.py`` pins the regression on
``VideoFrameProvider``: asserts ``video_for_episode`` and
``episode_clip_path`` are callable methods (not nested dead code or
free functions), and that the ``_lock`` field is a real
``threading.Lock``.
Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as
before this commit). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve conflicts and pull in the latest PR 1 fixes.
Conflicts:
- pyproject.toml: PR 1 added `lerobot-rollout` and PR 2 added
`lerobot-annotate` to the same `[project.scripts]` block. Kept both.
- uv.lock: dropped both sides and regenerated against the merged
`pyproject.toml` (PR 2 dropped the `datatrove` dep when distribution
moved to HF Jobs; PR 1's lock didn't have it).
Test follow-up:
- `tests/annotations/test_pipeline_recipe_render.py` — PR 1 deleted
`src/lerobot/configs/recipes/pi05_hirobot.yaml` (review feedback:
remove the canonical-recipe file; recipes are user-supplied). The
cross-PR contract this test guards is "the recipe DSL renders
non-empty messages from pipeline output", which doesn't depend on
any specific YAML, so the test now builds an inline blend recipe
with the same coverage. Passes.
Sweep: 82 passed, 2 failed (pre-existing module-impl bugs:
`test_module1_attaches_video_block_to_subtask_prompt`,
`test_module2_mid_episode_emits_paired_interjection_and_speech`).
The PR 1 carryover (`test_emitted_at_raises_on_ambiguous_per_camera_vqa`)
is now passing — the merge brought in PR 1's tightened `_select_one`
ambiguity check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ensure annotated datasets advertise language columns in meta/info.json so non-streaming dataset loads cast against the rewritten parquet schema.
Co-authored-by: Cursor <cursoragent@cursor.com>
PR 2 used to write a top-level ``tools`` column on every parquet shard
holding the JSON schema for the ``say`` tool, broadcast identically
across every row. That extends PR 1's schema for no real information
gain — the schema is a fixed code constant, parquet's RLE/dict encoding
collapses it on disk anyway, and HF/TRL chat-template consumers can
just import the constant directly.
PR 2 should fill in PR 1's existing schema, not add to it. So:
- ``writer.py``: stop emitting the ``tools`` column. Strip any legacy
``tools`` column from older shards on rerun so the schema converges to
v3.1. ``SAY_TOOL_SCHEMA`` stays as a public constant (now joined by
``DEFAULT_TOOLS = [SAY_TOOL_SCHEMA]``); chat-template policies and the
visualizer import them directly.
- ``test_writer.py``: replace the "tools column present" assertion with
one that explicitly checks the column is absent, plus a new test
asserting the constant's shape.
- ``test_pipeline_recipe_render.py``: drop the tools-column read; assert
it's not present in the rewritten parquet.
- ``annotation_pipeline.mdx``: update the writer description to note the
parquet stays small and the schema lives as a code constant.
If multi-tool-set support ever becomes real (datasets with different
tool inventories), the right home is ``meta/info.json["tools"]`` —
adding it later is non-breaking; ripping out a parquet column already
shipped is not.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 3 now produces one (vqa, user) + (vqa, assistant) pair per
emission tick *per camera* rather than only against the dataset's first
camera. Each emitted row carries the `camera` field added in PR 1
(language-columns), so the resolver can disambiguate per-camera VQA via
`emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity.
- `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property
and a `camera_key=` argument on `frames_at` / `video_for_episode`.
`VideoFrameProvider` exposes every `observation.images.*` key the
dataset declares (not just the first) and keys its decode cache on
`(episode, camera, timestamp)` so per-camera reads don't collide.
Module 1 / 2 keep their old single-camera behaviour by leaving
`camera_key=None` (falls back to the default camera).
- `modules/general_vqa.py`: `run_episode` iterates `frame_provider
.camera_keys` for each emission tick, builds one prompt per camera,
batches all of them through the VLM, and stamps the resulting rows
with `camera=<that key>`. Empty `camera_keys` (null provider) makes
the module a no-op rather than silently emitting untagged rows.
- `writer.py`: `_normalize_persistent_row` / `_normalize_event_row`
carry `camera` through and call `validate_camera_field` so the
invariant is enforced at the writer boundary. Event sort key now
includes `camera` for deterministic ordering when several cameras
share `(timestamp, style, role)`. `speech_atom` sets `camera=None`.
- `validator.py`: `StagingValidator` gains a `dataset_camera_keys`
field; `_check_camera_field` enforces the invariant and cross-checks
every view-dependent row's `camera` against the dataset's known video
keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate
`(vqa, role)` pairs at the same `(t, camera)`.
- `lerobot_annotate.py`: passes the live frame provider's
`camera_keys` into the validator so the cross-check uses the actual
dataset camera set.
- Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new
`camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera`
configures two cameras and asserts both are represented, that every
emitted row has a `camera` tag, and that uniqueness holds per
`(timestamp, camera, role)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces keyframe sampling with a single Qwen-VL video block covering
the whole demonstration. The model pools temporally itself and chooses
where to cut subtasks — no stride, no count, no keyframe count knob to
tune.
- frames.py: ``FrameProvider`` gains ``video_for_episode(record,
max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames``
uniformly across the episode duration; ``_NullProvider`` returns []
for the no-video fallback. New ``to_video_block`` helper.
- Module 1: drops keyframe sampling. The subtask prompt now goes out as
``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and
the prompt template asks the model to "watch the whole clip, then
segment it" with cut points decided from gripper/contact/regrasp
events the model sees.
- Module1Config: ``keyframes_per_episode`` removed; replaced with
``max_video_frames: int = 32`` (model-capacity bound, not annotation
logic).
- Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in
the single-video-block invariant.
- Stub-VLM markers updated: tests now key on "atomic subtasks" instead
of the old "Decompose the demonstration" phrase that no longer
appears in the prompt.
- Docs: updated to describe the whole-episode video-block behavior and
the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.
- New `frames.py`:
- `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
dataset's first `observation.images.*` stream via
`LeRobotDatasetMetadata.get_video_file_path` and
`decode_video_frames`, with the same `from_timestamp` shift the main
dataset uses.
- Per-process LRU cache so co-timestamped Module 1 plan-update + Module
2 calls share decode work.
- `make_frame_provider` falls back to a null provider when the dataset
has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
prepend image blocks before the text block.
- Module 1 attaches `keyframes_per_episode` keyframes to the subtask
decomposition prompt.
- Module 2 attaches the frame at the interjection timestamp.
- Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
`Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
`--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
attachment behaviour and the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>