Resolve conflicts and pull in the latest PR 1 fixes.
Conflicts:
- pyproject.toml: PR 1 added `lerobot-rollout` and PR 2 added
`lerobot-annotate` to the same `[project.scripts]` block. Kept both.
- uv.lock: dropped both sides and regenerated against the merged
`pyproject.toml` (PR 2 dropped the `datatrove` dep when distribution
moved to HF Jobs; PR 1's lock didn't have it).
Test follow-up:
- `tests/annotations/test_pipeline_recipe_render.py` — PR 1 deleted
`src/lerobot/configs/recipes/pi05_hirobot.yaml` (review feedback:
remove the canonical-recipe file; recipes are user-supplied). The
cross-PR contract this test guards is "the recipe DSL renders
non-empty messages from pipeline output", which doesn't depend on
any specific YAML, so the test now builds an inline blend recipe
with the same coverage. Passes.
Sweep: 82 passed, 2 failed (pre-existing module-impl bugs:
`test_module1_attaches_video_block_to_subtask_prompt`,
`test_module2_mid_episode_emits_paired_interjection_and_speech`).
The PR 1 carryover (`test_emitted_at_raises_on_ambiguous_per_camera_vqa`)
is now passing — the merge brought in PR 1's tightened `_select_one`
ambiguity check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ensure annotated datasets advertise language columns in meta/info.json so non-streaming dataset loads cast against the rewritten parquet schema.
Co-authored-by: Cursor <cursoragent@cursor.com>
PR 2 used to write a top-level ``tools`` column on every parquet shard
holding the JSON schema for the ``say`` tool, broadcast identically
across every row. That extends PR 1's schema for no real information
gain — the schema is a fixed code constant, parquet's RLE/dict encoding
collapses it on disk anyway, and HF/TRL chat-template consumers can
just import the constant directly.
PR 2 should fill in PR 1's existing schema, not add to it. So:
- ``writer.py``: stop emitting the ``tools`` column. Strip any legacy
``tools`` column from older shards on rerun so the schema converges to
v3.1. ``SAY_TOOL_SCHEMA`` stays as a public constant (now joined by
``DEFAULT_TOOLS = [SAY_TOOL_SCHEMA]``); chat-template policies and the
visualizer import them directly.
- ``test_writer.py``: replace the "tools column present" assertion with
one that explicitly checks the column is absent, plus a new test
asserting the constant's shape.
- ``test_pipeline_recipe_render.py``: drop the tools-column read; assert
it's not present in the rewritten parquet.
- ``annotation_pipeline.mdx``: update the writer description to note the
parquet stays small and the schema lives as a code constant.
If multi-tool-set support ever becomes real (datasets with different
tool inventories), the right home is ``meta/info.json["tools"]`` —
adding it later is non-breaking; ripping out a parquet column already
shipped is not.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 3 now produces one (vqa, user) + (vqa, assistant) pair per
emission tick *per camera* rather than only against the dataset's first
camera. Each emitted row carries the `camera` field added in PR 1
(language-columns), so the resolver can disambiguate per-camera VQA via
`emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity.
- `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property
and a `camera_key=` argument on `frames_at` / `video_for_episode`.
`VideoFrameProvider` exposes every `observation.images.*` key the
dataset declares (not just the first) and keys its decode cache on
`(episode, camera, timestamp)` so per-camera reads don't collide.
Module 1 / 2 keep their old single-camera behaviour by leaving
`camera_key=None` (falls back to the default camera).
- `modules/general_vqa.py`: `run_episode` iterates `frame_provider
.camera_keys` for each emission tick, builds one prompt per camera,
batches all of them through the VLM, and stamps the resulting rows
with `camera=<that key>`. Empty `camera_keys` (null provider) makes
the module a no-op rather than silently emitting untagged rows.
- `writer.py`: `_normalize_persistent_row` / `_normalize_event_row`
carry `camera` through and call `validate_camera_field` so the
invariant is enforced at the writer boundary. Event sort key now
includes `camera` for deterministic ordering when several cameras
share `(timestamp, style, role)`. `speech_atom` sets `camera=None`.
- `validator.py`: `StagingValidator` gains a `dataset_camera_keys`
field; `_check_camera_field` enforces the invariant and cross-checks
every view-dependent row's `camera` against the dataset's known video
keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate
`(vqa, role)` pairs at the same `(t, camera)`.
- `lerobot_annotate.py`: passes the live frame provider's
`camera_keys` into the validator so the cross-check uses the actual
dataset camera set.
- Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new
`camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera`
configures two cameras and asserts both are represented, that every
emitted row has a `camera` tag, and that uniqueness holds per
`(timestamp, camera, role)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces keyframe sampling with a single Qwen-VL video block covering
the whole demonstration. The model pools temporally itself and chooses
where to cut subtasks — no stride, no count, no keyframe count knob to
tune.
- frames.py: ``FrameProvider`` gains ``video_for_episode(record,
max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames``
uniformly across the episode duration; ``_NullProvider`` returns []
for the no-video fallback. New ``to_video_block`` helper.
- Module 1: drops keyframe sampling. The subtask prompt now goes out as
``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and
the prompt template asks the model to "watch the whole clip, then
segment it" with cut points decided from gripper/contact/regrasp
events the model sees.
- Module1Config: ``keyframes_per_episode`` removed; replaced with
``max_video_frames: int = 32`` (model-capacity bound, not annotation
logic).
- Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in
the single-video-block invariant.
- Stub-VLM markers updated: tests now key on "atomic subtasks" instead
of the old "Decompose the demonstration" phrase that no longer
appears in the prompt.
- Docs: updated to describe the whole-episode video-block behavior and
the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.
- New `frames.py`:
- `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
dataset's first `observation.images.*` stream via
`LeRobotDatasetMetadata.get_video_file_path` and
`decode_video_frames`, with the same `from_timestamp` shift the main
dataset uses.
- Per-process LRU cache so co-timestamped Module 1 plan-update + Module
2 calls share decode work.
- `make_frame_provider` falls back to a null provider when the dataset
has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
prepend image blocks before the text block.
- Module 1 attaches `keyframes_per_episode` keyframes to the subtask
decomposition prompt.
- Module 2 attaches the frame at the interjection timestamp.
- Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
`Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
`--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
attachment behaviour and the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>