feat(annotate): attach camera keyframes to module prompts; default to Qwen3.6-27B-FP8

Closes the visual-grounding gap flagged after the initial PR review: modules now decode actual camera frames at the relevant timestamps and attach them as `{"type":"image", "image":<PIL>}` content blocks to the VLM prompts. - New `frames.py`: - `FrameProvider` Protocol; `VideoFrameProvider` decodes from the dataset's first `observation.images.*` stream via `LeRobotDatasetMetadata.get_video_file_path` and `decode_video_frames`, with the same `from_timestamp` shift the main dataset uses. - Per-process LRU cache so co-timestamped Module 1 plan-update + Module 2 calls share decode work. - `make_frame_provider` falls back to a null provider when the dataset has no video tracks → text-only prompts (graceful absence). - Modules 1/2/3 take an optional `frame_provider` (default null) and prepend image blocks before the text block. - Module 1 attaches `keyframes_per_episode` keyframes to the subtask decomposition prompt. - Module 2 attaches the frame at the interjection timestamp. - Module 3 attaches the exact emission frame to each VQA pair. - VlmConfig: backend now defaults to `vllm`; default model is `Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`, `--vlm.camera_key` (override the keyframe stream). - `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded on 2× GPUs works out of the box. - `test_module3_attaches_frame_image_block_to_prompt` asserts modules emit one image block per VQA prompt at the exact emission timestamp. - Docs: example switched to `imstevenpmwork/super_poulain_draft` + Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe attachment behaviour and the no-video fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 00:29:52 +00:00 · 2026-04-27 16:58:45 +02:00
parent a635a32290
commit 80b7708a61
9 changed files with 253 additions and 17 deletions
@@ -30,11 +30,18 @@ Install the extra and invoke the console script:
 ```bash
 uv sync --extra annotations
 uv run lerobot-annotate \
-  --root=/path/to/dataset \
-  --vlm.backend=transformers \
-  --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
+  --repo_id=imstevenpmwork/super_poulain_draft \
+  --vlm.backend=vllm \
+  --vlm.model_id=Qwen/Qwen3.6-27B-FP8 \
+  --vlm.tensor_parallel_size=2
 ```

+The pipeline attaches camera keyframes to every Module 1/2/3 prompt by
+default, decoded from the dataset's first `observation.images.*` stream.
+Override with `--vlm.camera_key=observation.images.<name>` to pin a
+specific viewpoint. Datasets with no video tracks fall back to text-only
+prompts automatically.
+
 The executor picks `LocalPipelineExecutor` for small datasets and
 `SlurmPipelineExecutor` for large ones based on
 `--executor.auto_threshold` (default 32 episodes). Force local with