Commit Graph

1459 Commits

Author SHA1 Message Date
Pepijn a5d6ad614f docs(annotate): add HF Jobs runner example for lerobot-annotate
A ready-to-run example of launching the annotation pipeline on a
Hugging Face job (h200x2) with two vllm replicas serving
Qwen3.6-35B-A3B-FP8. Lives next to other end-to-end recipes under
examples/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:00:28 +02:00
Pepijn c0effa02c9 feat(annotate): emit VQA per-camera and propagate camera field
Module 3 now produces one (vqa, user) + (vqa, assistant) pair per
emission tick *per camera* rather than only against the dataset's first
camera. Each emitted row carries the `camera` field added in PR 1
(language-columns), so the resolver can disambiguate per-camera VQA via
`emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity.

- `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property
  and a `camera_key=` argument on `frames_at` / `video_for_episode`.
  `VideoFrameProvider` exposes every `observation.images.*` key the
  dataset declares (not just the first) and keys its decode cache on
  `(episode, camera, timestamp)` so per-camera reads don't collide.
  Module 1 / 2 keep their old single-camera behaviour by leaving
  `camera_key=None` (falls back to the default camera).
- `modules/general_vqa.py`: `run_episode` iterates `frame_provider
  .camera_keys` for each emission tick, builds one prompt per camera,
  batches all of them through the VLM, and stamps the resulting rows
  with `camera=<that key>`. Empty `camera_keys` (null provider) makes
  the module a no-op rather than silently emitting untagged rows.
- `writer.py`: `_normalize_persistent_row` / `_normalize_event_row`
  carry `camera` through and call `validate_camera_field` so the
  invariant is enforced at the writer boundary. Event sort key now
  includes `camera` for deterministic ordering when several cameras
  share `(timestamp, style, role)`. `speech_atom` sets `camera=None`.
- `validator.py`: `StagingValidator` gains a `dataset_camera_keys`
  field; `_check_camera_field` enforces the invariant and cross-checks
  every view-dependent row's `camera` against the dataset's known video
  keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate
  `(vqa, role)` pairs at the same `(t, camera)`.
- `lerobot_annotate.py`: passes the live frame provider's
  `camera_keys` into the validator so the cross-check uses the actual
  dataset camera set.
- Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new
  `camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera`
  configures two cameras and asserts both are represented, that every
  emitted row has a `camera` tag, and that uniqueness holds per
  `(timestamp, camera, role)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:20 +02:00
Pepijn 4b0354a046 fix(annotate): transcode subclips to H.264 instead of stream-copy
Modern LeRobot datasets store videos in AV1, which vllm's libav build
cannot decode (the video processor returns 0 frames and downstream
chokes with ZeroDivisionError). Re-encode each per-episode subclip
with libx264 (preset ultrafast, crf 23) so the resulting mp4 is
universally decodable. Strip audio with -an for a smaller payload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:20 +02:00
Pepijn ac183f019e feat(annotate): pack multiple vllm replicas per GPU via num_gpus
Adds VlmConfig.num_gpus so parallel_servers can exceed the physical
GPU count. Replicas are round-robin-assigned to GPUs (e.g.
parallel_servers=4 + num_gpus=2 → replicas pinned to GPUs 0,1,0,1).
Backward-compatible: num_gpus=0 keeps the existing 1-replica-per-GPU
behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:20 +02:00
Pepijn bda829af2f feat(annotate): forward chat_template_kwargs to OpenAI extra_body
Lets callers pass per-request template flags such as
{"enable_thinking": false} for Qwen3.5/Qwen3.6 models, where the
default thinking preamble otherwise consumes the entire max_new_tokens
budget before any JSON is emitted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:20 +02:00
Pepijn 824aac9ad8 fix(annotate): include prompt .txt files in wheel
The setuptools package-data declaration only listed envs/*.json, so
pip-installed wheels (including HF Jobs runs) were missing the
module_1_subtasks/plan/memory and module_2/3 prompt templates,
causing FileNotFoundError at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:20 +02:00
Pepijn 7e91482e3a refactor(annotate): drop HF Inference Providers code path
Default backend is now a local OpenAI-compatible server (vllm /
transformers) which auto_serve spawns. Removes the
use_hf_inference_providers config flag and the router.huggingface.co
routing branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:20 +02:00
Pepijn 1d66444577 feat(annotate): --vlm.push_to_hub uploads the annotated dataset
After the pipeline completes, optionally create/locate a dataset repo
and upload the dataset root (excluding .annotate_staging/). Add
push_private and push_commit_message knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:20 +02:00
Pepijn d0ad7ffb21 feat(annotate): parallelize episodes within each module phase
Saturates parallel_servers + client_concurrency. Previously the
executor processed one episode at a time, so each Module 1 episode's
3-5 dependent VLM calls hit a single server with the others idle. Now
defaults to 16 episodes in flight; configurable via
ExecutorConfig.episode_parallelism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:20 +02:00
Pepijn 994ad880ee fix(annotate): probe /v1/models for spawn-helper readiness
vllm with --uvicorn-log-level warning suppresses the "Uvicorn running"
banner that the readiness watcher waited for, so the spawn helper hung
forever even after the API was live. Add an HTTP probe in parallel with
the log watcher and broaden the log markers to include vllm's own
"Starting vLLM API server" / "Available routes are" lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 25bb0f6d6a fix(annotate): lock-protect per-line writes for parallel server streams
8 server-streaming threads writing chars unsynchronized cause UTF-8
sequences from different servers to interleave mid-byte, garbling the
terminal output. Switch to line-buffered reads with a single shared
print lock — output stays readable, ready-marker detection still works
on the line containing 'Uvicorn running' / 'Application startup
complete'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn c8c76c598c feat(annotate): client_concurrency for parallel in-flight requests
Adds vlm.client_concurrency (default 16) which uses a ThreadPoolExecutor
to fan out batched chat.completions calls. vllm batches them internally
on the server side, giving big throughput wins on a single TP=1 server
without needing DP/TP and the NCCL setup it requires.

Module 3 now batches all per-episode VQA calls into a single
generate_json invocation so they fire in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 281c0f3921 feat(annotate): parallel_servers spawns N independent vllm replicas
Adds --vlm.parallel_servers=N. Spawns N independent vllm processes
(each pinned to GPU i via CUDA_VISIBLE_DEVICES, listening on
serve_port+i) and round-robins requests across them. Sidesteps DP/TP
NCCL setup failures on nodes with restricted P2P/SHM.

Default serve_command for parallel mode: vllm serve <model_id>
--tensor-parallel-size 1 --max-model-len 32768 --uvicorn-log-level
warning. Override via --vlm.serve_command (use {port} placeholder).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 66ec839bb8 feat(annotate): per-episode progress logs in executor 2026-04-30 10:55:19 +02:00
Pepijn 4cbba00bcd fix(annotate): don't crash pipeline on persistent JSON parse failure
Some prompts/models occasionally return pure prose with no JSON object
even on retry. Returning None (and logging a preview) lets the pipeline
skip that one VLM call cleanly instead of aborting the whole episode.
The modules already check for None / non-dict results and degrade
gracefully (no row emitted from that call).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 4dce8f1bb1 fix(annotate): robust JSON extraction (think tags + first balanced object)
Models often wrap JSON in prose or <think>...</think> blocks. Strip the
think tags first, then try direct json.loads, then fall back to scanning
for the first balanced {...} substring (ignoring braces inside strings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 8389f302e7 fix(annotate): stream child stdout char-by-char so tqdm \\r progress flushes 2026-04-30 10:55:19 +02:00
Pepijn 035b470fd8 test(annotate): adjust video-block test for fps-based frame sampling 2026-04-30 10:55:19 +02:00
Pepijn 3dee7bf762 feat(annotate): Module 1 samples image frames at fps rate
Replace the fixed max_video_frames count with a rate (default 1 fps).
A 30 s episode now sends 30 frames; a 5 s episode sends 5; capped at
max_video_frames (default 128) to avoid blowing up the payload on long
episodes.

Override with --module_1.frames_per_second=2.0 for denser sampling, or
--module_1.frames_per_second=0.5 for sparser.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 33053e6996 feat(annotate): use cached HF token from huggingface-cli login
Fall back to huggingface_hub.get_token() when HF_TOKEN/HUGGINGFACE_API_KEY
env vars aren't set. That picks up the token cached by
'huggingface-cli login' so users don't need to export it on every shell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 7bb550a5a4 feat(annotate): default to HF Inference Providers, no local GPU needed
Flip the default backend to 'openai' with use_hf_inference_providers=True
and a Qwen3-VL-30B-A3B-Instruct:novita default model_id. The CLI now
runs end-to-end without a local model load — annotations are produced
by sending video_url + prompt to https://router.huggingface.co/v1.

Switch back to local inference with --vlm.backend=vllm or
--vlm.use_hf_inference_providers=false.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 358acfc794 feat(annotate): one-flag HF Inference Providers backend
Setting --vlm.use_hf_inference_providers=true routes requests through
https://router.huggingface.co/v1 using HF_TOKEN as the API key, and
disables auto_serve so no local server is spawned. Combine with a
provider-pinned model id like 'Qwen/Qwen3-VL-30B-A3B-Instruct:novita'
or any plain model id to let HF route.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 3916f17c4a fix(annotate): omit mm_processor_kwargs by default; transformers serve rejects it
transformers serve returns HTTP 422 'Unexpected fields' when
mm_processor_kwargs is in extra_body — that field is vllm-specific.
Drop it by default; opt in via LEROBOT_OPENAI_SEND_MM_KWARGS=1 when
talking to vllm serve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 8807e0b41e fix(annotate): mm_processor_kwargs in extra_body; inline file URLs as data URLs
Two fixes for video_url with transformers serve:
- fps must be in extra_body.mm_processor_kwargs, not in the content
  block; otherwise the server discards it as unknown kwargs.
- file:// URLs aren't fetched by transformers serve. Read the local mp4
  and inline it as a base64 data:video/mp4 URL so the server sees the
  bytes directly.

Both surface as std::bad_alloc on the server side when wrong, which is
unhelpful but explains what we hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 43d3ba1d4e fix(annotate): detect server ready via stdout banner, not /v1/models polls
transformers serve rescans the HF cache on every /v1/models request
which exceeds the 2s urllib timeout, leaving the probe loop spinning
even after Uvicorn is fully up. Watch the streamed server output for
'Uvicorn running' / 'Application startup complete' instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn d1849d1b63 fix(annotate): visible auto_serve via stdout prints + live server log stream
The previous logger-based output never appeared, leaving users in the
dark when auto_serve silently no-op'd. Switch to print(flush=True) so
the spawn decision is unmistakable, and stream the server's stdout to
the parent terminal in real-time on a background thread so model-load
progress and errors surface immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:19 +02:00
Pepijn 1a4358ed8b fix(annotate): auto_serve defaults to True; probe before spawning
Default auto_serve to True so lerobot-annotate can drive the entire
flow with one command. Probe api_base/models first — if a server is
already reachable (user started one manually, or it's a remote
endpoint), skip the spawn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 5003989ae6 feat(annotate): auto_serve mode spawns and tears down inference server
Setting --vlm.auto_serve=true with --vlm.backend=openai makes the CLI
launch 'transformers serve <model_id> --port <serve_port>
--continuous-batching' as a child process, poll /v1/models until ready
(up to serve_ready_timeout_s), run the pipeline, then SIGINT the
server on process exit.

Override the spawn command with --vlm.serve_command='vllm serve ...'
or any OpenAI-compatible launcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn cd12665e85 feat(annotate): video_url block for openai backend
Module 1 can now send the episode's actual mp4 file as a video_url
content block instead of pre-decoded frames. The server (transformers
serve / vllm serve / ktransformers serve) handles frame sampling at
the configured fps. Default fps=1 (one frame per second is enough for
subtask-boundary detection on manipulation episodes).

A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/
via ffmpeg stream-copy (no re-encode) so the model sees only this
episode's frames, not the whole shard.

Enable with --module_1.use_video_url=true (and --vlm.backend=openai).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 91dedcad1e feat(annotate): openai-compatible backend for transformers/ktransformers serve
Adds a third backend that talks to any OpenAI-compatible server. This
unblocks Qwen3.6 (and other models) that work in transformers serve /
ktransformers but not in vllm 0.10.2's fallback path:

- launch the server out-of-process (transformers serve, vllm serve,
  ktransformers serve)
- point lerobot-annotate at it via --vlm.backend=openai
  --vlm.api_base=http://localhost:8000/v1 --vlm.model_id=...

Image and video blocks are converted to OpenAI image_url/video_url
data URLs automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 0782bbcd38 fix(annotate): use vllm.chat() API for multimodal prompts
vllm.generate() expects a string/TextPrompt; passing message dicts
fails. vllm.chat() applies the chat template and extracts image/video
blocks automatically, which is what we need for VL models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 89f15e1731 fix(annotate): drop guided_decoding=dict (api differs across vllm)
vllm 0.10.2 expects guided_decoding to be a GuidedDecodingParams object,
not a dict. Different vllm versions differ here. The parser already has
a one-retry JSON-recovery path, so drop guided decoding entirely for
portability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn a36287329e fix(annotate): tolerate decoder returning fewer frames than requested
pyav (and sometimes torchcodec) decode can return fewer frames than
requested timestamps when some timestamps fall outside the video file's
content range. Drop the strict=True on the zip and rely on the
None-filter to discard missing frames.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 2b79e7e1dd fix(annotate): default video decode backend to pyav
torchcodec's __init__ bad-allocs on the cu128/torch-2.8 stack in some
environments (Lustre/conda combos). The annotation pipeline calls
decode_video_frames many times per episode, so this is a hard blocker.
Default to pyav (always available via the av package) and let users
opt back into torchcodec via LEROBOT_VIDEO_BACKEND=torchcodec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 7ab3cf3407 fix(annotate): default trust_remote_code=False for HF loaders
Setting trust_remote_code=True unconditionally pulled custom loader
code that triggers std::bad_alloc post-load on Qwen3-VL — the official
transformers class is sufficient. Flip the default to False; keep the
config field so users can opt in for models that actually need it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 8ea488080e fix(annotate): default transformers backend to manual GPU placement
Loading Qwen3-VL via transformers + accelerate's device_map='auto'
fails with std::bad_alloc on hosts with abundant RAM. The bug is in
accelerate's post-load dispatch path. Bypassing accelerate by loading
to CPU first and then calling .to('cuda') manually avoids that path.

LEROBOT_TRANSFORMERS_DEVICE_MAP=auto switches back to the old behavior
for cases where it works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 41f26d3b2e fix(annotate): LEROBOT_DISABLE_CUDNN escape hatch for conv3d crash
cuDNN 9.x + torch 2.8 has a regression where the conv3d kernel used in
Qwen-VL vision tower patch embedders fails with
CUDNN_STATUS_NOT_INITIALIZED. The crash is independent of model size
and reproduces on both Qwen2.5-VL and Qwen3-VL because both use 3D conv
for video patch embedding.

Setting LEROBOT_DISABLE_CUDNN=1 falls back to native PyTorch conv3d
kernels (slower but functional) so the pipeline can run while the
torch/cuDNN stack is still on the broken combo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 50ec7ed2e1 fix(annotate): expose gpu_memory_utilization and max_model_len for vllm
Large VL models (Qwen3-VL-30B-A3B BF16) take ~58 GB of an 80 GB H100,
leaving only ~22 GB for KV cache + cuDNN workspace. The vision tower's
3D conv then fails with CUDNN_STATUS_NOT_INITIALIZED because cuDNN
can't grab a workspace large enough.

- vlm.gpu_memory_utilization (default 0.9) — drop to 0.7 when the vision
  encoder needs more cuDNN workspace.
- vlm.max_model_len — cap context to free KV cache memory; the 262k
  default for Qwen3 is wildly more than annotation prompts need.
- vlm.trust_remote_code — already plumbed; now also passed to LLM().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 6be6b929d8 fix(annotate): pass trust_remote_code=True to HF auto-classes
Required for many newer VL checkpoints (Qwen3.x FP8 in particular) that
ship custom loader code in their repo. Without it, the FP8
weight_scale_inv parameters never bind to FP8Linear modules and the
post-load dispatch path bad-allocs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:18 +02:00
Pepijn 093e580672 fix(annotate): low_cpu_mem_usage=True on transformers load path
The std::bad_alloc we hit on Qwen3-line VL models is not a real OOM —
it triggers in the post-load tensor-placement path even on hosts with
2 TB RAM. low_cpu_mem_usage=True bypasses the offending intermediate
staging buffer and is the standard accelerate workaround.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:06 +02:00
Pepijn b4dc785db1 fix(annotate): use device_map='auto' for transformers backend
Without device_map, transformers stages the full FP8 checkpoint in CPU
RAM before any GPU placement, OOMing the host on 27B+ models even when
the GPU has enough VRAM. device_map='auto' streams shards directly to
GPU memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:06 +02:00
Pepijn f583307590 fix(annotate): try AutoModelForImageTextToText first, fall back to AutoModelForVision2Seq
Newer transformers versions renamed/removed AutoModelForVision2Seq in
favour of AutoModelForImageTextToText for VL models. Try the new name
first and fall back gracefully so the transformers backend works on
both transformers 4.45-4.5x and 5.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:05 +02:00
Pepijn 69f049b877 fix(annotate): replace Literal types with str for older draccus
Older draccus versions (e.g. 0.10.x bundled in some envs) lack a decoder
for typing.Literal and raise:
  No decoding function for type typing.Literal['vllm', 'transformers', 'stub']

Switching VlmConfig.backend from Literal to str works under every
draccus version. The runtime branch in vlm_client.make_vlm_client
already validates the value and raises ValueError on unknown backends,
so the constraint stays enforced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:05 +02:00
Pepijn 9d5aa1c63e feat(annotate): Module 1 sees the whole episode as one video block
Replaces keyframe sampling with a single Qwen-VL video block covering
the whole demonstration. The model pools temporally itself and chooses
where to cut subtasks — no stride, no count, no keyframe count knob to
tune.

- frames.py: ``FrameProvider`` gains ``video_for_episode(record,
  max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames``
  uniformly across the episode duration; ``_NullProvider`` returns []
  for the no-video fallback. New ``to_video_block`` helper.
- Module 1: drops keyframe sampling. The subtask prompt now goes out as
  ``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and
  the prompt template asks the model to "watch the whole clip, then
  segment it" with cut points decided from gripper/contact/regrasp
  events the model sees.
- Module1Config: ``keyframes_per_episode`` removed; replaced with
  ``max_video_frames: int = 32`` (model-capacity bound, not annotation
  logic).
- Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in
  the single-video-block invariant.
- Stub-VLM markers updated: tests now key on "atomic subtasks" instead
  of the old "Decompose the demonstration" phrase that no longer
  appears in the prompt.
- Docs: updated to describe the whole-episode video-block behavior and
  the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:05 +02:00
Pepijn 4f4dd49972 feat(annotate): attach camera keyframes to module prompts; default to Qwen3.6-27B-FP8
Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.

- New `frames.py`:
  - `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
    dataset's first `observation.images.*` stream via
    `LeRobotDatasetMetadata.get_video_file_path` and
    `decode_video_frames`, with the same `from_timestamp` shift the main
    dataset uses.
  - Per-process LRU cache so co-timestamped Module 1 plan-update + Module
    2 calls share decode work.
  - `make_frame_provider` falls back to a null provider when the dataset
    has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
  prepend image blocks before the text block.
  - Module 1 attaches `keyframes_per_episode` keyframes to the subtask
    decomposition prompt.
  - Module 2 attaches the frame at the interjection timestamp.
  - Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
  `Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
  `--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
  on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
  emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
  Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
  attachment behaviour and the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:05 +02:00
Pepijn 785cee429e feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.

Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
  refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
  mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
  configurable cadence with one-retry JSON validation.

Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.

Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.

Branched from PR 1 (`feat/language-columns`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:55:05 +02:00
Pepijn 1ca38d9748 fix(language): drop motion from VIEW_DEPENDENT_STYLES
Motion primitives are described in robot-frame (joint / Cartesian) terms,
not pixel space, so they are camera-agnostic. Only `vqa` (event) and
`trace` (event, pixel-trajectory) are view-dependent.

The `camera` field stays on PERSISTENT_ROW_FIELDS for schema symmetry —
the validator, resolver, and HF feature mapping behave identically across
the two columns regardless of which styles populate `camera` today —
but persistent rows now always have `camera=None` in practice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:54:12 +02:00
Pepijn 5a6aa64570 feat(language): per-camera tagging on view-dependent styles
Adds a nullable `camera` field to the language row struct (both persistent
and event variants) so view-dependent styles like `vqa` can carry which
`observation.images.*` view they were grounded against. Without this,
multi-camera datasets ended up with multiple `(vqa, role)` rows at the
same timestamp that the resolver could not disambiguate.

- `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS,
  to both Arrow struct types and the HF datasets feature mappings;
  introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus
  `is_view_dependent_style` and `validate_camera_field` helpers (camera
  required iff style is view-dependent).
- `language_render.py`: thread an optional `camera=` kwarg through every
  resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through
  `_matching_rows` / `_select_*`, so recipes can disambiguate per-camera
  VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`.
  Without a `camera` filter, multi-row matches keep raising the existing
  ambiguity error — which is the desired behaviour on multi-camera data.
- `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with
  `ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying
  the matching image block), keeping the original 0.20 budget and
  documenting the customization point for datasets with different cameras.
- Tests: schema test asserts the new field order; new tests cover
  `is_view_dependent_style`, `validate_camera_field` (both required and
  forbidden directions), per-camera `emitted_at` filtering, and the
  ambiguity error when two cameras emit `(vqa, assistant)` at the same
  timestamp without a `camera=` filter. RenderMessagesStep + dataset
  passthrough fixtures updated to include the new field.
- `docs/source/language_and_recipes.mdx`: document the `camera` field,
  the per-camera resolver pattern, and the canonical recipe convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:48:17 +02:00
Pepijn 0b06790da0 feat(language): add motion (persistent) and trace (event-only) styles
Promote the previously-reserved motion/trace styles to first-class core
styles. motion routes to language_persistent (it tracks robot state over
time); trace routes to language_events (single-moment annotations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 14:21:49 +02:00
Pepijn b43dc39ba4 Add docstrings to all new helpers; revert uv.lock
Covers private helpers in recipe.py, language.py, language_render.py,
and render_messages_processor.py. Also reverts uv.lock to main (it was
re-generated by `uv run` during local checks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 14:15:03 +02:00