lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-16 17:20:05 +00:00

Author	SHA1	Message	Date
Pepijn	a5d6ad614f	docs(annotate): add HF Jobs runner example for lerobot-annotate A ready-to-run example of launching the annotation pipeline on a Hugging Face job (h200x2) with two vllm replicas serving Qwen3.6-35B-A3B-FP8. Lives next to other end-to-end recipes under examples/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:00:28 +02:00
Pepijn	c0effa02c9	feat(annotate): emit VQA per-camera and propagate camera field Module 3 now produces one (vqa, user) + (vqa, assistant) pair per emission tick per camera rather than only against the dataset's first camera. Each emitted row carries the `camera` field added in PR 1 (language-columns), so the resolver can disambiguate per-camera VQA via `emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity. - `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property and a `camera_key=` argument on `frames_at` / `video_for_episode`. `VideoFrameProvider` exposes every `observation.images.*` key the dataset declares (not just the first) and keys its decode cache on `(episode, camera, timestamp)` so per-camera reads don't collide. Module 1 / 2 keep their old single-camera behaviour by leaving `camera_key=None` (falls back to the default camera). - `modules/general_vqa.py`: `run_episode` iterates `frame_provider .camera_keys` for each emission tick, builds one prompt per camera, batches all of them through the VLM, and stamps the resulting rows with `camera=<that key>`. Empty `camera_keys` (null provider) makes the module a no-op rather than silently emitting untagged rows. - `writer.py`: `_normalize_persistent_row` / `_normalize_event_row` carry `camera` through and call `validate_camera_field` so the invariant is enforced at the writer boundary. Event sort key now includes `camera` for deterministic ordering when several cameras share `(timestamp, style, role)`. `speech_atom` sets `camera=None`. - `validator.py`: `StagingValidator` gains a `dataset_camera_keys` field; `_check_camera_field` enforces the invariant and cross-checks every view-dependent row's `camera` against the dataset's known video keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate `(vqa, role)` pairs at the same `(t, camera)`. - `lerobot_annotate.py`: passes the live frame provider's `camera_keys` into the validator so the cross-check uses the actual dataset camera set. - Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new `camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera` configures two cameras and asserts both are represented, that every emitted row has a `camera` tag, and that uniqueness holds per `(timestamp, camera, role)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:20 +02:00
Pepijn	4b0354a046	fix(annotate): transcode subclips to H.264 instead of stream-copy Modern LeRobot datasets store videos in AV1, which vllm's libav build cannot decode (the video processor returns 0 frames and downstream chokes with ZeroDivisionError). Re-encode each per-episode subclip with libx264 (preset ultrafast, crf 23) so the resulting mp4 is universally decodable. Strip audio with -an for a smaller payload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:20 +02:00
Pepijn	ac183f019e	feat(annotate): pack multiple vllm replicas per GPU via num_gpus Adds VlmConfig.num_gpus so parallel_servers can exceed the physical GPU count. Replicas are round-robin-assigned to GPUs (e.g. parallel_servers=4 + num_gpus=2 → replicas pinned to GPUs 0,1,0,1). Backward-compatible: num_gpus=0 keeps the existing 1-replica-per-GPU behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:20 +02:00
Pepijn	bda829af2f	feat(annotate): forward chat_template_kwargs to OpenAI extra_body Lets callers pass per-request template flags such as {"enable_thinking": false} for Qwen3.5/Qwen3.6 models, where the default thinking preamble otherwise consumes the entire max_new_tokens budget before any JSON is emitted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:20 +02:00
Pepijn	824aac9ad8	fix(annotate): include prompt .txt files in wheel The setuptools package-data declaration only listed envs/*.json, so pip-installed wheels (including HF Jobs runs) were missing the module_1_subtasks/plan/memory and module_2/3 prompt templates, causing FileNotFoundError at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:20 +02:00
Pepijn	7e91482e3a	refactor(annotate): drop HF Inference Providers code path Default backend is now a local OpenAI-compatible server (vllm / transformers) which auto_serve spawns. Removes the use_hf_inference_providers config flag and the router.huggingface.co routing branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:20 +02:00
Pepijn	1d66444577	feat(annotate): --vlm.push_to_hub uploads the annotated dataset After the pipeline completes, optionally create/locate a dataset repo and upload the dataset root (excluding .annotate_staging/). Add push_private and push_commit_message knobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:20 +02:00
Pepijn	d0ad7ffb21	feat(annotate): parallelize episodes within each module phase Saturates parallel_servers + client_concurrency. Previously the executor processed one episode at a time, so each Module 1 episode's 3-5 dependent VLM calls hit a single server with the others idle. Now defaults to 16 episodes in flight; configurable via ExecutorConfig.episode_parallelism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:20 +02:00
Pepijn	994ad880ee	fix(annotate): probe /v1/models for spawn-helper readiness vllm with --uvicorn-log-level warning suppresses the "Uvicorn running" banner that the readiness watcher waited for, so the spawn helper hung forever even after the API was live. Add an HTTP probe in parallel with the log watcher and broaden the log markers to include vllm's own "Starting vLLM API server" / "Available routes are" lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	25bb0f6d6a	fix(annotate): lock-protect per-line writes for parallel server streams 8 server-streaming threads writing chars unsynchronized cause UTF-8 sequences from different servers to interleave mid-byte, garbling the terminal output. Switch to line-buffered reads with a single shared print lock — output stays readable, ready-marker detection still works on the line containing 'Uvicorn running' / 'Application startup complete'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	c8c76c598c	feat(annotate): client_concurrency for parallel in-flight requests Adds vlm.client_concurrency (default 16) which uses a ThreadPoolExecutor to fan out batched chat.completions calls. vllm batches them internally on the server side, giving big throughput wins on a single TP=1 server without needing DP/TP and the NCCL setup it requires. Module 3 now batches all per-episode VQA calls into a single generate_json invocation so they fire in parallel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	281c0f3921	feat(annotate): parallel_servers spawns N independent vllm replicas Adds --vlm.parallel_servers=N. Spawns N independent vllm processes (each pinned to GPU i via CUDA_VISIBLE_DEVICES, listening on serve_port+i) and round-robins requests across them. Sidesteps DP/TP NCCL setup failures on nodes with restricted P2P/SHM. Default serve_command for parallel mode: vllm serve <model_id> --tensor-parallel-size 1 --max-model-len 32768 --uvicorn-log-level warning. Override via --vlm.serve_command (use {port} placeholder). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	66ec839bb8	feat(annotate): per-episode progress logs in executor	2026-04-30 10:55:19 +02:00
Pepijn	4cbba00bcd	fix(annotate): don't crash pipeline on persistent JSON parse failure Some prompts/models occasionally return pure prose with no JSON object even on retry. Returning None (and logging a preview) lets the pipeline skip that one VLM call cleanly instead of aborting the whole episode. The modules already check for None / non-dict results and degrade gracefully (no row emitted from that call). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	4dce8f1bb1	fix(annotate): robust JSON extraction (think tags + first balanced object) Models often wrap JSON in prose or <think>...</think> blocks. Strip the think tags first, then try direct json.loads, then fall back to scanning for the first balanced {...} substring (ignoring braces inside strings). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	8389f302e7	fix(annotate): stream child stdout char-by-char so tqdm \\r progress flushes	2026-04-30 10:55:19 +02:00
Pepijn	035b470fd8	test(annotate): adjust video-block test for fps-based frame sampling	2026-04-30 10:55:19 +02:00
Pepijn	3dee7bf762	feat(annotate): Module 1 samples image frames at fps rate Replace the fixed max_video_frames count with a rate (default 1 fps). A 30 s episode now sends 30 frames; a 5 s episode sends 5; capped at max_video_frames (default 128) to avoid blowing up the payload on long episodes. Override with --module_1.frames_per_second=2.0 for denser sampling, or --module_1.frames_per_second=0.5 for sparser. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	33053e6996	feat(annotate): use cached HF token from huggingface-cli login Fall back to huggingface_hub.get_token() when HF_TOKEN/HUGGINGFACE_API_KEY env vars aren't set. That picks up the token cached by 'huggingface-cli login' so users don't need to export it on every shell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	7bb550a5a4	feat(annotate): default to HF Inference Providers, no local GPU needed Flip the default backend to 'openai' with use_hf_inference_providers=True and a Qwen3-VL-30B-A3B-Instruct:novita default model_id. The CLI now runs end-to-end without a local model load — annotations are produced by sending video_url + prompt to https://router.huggingface.co/v1. Switch back to local inference with --vlm.backend=vllm or --vlm.use_hf_inference_providers=false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	358acfc794	feat(annotate): one-flag HF Inference Providers backend Setting --vlm.use_hf_inference_providers=true routes requests through https://router.huggingface.co/v1 using HF_TOKEN as the API key, and disables auto_serve so no local server is spawned. Combine with a provider-pinned model id like 'Qwen/Qwen3-VL-30B-A3B-Instruct:novita' or any plain model id to let HF route. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	3916f17c4a	fix(annotate): omit mm_processor_kwargs by default; transformers serve rejects it transformers serve returns HTTP 422 'Unexpected fields' when mm_processor_kwargs is in extra_body — that field is vllm-specific. Drop it by default; opt in via LEROBOT_OPENAI_SEND_MM_KWARGS=1 when talking to vllm serve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	8807e0b41e	fix(annotate): mm_processor_kwargs in extra_body; inline file URLs as data URLs Two fixes for video_url with transformers serve: - fps must be in extra_body.mm_processor_kwargs, not in the content block; otherwise the server discards it as unknown kwargs. - file:// URLs aren't fetched by transformers serve. Read the local mp4 and inline it as a base64 data:video/mp4 URL so the server sees the bytes directly. Both surface as std::bad_alloc on the server side when wrong, which is unhelpful but explains what we hit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	43d3ba1d4e	fix(annotate): detect server ready via stdout banner, not /v1/models polls transformers serve rescans the HF cache on every /v1/models request which exceeds the 2s urllib timeout, leaving the probe loop spinning even after Uvicorn is fully up. Watch the streamed server output for 'Uvicorn running' / 'Application startup complete' instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	d1849d1b63	fix(annotate): visible auto_serve via stdout prints + live server log stream The previous logger-based output never appeared, leaving users in the dark when auto_serve silently no-op'd. Switch to print(flush=True) so the spawn decision is unmistakable, and stream the server's stdout to the parent terminal in real-time on a background thread so model-load progress and errors surface immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:19 +02:00
Pepijn	1a4358ed8b	fix(annotate): auto_serve defaults to True; probe before spawning Default auto_serve to True so lerobot-annotate can drive the entire flow with one command. Probe api_base/models first — if a server is already reachable (user started one manually, or it's a remote endpoint), skip the spawn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	5003989ae6	feat(annotate): auto_serve mode spawns and tears down inference server Setting --vlm.auto_serve=true with --vlm.backend=openai makes the CLI launch 'transformers serve <model_id> --port <serve_port> --continuous-batching' as a child process, poll /v1/models until ready (up to serve_ready_timeout_s), run the pipeline, then SIGINT the server on process exit. Override the spawn command with --vlm.serve_command='vllm serve ...' or any OpenAI-compatible launcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	cd12665e85	feat(annotate): video_url block for openai backend Module 1 can now send the episode's actual mp4 file as a video_url content block instead of pre-decoded frames. The server (transformers serve / vllm serve / ktransformers serve) handles frame sampling at the configured fps. Default fps=1 (one frame per second is enough for subtask-boundary detection on manipulation episodes). A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/ via ffmpeg stream-copy (no re-encode) so the model sees only this episode's frames, not the whole shard. Enable with --module_1.use_video_url=true (and --vlm.backend=openai). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	91dedcad1e	feat(annotate): openai-compatible backend for transformers/ktransformers serve Adds a third backend that talks to any OpenAI-compatible server. This unblocks Qwen3.6 (and other models) that work in transformers serve / ktransformers but not in vllm 0.10.2's fallback path: - launch the server out-of-process (transformers serve, vllm serve, ktransformers serve) - point lerobot-annotate at it via --vlm.backend=openai --vlm.api_base=http://localhost:8000/v1 --vlm.model_id=... Image and video blocks are converted to OpenAI image_url/video_url data URLs automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	0782bbcd38	fix(annotate): use vllm.chat() API for multimodal prompts vllm.generate() expects a string/TextPrompt; passing message dicts fails. vllm.chat() applies the chat template and extracts image/video blocks automatically, which is what we need for VL models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	89f15e1731	fix(annotate): drop guided_decoding=dict (api differs across vllm) vllm 0.10.2 expects guided_decoding to be a GuidedDecodingParams object, not a dict. Different vllm versions differ here. The parser already has a one-retry JSON-recovery path, so drop guided decoding entirely for portability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	a36287329e	fix(annotate): tolerate decoder returning fewer frames than requested pyav (and sometimes torchcodec) decode can return fewer frames than requested timestamps when some timestamps fall outside the video file's content range. Drop the strict=True on the zip and rely on the None-filter to discard missing frames. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	2b79e7e1dd	fix(annotate): default video decode backend to pyav torchcodec's __init__ bad-allocs on the cu128/torch-2.8 stack in some environments (Lustre/conda combos). The annotation pipeline calls decode_video_frames many times per episode, so this is a hard blocker. Default to pyav (always available via the av package) and let users opt back into torchcodec via LEROBOT_VIDEO_BACKEND=torchcodec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	7ab3cf3407	fix(annotate): default trust_remote_code=False for HF loaders Setting trust_remote_code=True unconditionally pulled custom loader code that triggers std::bad_alloc post-load on Qwen3-VL — the official transformers class is sufficient. Flip the default to False; keep the config field so users can opt in for models that actually need it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	8ea488080e	fix(annotate): default transformers backend to manual GPU placement Loading Qwen3-VL via transformers + accelerate's device_map='auto' fails with std::bad_alloc on hosts with abundant RAM. The bug is in accelerate's post-load dispatch path. Bypassing accelerate by loading to CPU first and then calling .to('cuda') manually avoids that path. LEROBOT_TRANSFORMERS_DEVICE_MAP=auto switches back to the old behavior for cases where it works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	41f26d3b2e	fix(annotate): LEROBOT_DISABLE_CUDNN escape hatch for conv3d crash cuDNN 9.x + torch 2.8 has a regression where the conv3d kernel used in Qwen-VL vision tower patch embedders fails with CUDNN_STATUS_NOT_INITIALIZED. The crash is independent of model size and reproduces on both Qwen2.5-VL and Qwen3-VL because both use 3D conv for video patch embedding. Setting LEROBOT_DISABLE_CUDNN=1 falls back to native PyTorch conv3d kernels (slower but functional) so the pipeline can run while the torch/cuDNN stack is still on the broken combo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	50ec7ed2e1	fix(annotate): expose gpu_memory_utilization and max_model_len for vllm Large VL models (Qwen3-VL-30B-A3B BF16) take ~58 GB of an 80 GB H100, leaving only ~22 GB for KV cache + cuDNN workspace. The vision tower's 3D conv then fails with CUDNN_STATUS_NOT_INITIALIZED because cuDNN can't grab a workspace large enough. - vlm.gpu_memory_utilization (default 0.9) — drop to 0.7 when the vision encoder needs more cuDNN workspace. - vlm.max_model_len — cap context to free KV cache memory; the 262k default for Qwen3 is wildly more than annotation prompts need. - vlm.trust_remote_code — already plumbed; now also passed to LLM(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	6be6b929d8	fix(annotate): pass trust_remote_code=True to HF auto-classes Required for many newer VL checkpoints (Qwen3.x FP8 in particular) that ship custom loader code in their repo. Without it, the FP8 weight_scale_inv parameters never bind to FP8Linear modules and the post-load dispatch path bad-allocs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:18 +02:00
Pepijn	093e580672	fix(annotate): low_cpu_mem_usage=True on transformers load path The std::bad_alloc we hit on Qwen3-line VL models is not a real OOM — it triggers in the post-load tensor-placement path even on hosts with 2 TB RAM. low_cpu_mem_usage=True bypasses the offending intermediate staging buffer and is the standard accelerate workaround. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:06 +02:00
Pepijn	b4dc785db1	fix(annotate): use device_map='auto' for transformers backend Without device_map, transformers stages the full FP8 checkpoint in CPU RAM before any GPU placement, OOMing the host on 27B+ models even when the GPU has enough VRAM. device_map='auto' streams shards directly to GPU memory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:06 +02:00
Pepijn	f583307590	fix(annotate): try AutoModelForImageTextToText first, fall back to AutoModelForVision2Seq Newer transformers versions renamed/removed AutoModelForVision2Seq in favour of AutoModelForImageTextToText for VL models. Try the new name first and fall back gracefully so the transformers backend works on both transformers 4.45-4.5x and 5.x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:05 +02:00
Pepijn	69f049b877	fix(annotate): replace Literal types with str for older draccus Older draccus versions (e.g. 0.10.x bundled in some envs) lack a decoder for typing.Literal and raise: No decoding function for type typing.Literal['vllm', 'transformers', 'stub'] Switching VlmConfig.backend from Literal to str works under every draccus version. The runtime branch in vlm_client.make_vlm_client already validates the value and raises ValueError on unknown backends, so the constraint stays enforced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:05 +02:00
Pepijn	9d5aa1c63e	feat(annotate): Module 1 sees the whole episode as one video block Replaces keyframe sampling with a single Qwen-VL video block covering the whole demonstration. The model pools temporally itself and chooses where to cut subtasks — no stride, no count, no keyframe count knob to tune. - frames.py: ``FrameProvider`` gains ``video_for_episode(record, max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames`` uniformly across the episode duration; ``_NullProvider`` returns [] for the no-video fallback. New ``to_video_block`` helper. - Module 1: drops keyframe sampling. The subtask prompt now goes out as ``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and the prompt template asks the model to "watch the whole clip, then segment it" with cut points decided from gripper/contact/regrasp events the model sees. - Module1Config: ``keyframes_per_episode`` removed; replaced with ``max_video_frames: int = 32`` (model-capacity bound, not annotation logic). - Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in the single-video-block invariant. - Stub-VLM markers updated: tests now key on "atomic subtasks" instead of the old "Decompose the demonstration" phrase that no longer appears in the prompt. - Docs: updated to describe the whole-episode video-block behavior and the no-video fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:05 +02:00
Pepijn	4f4dd49972	feat(annotate): attach camera keyframes to module prompts; default to Qwen3.6-27B-FP8 Closes the visual-grounding gap flagged after the initial PR review: modules now decode actual camera frames at the relevant timestamps and attach them as `{"type":"image", "image":<PIL>}` content blocks to the VLM prompts. - New `frames.py`: - `FrameProvider` Protocol; `VideoFrameProvider` decodes from the dataset's first `observation.images.*` stream via `LeRobotDatasetMetadata.get_video_file_path` and `decode_video_frames`, with the same `from_timestamp` shift the main dataset uses. - Per-process LRU cache so co-timestamped Module 1 plan-update + Module 2 calls share decode work. - `make_frame_provider` falls back to a null provider when the dataset has no video tracks → text-only prompts (graceful absence). - Modules 1/2/3 take an optional `frame_provider` (default null) and prepend image blocks before the text block. - Module 1 attaches `keyframes_per_episode` keyframes to the subtask decomposition prompt. - Module 2 attaches the frame at the interjection timestamp. - Module 3 attaches the exact emission frame to each VQA pair. - VlmConfig: backend now defaults to `vllm`; default model is `Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`, `--vlm.camera_key` (override the keyframe stream). - `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded on 2× GPUs works out of the box. - `test_module3_attaches_frame_image_block_to_prompt` asserts modules emit one image block per VQA prompt at the exact emission timestamp. - Docs: example switched to `imstevenpmwork/super_poulain_draft` + Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe attachment behaviour and the no-video fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:05 +02:00
Pepijn	785cee429e	feat: language annotation pipeline (PR 2/3) Adds the steerable annotation pipeline (`lerobot-annotate`) that populates the `language_persistent` and `language_events` columns introduced in PR 1 directly into `data/chunk-/file-.parquet`. No flavor namespace, no sidecar tree. Modules produced: - Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init + refresh on interjection), MEM-style memory at subtask boundaries. - Module 2 (interjections_and_speech): t=0 speech-only acknowledgement, mid-episode paired interjection + speech tool-call atom. - Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at configurable cadence with one-retry JSON validation. Writer enforces: per-episode persistent identity, exact-frame event timestamps, column routing per `column_for_style`, dataset-level `tools` column with the `say` schema, drops legacy `subtask_index`. Validator runs against staged JSONL artifacts before the writer rewrites parquet. Adds `lerobot-annotate` console script, `annotations` extra (datatrove + optional vllm), `make annotation-e2e` opt-in smoke target, and `docs/source/annotation_pipeline.mdx`. Branched from PR 1 (`feat/language-columns`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:55:05 +02:00
Pepijn	1ca38d9748	fix(language): drop motion from VIEW_DEPENDENT_STYLES Motion primitives are described in robot-frame (joint / Cartesian) terms, not pixel space, so they are camera-agnostic. Only `vqa` (event) and `trace` (event, pixel-trajectory) are view-dependent. The `camera` field stays on PERSISTENT_ROW_FIELDS for schema symmetry — the validator, resolver, and HF feature mapping behave identically across the two columns regardless of which styles populate `camera` today — but persistent rows now always have `camera=None` in practice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:54:12 +02:00
Pepijn	5a6aa64570	feat(language): per-camera tagging on view-dependent styles Adds a nullable `camera` field to the language row struct (both persistent and event variants) so view-dependent styles like `vqa` can carry which `observation.images.` view they were grounded against. Without this, multi-camera datasets ended up with multiple `(vqa, role)` rows at the same timestamp that the resolver could not disambiguate. - `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS, to both Arrow struct types and the HF datasets feature mappings; introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus `is_view_dependent_style` and `validate_camera_field` helpers (camera required iff style is view-dependent). - `language_render.py`: thread an optional `camera=` kwarg through every resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through `_matching_rows` / `_select_`, so recipes can disambiguate per-camera VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`. Without a `camera` filter, multi-row matches keep raising the existing ambiguity error — which is the desired behaviour on multi-camera data. - `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with `ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying the matching image block), keeping the original 0.20 budget and documenting the customization point for datasets with different cameras. - Tests: schema test asserts the new field order; new tests cover `is_view_dependent_style`, `validate_camera_field` (both required and forbidden directions), per-camera `emitted_at` filtering, and the ambiguity error when two cameras emit `(vqa, assistant)` at the same timestamp without a `camera=` filter. RenderMessagesStep + dataset passthrough fixtures updated to include the new field. - `docs/source/language_and_recipes.mdx`: document the `camera` field, the per-camera resolver pattern, and the canonical recipe convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:48:17 +02:00
Pepijn	0b06790da0	feat(language): add motion (persistent) and trace (event-only) styles Promote the previously-reserved motion/trace styles to first-class core styles. motion routes to language_persistent (it tracks robot state over time); trace routes to language_events (single-moment annotations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 14:21:49 +02:00
Pepijn	b43dc39ba4	Add docstrings to all new helpers; revert uv.lock Covers private helpers in recipe.py, language.py, language_render.py, and render_messages_processor.py. Also reverts uv.lock to main (it was re-generated by `uv run` during local checks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 14:15:03 +02:00

1 2 3 4 5 ...

1459 Commits