Commit Graph

1475 Commits

Author SHA1 Message Date
Pepijn 223cc8a9e2 feat(smolvla2): inference runtime — select_message + multi-rate REPL
Closes the loop on PR 3: SmolVLA2 can now be queried interactively at
inference, dispatching the same five sub-recipe shapes it was trained
on (action chunks, subtask gen, memory updates, plan/speech on
interjection, VQA on questions).

Modeling fixes + additions
--------------------------

- ``_compute_text_loss``: standard next-token CE shift was missing
  (logits at position t were CE'd against the label at t — identity-
  mapped, learning nothing). Adds ``logits[:, :-1]`` /
  ``labels[:, 1:]`` shift to match HuggingFace ``LlamaForCausalLM``.

- New ``select_message`` on ``SmolVLA2Policy``: AR text generation
  with KV caching, mirroring SmolVLA's ``select_action`` pattern.
  Single prefix forward fills the cache, then per-token forwards
  reuse it. Greedy + top-p nucleus sampling. Returns the decoded
  string with the prompt stripped.

Runtime package — ``src/lerobot/policies/smolvla2/inference/``
-------------------------------------------------------------

- ``triggers.py`` — ``Trigger`` Protocol + ``HzTrigger`` /
  ``EventTrigger`` + ``TickClock``. The whole runtime ticks at
  ``max_rate_hz=50`` and each step gates itself off its own
  cadence.

- ``runtime_state.py`` — runtime state dict factory plus tiny
  helpers (``take_event``, ``set_if_changed``, ``push_log``).
  Stable keys are documented at the top of the module.

- ``steps.py`` — :class:`InferenceStep` base + concrete steps:
  ``LowLevelForward`` / ``DispatchAction`` (action path),
  ``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` /
  ``UserInterjectionFwd`` / ``AskVQAFwd`` (text paths),
  ``DispatchToolCalls`` (tool registry → ``Tool.call``). Each
  text step builds a chat-template prompt from current
  ``RuntimeState`` (task / plan / memory / subtask) matching
  what ``smolvla2_hirobot.yaml`` renders during training.
  Includes a tiny ``<say>...</say>`` parser for the
  ``user_interjection_response`` branch's combined plan + speech
  output.

- ``runtime.py`` — :class:`SmolVLA2Runtime` composes the pipeline,
  drives ticks via ``TickClock``, polls a user-supplied
  ``event_collector`` per tick, and prints state-change log lines.

- ``repl.py`` — :class:`StdinReader` non-blocking line reader
  with simple intent classification: ``stop`` / ``quit`` /
  ``exit`` → terminate; ``?`` suffix → ``user_vqa_query`` event;
  first line → set task; other lines → ``user_interjection``.

CLI
---

- ``src/lerobot/scripts/lerobot_smolvla2_runtime.py``: console
  script ``lerobot-smolvla2-runtime`` that loads a checkpoint,
  optionally instantiates ``SayTool`` (pocket-tts), wires up
  ``SmolVLA2Runtime`` + ``StdinReader``, and runs.

  Real-robot wiring (observation_provider / robot_executor) is
  intentionally left as a follow-up — v1 is dry-run / language-
  only so the REPL works without robot hardware.

  Registered in ``pyproject.toml`` ``[project.scripts]``.

Known follow-ups
----------------

- Real-robot integration: today ``LowLevelForward`` only fires when
  an observation_provider is wired. The CLI prints a warning if
  ``--no_robot`` is omitted.
- ``select_message`` runs an extra prefix forward; could share with
  the action path's prefix when both are needed in the same tick.
- Tests: no end-to-end runtime test yet (would need a tiny SmolVLM
  fixture). The components compile and the public surface is
  exercised by the CLI's argument-parsing path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:04:00 +02:00
Pepijn af6d8ebd5b feat(smolvla2): dual-head forward — flow loss + lm_head text loss
The third and final commit of PR 3's SmolVLA2 work. Wires the actual
training signal through:

* ``predict_actions[i] = True``  → sample i contributes to flow loss
* ``text_labels[i, t] != -100``  → token t of sample i contributes to
                                    LM-head cross-entropy

Both routing knobs come from ``SmolVLA2ChatTokenizerStep`` (previous
commit on this branch), which builds them from the recipe's
``message_streams`` / ``target_message_indices``. The per-sample
``predict_actions`` mask preserves the Pi0.5 convention from the
plan's Section I.7: "True iff any low_level target exists".

Implementation:

- ``forward`` reads ``text_labels`` and ``predict_actions`` from the
  batch. When neither is present (vanilla SmolVLA usage with no
  recipe), delegates to ``SmolVLAPolicy.forward`` so unannotated
  datasets keep training as before — full backward compatibility.
- ``flow_loss``: super().forward(reduction="none") returns the
  per-sample (B,) flow loss; we mask non-action samples with the
  ``predict_actions`` bool and renormalize by the count of action
  samples. ``flow_loss_weight = 0`` in the config disables this
  branch entirely (text-only training).
- ``text_loss``: a prefix-only forward through the VLM (no action
  expert / suffix), slicing the lang-token range out of the
  resulting hidden states (``embed_prefix`` orders the prefix as
  ``[image_blocks..., lang, state]`` so the slice is unambiguous).
  Apply ``vlm.lm_head`` to those hidden states, cross-entropy with
  ``text_labels`` (ignore_index=-100). ``text_loss_weight = 0``
  disables this branch (reverts to flow-only behaviour, matching
  SmolVLA exactly).
- The two losses are summed with the config-supplied weights.

Mixed-stream samples (one batch containing both action targets and
text-only sub-recipes) are handled correctly: each sample contributes
where its labels are valid and is masked elsewhere.

Limitations / known follow-ups:

- Text loss runs an additional prefix-only forward separate from the
  flow path's prefix forward. The forwards could share their prefix
  computation; for clarity of this first commit they don't.
  Optimization is straightforward when needed.
- Per-sample loss for ``reduction="none"`` is not yet meaningfully
  defined for the dual path — we broadcast the scalar to (B,) for
  caller compatibility (e.g. RA-BC weighting will need follow-up).
- Inference ``select_action`` is unchanged from SmolVLA today —
  it predicts actions only. A separate "generate text"
  ``select_message`` path is the natural next step for runtime
  use of the LM head (memory updates, plan refreshes, VQA answers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:54:57 +02:00
Pepijn 37b1eb218a feat(smolvla2): chat-template processor + label mask + predict_actions
Wires PR 1's recipe stack into the SmolVLA2 pipeline so multi-target
sub-recipes (memory_update, ask_vqa, user_interjection_response,
high_level_subtask) carry meaningful supervision through to the model.

- New ``chat_processor_smolvla2.py`` with
  ``SmolVLA2ChatTokenizerStep``: reads ``messages`` /
  ``message_streams`` / ``target_message_indices`` from the rendered
  sample (PR 1 ``RenderMessagesStep``), calls
  ``apply_chat_template(messages, tools=DEFAULT_TOOLS, ...)`` on the
  SmolVLM tokenizer, and writes:

    OBS_LANGUAGE_TOKENS / _ATTENTION_MASK   ← chat-templated prompt
    text_labels                              ← -100 except target msg tokens
    predict_actions                          ← True iff any low_level target

  Builds the label mask robustly by re-rendering the chat through
  each target's prefix and reading off the prefix length — same
  tokenizer, same tools, so the prefix tokens are guaranteed to be
  a prefix of the full sequence. Image/video content blocks
  (LeRobot ``feature``-keyed) are stripped before tokenizing; the
  actual image tensors flow through SmolVLA's existing
  ``OBS_IMAGES_*`` channels and ``embed_prefix`` puts them before
  the language embeddings, matching the chat-template-stripped
  text order.

- ``processor_smolvla2.py``: when ``config.recipe_path`` is set,
  build a new pipeline with ``RenderMessagesStep`` +
  ``SmolVLA2ChatTokenizerStep`` instead of SmolVLA's plain
  ``TokenizerProcessorStep``. When ``recipe_path`` is ``None``,
  fall back to SmolVLA's pipeline so unannotated datasets still
  work unchanged. Resolves recipe paths relative to
  ``src/lerobot/configs/`` so ``recipes/smolvla2_hirobot.yaml``
  works directly.

The next commit on this branch picks up ``text_labels`` and
``predict_actions`` from the batch and routes them through the
SmolVLM ``lm_head`` for the actual dual-loss training.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:21:03 +02:00
Pepijn 52e1fd35cb feat(tools): src/lerobot/tools/ — runnable tool registry + SayTool
Ships the runtime side of the OpenAI-style function-calling stack
introduced in PR 1 (catalog in ``meta/info.json["tools"]``) and PR 2
(annotation pipeline writes the catalog after a run). One file per
tool — heavy deps stay isolated.

Layout:

- ``base.py`` — :class:`Tool` Protocol: ``name``, ``schema``,
  ``call(arguments)``. Runtime-checkable so tests can use
  ``isinstance(...)``.
- ``registry.py`` — :data:`TOOL_REGISTRY` (name → class) plus
  ``get_tools(meta, **kwargs)`` that instantiates every entry whose
  ``function.name`` is registered. Tools whose name is unknown are
  silently skipped — the schema still rides through the chat
  template, the model just can't actually invoke that tool at
  inference.
- ``say.py`` — :class:`SayTool` wrapping Kyutai's pocket-tts
  (CPU-only, ~100M params, ~6× real-time on a MacBook Air M4).
  Lazy model load: pocket-tts is imported and the voice state
  computed on first ``call(...)`` (or eagerly via ``preload()``).
  Returns the PCM tensor; optionally writes a ``.wav`` to
  ``output_dir`` for offline inspection.
- ``__init__.py`` — re-exports the public surface.

Optional install:

    pip install lerobot[tools]

The ``[tools]`` extra in ``pyproject.toml`` pulls in ``pocket-tts`` +
``scipy`` (for the wav writer). Adding more tools later means a new
file + a registry entry — no new extras unless the tool brings new
deps.

To add your own tool, follow the three-step guide in
``docs/source/tools.mdx`` (PR 1):

  1. Drop ``src/lerobot/tools/<my_tool>.py`` with a ``Tool``-conforming
     class.
  2. Register the class in ``TOOL_REGISTRY`` (this file).
  3. Pre-populate ``meta/info.json["tools"]`` with the schema (or let
     ``lerobot-annotate`` add it on the next run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:58:04 +02:00
Pepijn 7459dfccb6 feat(policies): scaffold smolvla2 (smolvla + lm_head re-enabled)
PR 3 of the steerable-annotation plan retargeted from Pi0.5 to SmolVLA
because the recipe stack (PR 1 + PR 2) outputs HF/TRL-compatible chat
which a chat-pretrained backbone consumes natively. SmolVLA strips the
SmolVLM ``lm_head`` though, so it can only do flow-matching action
prediction. SmolVLA2 keeps the LM head so the same model can train on
the full Hi Robot / MEM / ECoT blend defined in the plan:

  * action-only sub-recipes  (low_level_execution)        flow loss
  * text-only sub-recipes    (memory_update / ask_vqa /   CE loss on
                              user_interjection_response)  lm_head
  * mixed sub-recipes                                      both summed

This first commit lays down the structural scaffold:

- ``src/lerobot/policies/smolvla2/`` — new package with thin subclasses
  of ``SmolVLAConfig`` / ``SmolVLAPolicy`` so we don't fork the 900-line
  modeling code. ``SmolVLA2Config`` adds ``recipe_path``,
  ``apply_chat_template``, ``text_loss_weight``, ``flow_loss_weight``,
  and ``unfreeze_lm_head``. ``SmolVLA2Policy`` unfreezes the SmolVLM
  ``lm_head`` (and the surrounding norm + last text-model layer SmolVLA
  freezes) when ``unfreeze_lm_head=True`` and ``text_loss_weight>0``.
- ``factory.py`` registers ``smolvla2`` in ``get_policy_class``,
  ``make_policy_config``, and the pre/post-processor builder. Important:
  the ``smolvla2`` branch lives BEFORE the ``isinstance(config,
  SmolVLAConfig)`` check because ``SmolVLA2Config`` subclasses
  ``SmolVLAConfig`` — without the ordering, SmolVLA2 would silently
  pick up SmolVLA's processor.
- ``configs/recipes/smolvla2_hirobot.yaml`` — canonical Hi Robot blend
  for SmolVLA2. Same shape as ``pi05_hirobot.yaml`` (PR 1) so the
  recipe stack stays uniform across policy backbones.

Behaviour today is identical to SmolVLA: the modeling forward
delegates to ``SmolVLAPolicy.forward`` and the processor delegates to
``make_smolvla_pre_post_processors``. The next commit on this branch
adds the chat-template processor + ``text_labels`` / ``predict_actions``
batch keys; the commit after that wires the actual text-loss path
through ``vlm.lm_head``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:55:23 +02:00
Pepijn 73740ecf4b feat(annotate): write tool catalog to meta/info.json after annotation
After every ``lerobot-annotate`` run, the executor ensures
``meta/info.json["tools"]`` contains at minimum the canonical ``say``
schema, while preserving any tools the user pre-declared on the
dataset. Chat-template consumers (PR 3 SmolVLA2 / Pi0.5 / dataset
visualizer) read the catalog through
``LeRobotDatasetMetadata.tools`` and pass it to
``apply_chat_template(messages, tools=meta.tools, ...)``.

- ``executor.py``: new ``_ensure_tools_in_info`` helper called
  after the parquet rewrite. Idempotent and additive — merges by
  ``function.name``, only writes back if the list changed.
- ``writer.py``: drops the duplicated ``SAY_TOOL_SCHEMA`` /
  ``DEFAULT_TOOLS`` constants in favour of importing from
  ``lerobot.datasets.language`` (PR 1's single source of truth).
  Re-exported so existing imports keep working.
- ``annotation_pipeline.mdx``: replace the "code constant only" note
  with a pointer to the new Tools doc and a description of the
  meta/info.json behaviour, including how to pre-declare custom
  tools before annotation runs.

This is the storage half of the tools work; PR 3 ships the runnable
implementations under ``src/lerobot/tools/`` (one file per tool,
first up: ``say.py`` wired to Kyutai's pocket-tts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:51:38 +02:00
Pepijn 1b81e49214 feat(annotate): task rephrasings + video-derived task fallback
Module 1 now produces ``task_aug`` rows (registered in PR 1) so the
PR-1 ``${task}`` resolver can rotate phrasings deterministically per
``sample_idx``. Plus an opt-in video-derived task that bypasses the
canonical ``meta/tasks.parquet`` task when it's empty, low-quality, or
explicitly disabled — every downstream Module-1 prompt then uses the
derived task as its grounding.

- ``Module1Config``: adds ``n_task_rephrasings`` (default 10) and
  ``derive_task_from_video`` ∈ ``{off, if_short, always}`` (default
  ``if_short``: triggers when canonical is empty, < 3 words, or matches
  a placeholder string like ``debug`` / ``unnamed`` / ``tbd``).
- ``plan_subtasks_memory.py``: ``run_episode`` now resolves an
  ``effective_task`` (canonical OR video-derived) and threads it
  through ``_generate_subtasks`` / ``_generate_plan`` /
  ``_generate_memory`` so subtasks, plans, and memory are all grounded
  in the same task string. Then generates ``n`` rephrasings of the
  effective task and writes them as ``task_aug`` rows at ``t=0`` with
  ``role=user``. The effective task itself is included as the first
  variant so the rotation is guaranteed to cover the source-of-truth
  phrasing.
- New prompts: ``module_1_video_task.txt`` (one-shot video → task),
  ``module_1_task_rephrasings.txt`` (text-only paraphraser, ``n`` per
  call).
- ``meta/tasks.parquet`` is NOT modified — derived tasks live only in
  ``language_persistent``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn d813c75b76 fix(annotate): align interjections with the actual demo trajectory
qwen36moe-11 surfaced a deeper semantic problem with mid-episode
interjections: they were generated as *counterfactual* user requests
("actually skip the wipe", "use the blue one instead") but teleop data
is frozen — the robot in the video already executed everything,
including the steps the user "asked to skip". The training signal was
therefore self-contradictory: interjection text said one thing, the
robot's subsequent action stream did the opposite.

Flip the framing. Anchor every interjection at a subtask boundary and
write it as a natural user request for the *upcoming* subtask. The
robot's visible next behavior IS the interjection's effect, so:

  interjection text → plan refresh → action stream

are all consistent with the same observed video.

Concretely:

- ``interjections_and_speech.py``: instead of sampling random
  timestamps from ``frame_timestamps``, walk Module 1's subtask spans
  and sample from the (subtask N → subtask N+1) transitions. Pass both
  the just-finished and the upcoming subtask texts into the prompt.

- ``_window_timestamps``: re-center the multi-frame video window on
  the boundary itself (half the frames cover the end of the previous
  subtask, half cover the start of the next one) so the VLM has the
  same visual conditioning the policy will see at training time.

- ``module_2_interjection.txt``: rewritten. The prompt now states
  explicitly that this is offline data, the robot already committed to
  the next subtask, and the interjection must be a natural request
  that aligns with — not contradicts — the next subtask. Removes the
  "negative task / situated correction" Hi Robot framing because those
  scenarios require online execution to be coherent.

Plan-refresh logic from the previous commit (forwarding interjection
text into the refresh prompt) is unchanged and now reinforces the same
direction: the refreshed plan emphasizes the upcoming subtask the
interjection just asked for.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 3434d2ef22 fix(annotate): ground interjections in video + propagate text to plan refresh
qwen36moe-10 showed three Module-2 / plan-refresh quality issues that
are not architecture problems — they're prompt-grounding bugs:

1. Interjection prompt passed ``current_subtask = record.episode_task``
   (the WHOLE-episode task), not the actual subtask in force at the
   chosen timestamp. The VLM had no signal about what was visible at
   that moment, so its interjections were generic ("actually skip X"
   where X had nothing to do with the visible activity).

2. Interjection prompt only attached a single frame
   (``frames_at(record, [t_snap])``). With one frozen image the VLM
   couldn't read the ongoing motion. Module 1 already gets the whole
   episode video for subtask decomposition, which is why subtasks are
   well-grounded; Module 2 was the outlier.

3. The plan-refresh prompt told the model "a plan refresh after a user
   interjection at t=X.YZs" but never showed it the interjection
   *text*. So the refreshed plan couldn't actually reflect the user's
   correction — at best it recombined the same step list.

Fix:

- ``interjections_and_speech.py``: Module 2 reads Module 1's subtask
  rows from the same staging tree (executor orders module_1 → module_2
  so they're already there) and resolves the actual ``current_subtask``
  at each chosen timestamp. Pulls a small clip
  (``interjection_window_seconds`` × ``interjection_window_frames``,
  defaulting to 4 frames over the leading 2 s) instead of one frame.
  Drops the silently-zeroing ``len(candidate_ts) // 4`` cap on the
  interjection count.

- ``module_2_interjection.txt``: prompt is rewritten to reference the
  multi-frame visual context and require the interjection to mention
  something visible OR named in the current subtask, not invented.

- ``plan_subtasks_memory.py``: ``run_plan_updates`` now accepts and
  threads through interjection texts. ``_generate_plan(refresh_t,
  interjection)`` injects both the current subtask AND the interjection
  text into the prompt so the refreshed plan can drop / reorder /
  constrain steps to match the user's correction. (Plan still refreshes
  ONLY at user interjections — subtask generation runs ~1 Hz at
  inference, plan re-emission is event-driven.)

- ``executor.py``: forwards ``interjection_texts`` alongside
  ``interjection_times`` to ``run_plan_updates``.

- ``Module2Config``: bumps ``max_interjections_per_episode`` default
  from 1 to 3 and exposes ``interjection_window_seconds`` /
  ``interjection_window_frames``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn b71e10da6b refactor(annotate): drop dataset-level `tools` parquet column
PR 2 used to write a top-level ``tools`` column on every parquet shard
holding the JSON schema for the ``say`` tool, broadcast identically
across every row. That extends PR 1's schema for no real information
gain — the schema is a fixed code constant, parquet's RLE/dict encoding
collapses it on disk anyway, and HF/TRL chat-template consumers can
just import the constant directly.

PR 2 should fill in PR 1's existing schema, not add to it. So:

- ``writer.py``: stop emitting the ``tools`` column. Strip any legacy
  ``tools`` column from older shards on rerun so the schema converges to
  v3.1. ``SAY_TOOL_SCHEMA`` stays as a public constant (now joined by
  ``DEFAULT_TOOLS = [SAY_TOOL_SCHEMA]``); chat-template policies and the
  visualizer import them directly.
- ``test_writer.py``: replace the "tools column present" assertion with
  one that explicitly checks the column is absent, plus a new test
  asserting the constant's shape.
- ``test_pipeline_recipe_render.py``: drop the tools-column read; assert
  it's not present in the rewritten parquet.
- ``annotation_pipeline.mdx``: update the writer description to note the
  parquet stays small and the schema lives as a code constant.

If multi-tool-set support ever becomes real (datasets with different
tool inventories), the right home is ``meta/info.json["tools"]`` —
adding it later is non-breaking; ripping out a parquet column already
shipped is not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 0f6e3230df fix(annotate): decode video frames with PyAV directly
``lerobot.datasets.video_utils.decode_video_frames`` routes
``backend="pyav"`` through ``decode_video_frames_torchvision`` →
``torchvision.io.VideoReader``, but ``VideoReader`` was removed in
torchvision >= 0.22 (the vllm/vllm-openai:latest container ships with
torchvision 0.25). That made every Module 3 frame decode raise
``AttributeError: module 'torchvision.io' has no attribute 'VideoReader'``,
which the previous catch-all silently turned into an empty image list,
which then made every Module 3 prompt skip via the
``not _has_image_block(messages)`` branch and produce zero VQA rows.

Bypass ``video_utils`` entirely. The annotation pipeline only needs
a handful of PIL frames per (episode, ts), so a direct PyAV decode is
both simpler and insulated from torchvision API churn. ``av`` is already
in the install set, no new dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 2f2e42c4aa log(annotate): warn loudly on first video decode failure
VideoFrameProvider._decode used to swallow every exception silently and
return []. That made Module 3 (VQA) produce zero rows whenever local
video decoding broke (codec, backend, missing file, ...) because every
prompt got skipped via the ``not _has_image_block(messages)`` branch in
general_vqa.py — without any signal in the job log.

Log the first failure with full exception info (subsequent failures
stay quiet to avoid log spam) so this fast-path is debuggable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 5ee0104739 log(annotate): surface resolved frame-provider cameras at startup
Print the default and full camera list once at the top of every run so a
silent Module-3-no-op (cam_keys=[]) is visible in the job log instead of
only being discoverable by counting parquet rows after upload.

Also warn loudly when Module 3 is enabled but no cameras resolved, with
a hint about the --vlm.camera_key fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn e064cfcb04 fix(annotate): seed Module 3 cameras from camera_keys + camera_key fallback
Module 3 fast-pathed out (50 episodes in 0.6s) when
``frame_provider.camera_keys`` came back empty even though Module 1/2
worked, because they use ``frame_provider.camera_key`` (singular) and
were happy with the explicit ``--vlm.camera_key=...`` override.

Two fixes:

- ``frames.py``: read ``meta.camera_keys`` (covers both video- and
  image-stored cameras) instead of ``meta.video_keys`` (video-only),
  matching :class:`LeRobotDatasetMetadata`'s canonical accessor. If
  metadata still surfaces nothing but the caller explicitly passed
  ``--vlm.camera_key=<key>``, fall back to ``[<key>]`` — the key is by
  definition known to exist on the dataset.
- ``general_vqa.py``: emit a one-time WARNING log when Module 3 sees
  zero cameras so this never silently produces zero VQA again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn b3d9494831 docs(annotate): add HF Jobs runner example for lerobot-annotate
A ready-to-run example of launching the annotation pipeline on a
Hugging Face job (h200x2) with two vllm replicas serving
Qwen3.6-35B-A3B-FP8. Lives next to other end-to-end recipes under
examples/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 1217fdb6f0 feat(annotate): emit VQA per-camera and propagate camera field
Module 3 now produces one (vqa, user) + (vqa, assistant) pair per
emission tick *per camera* rather than only against the dataset's first
camera. Each emitted row carries the `camera` field added in PR 1
(language-columns), so the resolver can disambiguate per-camera VQA via
`emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity.

- `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property
  and a `camera_key=` argument on `frames_at` / `video_for_episode`.
  `VideoFrameProvider` exposes every `observation.images.*` key the
  dataset declares (not just the first) and keys its decode cache on
  `(episode, camera, timestamp)` so per-camera reads don't collide.
  Module 1 / 2 keep their old single-camera behaviour by leaving
  `camera_key=None` (falls back to the default camera).
- `modules/general_vqa.py`: `run_episode` iterates `frame_provider
  .camera_keys` for each emission tick, builds one prompt per camera,
  batches all of them through the VLM, and stamps the resulting rows
  with `camera=<that key>`. Empty `camera_keys` (null provider) makes
  the module a no-op rather than silently emitting untagged rows.
- `writer.py`: `_normalize_persistent_row` / `_normalize_event_row`
  carry `camera` through and call `validate_camera_field` so the
  invariant is enforced at the writer boundary. Event sort key now
  includes `camera` for deterministic ordering when several cameras
  share `(timestamp, style, role)`. `speech_atom` sets `camera=None`.
- `validator.py`: `StagingValidator` gains a `dataset_camera_keys`
  field; `_check_camera_field` enforces the invariant and cross-checks
  every view-dependent row's `camera` against the dataset's known video
  keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate
  `(vqa, role)` pairs at the same `(t, camera)`.
- `lerobot_annotate.py`: passes the live frame provider's
  `camera_keys` into the validator so the cross-check uses the actual
  dataset camera set.
- Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new
  `camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera`
  configures two cameras and asserts both are represented, that every
  emitted row has a `camera` tag, and that uniqueness holds per
  `(timestamp, camera, role)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn d0388e1142 fix(annotate): transcode subclips to H.264 instead of stream-copy
Modern LeRobot datasets store videos in AV1, which vllm's libav build
cannot decode (the video processor returns 0 frames and downstream
chokes with ZeroDivisionError). Re-encode each per-episode subclip
with libx264 (preset ultrafast, crf 23) so the resulting mp4 is
universally decodable. Strip audio with -an for a smaller payload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 524aa59faa feat(annotate): pack multiple vllm replicas per GPU via num_gpus
Adds VlmConfig.num_gpus so parallel_servers can exceed the physical
GPU count. Replicas are round-robin-assigned to GPUs (e.g.
parallel_servers=4 + num_gpus=2 → replicas pinned to GPUs 0,1,0,1).
Backward-compatible: num_gpus=0 keeps the existing 1-replica-per-GPU
behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 27f7829b09 feat(annotate): forward chat_template_kwargs to OpenAI extra_body
Lets callers pass per-request template flags such as
{"enable_thinking": false} for Qwen3.5/Qwen3.6 models, where the
default thinking preamble otherwise consumes the entire max_new_tokens
budget before any JSON is emitted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 7f8bf108e8 fix(annotate): include prompt .txt files in wheel
The setuptools package-data declaration only listed envs/*.json, so
pip-installed wheels (including HF Jobs runs) were missing the
module_1_subtasks/plan/memory and module_2/3 prompt templates,
causing FileNotFoundError at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 855ff027f8 refactor(annotate): drop HF Inference Providers code path
Default backend is now a local OpenAI-compatible server (vllm /
transformers) which auto_serve spawns. Removes the
use_hf_inference_providers config flag and the router.huggingface.co
routing branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 3b797bb118 feat(annotate): --vlm.push_to_hub uploads the annotated dataset
After the pipeline completes, optionally create/locate a dataset repo
and upload the dataset root (excluding .annotate_staging/). Add
push_private and push_commit_message knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn aea04721ae feat(annotate): parallelize episodes within each module phase
Saturates parallel_servers + client_concurrency. Previously the
executor processed one episode at a time, so each Module 1 episode's
3-5 dependent VLM calls hit a single server with the others idle. Now
defaults to 16 episodes in flight; configurable via
ExecutorConfig.episode_parallelism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn ab5479129a fix(annotate): probe /v1/models for spawn-helper readiness
vllm with --uvicorn-log-level warning suppresses the "Uvicorn running"
banner that the readiness watcher waited for, so the spawn helper hung
forever even after the API was live. Add an HTTP probe in parallel with
the log watcher and broaden the log markers to include vllm's own
"Starting vLLM API server" / "Available routes are" lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn e6d4ac6f02 fix(annotate): lock-protect per-line writes for parallel server streams
8 server-streaming threads writing chars unsynchronized cause UTF-8
sequences from different servers to interleave mid-byte, garbling the
terminal output. Switch to line-buffered reads with a single shared
print lock — output stays readable, ready-marker detection still works
on the line containing 'Uvicorn running' / 'Application startup
complete'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 5722d365c5 feat(annotate): client_concurrency for parallel in-flight requests
Adds vlm.client_concurrency (default 16) which uses a ThreadPoolExecutor
to fan out batched chat.completions calls. vllm batches them internally
on the server side, giving big throughput wins on a single TP=1 server
without needing DP/TP and the NCCL setup it requires.

Module 3 now batches all per-episode VQA calls into a single
generate_json invocation so they fire in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 3d7e60cee4 feat(annotate): parallel_servers spawns N independent vllm replicas
Adds --vlm.parallel_servers=N. Spawns N independent vllm processes
(each pinned to GPU i via CUDA_VISIBLE_DEVICES, listening on
serve_port+i) and round-robins requests across them. Sidesteps DP/TP
NCCL setup failures on nodes with restricted P2P/SHM.

Default serve_command for parallel mode: vllm serve <model_id>
--tensor-parallel-size 1 --max-model-len 32768 --uvicorn-log-level
warning. Override via --vlm.serve_command (use {port} placeholder).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 7b767d4d60 feat(annotate): per-episode progress logs in executor 2026-04-30 18:48:35 +02:00
Pepijn f1e3ab7794 fix(annotate): don't crash pipeline on persistent JSON parse failure
Some prompts/models occasionally return pure prose with no JSON object
even on retry. Returning None (and logging a preview) lets the pipeline
skip that one VLM call cleanly instead of aborting the whole episode.
The modules already check for None / non-dict results and degrade
gracefully (no row emitted from that call).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 585341ba9f fix(annotate): robust JSON extraction (think tags + first balanced object)
Models often wrap JSON in prose or <think>...</think> blocks. Strip the
think tags first, then try direct json.loads, then fall back to scanning
for the first balanced {...} substring (ignoring braces inside strings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 23ff346027 fix(annotate): stream child stdout char-by-char so tqdm \\r progress flushes 2026-04-30 18:48:35 +02:00
Pepijn 3c5cbe7af4 test(annotate): adjust video-block test for fps-based frame sampling 2026-04-30 18:48:35 +02:00
Pepijn f2cbd97635 feat(annotate): Module 1 samples image frames at fps rate
Replace the fixed max_video_frames count with a rate (default 1 fps).
A 30 s episode now sends 30 frames; a 5 s episode sends 5; capped at
max_video_frames (default 128) to avoid blowing up the payload on long
episodes.

Override with --module_1.frames_per_second=2.0 for denser sampling, or
--module_1.frames_per_second=0.5 for sparser.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn c06c8d594a feat(annotate): use cached HF token from huggingface-cli login
Fall back to huggingface_hub.get_token() when HF_TOKEN/HUGGINGFACE_API_KEY
env vars aren't set. That picks up the token cached by
'huggingface-cli login' so users don't need to export it on every shell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn cd495a3a9d feat(annotate): default to HF Inference Providers, no local GPU needed
Flip the default backend to 'openai' with use_hf_inference_providers=True
and a Qwen3-VL-30B-A3B-Instruct:novita default model_id. The CLI now
runs end-to-end without a local model load — annotations are produced
by sending video_url + prompt to https://router.huggingface.co/v1.

Switch back to local inference with --vlm.backend=vllm or
--vlm.use_hf_inference_providers=false.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn c99ac45cd1 feat(annotate): one-flag HF Inference Providers backend
Setting --vlm.use_hf_inference_providers=true routes requests through
https://router.huggingface.co/v1 using HF_TOKEN as the API key, and
disables auto_serve so no local server is spawned. Combine with a
provider-pinned model id like 'Qwen/Qwen3-VL-30B-A3B-Instruct:novita'
or any plain model id to let HF route.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 13aaafeae0 fix(annotate): omit mm_processor_kwargs by default; transformers serve rejects it
transformers serve returns HTTP 422 'Unexpected fields' when
mm_processor_kwargs is in extra_body — that field is vllm-specific.
Drop it by default; opt in via LEROBOT_OPENAI_SEND_MM_KWARGS=1 when
talking to vllm serve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 2129648bf4 fix(annotate): mm_processor_kwargs in extra_body; inline file URLs as data URLs
Two fixes for video_url with transformers serve:
- fps must be in extra_body.mm_processor_kwargs, not in the content
  block; otherwise the server discards it as unknown kwargs.
- file:// URLs aren't fetched by transformers serve. Read the local mp4
  and inline it as a base64 data:video/mp4 URL so the server sees the
  bytes directly.

Both surface as std::bad_alloc on the server side when wrong, which is
unhelpful but explains what we hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn f5cd3f6e4e fix(annotate): detect server ready via stdout banner, not /v1/models polls
transformers serve rescans the HF cache on every /v1/models request
which exceeds the 2s urllib timeout, leaving the probe loop spinning
even after Uvicorn is fully up. Watch the streamed server output for
'Uvicorn running' / 'Application startup complete' instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ecf5766301 fix(annotate): visible auto_serve via stdout prints + live server log stream
The previous logger-based output never appeared, leaving users in the
dark when auto_serve silently no-op'd. Switch to print(flush=True) so
the spawn decision is unmistakable, and stream the server's stdout to
the parent terminal in real-time on a background thread so model-load
progress and errors surface immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 11597d4f71 fix(annotate): auto_serve defaults to True; probe before spawning
Default auto_serve to True so lerobot-annotate can drive the entire
flow with one command. Probe api_base/models first — if a server is
already reachable (user started one manually, or it's a remote
endpoint), skip the spawn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 8b9c598cf4 feat(annotate): auto_serve mode spawns and tears down inference server
Setting --vlm.auto_serve=true with --vlm.backend=openai makes the CLI
launch 'transformers serve <model_id> --port <serve_port>
--continuous-batching' as a child process, poll /v1/models until ready
(up to serve_ready_timeout_s), run the pipeline, then SIGINT the
server on process exit.

Override the spawn command with --vlm.serve_command='vllm serve ...'
or any OpenAI-compatible launcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn b325475b38 feat(annotate): video_url block for openai backend
Module 1 can now send the episode's actual mp4 file as a video_url
content block instead of pre-decoded frames. The server (transformers
serve / vllm serve / ktransformers serve) handles frame sampling at
the configured fps. Default fps=1 (one frame per second is enough for
subtask-boundary detection on manipulation episodes).

A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/
via ffmpeg stream-copy (no re-encode) so the model sees only this
episode's frames, not the whole shard.

Enable with --module_1.use_video_url=true (and --vlm.backend=openai).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ef137ff86a feat(annotate): openai-compatible backend for transformers/ktransformers serve
Adds a third backend that talks to any OpenAI-compatible server. This
unblocks Qwen3.6 (and other models) that work in transformers serve /
ktransformers but not in vllm 0.10.2's fallback path:

- launch the server out-of-process (transformers serve, vllm serve,
  ktransformers serve)
- point lerobot-annotate at it via --vlm.backend=openai
  --vlm.api_base=http://localhost:8000/v1 --vlm.model_id=...

Image and video blocks are converted to OpenAI image_url/video_url
data URLs automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn c5df821a96 fix(annotate): use vllm.chat() API for multimodal prompts
vllm.generate() expects a string/TextPrompt; passing message dicts
fails. vllm.chat() applies the chat template and extracts image/video
blocks automatically, which is what we need for VL models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 7ec3d7999c fix(annotate): drop guided_decoding=dict (api differs across vllm)
vllm 0.10.2 expects guided_decoding to be a GuidedDecodingParams object,
not a dict. Different vllm versions differ here. The parser already has
a one-retry JSON-recovery path, so drop guided decoding entirely for
portability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 712d63abbd fix(annotate): tolerate decoder returning fewer frames than requested
pyav (and sometimes torchcodec) decode can return fewer frames than
requested timestamps when some timestamps fall outside the video file's
content range. Drop the strict=True on the zip and rely on the
None-filter to discard missing frames.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 6653999983 fix(annotate): default video decode backend to pyav
torchcodec's __init__ bad-allocs on the cu128/torch-2.8 stack in some
environments (Lustre/conda combos). The annotation pipeline calls
decode_video_frames many times per episode, so this is a hard blocker.
Default to pyav (always available via the av package) and let users
opt back into torchcodec via LEROBOT_VIDEO_BACKEND=torchcodec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 4bdbedc9a0 fix(annotate): default trust_remote_code=False for HF loaders
Setting trust_remote_code=True unconditionally pulled custom loader
code that triggers std::bad_alloc post-load on Qwen3-VL — the official
transformers class is sufficient. Flip the default to False; keep the
config field so users can opt in for models that actually need it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn e240305e8e fix(annotate): default transformers backend to manual GPU placement
Loading Qwen3-VL via transformers + accelerate's device_map='auto'
fails with std::bad_alloc on hosts with abundant RAM. The bug is in
accelerate's post-load dispatch path. Bypassing accelerate by loading
to CPU first and then calling .to('cuda') manually avoids that path.

LEROBOT_TRANSFORMERS_DEVICE_MAP=auto switches back to the old behavior
for cases where it works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00