Compare commits

...

179 Commits

Author SHA1 Message Date
Pepijn f60babc946 refactor(inference): subtask = head of plan, drop AR generation
Replaced the LM-head subtask generator with a deterministic
plan-walker. The plan is already a numbered list of still-todo
subtasks (produced by ``plan_generation``); the current subtask is
just the head of that list, popped one position per chunk-dispatch
boundary.

Why
---
The previous design called ``select_message`` every ~1 Hz to
auto-regressively generate the current subtask string. On a small
training set with a chat-pretrained backbone, the LM head collapses
into n-gram loops under greedy decoding —

  "the robot arm extends and retracts and retracts and retracts
   from the beige surface and retracts from the surface"

— which then poisons the action expert's language conditioning,
because the same garbage string is what gets rendered into its
prefix. Repetition penalty / no_repeat_ngram_size help but are
bandaids.

Walking the plan deterministically eliminates the failure mode at
the root: the action expert always sees a real, recipe-aligned
subtask phrase. Inference is also interpretable now — every robot
motion traces back to a specific plan line — and we save the per-
chunk ``select_message`` round trip (~2 s on MPS).

The LM head is still trained (``high_level_subtask`` text-CE) and
still used at runtime for one-shot plan generation, interjection
replanning, and VQA. We just don't sit on the AR subtask loop.

Advance trigger: ``actions_dispatched`` increment — each new chunk
the action expert produced and dispatched advances ``plan_index``
by one. Clamps to the last plan line so the robot keeps executing
the final subtask if it runs off the end instead of going silent.
Falls back to the task string when no plan is set yet (e.g. before
``plan_generation`` has fired this episode).

Added ``_parse_plan_lines`` helper for stripping the ``"N. "``
prefix off plan items.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:39:42 +02:00
Pepijn db2972fb6c feat(inference): repetition_penalty + no_repeat_ngram_size for select_message
Under-trained LM heads (small dataset + a few thousand steps on a
chat-pretrained backbone) collapse into n-gram loops under greedy
decoding — observed in real-robot run as

    "the robot arm extends and retracts and retracts from the
     beige surface and retracts from the surface"

repeating the same trigram across the whole 256-token budget.

Added the two standard HF generation knobs to
``SmolVLA2Policy.select_message``:

* ``repetition_penalty`` (1.0 = off) — divides positive logits /
  multiplies negative logits for already-emitted token ids.
* ``no_repeat_ngram_size`` (0 = off) — hard-bans any token that
  would complete an n-gram already present in the generated suffix.
  Implemented via a small ``_ngram_banned_ids`` helper that mirrors
  HF's ``_get_ngrams`` semantics.

Wired through ``_generate_with_policy`` to all four call sites
(subtask, memory, plan/say, vqa) and exposed as
``--text_repetition_penalty=1.2-1.5`` and
``--text_no_repeat_ngram_size=3`` on the runtime CLI.

Empirically ``--text_no_repeat_ngram_size=3`` alone usually breaks
the trigram-loop failure mode without distorting the next-token
distribution; combine with ``--text_repetition_penalty=1.2`` for
heavier collapses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:35:46 +02:00
Pepijn 95033733fc deps: add sentencepiece to the pi extra (FAST action tokenizer)
PI052 and PI0_FAST both load ``physical-intelligence/fast`` as
their action tokenizer. That tokenizer's HF backend requires
``sentencepiece`` to instantiate (or ``tiktoken``); without it
``AutoProcessor.from_pretrained`` raises:

  ValueError: Couldn't instantiate the backend tokenizer from one of:
  (1) a tokenizers library serialization file,
  (2) a slow tokenizer instance to convert or
  (3) an equivalent slow tokenizer class to instantiate and convert.
  You need to have sentencepiece or tiktoken installed [...]

It wasn't listed in pyproject so fresh installs missed it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:52:55 +02:00
Pepijn c3503b774f fix(debug): dumper now shows real stream + target flags
The dumper was printing ``stream=None target=None`` for every
message because it read those fields off the message dicts, but
the recipe renderer keeps them in parallel arrays
(``message_streams`` / ``target_message_indices`` in
COMPLEMENTARY_DATA) so the chat template doesn't see unknown
keys. Zip them back into the dump-time dicts so the printed
metadata is accurate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:43:51 +02:00
Pepijn 99ebee4d16 annotate: tighter subtask + memory prompts (≤5 / ≤10 words)
Both feed into the high-level prompt and the plan rendering, so
keeping them short directly reduces the rendered ``${task}\nPlan:
…\nMemory: …`` prefix the model has to chew through at inference.

Subtasks
* Hard cap: ≤ 5 words. Verb + object only, drop articles/adverbs.
* Concrete good/bad examples to anchor the VLM.

Memory
* Hard cap: ≤ 10 words. Telegraphic noun→location fragments
  ("bowl in box, lid open"), no past-tense verbs, drop attributes
  that don't matter for downstream subtasks.
* Allow empty string when no material change occurred — keeps the
  rendered memory line literally blank instead of forcing a no-op
  sentence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:28:09 +02:00
Pepijn a8ca5128b8 fix(annotate): re-emit plan at every subtask boundary
Previously only emitted a plan at t=0 and on interjections, so the
active plan rendered into training carried "done" subtasks until
the next interjection. With the new "plan = remaining subtasks"
summariser this meant the plan was stale between boundaries.

Emit a fresh plan row at every subtask start. ``active_at(t)`` then
returns a plan that contains exactly the subtasks whose start ≥
the current span's start — completed subtasks fall off the plan
the moment the next subtask begins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:26:49 +02:00
Pepijn dd97c33814 refactor(annotate): plan = summary of still-todo subtasks, drop VLM call
The plan was being generated by a separate VLM call (one per
episode + one per interjection refresh) with a prompt that asked
the model to "compress the subtasks into a compact hierarchical
plan". In practice the plans came out longer than necessary and
sometimes drifted from the actual subtask sequence the runtime
would execute.

Replaced ``_generate_plan`` with a deterministic numbered list
of the upcoming subtasks. At a refresh time the list shrinks to
subtasks whose start ≥ refresh_t — the plan describes what's
*left* to do, so it gets shorter as work progresses.

Saves the per-episode + per-interjection VLM round-trip in the
annotation pipeline and keeps train-time plan text bit-aligned
with the subtask annotations the rest of Module 1 emits.

Removed the now-unused ``prompts/module_1_plan.txt``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:55:02 +02:00
Pepijn fa45ba631b fix(policies,recipe): register PI052Config + allow flow-only sub-recipes
Two regressions surfaced by the first training run:

1. ``--policy.type=pi052`` failed with ``invalid choice``. PI052Config
   wasn't imported in ``policies/__init__.py``, so its
   ``@register_subclass("pi052")`` decorator never ran and draccus
   didn't see it as a valid policy type. Mirror PI05Config /
   SmolVLA2Config in the top-level imports + __all__.

2. ``low_level_execution`` (user-only ``${subtask}`` recipe used for
   π0.5-style flow conditioning) tripped
   ``ValueError: Message recipes must contain at least one target
   turn.`` The validator was too strict — a recipe with only a
   ``stream: low_level`` turn still drives meaningful supervision
   (flow MSE on the action expert via ``predict_actions=True``).
   Allow either ``target: true`` OR ``stream: low_level`` to satisfy
   the "supervises something" requirement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:51:47 +02:00
Pepijn ffd8c92ce5 fix(inference): always emit Plan:/Memory: labels in the high-level prompt
The recipe renders ``"\${task}\nPlan: \${plan}\nMemory: \${memory}"``
unconditionally — when a binding resolves to None,
``language_render._substitute`` substitutes an empty string, so the
training-time user turn always contains the literal ``Plan: `` /
``Memory: `` prefixes even with empty values.

The inference message builders were skipping those lines entirely
when ``state['current_plan']`` / ``state['current_memory']`` was
empty, producing a different prompt shape on early frames (before
the plan-generation step runs) and on datasets without plan/memory
annotations.

Factored a shared ``_hirobot_user_head`` helper used by
``_msgs_for_subtask``, ``_msgs_for_memory``, and the legacy
``_control_context_messages`` so they all match training byte-for-
byte regardless of which bindings are populated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:42:29 +02:00
Pepijn 841d3c47e1 feat(debug): LEROBOT_DUMP_RECIPE_SAMPLES=N dumps the first N rendered samples
Adds a one-shot debug dumper to both chat processors. When the env
var ``LEROBOT_DUMP_RECIPE_SAMPLES`` is set to a positive integer N,
the next N samples processed (rank-0 only) get pretty-printed:

* the recipe-rendered messages (role / stream / target / content),
* the full tokenized prompt (decoded back),
* inline ``[TGT]...[/TGT]`` markers over the spans the LM head is
  supervised on,
* token count + target-token count,
* ``predict_actions`` flag.

Usage:

  LEROBOT_DUMP_RECIPE_SAMPLES=5 sbatch train_smolvla2.slurm

After N dumps the helper becomes a no-op; training continues
unaffected. Works for both smolvla2 (chat-template renderer) and
pi052 (plain ``Role: content`` concat renderer); each processor has
its own copy to avoid cross-package imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:21:46 +02:00
Pepijn 2c920ab178 refactor(recipes): consolidate to shared hirobot.yaml + audit fixes
The smolvla2 and pi052 recipe blends had drifted to identical content
twice in a row; collapse them to a single ``recipes/hirobot.yaml``
both policies point at. Each backbone's text tokenizer (chat-template
for SmolVLA2, plain ``Role: content`` for PI052) handles the
rendering differences downstream — the recipe spec is shared.

Audit fixes folded into the same commit:

* **Train/inference prefix mismatch on the action expert**
  ``_build_text_batch`` always passed ``add_generation_prompt=True``,
  appending ``<|im_start|>assistant\\n`` tokens that the action
  expert never saw at training (the chat tokenizer renders with
  ``add_generation_prompt=False``). Parameterized the helper and
  pass ``False`` from ``LowLevelForward``; ``select_message`` paths
  still default to ``True`` for AR text generation.

* **PI052 fallthrough could silently train flow on text-only frames**
  When ``text_loss_weight=0`` AND every sample was high-level
  (``predict_actions.any()==False``), the previous heuristic
  delegated to ``PI05Policy.forward``, which ignores
  ``predict_actions`` and runs flow on every sample. Reverted to
  delegating only on fully unannotated batches.

* **SmolVLA2 silent zero-loss training**
  ``forward`` returned ``loss=0`` (no error) when neither flow nor
  text path fired. Now raises ``RuntimeError`` with the weights and
  routing flags — fails loud like PI052 already does.

* **PI052 dropout-seed key**
  Was reading ``complementary["dataset_index"]`` (only set by
  ``MultiDataset`` and means "which sub-dataset", not row index)
  with fallback to ``frame_index`` (never set) — every sample got
  seed=0, so per-component dropout was deterministic across the
  epoch. Switched to ``complementary["index"]`` to match SmolVLA2
  and the canonical ``BatchProcessor`` convention.

* **Dead ``DEFAULT_TOOLS`` import**
  Removed from ``chat_processor_smolvla2.py`` — unused since the
  default-tools list was switched to ``[]`` in the prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:16:28 +02:00
Pepijn 9f630e2a41 fix(recipes,training): stop tool prompt leak + drop subtask copy-supervision
CRITICAL (smolvla2) — the SmolVLM2 chat template was rendering the
``say`` tool's JSON schema as a system message on every training
sample because ``DEFAULT_TOOLS`` was the default in
``SmolVLA2ChatTokenizerStep``. That schema was only relevant to
the now-removed ``user_interjection_response`` recipe; with it
gone the schema is dead weight that polluted every action-expert
prefix AND created a train/inference mismatch (the inference
``_build_text_batch`` doesn't pass ``tools=``). Default is now
``[]``; callers needing tools can still set them via
``with_tools(meta.tools)``.

LIKELY-BUG — ``low_level_execution`` had ``target: true`` on its
assistant turn, so text-CE trained the LM head to predict the
same subtask string the user just stated (trivial "copy previous
turn" supervision that diluted LM head capacity). Dropped the
assistant turn entirely; ``high_level_subtask`` (w=0.50) already
owns subtask prediction from real context.

The chat-tokenizer's ``predict_actions`` detection used to scan
target streams only. With the new no-target low_level recipe it
would mis-fire as False. Switched both
``chat_processor_smolvla2.py`` and ``text_processor_pi052.py`` to
scan all message streams — any ``stream: low_level`` on the
sample is enough to trigger flow loss.

Inference: the low-level loop sends only ``[user(subtask)]`` now,
matching the new recipe shape.

PI052 — hardened the forward fallthrough so a degenerate batch
where every sample's recipe is text-only AND text supervision is
disabled (text_loss_weight<=0 or text_labels missing) cleanly
delegates to ``PI05Policy.forward`` instead of raising
"nothing to train".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 14:59:01 +02:00
Pepijn 7a32f8a72a refactor(recipes): π0.5-style split — action expert conditions on subtask only
Previously ``action_execution`` rendered ``task + plan + memory +
subtask`` into one prefix and ran the flow loss on it. That meant
the action expert was conditioned on the full hierarchical context
(closer to π0.7 §V.A), not just the subtask.

The π0.5 paper's hierarchical inference has the action expert see
only the *subtask* (plus images and state). Split the recipe to
match:

  high_level_subtask  (0.50)
    user(task + plan + memory) → assistant(subtask)
    [+ assistant(new_memory) at boundary frames]
    All ``stream: high_level`` → text-CE only, no flow loss.

  low_level_execution (0.30)
    user(subtask) → assistant(subtask)
    Both ``stream: low_level`` → flow loss fires; text CE on the
    subtask is a small redundant extra signal. Prefix the action
    expert sees: [images, subtask, state].

  plan_generation (0.10) — unchanged.
  ask_vqa_{top,wrist} (0.05 each) — unchanged.

Runtime: the low-level loop in ``smolvla2/inference/steps.py``
now sends ``[user(subtask), assistant(subtask)]`` to
``predict_action_chunk`` instead of the full task+plan+memory
context. Falls back to ``state['task']`` when no subtask has been
generated yet so the first frame still has something to condition
on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 14:13:07 +02:00
Pepijn 129aa207e3 fix(smolvla2,pi052): training-correctness audit fixes
CRITICAL (smolvla2) — text-CE was applied to the wrong prefix slice.
``num_state`` was being read from ``state.shape[1]`` (the raw
max_state_dim, ~14-32) instead of the *number of state tokens*
(always 1). Compounded by the trailing-padding issue (state is
not at the end of the padded prefix when ``seq_len < prefix_length``),
the lang slice was landing on image / padding hidden states.

New ``_locate_lang_range`` finds the state position via
``att_masks.nonzero()`` (the only ``1`` in the mask), making the
slice robust to both bugs. Used by ``_compute_text_loss`` and
``_compute_fused_loss``.

LIKELY-BUG (smolvla2) — ``_unfreeze_lm_head`` only re-enabled
``lm_head`` and ``text_model.model.norm.weight``. SmolVLA's parent
ALSO freezes the last 1-2 transformer layers, so text-loss
gradients died in a frozen final block. Now mirrors the parent's
freeze targets and unfreezes the matching ``layers.{N-1}`` (and
``N-2`` when num_vlm % num_expert == 0).

CRITICAL (pi052) — flow and FAST CE were not per-sample masked
under per-sample-routing. Text-only recipe samples
(``plan_generation``, ``ask_vqa_*``) contributed to flow/FAST
loss with prompts that deliberately omit the subtask, corrupting
the signal. Threaded ``predict_actions_t`` through both
``_compute_all_losses_fused`` and ``_compute_text_and_fast_loss``;
flow uses ``(per_sample * mask).sum() / mask.sum()``, FAST uses
``shift_valid & sample_mask`` before ``masked_fill(-100)``.

OTHER
* PI052Policy.forward now falls through to PI05Policy.forward on
  unannotated batches (no text_labels, no predict_actions, no FAST).
* fit_fast_tokenizer cache key now includes ``chunk_size`` — changing
  the chunk size no longer silently loads a wrongly-fit tokenizer.
* Removed dead ``_compute_text_loss`` / ``_compute_fast_action_loss``
  in pi052 (superseded by the fused helpers).
* Fixed stale "no-op stub" docstring on ``knowledge_insulation`` —
  it's been fully wired since the per-layer KI forward port.
* Stripped unused ``copy`` / ``resize_with_pad`` imports.
* Extracted ``_shifted_ce`` / ``_mask_per_sample`` / ``_fast_ce``
  helpers shared between fused and prefix-only paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 14:08:06 +02:00
Pepijn e3ad1c59fc feat(recipes): add plan_generation sub-recipe to smolvla2 + pi052 blends
New text-only sub-recipe at 0.10 weight on both blends:

    user      :  ${task}
    assistant :  ${current_plan}   (high_level target)

Bound to ``active_at(t, style=plan)`` so it supervises the
currently-active plan on every frame, gated by ``if_present`` to
skip frames without a plan annotation.

Weights rebalanced: action_execution 0.85 → 0.75, plan_generation
0.10, VQA top/wrist 0.075 each (sums to 1.0).

Added matching runtime builder ``_msgs_for_plan`` in
``smolvla2/inference/steps.py`` so the high-level loop can call
``select_message`` with the bare-task prompt at episode start /
replanning events.

Closes a gap vs. Pi 0.7 §V — without this recipe the model could
read ``${plan}`` from the prompt but never had to produce one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:51:37 +02:00
Pepijn 9ff62cb08c docs(recipes): trim header comments, drop diversity-knobs note in run_hf_job
Recipes were over-commented (paper citations, history of removed
sub-recipes, inference-time loop walkthroughs). Stripped down to a
short header + a one-line note on the boundary-frame memory tail.

Also removed the ``_tool3`` diversity-knobs comment block in
``examples/annotation/run_hf_job.py`` — it was a personal note about
a since-merged experiment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:55:03 +02:00
Pepijn b2aa372fcf refactor(recipes): fold memory into action_execution, drop interjection, fuse smolvla2 forward
Recipe changes:
* action_execution now bundles the memory update as a second
  assistant target gated on a new ``new_memory`` binding (fires
  only at subtask-boundary frames). No "Completed subtask: X"
  filler — the model emits the new subtask AND the updated
  memory back-to-back in one prefix.
* user_interjection_response sub-recipe removed (current
  datasets don't have interjection / say() annotations).
* Standalone memory_update sub-recipe removed (folded above).
* Weights rebalanced: action_execution 0.85, ask_vqa_top/wrist
  0.075 each (sums to 1.0).

Runtime ``_msgs_for_memory`` updated to match the new
boundary-frame prompt layout.

Modeling:
* SmolVLA2Policy now fuses the flow + text losses into a SINGLE
  backbone forward via ``_compute_fused_loss`` (one
  vlm_with_expert pass with [prefix, suffix] embeds, then both
  lm_head CE on lang slice + action_out_proj MSE on suffix).
  Mirrors pi052's existing ``_compute_all_losses_fused`` —
  saves one backbone pass per training step.

Examples:
* Removed the two training SLURM scaffolds; they were
  out-of-date with the recipe refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:51:09 +02:00
Pepijn 058b8f3958 refactor(recipes): two-flavor design — one fused action_execution + text-only events
Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a
clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions)
and the hierarchical inference pattern from Pi 0.5 §IV.D.

Flavor 1 — action_execution (60% weight, "main path")
-----------------------------------------------------

One always-on recipe that fuses **all** available context (task,
plan, memory) into a single user prompt and uses the current subtask
as the supervised assistant target. This single recipe supervises
*both* objectives:

  * subtask prediction (text CE on the assistant span via lm_head)
  * action chunks (flow MSE on the action expert via
    stream: low_level, target: true; plus FAST CE on action tokens
    when enable_fast_action_loss=True)

At inference, the *same* prompt structure drives both inference
modes:

  * select_message(user_prompt_only) → LM head generates the next
    subtask. Matches action_execution's training distribution
    exactly (prompt is the user turn, target is the subtask).
  * predict_action_chunk(user_prompt + assistant_subtask) → action
    expert produces the chunk. Matches action_execution's full
    prompt+target.

This replaces what used to be a separate high_level_subtask recipe
plus a low_level_execution recipe; both were supervising the same
subtask text, so collapsing them into one is correct and removes
the redundant text-CE gradient.

Flavor 2 — event-driven text-only recipes
-----------------------------------------

Each of these supervises the LM head to predict a specific kind of
text given a specific event-triggered context. ``stream: high_level``
on all targets so they never trigger predict_actions / flow loss.
``if_present`` guards ensure they only fire on frames where the
event annotation is present.

  * memory_update           (10%)  new memory at subtask boundary
  * user_interjection_response (15%) new plan + say(...) on input
  * ask_vqa_top             (7.5%) front-camera VQA
  * ask_vqa_wrist           (7.5%) wrist-camera VQA

Total weight = 1.0.

Prompt format consistency
-------------------------

User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}``
matches what ``inference/steps.py::_msgs_for_subtask`` and
``_control_context_messages`` already emit at inference time. No
"Task: " prefix — the bare task string is used as the leading
content with literal "Plan: " / "Memory: " labels for the
subsequent components.

What changed structurally
-------------------------

  - low_level_execution            DROPPED  (folded into action_execution)
  - high_level_subtask             DROPPED  (subtask supervision moved into action_execution)
  + action_execution               NEW      (the fused main recipe)
    memory_update                  kept, prompt cleaned up
    user_interjection_response     kept, prompt cleaned up
    ask_vqa_top / ask_vqa_wrist    kept

Runtime compatibility
---------------------

No runtime change needed — ``SmolVLA2Runtime`` and the inference
helpers already build their high-level prompt as just the user turn
(task + plan + memory) and append a ``current_subtask`` assistant
turn for the low-level call. Both match the new ``action_execution``
prompt shape exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:35:51 +02:00
Pepijn b873fe454c perf(pi052): full fusion — text + FAST + flow in ONE backbone forward
Previously the forward did 2 backbone passes when all heads were
active: one for flow (via super().forward) and one for the fused
text+FAST helper. This commit reduces it to **one pass** — same
compute as flow-only training.

New ``_compute_all_losses_fused`` builds:

    prefix = [images, language, FAST (when provided)]
    suffix = [noisy_actions]  (action expert via gemma_expert)

and runs a single ``paligemma_with_expert.forward`` with
``inputs_embeds=[prefix_embs, suffix_embs]`` (both experts active
in the same call). Captures *both* prefix_out and suffix_out, slices
each for its respective loss:

    flow MSE     ← suffix_out  (existing action_out_proj + MSE path)
    text  CE     ← prefix_out at language positions (lm_head + CE)
    FAST  CE     ← prefix_out at FAST positions (lm_head + CE)

Critical attention mask override
--------------------------------

``make_att_2d_masks`` produces a cumulative-block attention mask in
which suffix tokens (highest cumsum) attend to *every* lower-cumsum
position by default, including FAST tokens. If we let that stand the
action expert reads the discrete FAST tokens and trivially decodes
them back to the same continuous actions the flow head is supposed
to predict from noise — the entire training signal collapses to a
copy operation.

The fix is a single line right after make_att_2d_masks:

    att_2d_masks[:, fast_end:, fast_start:fast_end] = False

Explicitly zeros out *suffix → FAST* attention. Everything else
remains correct under the cumsum semantics:

  * prefix images/language stay bidirectional among themselves
  * FAST stays causal within itself, attending bidirectionally
    to images+language
  * FAST cannot see suffix (cumsum < suffix cumsum, default)
  * suffix attends bidirectionally among itself, to images+language,
    and now NOT to FAST (this override)

Bit-equivalent to the previous separated forward path for text+FAST
losses (the prefix hidden states at language and FAST positions are
unchanged whether suffix is present or not — the prefix doesn't
attend to suffix). For flow loss, suffix→FAST being masked is the
correct behaviour we *want* — if anything the previous separated
path was less correct for production use because the joint
gradient signal through the action expert was missing the prefix
extension.

Forward routing in ``forward()``
--------------------------------

  * run_flow=True  →  _compute_all_losses_fused (one forward, all
                      three losses)
  * run_flow=False, run_text or run_fast → _compute_text_and_fast_loss
                      (one prefix-only forward, two CE losses, no
                      suffix → cheaper than fusion)
  * neither       →  RuntimeError (explicit; both losses disabled)

Wall-time per step
------------------

  Before this commit:  flow + (text+FAST fused) = 2 forwards
  After this commit:   (flow+text+FAST fused)   = 1 forward

Compute parity with flow-only training when all three heads active.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:28:38 +02:00
Pepijn 83d7250a22 fix(recipes): low_level_execution needs if_present:subtask guard too
Same bug we fixed for high_level_subtask, just on the other
subtask-supervised sub-recipe. ``low_level_execution`` targets
``${subtask}`` (the current active span) but had no
``if_present`` guard. When ``active_at(t, style=subtask)`` returned
None at a frame (gaps in the annotation, or the very first/last
frames of an episode if the annotator's spans don't fully tile),
the assistant message rendered with empty content. The chat
tokenizer still included it in ``target_message_indices`` → text CE
supervised whatever the chat-template's empty assistant turn
decoded to (usually a single ``\n``). That trains the LM head's
prior at the first generation position toward ``\n``, the same
collapse we observed with the original ``${next_subtask}`` target.

Fix: ``if_present: subtask`` on the assistant target in
``low_level_execution`` for both ``smolvla2_hirobot.yaml`` and
``pi052_hirobot.yaml``.

Side effect: frames without an active subtask span no longer
contribute to the flow loss either (the only ``low_level`` target
is skipped, ``predict_actions = bool(targets_by_stream.get("low_level"))``
becomes False). For a well-annotated dataset where subtask spans
tile the whole episode this is a no-op. For datasets with gaps,
those gap frames lose flow supervision — strictly better than the
degenerate text-CE alternative.

Sub-recipe audit summary (no other changes needed):

  * memory_update                 — all if_present guards present, OK
  * user_interjection_response    — all if_present guards present, OK
  * high_level_subtask            — fixed earlier, OK
  * low_level_execution           — fixed by this commit
  * ask_vqa_top / ask_vqa_wrist   — query+answer both guarded, OK

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:22:45 +02:00
Pepijn 35f9063a6c perf(pi052): fuse text + FAST loss into a single prefix forward
Previously the forward did three backbone passes per training step
when all heads were active: one for flow (via super().forward), one
for text CE, and one for FAST CE. That's ~3× the compute of
flow-only training.

The text and FAST losses share their prefix forward exactly — both
are CE on the LM head, evaluated at different slices of the same
hidden states. Adding FAST tokens after language in the prefix is
bit-equivalent for the text loss because the mask_ar convention in
``make_att_2d_masks`` keeps FAST tokens in a strictly-later causal
block: language tokens never see FAST, so their hidden states are
unchanged.

New ``_compute_text_and_fast_loss``:

  * embeds [images, language] once
  * optionally appends [FAST] (when run_fast is True)
  * one backbone forward
  * slices ``vlm_out[:, -(fast_len + lang_len):-fast_len]`` for
    language hidden states (or ``vlm_out[:, -lang_len:]`` when no
    FAST) → text CE
  * slices ``vlm_out[:, -fast_len:]`` for FAST hidden states →
    FAST CE
  * returns both losses, either of which can be None when the
    caller doesn't want that head.

forward() now calls this fused helper instead of running the two
separate ``_compute_text_loss`` / ``_compute_fast_action_loss``
methods. Those remain in the file for callers that only want one
head (e.g. ablations).

Why flow isn't fused
--------------------

Flow MSE comes from the action-expert (suffix) hidden states, which
attend to the prefix. If we just concat FAST onto the prefix and let
the action expert attend to it, the expert can trivially decode FAST
back to continuous actions — overfitting via shortcut. Preventing
that requires a custom segment-aware attention mask (action expert
can attend to images+language but NOT to subtask/FAST), which is
what pi05_full does in ``compute_layer_complete_knowledge_insulation``.
That's the full-fusion path; deferred as a follow-up since the
text+FAST fusion already recovers most of the compute.

End-to-end forward pass count
-----------------------------

Before: 1 (flow) + 1 (text) + 1 (FAST) = 3 backbone forwards
After:  1 (flow) + 1 (text+FAST fused) = 2 backbone forwards

~33% wall-time reduction per training step when all three heads
are active.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:08:34 +02:00
Pepijn 17c0800461 fix(pi052): FAST loss masking + predict_actions gating + smolvla2 review
FAST loss changes
-----------------

1. Gate by ``predict_actions`` (same routing as flow loss). The
   ActionTokenizerProcessorStep tokenises actions for *every*
   sample regardless of which sub-recipe rendered it; for text-only
   recipes (high_level_subtask, memory_update, ...) the action
   tokens are still in the batch but mustn't be supervised. Skip
   the FAST forward+CE entirely when no sample in the batch has
   ``predict_actions=True``.

2. Switch from "multiply-by-mask" masking to ``ignore_index=-100``.
   The old pattern computed per-token CE for all positions, then
   zeroed out invalid ones. Two issues: (a) any out-of-vocab target
   id at a padded position would have crashed cross_entropy before
   the mask got a chance to zero it out, and (b) the pattern is
   needlessly clever. Now ``shift_targets.masked_fill(~mask, -100)``
   followed by ``ignore_index=-100`` cleanly drops invalid positions.
   Matches the smolvla2 text-loss convention.

3. Clean up unused ``bsize`` variable in _compute_fast_action_loss
   and expand the attention-mask docstring with the
   ``make_att_2d_masks`` mask_ar convention spec (causal vs
   bidirectional blocks).

smolvla2 audit (reference review, no code change)
-------------------------------------------------

Compared smolvla2/modeling_smolvla2.py against pi052/modeling_pi052.py
to catch parallel bugs. Findings:

* No ``paligemma.language_model`` vs ``paligemma.model.language_model``
  issue — smolvla2 uses SmolVLM (different class, different attribute
  layout) so the bug doesn't apply.

* ``fill_kv_cache=True`` is correctly passed to smolvla's
  ``vlm_with_expert.forward`` — that class *does* accept the kwarg
  (unlike pi05's PaliGemmaWithExpertModel.forward, which is why
  pi052 must omit it).

* Text-loss alignment is correct: ``_compute_text_loss`` computes
  ``lang_start`` / ``lang_end`` from the known prefix layout
  (``[image_blocks..., lang, state]``) and slices ``prefix_out``
  to just the language positions before applying ``lm_head``. The
  parallel bug I fixed in pi052 (lm_head over the full prefix,
  shape-mismatched against text_labels) was *not* present in
  smolvla2.

* Per-sample flow routing via ``predict_actions``: correctly masks
  per-sample by calling the parent ``forward(..., reduction='none')``
  and applying the predict_actions mask before the mean. pi052 only
  has the batch-level any() gate — a parallel improvement for pi052
  would require modifying PI05Pytorch.forward to support per-sample
  reduction, deferred.

* ``reduction="none"`` returns ``total.expand(bsize)``: identical
  scalar-broadcast limitation in both policies. Acknowledged but
  low priority (only RA-BC weighting uses the per-sample path and
  it's documented as a known approximation in smolvla2).

* Chat tokenizer correctly handles batched/unbatched messages,
  pads with -100 for label positions, builds attention masks. No
  bugs found.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:05:37 +02:00
Pepijn c8763e0ad5 fix(pi052): four real bugs in the modeling code + flip defaults
Defaults
--------
* enable_fast_action_loss: False -> True   (match paper §III.B-C Eq.1)
* auto_fit_fast_tokenizer: True -> False   (opt-in; needs base.fit())

Bug fixes
---------

1. Wrong attribute path on PaliGemma. The KI port copied
   pi05_full's ``paligemma.language_model.layers[...]`` literally,
   but the production pi05 wrapper exposes the text model at
   ``paligemma.model.language_model``. With KI enabled, every layer
   would have raised AttributeError on first forward. Fixed all
   references in _compute_layer_ki + _paligemma_forward_ki.

2. ``fill_kv_cache=True`` passed to PaliGemmaWithExpertModel.forward.
   That kwarg is a SmolVLA-only concept; pi05's signature has no
   such argument, so every forward call from pi052 (text loss, FAST
   loss, select_message) would have crashed with TypeError. Dropped
   from all four call sites — pi05's forward already handles the
   cache via past_key_values, and re-forwarding the cumulative
   sequence each step in select_message is fine for our short
   subtask completions.

3. Text-loss shape mismatch. _compute_text_loss applied lm_head to
   the *full* vlm_out (image tokens + language tokens), then tried
   to cross-entropy that against text_labels which only covers the
   language portion — the .view(-1) calls would produce two
   tensors of different lengths and CE would fail. Now slices
   vlm_out to the last text_labels.shape[1] positions before
   running lm_head, matching the [images, language] order
   embed_prefix produces.

4. Dead-code conditional in _paligemma_forward_ki's single-expert
   fallback. The ``if hasattr(...) else self._pi052_orig_forward``
   ternary always took the wrong branch because the attribute is
   always set (we save it in PI052Policy.__init__). Simplified to
   just call self._pi052_orig_forward directly.

After this commit, pi052 should be runnable end-to-end for the
first time with all three loss heads + KI active. Still worth a
100-step smoke test before kicking off a long run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:58:40 +02:00
Pepijn 0f4faddc01 feat(pi052): auto-fit FAST tokenizer per-dataset before training
Per Pertsch et al. 2025 (FAST paper, [64] in π0.5) and π0.5 §III.C,
the recommended practice is to *fit* the FAST action tokenizer on
the specific dataset's action distribution rather than using the
published universal codebook off the shelf. The universal tokenizer
works on any 6-DoF action sequence but produces suboptimal
compression, which slows CE convergence and wastes vocab capacity.

New utility ``lerobot.policies.pi052.fit_fast_tokenizer``:

  * samples N action chunks from the LeRobotDataset (default 1024)
  * loads ``physical-intelligence/fast`` as the base
  * calls ``.fit(actions)`` (the AutoProcessor API the HF model card
    documents) — produces a per-dataset codebook
  * saves to ``{cache_dir}/{sha256(dataset, base, n_samples)[:16]}/``
  * returns the local path, ready to feed
    ``ActionTokenizerProcessorStep(action_tokenizer_name=...)``.

Cache is keyed on (dataset, base tokenizer, sample count) so changing
any of them re-runs the fit. Re-running training on the same dataset
re-uses the cache (one fit per dataset per machine).

Auto-fit wiring:

  * PI052Config gets ``auto_fit_fast_tokenizer`` (default True),
    ``fast_tokenizer_cache_dir`` (default ~/.cache/lerobot/...),
    ``fast_tokenizer_fit_samples`` (default 1024).
  * make_pi052_pre_post_processors now takes ``dataset_repo_id``;
    when ``enable_fast_action_loss`` and ``auto_fit_fast_tokenizer``
    are both True and a repo_id is provided, the factory calls
    ``fit_fast_tokenizer`` before constructing the processor step
    and points it at the fitted path.
  * ProcessorConfigKwargs gains ``dataset_repo_id``; the global
    factory dispatch threads it through for ``pi052`` policies.
  * lerobot_train.py populates ``processor_kwargs['dataset_repo_id']``
    from ``--dataset.repo_id`` for pi052 runs.

Failure mode: if ``.fit()`` fails (e.g. older transformers without
the method, or no usable action chunks in the dataset), the factory
logs a warning and falls back to the universal base tokenizer. Train
still works; you just lose the compression improvement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:52:31 +02:00
Pepijn 8dc0af3c28 feat(pi052): FAST action CE loss + knowledge insulation + processor wiring
Three additions ported from ``pi05_full`` on branch ``feat/add-pi05``,
giving pi052 full paper-§III.B-C training capabilities alongside the
recipe-driven text supervision it already had:

* **Config flags** in PI052Config:
    - ``enable_fast_action_loss``  default False
    - ``action_tokenizer_name``    default "physical-intelligence/fast"
    - ``max_action_tokens``        default 256
    - ``fast_skip_tokens``         default 128
    - ``fast_action_loss_weight``  default 1.0
    - ``knowledge_insulation``     default False

* **Processor wiring** (processor_pi052.py): when
  ``enable_fast_action_loss=True``, append an
  ``ActionTokenizerProcessorStep`` after the text tokenizer. It
  tokenises the action tensor with the FAST tokenizer and writes
  ACTION_TOKENS / ACTION_TOKEN_MASK into ``COMPLEMENTARY_DATA`` —
  the existing batch-collation pipeline forwards them as
  ``batch['action.tokens']`` / ``batch['action.token_mask']``.

* **FAST CE loss** (modeling_pi052.py::_compute_fast_action_loss):
  Re-embeds the prefix [images, language], appends the FAST token
  embeddings (using PaliGemma's shared embed_language_tokens),
  forwards through the backbone, slices the trailing
  ``fast_len`` positions, applies the LM head, computes shifted
  next-token CE with the action-mask gating the loss. The loss is
  summed into ``forward()``'s total with ``fast_action_loss_weight``.

* **Knowledge insulation** (modeling_pi052.py::_compute_layer_ki +
  _paligemma_forward_ki): port of pi05_full's per-layer attention
  that detaches VLM K/V on the action-query path so action loss
  gradients cannot flow back into the VLM's K/V projections. Bound
  per-instance via ``types.MethodType`` so it doesn't leak into
  stock ``pi05`` policies that share PaliGemmaWithExpertModel.
  Activated automatically when ``config.knowledge_insulation=True``.

Combined with the existing recipe-driven text head, pi052 now
supports the full three-loss objective:

   L = text_w·H(text) + fast_w·H(FAST actions) + flow_w·MSE(flow)

matching Eq. (1) of arxiv:2504.16054 §IV.D (α=10 by default for the
flow term, 1.0 each for text and FAST CE).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:46:21 +02:00
Pepijn 8eba704f15 Revert "chore(training): align pi052_hirobot.slurm with the operator's actual command"
This reverts commit ecbac17196.
2026-05-13 11:03:58 +02:00
Pepijn ecbac17196 chore(training): align pi052_hirobot.slurm with the operator's actual command
Match the working SmolVLA2 launch pattern so the two SLURM scripts
are interchangeable:

  * literal NUM_PROCESSES / BATCH_SIZE / STEPS (no env-var defaults)
  * STEPS=10000 to match the next SmolVLA2 run
  * save_freq=$STEPS so only the final checkpoint is saved
  * dropouts 0.1/0.1/0.1 (mild — matches the operator's iteration)
  * flow_loss_weight / text_loss_weight come from the PI052Config
    defaults (10.0 / 1.0 per Pi 0.5 paper §IV.D), no need to pass
    them explicitly

Job name and policy_repo_id mirror the SmolVLA2 ``_tool-g2`` naming
so the two runs can be compared side-by-side in WandB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:03:09 +02:00
Pepijn 12cce8f2cc fix(smolvla2): align flow_loss_weight default with Pi 0.5 paper's α=10
Pi 0.5 paper §IV.D Eq. (1) sets the loss balance to α=10 between text
CE and flow MSE: actions are the primary output and the flow head
should dominate the gradient signal. SmolVLA2 was defaulting both
weights to 1.0, which inverts that — text CE (~0.5-2.0 nats) ends up
larger than flow MSE (~0.1-1.0), so the action expert gets less
gradient than the LM head despite being the primary task.

Match the paper's split: text_loss_weight=1.0, flow_loss_weight=10.0.
Same as ``pi052`` (the new full reproduction policy).

Also pin the values explicitly in the SLURM launcher so the choice is
visible and overridable per-run rather than buried in the config
default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:02:17 +02:00
Pepijn ef5879a02a feat(pi052): π0.5 v2 — full reproduction of the π0.5 paper recipe
New ``lerobot.policies.pi052`` (parallel to ``smolvla2``) that adds
text-prediction + hierarchical-inference on top of the existing π0.5
implementation. Mirrors the paper's §IV.D dual-head training:

  L = H(text) + α * ‖ω - a - f_θ_action(...)‖²,  α = 10

Components:

  * ``configuration_pi052.py``     thin PI05Config subclass; adds
                                    recipe_path, text/flow loss weights
                                    (default α=10 per paper), prompt
                                    dropout knobs, ``unfreeze_lm_head``.
  * ``text_processor_pi052.py``    PI052TextTokenizerStep — concatenates
                                    rendered messages as ``Role: ...``
                                    plain text (PaliGemma has no chat
                                    template), tokenises with the
                                    PaliGemma tokenizer, builds a label
                                    mask covering supervised target
                                    spans. Includes Pi 0.7 §V.E
                                    per-component prompt dropout.
  * ``processor_pi052.py``         make_pi052_pre_post_processors —
                                    Rename + Batch + Relative +
                                    Normalize + RenderMessagesStep +
                                    PI052TextTokenizerStep + Device.
                                    Falls back to π0.5's plain pipeline
                                    when recipe_path is unset.
  * ``modeling_pi052.py``          PI052Policy(PI05Policy) — re-enables
                                    PaliGemma ``lm_head``, computes
                                    text_loss via CE on the supervised
                                    span, sums with flow_loss in
                                    forward(), and adds select_message
                                    for AR text generation at inference
                                    (same surface as
                                    SmolVLA2Policy.select_message so
                                    SmolVLA2Runtime drives it unchanged).

Plus the supporting plumbing:

  * recipe ``configs/recipes/pi052_hirobot.yaml`` — same Hi-Robot blend
    as smolvla2_hirobot.yaml, with the same ``${subtask}`` /
    ``if_present`` supervision fix (current span at every frame, not
    ``${next_subtask}``).
  * SLURM ``examples/training/pi052_hirobot.slurm`` — full training
    command matching the SmolVLA2 launcher.
  * factory registration: ``--policy.type=pi052`` resolves to
    PI052Policy with the new processor.

Same multi-rate runtime (``lerobot.policies.smolvla2.inference``)
drives this policy too — both expose ``predict_action_chunk`` for the
action expert and ``select_message`` for the LM head.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 10:59:26 +02:00
Pepijn 1d24301b67 chore(training): STEPS=15000 default + dropout walked back to 0.30/0.30/0.20
After _tool-good (2000 steps, 0.50/0.50/0.20 dropout) the LM head's
distribution at position 0 shifted from EOS to subtask-vocabulary
tokens but emitted bag-of-words ("cube arm and") rather than well-
formed sentences. That's the expected mid-fine-tuning phase: token-
level supervision has landed, sequence-level grammar hasn't.

Two changes for the next retrain:

  * STEPS=15000 (from 2000) — chat-pretrained backbones need O(10k+)
    steps to walk their pretraining priors down far enough to commit
    to the fine-tuned distribution structurally, not just at the
    token level. _tool-g2's bag-of-words output proves the model is
    on the right path; it just needs more gradient signal.

  * plan/memory dropout 0.50 -> 0.30 — 0.50 was probably too
    aggressive for a small dataset. Half the training samples had
    crucial context missing, which slows down learning the full
    conditional structure. 0.30 still regularises against prompt
    leakage but lets the model learn proper grammar first; the
    higher dropout can be revisited once the head is solid.

Subtask dropout stays at 0.20 since subtask isn't in the high-level
prompt anyway (recipe fix removed the "Current subtask:" message).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 10:46:19 +02:00
Pepijn 3a20ea337e feat(smolvla2-runtime): --text_min_new_tokens / --text_temperature CLI debug knobs
The recipe fix (target=${subtask} instead of ${next_subtask}) shifted
the LM head's failure mode from "emit newlines" to "emit EOS at
position 0". On the new ``_tool-good`` checkpoint inference produces
exactly one token (``<end_of_utterance>``, id 49279) and decodes to
empty. That's the chat-pretrained backbone's short-turn EOS prior
not yet being overridden by 2000 steps of fine-tuning supervision.

Expose three knobs so the operator can probe whether the head has
real subtask-token probability mass *under* the EOS argmax without
recompiling or retraining:

  --text_min_new_tokens=N    suppress EOS for the first N tokens
  --text_temperature=T       sample at temperature T
  --text_top_p=P             nucleus filtering at top-p

These are explicitly off-policy (training was greedy / no min-tokens),
so they shouldn't ship in production runs — but they let us tell
whether the model has *learned* subtask prediction (just under EOS)
or hasn't yet. If forcing min_new_tokens=3 with temperature=0.5
produces a sensible subtask, the model is fine and just needs more
training steps to walk EOS down. If it produces gibberish, training
hasn't progressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 21:39:33 +02:00
Pepijn b6fb536460 chore(training): bump plan/memory dropout to 0.50 to force vision-grounding
After the recipe fix (target=${subtask} at every frame) the model
can still reach low text_loss by reading the answer off the plan in
the prompt: at training the prompt contains the 6-step plan, and the
current subtask is one of those steps, so the model just learns
"active step N matches subtask N" and never needs to look at the
image. Symptom at inference: subtask string is set but never updates
because the model isn't really conditioning on the visual progress.

Drop plan and memory with p=0.50 each — half of training frames the
prompt is just "${task}" (constant for this dataset) + visual prefix,
which is the only place the answer can come from. Forces the LM head
to actually use vision.

``subtask_dropout`` stays at 0.20 because subtask isn't in the
high-level prompt anymore (recipe fix removed the "Current subtask:
X" message); the knob still affects other sub-recipes that reference
it as context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 21:31:00 +02:00
pepijn bfd3bb1791 fix(smolvla2): handle batched sample indices in chat tokenizer
Normalize tensor and sequence sample indices before prompt dropout so distributed batched preprocessing does not try to cast full index tensors to scalars.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 16:56:13 +00:00
Pepijn 4908433f9a chore(training): align smolvla2_hirobot.slurm with what's actually run
Match the operator's current training command for the _tool6 retrain:

  * default DATASET / POLICY_REPO_ID / JOB_NAME point at the tool6
    iteration (super_poulain_full_tool3 → smolvla2_hirobot_super_poulain_tool6)
  * STEPS default 2000 (short enough to iterate; bump to 10k for full)
  * save_freq=$STEPS so the only checkpoint is the final one
  * OUTPUT_DIR includes step count so successive runs don't clobber
  * Drop the wider augmentation envelope I added earlier — back to
    default ColorJitter ranges (brightness ±20% etc) since the
    high_level_subtask recipe fix (current-subtask supervision) is
    expected to fix the LM-head collapse on its own; the augmentation
    is just the standard regulariser, not a load-bearing widener.
  * prompt-dropout fractions stay at the original 0.15 / 0.15 / 0.20.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:45:38 +02:00
Pepijn 6ce1f36002 fix(smolvla2): supervise high-level head with *current* subtask at every frame
The high_level_subtask recipe targeted ``nth_next(style=subtask, offset=1)``,
which on the last span of any episode resolves to None. The recipe had no
``if_present`` guard on the target, so the renderer emitted an empty
assistant turn and cross-entropy supervised the model on the chat
template's structural newlines (``\n``). Across the dataset this trained
the LM head's argmax at position 0 to collapse to ``\n`` whenever no
transition was imminent (i.e. most frames). Visible failure mode at
inference: the head emits 40+ newlines + ``<end_of_utterance>`` every
chunk boundary while the action expert keeps working — confirmed by
running the dry-run on dataset frame 0 with the dataset's own image
and seeing the same ``\n × 44`` collapse.

Switch to the Pi 0.5 / Pi 0.7 supervision pattern: at every frame, the
assistant target is the *current* active subtask span text (via
``${subtask}`` → ``active_at(t, style=subtask)``). Always non-empty,
always scene-grounded, ``if_present: subtask`` skips frames with no
active span instead of emitting a degenerate empty turn.

Runtime callsite update: ``_msgs_for_subtask`` no longer feeds a
"Current subtask: X" user message into the prompt (that would be
circular — we'd be telling the model the answer). Transition
detection moves into the runtime — when the predicted subtask differs
from ``state['current_subtask']``, the existing ``set_if_changed``
path fires ``subtask_change`` and downstream memory updates. Same
event surface, supervision target is now always meaningful.

Requires re-annotating the dataset and retraining for the fix to land
in the checkpoint, but the recipe + runtime change is what enables it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:42:59 +02:00
Pepijn 731576be80 chore(smolvla2-runtime): auto-fire one tick at dry-run startup
Previously the dry-run REPL only ticked on user input (empty Enter
just redrew), so the bisection test "does the LM head produce text on
start_frame=0?" required typing something arbitrary to drive a tick.
Just run ``step_once`` at startup — the obs diagnostic *and* the
subtask gen both fire automatically, the diag row populates, and the
operator can read the result before pressing any key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:34:42 +02:00
Pepijn 47fb8318b1 chore(training): widen augmentation envelope after live-robot diagnostic
The tensor-level comparison between dry-run (dataset frame) and live-
robot inference proved the runtime is bug-free — same shape, dtype,
device, channel order, batch dim, and normalization on both paths.
The remaining variable: front-camera mean brightness was 0.26 live vs
0.39 on the dataset frame, ~33% darker. Training augmentation only
covered ±20% brightness, so the live scene sits just outside the
supervised envelope and the LM head collapses to its dominant prior.

Widen the augmentation knobs for the next retrain:

  * brightness    0.8–1.2  → 0.5–1.6   (covers ~30% darker / 60% lighter)
  * contrast      0.8–1.2  → 0.6–1.5
  * saturation    0.5–1.5  → 0.3–1.7
  * hue          ±0.05    → ±0.10
  * affine        ±5°/±5%  → ±15°/±15% (covers cube placement / camera drift)
  * max_num_transforms 3 → 4

And bump prompt-component dropout (subtask 0.20 → 0.30) so the LM
can't lean on stale memorised plan/memory at inference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:25:41 +02:00
Pepijn 53172873e3 chore(smolvla2-runtime): probe obs once at dry-run startup
The dry-run REPL only fires a tick when the user types, so the
``_log_obs_tensors_once`` diagnostic never reached stdout (the
provider was never called). Probe the provider once at startup —
the result is discarded; we only care about the obs log it triggers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:21:58 +02:00
Pepijn fcdae0ce8e chore(smolvla2-runtime): tensor-level obs print for both inference paths
Helper that prints (once per provider lifetime) every
``observation.*`` tensor the policy is about to see, with its shape,
dtype, device, and per-channel min/max/mean/std. Wired into both the
dry-run dataset path and the live-robot path.

Now we can bisect train/inference mismatch *at the tensor level* —
if the same checkpoint produces coherent text on one path's tensors
and ``\n`` on the other's, and the printed tensor stats differ
materially, the bug is in the observation prep, not in the model or
the training distribution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:19:18 +02:00
Pepijn 4852b9f952 feat(smolvla2-runtime): --dataset.augment_at_inference for the bisection test
Apply the training-time torchvision-v2 ColorJitter / SharpnessJitter /
RandomAffine pipeline to dataset frames in dry-run, so we can isolate
whether the LM head's collapse to '\n' on live frames is:

  * pure scene-content OOD (unaugmented dataset frames work, mildly
    augmented ones still work — model has learned the augmentation
    distribution, only fails when the scene content itself diverges)
  * hyper-specific memorisation (dry-run with augmentation also
    collapses to '\n' — head is nailed to the exact unperturbed
    training samples and only the retrain helps)

Usage:

  lerobot-smolvla2-runtime --no_robot --policy.path=... \
    --dataset.repo_id=... --dataset.episode=0 \
    --dataset.start_frame=1000 \
    --dataset.augment_at_inference

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:14:57 +02:00
Pepijn 0410705aff chore(smolvla2-runtime): print live state vector once at startup
So the operator can compare live joint values to the dataset's
``observation.state`` mean/std and spot when the robot's home pose is
several σ off the supervised support region. State OOD is the
remaining viable hypothesis for why the live LM head collapses to
``\n`` even though images are pixel-shape-matched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:12:27 +02:00
Pepijn 398a8cf730 chore(smolvla2-runtime): log first-tick resize so train/inference match is verifiable
Print one warning the first time the robot observation provider runs
through, showing live camera resolution and the dataset's training
resolution, plus whether we resized. Lets the operator confirm at a
glance that the visual prefix really is being fed at the same shape
the model saw at training — instead of guessing whether the resize
fired silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:06:00 +02:00
Pepijn ab5c1dc392 fix(smolvla2-runtime): match training visual distribution on robot frames
Root cause for the LM head's empty-completion symptom on the live robot
(while the same checkpoint produced sensible subtask/plan/memory in
``--no_robot`` dry-run on dataset frames): the camera observation was
flowing into the model at its native resolution. A Mac/USB webcam
hands us 1280×720 or 1920×1080; the dataset was recorded at the
feature schema's ``observation.images.*['shape']`` resolution
(typically 480×640). SmolVLA's internal ``resize_with_pad(512, 512)``
*does* fit both — but with very different pad geometry, so visual
tokens at each tile carry different content than at training. Action
expert tolerates this; the tightly-supervised LM head goes OOD and
the head's distribution at position 0 collapses to its dominant mode
(``\n`` ×N then ``<end_of_utterance>`` for this checkpoint).

The fix: in ``_build_robot_observation_provider``, pre-compute the
camera-key → (H, W) target from ``ds_features`` and ``cv2.resize``
each live frame to that shape before tensorising. The downstream
``resize_with_pad`` then sees the same input geometry as training and
the LM head returns to producing readable subtask text under plain
greedy decoding — the same as dry-run.

Also drops the inference-time patches (``min_new_tokens``,
``temperature``, ``top_p`` overrides) on the four high-level callers.
They were band-aids around the visual-distribution shift, not a real
LM problem, and they drift inference off the training distribution.
Greedy argmax is what training matched. The ``select_message``
signature still accepts the knobs for callers that want them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:59:24 +02:00
Pepijn 1292304c42 fix(smolvla2): suppress all special tokens during min_new_tokens window
Previous attempt only masked the tokenizer's eos_token_id during the
min_new_tokens prefix. The empty-completion symptom persisted because a
memorised SmolVLM head doesn't just want EOS — its top-1 at position 0
is *some* special token, and when EOS is masked the argmax shifts to a
sibling (``<|im_end|>``, ``<image>``, ``<fake_token_around_image>``,
``<row_X_col_Y>``, …). Those tokens survive generation but then get
stripped by ``decode(skip_special_tokens=True)``, so the runtime still
saw ``last_raw='(empty)'`` every chunk boundary.

Mask the full ``tokenizer.all_special_ids`` set instead. Forces the
head to commit to a normal vocabulary token before it can close or
quietly poison the turn.

Also: when decode returns empty but tokens *were* generated, expose
the raw token ids and the special-tokens-included decoded string via
``policy._last_select_message_debug``. The runtime surfaces this in
the scrollback so the operator can see what the head is actually
emitting — distinguishing "head EOS-ing" from "head emitting image
placeholders" from "head emitting chat-template fragments".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:49:53 +02:00
Pepijn b95eebff77 fix(smolvla2): force min_new_tokens + sampling so memorised LM emits something
Real-robot run confirmed the LM head is producing 0 tokens at every
chunk boundary (empty:N counter climbing, no exception in scrollback):
the model EOS-es at decode step 0. That's the memorisation collapse —
training reached text_loss=6e-6 by overfitting one trajectory whose
supervised subtask turn ended in EOS, and at inference the head's
argmax for token 0 is EOS regardless of the actual frame.

Two changes in select_message:

  * ``min_new_tokens`` parameter masks the EOS logit to -inf until at
    least N real tokens have been decoded. Without this the head's
    "EOS first" prior produces an empty completion every single time.

  * The runtime callers now pass ``min_new_tokens=5..10`` plus
    ``temperature=0.4..0.5`` + ``top_p=0.9``. Sampling at moderate
    temperature with nucleus filtering also helps break the greedy
    argmax collapse — when the model has memorised one continuation,
    greedy keeps replaying it; nucleus sampling forces it to commit
    to *some* coherent continuation that's well-supported by the
    prefix even when greedy's top-1 is degenerate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:48:08 +02:00
Pepijn fbcac95662 feat(smolvla2-runtime): scrollback in autonomous panel + empty-gen counter
Two improvements for diagnosing why ``last_raw`` stays empty:

1. The autonomous panel-redraw thread calls console.clear() every
   0.5 s, wiping any log lines the runtime printed since the last
   redraw. So warnings from generation (``[warn] subtask gen failed:
   ...``, ``[info] subtask gen rejected (gibberish): ...``) flashed
   for milliseconds and disappeared, leaving the operator blind.

   Capture log_lines from each tick into a bounded scrollback
   (last 12 entries) and render them inside the panel itself, below
   the diag row. They now stick across redraws until rotated out.

2. ``empty`` counter for subtask gen. Persistent empty completions
   are their own failure mode — the LM head EOS-es immediately from
   the chat-template generation prompt, distinct from "generated
   something but filter rejected it". The diag row now reads:

     subtask diag    repeat:0  gibberish:0  empty:14  last_raw: '(empty)'
                                            ^^^^^^^
   plus a periodic log line every 10 empties so the cause is also
   surfaced in the scrollback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:42:13 +02:00
Pepijn b9db4d21a2 fix(smolvla2): high-level steps must run before LowLevelForward refills
Both HighLevelSubtaskFwd and LowLevelForward are gated on
'action queue is empty'. With LowLevelForward listed first, it refilled
the queue on the empty-queue tick before HighLevelSubtaskFwd got to
check — so the gate I added in the previous commit made the high-level
step a permanent no-op after the initial bootstrap. Visible symptom:
subtask string never advances past whatever bootstrap seeded, no
subtask_change events, memory stays unset, and the new overfit
diagnostics never appear on the panel because last_subtask_raw is
never written.

Move all high-level steps (subtask, memory, interjection, vqa) ahead
of LowLevelForward. On an empty-queue tick the subtask refreshes
first, the new string flows into the next chunk's prompt, then
LowLevelForward generates the chunk, then DispatchAction drains it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:38:06 +02:00
Pepijn aecb80a9d2 feat(smolvla2-runtime): overfit/memorisation diagnostics on the panel
The autonomous-mode panel now surfaces what the model is *actually*
producing at every chunk boundary, not just what got accepted:

  * last_subtask_raw       most recent generation (accepted or not)
  * subtask_repeat_count   times the same accepted string regenerated
  * subtask_gibberish_count rejections by the gibberish filter
  * memory_gibberish_count / plan_gibberish_count for the other heads

These let the operator see memorisation collapse without scrolling
back through logs:

  subtask diag    repeat:8  gibberish:0  last_raw: '<same string>'
                  ^^^^^^^^^^ → model can't move past current phase

  subtask diag    repeat:0  gibberish:14  last_raw: 'Ass:::'
                  ^^^^^^^^^^^^^^^^^^^^^^ → LM collapsed to template salad

Also silences the per-action ``Relative goal position magnitude had
to be clamped`` warning. The clamp fires every dispatch tick when the
model emits stale joint targets, flooding the panel at ctrl_hz=30.
Replaced the bare ``logging.warning`` call in robots/utils.py with a
module logger so it can be selectively raised to ERROR. Operators
who need the per-tick clamp detail can use ``-v``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:31:04 +02:00
Pepijn c98c695127 feat(smolvla2-runtime): 'rephrase:' prefix to swap task string in place
Adds a third stdin channel alongside 'task:' and bare interjections:

  rephrase: <text>

Swaps state['task'] with the new string while preserving plan/memory/
subtask. Lets the operator probe how robust the model is to wording
variations of the same task — the trained augmentation provided
n_task_rephrasings≈30 task wordings per dataset task, and this is the
direct way to exercise that distribution at inference without
generating a fresh plan via user_interjection_response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:26:59 +02:00
Pepijn d528078aca fix(smolvla2-runtime): allow task switching mid-run via 'task:' prefix
Both stdin handlers (autonomous mode and rich REPL) gated 'task:' to
'only if no task is set yet' — once the initial task existed, typing
'task: <new task>' silently fell through to the interjection branch.
Make 'task:' always override the active task and clear stale
plan/memory/subtask so the next high-level pass regenerates context
from scratch for the new task.

For rephrasings within the same task, the interjection path
(user_interjection_response recipe) is still the right channel — it
refreshes the plan and emits a paired <say> in one trained call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:24:16 +02:00
Pepijn a648da0455 fix(smolvla2): unblock action dispatch when high-level LLM stalls loop
The runtime is single-threaded. `HighLevelSubtaskFwd` at HzTrigger(1.0)
fires every loop iteration on MPS because each `select_message` call
takes ~2 s, longer than its 1/hz period. The whole tick stretches to
~2.5 s, so `DispatchAction` (HzTrigger 30) only pops a single action per
loop iteration — the queue drains at ~0.4 actions/sec instead of 30 and
the robot barely moves between chunk refreshes.

Two changes, both purely about scheduling — no threading:

* Gate `HighLevelSubtaskFwd` to fire only when the action queue is
  empty, matching `LowLevelForward`'s refresh condition. The slow LLM
  call now happens during the "think" phase between chunks, not on
  every dispatch tick. Restores a clean sense → think → act cycle.

* `DispatchAction` catches up via wall-clock: when the trigger fires
  after a stall, pop `round(elapsed * hz)` entries and send only the
  most recent. Open-loop chunks are timestamped at ctrl_hz; sending
  stale joint targets one-by-one would just lag the robot further
  behind. The dynamixel smooths to the latest goal anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:23:09 +02:00
Pepijn d866c2c9fd fix(smolvla2): only regenerate chunk when queue is fully drained
The previous refresh threshold (queue > chunk_size // 2) made each
new chunk *telescope* past the previous one: at queue=25, we kicked
off a new chunk forward from the current observation, but by the
time the new chunk's first action was actually dispatched, the
robot had executed the remaining 25 actions of the previous chunk
— so the new chunk was planned from an observation 25+ steps stale.

Canonical sense → think → act loop: execute the full chunk, then
re-observe and replan. Refresh only when the queue is empty. Every
step of every chunk still gets dispatched to the robot (no
behaviour change there), but each chunk is now planned from an
observation that's at most one chunk's worth of dispatch latency
old, not "previous chunk's worth of stale state on top of that".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:15:02 +02:00
Pepijn 01e2228b24 feat(smolvla2): per-component prompt dropout + augmented training script
Two complementary regularisers to attack the
``text_loss=6e-6 = memorised one dataset`` failure mode that's
making the model collapse on real-robot input:

1. **Per-component prompt dropout** (Pi0.7 §V.E / plan's
   ``feat/pi05-prompt-dropout`` follow-up).
   ``SmolVLA2ChatTokenizerStep`` gains
   ``plan_dropout_prob`` / ``memory_dropout_prob`` /
   ``subtask_dropout_prob`` knobs (default 0.0 — opt-in). At training,
   non-target messages whose rendered content starts with
   ``Plan:`` / ``Memory:`` / ``Current subtask:`` etc. are dropped
   with their respective probability before tokenisation, with a
   deterministic per-sample RNG keyed off the dataset ``index``.
   ``target_message_indices`` is re-mapped so the supervision still
   lands on the right turn. Forces the model to handle missing
   plan/memory/subtask context — directly attacks the real-robot
   collapse where a stale or empty plan field puts the prompt OOD.

   Surfaced on ``SmolVLA2Config`` as three floats so they're
   ``--policy.<knob>=<value>``-controllable from the train CLI;
   plumbed through ``make_smolvla2_pre_post_processors``.

2. **Image augmentation** is already wired in lerobot via
   ``--dataset.image_transforms.enable=true`` (torchvision v2
   ColorJitter + SharpnessJitter + RandomAffine, default 3 of 6
   sampled per frame). No code change needed — just a CLI flag.

``examples/training/smolvla2_hirobot.slurm`` shows the full
training command with both enabled. Drop-in replacement for the
ad-hoc SLURM script Pepijn was using locally; same args, plus the
three dropout probs and the image-transforms flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:52:32 +02:00
Pepijn c36de3a3e8 fix(smolvla2): enqueue full chunk via predict_action_chunk
``LowLevelForward`` was calling ``select_action()`` once per
``chunk_hz`` tick. SmolVLA's ``select_action`` is a thin queue-pop:
it returns one action per call and only re-runs the expensive
flow-matching forward when its private internal queue empties.
Result: we got one action back per chunk_hz tick (1Hz default),
``DispatchAction`` at ctrl_hz=30 popped it instantly, then queue
sat empty for ~1s waiting for the next tick. Net throughput was
1 dispatched action/sec instead of the 30 we wanted.

Switch to ``predict_action_chunk`` and enqueue every step of the
returned ``(batch, n_action_steps, action_dim)`` chunk. Refresh
only when the queue is below half a chunk so we don't burn one
flow-matching forward per chunk_hz tick — saves ~5x inference cost
on this hot path. At ctrl_hz=30, chunk_size=50, the queue drains
in ~1.7s before the next refresh, giving smooth dispatch at the
control rate the robot was trained on.

Side effect: ``state['last_chunk_size']`` records how many actions
the most recent chunk produced — useful for the panel later if we
want to surface "chunks generated" alongside "dispatched".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:27:23 +02:00
Pepijn cbfaf2c544 feat(smolvla2): action-dispatch counter + tighter gibberish filter
Real-robot run was unreadable for two reasons:

1. The panel surfaced ``queued actions: 0`` (always zero — dispatch
   pops faster than chunk_hz generates) and gave no signal that
   actions were actually reaching the robot. The only sign of life
   was the safety-clamp warning lines scrolling past.

2. The text head consistently collapses to ``the`` / ``Ass``
   fragments on real-camera input (memorisation wall). The old
   gibberish filter caught ``":":":"`` JSON salad but let
   single-token fragments through, and the ``[info] subtask gen
   produced no text this tick`` line flooded the panel every second.

Changes:

  * ``DispatchAction`` bumps ``state["actions_dispatched"]`` each
    tick; panel renders it next to queue depth. Operator can see
    the policy IS issuing actions even when text is broken.
  * ``_looks_like_gibberish`` now also rejects:
    - too few unique alphabetic tokens (``the``, ``the the``, ...)
    - chat-template marker leakage (``Assistant:``, ``Ass\\n::``)
    catching the actual failure mode on real-robot frames.
  * Gibberish rejections log only the first occurrence + every 30th
    after that, with a count, so the panel stays legible.
  * Empty completions no longer log at all (was every tick).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:22:36 +02:00
Pepijn d0278ea093 feat(smolvla2): render state panel in autonomous mode too
Dry-run REPL had a clean ANSI-clear-+-rich-panel layout via
``_redraw`` showing task / subtask / plan / memory / queued-actions /
pending-tool-calls; autonomous mode just had bare ``> `` plus log
lines scrolling past the user. Same data, two presentations.

Extract ``_make_state_panel_renderer(runtime, mode_label=...)`` and
use it from both ``_run_repl`` (called per user input) and
``_run_autonomous`` (called both on user input *and* on a 0.5s
background timer so subtask / plan / memory refreshes from the
runtime's own loop become visible without the user typing anything).
Title bar shows ``dry-run`` vs ``autonomous`` so it's obvious which
mode you're in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:16:28 +02:00
Pepijn 15f6b08b0e fix(smolvla2): use canonical _strip_lerobot_blocks for inference msgs
Training tokenises messages through ``_strip_lerobot_blocks`` (in
``chat_processor_smolvla2.py``), which normalises every variant of
``message['content']`` into the ``[{type:text, text:...}]`` list shape
SmolVLM's chat template expects:

  * ``list[block]`` → keep text blocks, drop images
  * ``None``        → ``[{type:text, text:""}]``
  * ``str`` / other → ``[{type:text, text:str(content)}]``

Inference was doing a partial inline conversion that only handled the
``str`` case — ``None`` and pre-formatted ``list`` content slipped
through unchanged. ``memory_update``'s ``Previous memory: ...``
assistant turn ends up with ``None`` content when there's no prior
memory, which then renders as no-content / role-marker-only and the
model hallucinates ``Assistant:`` fragments. Subtask gen got further
because its prompt always has at least the task string.

Reuse ``_strip_lerobot_blocks`` directly. Now the inference prompt
shape matches the exact tokenisation training did — no more "trained
on shape X, asked to predict shape Y" mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:07:39 +02:00
Pepijn fc715db4a3 fix(smolvla2): coerce str content to list-of-blocks for chat template
SmolVLM's chat template (and many other multimodal templates) declares
``message['content']`` as a list of typed blocks and iterates it
expecting dicts with a ``'type'`` field:

    {% for line in message['content'] %}
      {% if line['type'] == 'text' %}{{ line['text'] }}
      {% elif line['type'] == 'image' %}{{ '<image>' }}
      {% endif %}
    {% endfor %}

When the caller passes ``content`` as a plain ``str`` (which we did
throughout ``_msgs_for_subtask`` / ``_msgs_for_memory`` etc.), Jinja
silently iterates the string character-by-character. ``'P'['type']``
returns nothing; neither branch fires; *no text tokens get emitted*.
The model receives a prompt containing only role markers
(``User:<end_of_utterance>\nAssistant:``) and predictably continues by
emitting ``Assistant:`` fragments — the gibberish ``subtask: Ass\n::``
on the runtime panel.

Before calling ``apply_chat_template``, walk the messages and rewrite
any string ``content`` into ``[{'type': 'text', 'text': content}]``.
The template's text branch then fires correctly and the model sees
the actual user/assistant text, not just structural tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:01:53 +02:00
Pepijn fe4bd2b6ba fix(smolvla2): pass flat batch dict to preprocessor (no manual wrap)
``PolicyProcessorPipeline.__call__`` already wraps its input via
``to_transition`` (defaulting to ``batch_to_transition``) before
running the steps, and unwraps via ``to_output`` (defaulting to
``transition_to_batch``) afterwards. The input format is therefore a
*flat batch dict* keyed by ``observation.*`` / ``action`` / etc., not
an ``EnvTransition``.

Previous attempt pre-wrapped the observation into a transition with
``TransitionKey.OBSERVATION`` as the key, then handed *that* to the
pipeline — which fed it to ``batch_to_transition``, which looked for
top-level ``observation.*`` entries, found none (they were nested
inside the enum key), and produced an empty observation. Every step
then bailed with ``ObservationProcessorStep requires an observation
in the transition.``

Pass the flat dict from ``build_inference_frame`` straight to the
preprocessor — it does the wrap/unwrap itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:54:48 +02:00
Pepijn 3f7436ff8a fix(smolvla2): use TransitionKey enum (not .value) as transition keys
``EnvTransition`` is declared as a ``TypedDict`` keyed by
``TransitionKey.OBSERVATION.value`` (the string ``'observation'``),
but every concrete ``ProcessorStep`` in the pipeline indexes the
transition with the enum *member* (``transition[TransitionKey.
OBSERVATION]`` / ``transition.get(TransitionKey.OBSERVATION)``).
Those are two different keys in a Python dict — string key vs enum
key — so steps couldn't find the observation we'd placed under the
string variant, and bailed every tick with
``ObservationProcessorStep requires an observation in the
transition``.

Build the transition with the enum members directly. Matches how
``BatchProcessor``, ``RelativeActionProcessor``, ``HilProcessor``,
etc. read the dict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:50:22 +02:00
Pepijn 992d13d4e9 fix(smolvla2): use build_inference_frame for raw robot observations
``robot.get_observation()`` on omx_follower (and most lerobot robots)
returns:

  * per-joint scalar floats with ``.pos`` suffix
    (``shoulder_pan.pos: 0.123``, ``shoulder_lift.pos: 0.456``, ...)
  * per-camera ndarrays keyed by the camera config name (``wrist:
    ndarray(H,W,3)``)

But the trained policy expects:

  * single ``observation.state: tensor[N_joints]`` vector
  * image keys prefixed: ``observation.images.<cam_key>:
    tensor[1, 3, H, W]``

``prepare_observation_for_inference`` only handles the tensor /
batch-dim / device step — it crashes on scalar floats with
``expected np.ndarray (got float)``. The right helper is
``build_inference_frame`` which uses the dataset's feature schema
(``ds_meta.features``) to:

  1. extract the right raw keys per dataset feature,
  2. fold ``shoulder_pan.pos`` / ``shoulder_lift.pos`` / ...
     into a single ``observation.state`` ndarray,
  3. prefix camera keys with ``observation.images.``,
  4. delegate to ``prepare_observation_for_inference`` for the
     tensor / batch / device step.

Pass ``ds_meta.features`` into the observation provider and switch
to ``build_inference_frame`` when available; fall back to the bare
``prepare_observation_for_inference`` only when no dataset is
provided (rare — autonomous mode already requires it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:47:59 +02:00
Pepijn afe40a016b fix(smolvla2): wrap robot obs in EnvTransition before preprocessor
The policy preprocessor pipeline is transition-shaped — its steps
read ``TransitionKey.OBSERVATION`` off an ``EnvTransition`` dict, not
a flat ``RobotObservation`` dict. Passing the raw observation through
made every step bail with
``ObservationProcessorStep requires an observation in the transition``,
which the runtime swallowed at warning level. ``select_message`` then
got called with no ``observation.images.*`` features and crashed
with ``All image features are missing from the batch``.

Mirror ``lerobot-record``'s preamble:
  1. ``prepare_observation_for_inference`` → numpy → torch, ``CHW``
     image layout, ``[0,1]`` scaling, add batch dim, move to device.
  2. Wrap into an ``EnvTransition`` (``{TransitionKey.OBSERVATION.value:
     ...}`` plus ``COMPLEMENTARY_DATA: {}`` and ``None``s for the rest)
     so transition-aware steps see the keys they expect.
  3. Run preprocessor.
  4. Unwrap the transition's ``OBSERVATION`` slot to get the final
     flat dict the policy's ``select_action`` / ``select_message``
     consume.

Image features now reach the policy; the autonomous loop produces
real actions instead of swallowing warnings every tick.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:44:24 +02:00
Pepijn 41095e3cc3 fix(smolvla2): instantiate CameraConfig subclasses from JSON dicts
``--robot.cameras`` parses the JSON into ``dict[str, dict]``, but
``RobotConfig`` expects ``dict[str, CameraConfig]`` — each inner
value must be the actual ``CameraConfig`` subclass instance for the
chosen backend (e.g. ``OpenCVCameraConfig``). Passing raw dicts
blew up in ``RobotConfig.__post_init__`` with
``AttributeError: 'dict' object has no attribute 'width'`` when it
iterated cameras and tried to read attributes.

Look up the right subclass per-camera by its ``"type"`` field via
``CameraConfig.get_choice_class(...)`` (mirroring the lazy-import
dance we already do for ``RobotConfig``: eagerly walk
``lerobot.cameras``'s submodules so the registry is populated
before lookup). Construct an instance with the rest of the dict's
fields. On an unknown camera type, raise a clean ``ValueError``
listing the available choices.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:39:28 +02:00
Pepijn e0fa957569 fix(smolvla2): eagerly import robot submodules before get_choice_class
``RobotConfig._choice_registry`` is populated as a side-effect of
each robot's ``@RobotConfig.register_subclass`` decorator running,
and those decorators only fire when the corresponding
``lerobot.robots.<name>`` module is imported. The package's
``__init__.py`` doesn't import them — instead ``make_robot_from_config``
does it lazily in its big if/elif chain.

``_build_robot`` jumped the gun: called ``RobotConfig.get_choice_class
(robot_type)`` before any robot module had been imported, so the
registry was empty and every ``--robot.type=<X>`` produced
``KeyError: 'X'`` (e.g. ``KeyError: 'omx_follower'``).

Walk ``lerobot.robots``'s submodules via ``pkgutil.iter_modules`` and
``importlib.import_module`` each one before the lookup. ~200ms on the
first invocation, negligible for an autonomous run. On a real
``KeyError`` (typo / unsupported robot), raise a clean ``ValueError``
listing the registry's available choices instead of a bare KeyError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:31:58 +02:00
Pepijn c661d81409 fix(smolvla2): use RobotConfig.max_relative_target, drop --max_action_norm
The hand-rolled action-norm safety clip duplicated what every
``RobotConfig`` already exposes — ``max_relative_target`` — and at
the wrong layer (after postprocess but before send_action, instead
of inside the robot driver where every other lerobot entry point
puts it). The norm clip also rejected entire actions instead of
clipping per-motor relative motion, so a single rogue joint would
kill the whole tick.

Replace with ``--robot.max_relative_target``: a string parsed as
either a bare float (uniform per-motor cap) or a JSON object
mapping motor name → cap. Passed through to
``RobotConfig(max_relative_target=...)`` at robot construction;
the driver's ``send_action`` clips each commanded joint position
relative to the current measured one before issuing it on the bus —
same behaviour ``lerobot-record`` ships.

Also bump ``--chunk_hz`` default from ``4.0`` to ``1.0``. One new
chunk per second is what the trained checkpoint can comfortably
keep up with on common hardware and gives smoother motion than
sub-second chunk regenerations (no RTC interpolation between
chunks yet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:41:57 +02:00
Pepijn 33a4b4a5a0 feat(smolvla2): autonomous robot mode in lerobot-smolvla2-runtime
The runtime CLI was deliberately scoped to dry-run only: it
hard-coded ``robot_executor=None`` and printed a "real-robot
integration is a follow-up" warning even when ``--no_robot`` was
omitted. The runtime *engine* was already structured for real-robot
operation (separate ``LowLevelForward`` chunk-rate generation +
``DispatchAction`` ctrl-rate dispatch with a ``robot_executor``
hook); only the wiring was missing.

Add the wiring:

  * ``_load_policy_and_preprocessor`` now also returns the
    postprocessor (action denormaliser).
  * ``--robot.type`` / ``--robot.port`` / ``--robot.id`` /
    ``--robot.cameras`` (JSON) build a ``Robot`` via
    ``make_robot_from_config`` and connect it.
  * ``_build_robot_observation_provider`` reads
    ``robot.get_observation()`` each call, drops the language
    columns (runtime drives messages itself), and runs the policy's
    preprocessor (rename → batch → device → normalise).
  * ``_build_robot_action_executor`` postprocesses the policy's
    action tensor (denormalise), converts to the ``{joint: value}``
    dict via ``make_robot_action(action, ds_meta.features)``, and
    calls ``robot.send_action(...)``. Optional ``--max_action_norm``
    safety clip rejects ticks whose action L2 norm exceeds the
    threshold (kill-switch when bringing up a new robot).
  * ``_run_autonomous`` runs ``runtime.run()`` in a background
    thread (the policy must keep generating chunks at chunk_hz and
    dispatching at ctrl_hz regardless of stdin) and handles user
    interjections / VQA queries from the foreground stdin loop.
    Confirmation prompt before start (skip with ``--auto_start``);
    Ctrl+C stops the thread and disconnects the robot cleanly.
  * Autonomous mode requires ``--dataset.repo_id`` for action stats
    / feature shapes — pass the same dataset the policy was trained
    on. The bootstrap path that pulls canonical task / plan / memory
    runs in both REPL and autonomous modes so the model's first
    prompt matches training distribution.

Dry-run REPL behaviour is unchanged when ``--robot.type`` is not
passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 18:30:56 +02:00
Pepijn a764c3e1d6 fix(datasets,annotate): tag pushed dataset + clean revision error
Two bugs combining to make the brand-new ``_tool3`` dataset
unloadable:

1. ``lerobot_annotate.py:_push_to_hub`` uploads the annotated
   dataset folder but never creates a codebase-version tag, so
   ``api/datasets/<repo>/refs`` returns ``"tags": []``. Then
   ``LeRobotDatasetMetadata`` → ``get_safe_version`` →
   ``get_repo_versions`` returns empty and the loader raises
   ``RevisionNotFoundError``.

2. ``RevisionNotFoundError`` itself was unconstructible: its
   ``HfHubHTTPError.__init__`` indexes ``response.headers``
   unconditionally on current ``huggingface_hub`` versions, so
   constructing it without a real ``Response`` blew up with
   ``AttributeError: 'NoneType' object has no attribute 'headers'``,
   masking the real "no tag" message.

Fix #1: after upload, read ``meta/info.json["codebase_version"]`` and
``HfApi.create_tag(..., tag=<v3.x>, repo_type='dataset',
exist_ok=True)`` so the dataset is loadable straight from the Hub on
the next ``LeRobotDataset(repo_id)`` call. Falls back to the in-tree
``CODEBASE_VERSION`` if info.json is missing/malformed; on tag
creation failure, prints the manual one-liner the user needs.

Fix #2: stop trying to instantiate ``RevisionNotFoundError`` (which
inherits HfHubHTTPError) for what is really a config issue, not an
HTTP failure. Raise plain ``RuntimeError`` with the same message —
the caller actually sees what's wrong instead of an upstream
attribute error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:23:18 +02:00
Pepijn b416f287f2 fix(datasets): raise readable error when repo has no version tags
``RevisionNotFoundError`` inherits from
``huggingface_hub.HfHubHTTPError`` which made ``response`` a required
keyword-only argument on recent versions. Constructing it with just a
message string blew up with
``TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only
argument: 'response'`` instead of surfacing the actual problem (the
dataset/checkpoint repo doesn't exist on the Hub yet).

Pass ``response=None`` explicitly. Fall back to the bare-message form
for older ``huggingface_hub`` versions that don't accept the kwarg.
Also clarify the message to call out the most common cause: typing a
hub repo id that hasn't been pushed yet (instead of just "needs a
version tag").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:12:40 +02:00
Pepijn aa749d4947 chore(annotate): throttle Module 3 + executor parallelism to fix vLLM stall
Last bump combined ``module_3.K=3`` with ``vqa_emission_hz=2.0`` and
``executor.episode_parallelism=32``. With 2 cameras per dataset that
produced ~12× the original VQA call volume, all submitted concurrently.
Module 3 latency went from ~30s/phase to ~490s per episode, vLLM's
KV cache pegged at 94% with 800+ in-flight requests, and the
multimodal cache corrupted with ``AssertionError: Expected a cached
item for mm_hash='...'`` (a known vLLM bug under image-heavy
concurrency). Module 1 and 2 ran fine; Module 3 was the bottleneck.

Pull back the multipliers to land in a sustainable spot:

  * module_3.K: 3 (kept) — three diverse questions per emission,
    where the diversity actually helps the LM head.
  * module_3.vqa_emission_hz: 2.0 → 1.0 — back to the original
    emission rate. Net VQA volume is now ~3× original (K alone) on
    a single camera, ~6× across both cameras — manageable.
  * module_2.max_interjections_per_episode: 9 → 6 — still 2× the
    default, fewer than the prior 3× to keep total request volume
    in check.
  * vlm.client_concurrency: 256 → 128 — gives vLLM headroom on the
    multimodal request path so the mm_cache doesn't desync.
  * executor.episode_parallelism: 32 → 16 — half the episodes
    in flight at once, so peak vLLM load is ~half.

n_task_rephrasings stays at 30 (text-only, doesn't load the image
path) and vlm.temperature stays at 0.7. The diversity gains are
preserved; only the throughput knobs come down.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:07:18 +02:00
Pepijn 1394a6ab5d chore(annotate): bump diversity knobs ~3x to fight memorisation
Following Pi0.7 §V (prompt expansion / diverse context conditioning),
push more atom variants per episode and higher VLM sampling
temperature so the training distribution has enough wording diversity
that the LM head is forced to use its parameters rather than memorise
specific (prompt, target) pairs.

Changes vs prior annotation pass:

  * vlm.temperature: 0.2 (default) → 0.7 — every Module-1/2/3 call
    now produces diverse phrasings; same prompt yields different
    completions across emissions.
  * module_1.n_task_rephrasings: 10 → 30 — three times as many
    ``task_aug`` rows in language_persistent. ``${task}`` already
    rotates through them deterministically per sample_idx (see
    ``_resolve_task`` in language_render.py).
  * module_2.max_interjections_per_episode: 3 (default) → 9 — more
    ``user_interjection_response`` training samples + more plan
    refresh events.
  * module_3.K: 1 → 3 — three VQA pairs per emission tick instead of
    one. Combined with the hz bump below, ~6× more VQA samples.
  * module_3.vqa_emission_hz: 1.0 → 2.0 — double the VQA emission
    rate within each subtask span.

Pushes to a new hub repo (``_tool3``) so the working ``_tool2``
dataset stays intact for comparison. ``${task}`` already wired to
rotate through ``task_aug`` rows, so no renderer change needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:32:05 +02:00
Pepijn db9118f16f fix(smolvla2): reject gibberish high-level generations
Memorised models can collapse to dominant-mode outputs (the
JSON-token salad ``":":":":...`` from VQA training) when the prompt
drifts even slightly from training distribution. Without a guard,
that gibberish lands in ``current_subtask`` / ``current_plan`` /
``current_memory``, which feeds the next tick's prompt and cascades
into worse outputs. The user observed exactly this: a clean run
followed by a tick that wrote ``" " "`` into plan and memory, then
slow recovery several ticks later.

Add ``_looks_like_gibberish`` heuristic (alpha density, repeating
chars, JSON-prefix sniff) and apply it before mutating state in
``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd``.
Bad generations are logged inline (``[info] subtask gen rejected
(gibberish): "":":":..."``) so the user can see what was dropped, but
the state stays at its last-known-good value (typically the dataset
bootstrap) instead of being polluted.

VQA path is intentionally exempt — its training targets *are*
JSON-shaped, so the heuristic would false-positive on them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:07:25 +02:00
Pepijn 7a945d7bdc fix(smolvla2): bootstrap canonical task + plan/memory from dataset
The user-typed task and the dataset's canonical task differ in
wording (capitalisation, ``green box`` vs ``green bin``, etc.). With
``text_loss`` driven down to ~6e-6 across 78 epochs the model is
memorised on the *exact* rendered training prompts: any wording drift
puts the prompt out of distribution and the model collapses to its
dominant training mode (VQA JSON output).

When ``--dataset.repo_id`` is set, automatically:
  * read the canonical task string from the chosen episode (and use
    it as ``--task`` when the user didn't pass one);
  * pull the active ``plan`` / ``memory`` / ``subtask`` rows from the
    persistent slice (latest row whose timestamp ≤ start frame's
    timestamp — same semantics as the renderer's ``active_at``) and
    seed them into the runtime state.

The first prompt the runtime builds at REPL start now mirrors what
the recipe rendered during training (task + active plan + active
memory + optional current subtask). The user can still override any
of these by typing.

Memorisation itself is upstream (training mix collapsed to too few
unique high-level targets); this commit only fixes the inference-side
prompt mismatch that was making the memorisation surface as gibberish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:00:36 +02:00
Pepijn a47e535b02 fix(smolvla2): per-recipe inference prompts to match training shape
The four high-level steps shared one generic
``_control_context_messages`` that jammed task + plan + memory +
completed_subtask into a single user message. The recipes in
``smolvla2_hirobot.yaml`` each have a *specific* multi-message layout
(``memory_update``: ``user(task) → assistant(prev memory) →
user(completed subtask)``; ``high_level_subtask``: ``user(task+plan+
memory) → user(current subtask)``; ``user_interjection_response``:
``user(task) → assistant(prev plan) → user(interjection)``). After
``apply_chat_template`` those layouts produce different prompts than
the runtime's flattened single-user-turn version, and the model fell
back to its dominant training mode (VQA JSON output) — generating
``":":":":":":...`` repetition.

Add four per-recipe prompt builders (``_msgs_for_subtask``,
``_msgs_for_memory``, ``_msgs_for_interjection``, ``_msgs_for_vqa``),
each mirroring its sub-recipe's exact message structure including
the ``if_present`` skips. Wire each high-level step to its matching
builder. Inference prompts now line up with what the model saw in
training, so generation should produce coherent text instead of
repeated tokens.

Generic ``_control_context_messages`` is kept (still used by tests
and the no-recipe fallback path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:47:22 +02:00
Pepijn 6d9b431b54 fix(smolvla2): match training's text-loss forward in select_message
Previous rewrite drove generation through ``vlm.generate()`` (the
standard SmolVLM path), which ignores SmolVLA's custom ``embed_prefix``
that interleaves images + lang + state. Result: the model received a
prompt format it had never been trained on at inference and emitted
JSON-fragment gibberish (``" " " ,",","`` ``cube lift {"...``).

Revert to the cumulative-buffer AR loop driven through
``vlm_with_expert.forward`` — the *same* forward call ``_compute_text_loss``
makes during training (``inputs_embeds=[prefix_embs, None],
use_cache=False, fill_kv_cache=True``). With ``fill_kv_cache=True``,
every layer routes through ``forward_attn_layer``, which gracefully
skips ``None`` expert inputs (``if hidden_states is None or layer is
None: continue``); cross-attention layers — which would otherwise hard-
require a non-None expert input — are bypassed entirely.

Inference now sees the same prefix structure as training: images +
lang + state, with new tokens appended to the lang region. The text
distribution matches what the model was trained to produce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:42:15 +02:00
Pepijn 347e706326 fix(smolvla2): drop pixel_values from select_message generate path
SmolVLA's image preprocessor sizes frames to whatever the action
expert was trained on, but SmolVLM's standard vision tower expects
its own default tile grid (e.g. 384/14 → 27×27 patches). The
mismatch surfaces deep in the post-vision reshape as
``RuntimeError: shape '[2, 34, 34, 768]' is invalid for input of
size 1843200`` — the model has 1200 patches but expects 34×34=1156.

Drop ``pixel_values`` from ``vlm.generate(...)`` so SmolVLM runs as
a text-only LM at REPL time. The high-level branches (subtask /
plan / memory) are dominated by their text context anyway, so this
is acceptable for dry-run inference. VQA loses its image grounding
— that will be marked as expected for the dry-run path until a
follow-up either re-processes images through SmolVLM's own
``ImageProcessor`` to match its tile grid, or gives
``vlm_with_expert`` a real AR text decode mode that handles state
and image embeddings the way training does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:36:53 +02:00
Pepijn fa8ae1e89b fix(smolvla2): drive select_message through SmolVLM.generate
The hand-rolled AR loop in ``select_message`` was fighting the
underlying ``vlm_with_expert.forward`` design, which assumes the
"prefix-once + suffix-always-via-expert" pattern that ``denoise_step``
uses for action chunks. Cross-attn layers (every other layer with
``attention_mode='cross_attn'`` + ``self_attn_every_n_layers=2``)
hard-require an expert input on every call: passing
``inputs_embeds=[current_embs, None]`` crashed at
``expert_layer.input_layernorm(None)`` with ``'NoneType' object has
no attribute 'dtype'``. Earlier KV-cache attempts ran into the
matching ``[15, 139] vs [15, 1]`` shape mismatch because the cache
gets *overwritten*, not appended, on each ``fill_kv_cache=True`` call
— there's just no AR-text-decode mode in this forward.

Stop fighting it: drive AR text generation through the underlying
SmolVLM via ``vlm.generate(input_ids=..., attention_mask=...,
pixel_values=...)``. KV caching, sampling/greedy, EOS handling all
come from HF's standard implementation. Trade-off: ``state`` drops
out of the prefix at inference (no slot for it on the standard
SmolVLM path), so high-level generations may drift from training
distribution slightly. That's acceptable for the dry-run REPL — the
high-level branches (subtask / plan / memory / vqa) are mostly
vision+language conditioned anyway, and the action expert (where
state actually matters) goes through the unchanged ``select_action``
path.

Image features the runtime merged in (``observation.images.*``) are
stacked into the ``[B, num_images, C, H, W]`` shape SmolVLM expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:39:34 +02:00
Pepijn 3ff6c6860e fix(smolvla2): rewrite select_message decode loop without KV cache
SmolVLA's ``vlm_with_expert.forward`` doesn't actually support
incremental KV cache growth — its only ``fill_kv_cache=True`` mode
*overwrites* the cache with the latest call's key/value states, and
its only ``fill_kv_cache=False`` mode concatenates ``cache + new``
into a local ``key_states`` for one matmul without ever updating the
cache itself. The original ``select_message`` decode loop tried to
use ``fill_kv_cache=True`` per step, which clobbered the cache to
1 token after the first decode and threw
``Expected size for first two dimensions of batch2 tensor to be:
[15, 139] but got: [15, 1]`` — the attention mask still expected
139 keys but the cached + new key_states only had 1.

Match the pattern ``denoise_step`` already uses successfully:
maintain a cumulative ``(embs, pad, att)`` buffer that starts as the
prefix and grows by one bool/embedding row per step. Each step
forwards the *full* sequence with ``use_cache=False,
fill_kv_cache=False, past_key_values=None`` so the matmul shapes
always line up. Generated-token rows are tagged ``pad=1, att=1``
which makes them fully causal among themselves while still able to
attend back to the entire prefix (per ``make_att_2d_masks``
semantics: a token can attend to any earlier token whose cumulative
``att`` count is ≤ its own).

Image encoding is still done once via the initial ``embed_prefix``
call — the expensive part doesn't repeat. The remaining cost is
O(n²) text-only transformer forwards, which is fine for the dry-run
REPL's 50–100 token responses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:15:28 +02:00
Pepijn fd89efb545 fix(smolvla2): 3D attention mask in select_message decode loop
SmolVLA's ``eager_attention_forward`` does
``masked = torch.where(attention_mask[:, None, :, :], ...)``, which
requires a 3D ``[B, query_len, key_len]`` bool tensor so the
broadcast to 4D works. ``select_message``'s prefix forward got this
right (passes ``prefix_2d`` from ``make_att_2d_masks``), but the
KV-cache decoding loop built ``new_attn = torch.ones((bsize,
cur_pos + 1))`` — 2D — and the very first decode step blew up with
``IndexError: too many indices for tensor of dimension 2``.

During KV-cache decoding ``query_len = 1`` and
``key_len = cur_pos + 1`` (prefix + every token already generated),
so the right shape is ``[B, 1, cur_pos + 1]``. Match the layout
SmolVLA's working ``denoise_step`` uses for the equivalent
``prefix_pad_2d_masks`` build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:08:52 +02:00
Pepijn 2776b57c9e fix(smolvla2): bool attention mask + clean Claude-Code-style REPL
Two issues that combined to make the REPL unusable:

1. ``BatchEncoding.attention_mask`` is a ``Long`` tensor, but SmolVLA's
   ``eager_attention_forward`` does
   ``torch.where(attention_mask[..., None, :, :], ...)`` which
   requires a *bool* condition. Every forward raised ``where expected
   condition to be a boolean tensor, but got a tensor with dtype Long``
   and the diagnostic surfaced it cleanly in the REPL — but generation
   produced nothing useful. Cast to ``bool`` in ``_build_text_batch``
   so the prefix forward goes through.

2. The interactive REPL used ``rich.live.Live`` panels stacked on top
   of ``logging.basicConfig(level=DEBUG)`` HTTP request lines from
   ``httpcore`` / ``httpx`` / ``huggingface_hub``. The two rendering
   loops fought each other in the user's terminal and the output was
   illegible: hundreds of debug lines interleaved with re-rendered
   panels.

   Replace ``Live`` with a simple block redraw — clear screen, print
   the state block, print any robot log lines, then a single ``> ``
   prompt. State changes are visible above the prompt, the way Claude
   Code's REPL renders. No flicker, no re-render races.

   ``_silence_noisy_loggers`` drops the chatty third-party HTTP /
   download / model-init loggers to WARNING. ``-v`` still enables
   DEBUG on the lerobot loggers; if the user needs the HTTP traces,
   they can flip those individually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:03:47 +02:00
Pepijn 0fb5f04965 fix(smolvla2): handle BatchEncoding return from apply_chat_template
``tokenizer.apply_chat_template(..., tokenize=True, return_tensors='pt')``
on newer transformers returns a ``BatchEncoding`` (dict-like) rather
than a raw ``Tensor`` — particularly when the underlying call routes
through a processor. ``_build_text_batch`` only handled the ``Tensor``
and ``list`` shapes, so the encoding object reached SmolVLA's
``embed_language_tokens`` and ``F.embedding`` blew up with
``argument 'indices' must be Tensor, not BatchEncoding`` on every
high-level forward.

Normalise the return:
  * ``BatchEncoding`` / ``dict`` → take ``input_ids`` (and the encoder's
    ``attention_mask`` when present, since ``pad_token_id`` can be
    ``None`` for SmolVLM and the fall-back ``ids != pad_token_id``
    breaks then),
  * ``list[int]`` / ``list[list[int]]`` → wrap in a long tensor,
  * ``Tensor`` → keep as-is.

After unwrapping, ensure shape ``(1, seq)`` and that ``attention_mask``
is a tensor on the same device as ``ids``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:59:57 +02:00
Pepijn 7296ac97af fix(smolvla2): make silent generation failures visible in REPL
Two failure modes were combining to make the runtime "look dead":

1. ``_build_text_batch`` produced lang tokens via
   ``apply_chat_template(return_tensors='pt')`` on CPU, but the policy
   sits on the configured device (mps / cuda). The first prefix-embed
   inside ``select_message`` then raised a device-mismatch on every
   call. The bare ``except Exception`` in ``_generate_with_policy``
   swallowed it at debug level — no logs, no chat output, no visible
   sign anything had run.

2. Even when generation succeeded but returned an empty string
   (greedy EOS, unhappy chat template, etc.), the high-level steps
   silently no-op'd, so users saw nothing.

Move tokens to ``policy.config.device`` in ``_build_text_batch`` so
the prefix forward succeeds in the common case. Bump the swallowing
log level to ``warning`` (with optional traceback under ``-v``), and
when ``state`` is given route the same diagnostic into the REPL log
via ``push_log`` so the user sees ``[warn] subtask gen failed: ...``
inline. Also push an ``[info] ... produced no text this tick`` line
when generation runs but yields nothing, so empty completions are
distinguishable from "step never ran". Apply the same surface to
``LowLevelForward.select_action`` failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:47:34 +02:00
Pepijn 9cbbcfb6a2 fix(smolvla2): tokenize lang prompt inline before select_action
LowLevelForward was handing the observation provider's output straight
to ``policy.select_action``, but SmolVLA's ``_get_action_chunk``
indexes ``batch[OBS_LANGUAGE_TOKENS]`` and crashes with ``KeyError:
'observation.language.tokens'`` when the key isn't there. Our provider
deliberately strips the dataset's language columns (the runtime drives
messages itself), so nothing else was producing those tokens — the
chunk path crashed on the very first tick after task was set.

Build a low-level prompt from current runtime state inline (task /
plan / memory as the user turn, current subtask appended as a
continuation assistant turn when known), tokenize it with the same
helper the high-level steps use, and merge ``lang_tokens`` /
``lang_masks`` into the observation before the call. Skip the step
when no task is set yet, and swallow ``select_action`` exceptions at
debug level so a missing observation feature doesn't kill the REPL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:40:18 +02:00
Pepijn fea41b29f5 fix(datasets): probe parquet for language columns before strict cast
``_load_hf_dataset`` was building the strict cast schema only from
``meta/info.json["features"]``. Datasets annotated by
``lerobot-annotate`` but still tagged at the older codebase version
(no ``language_persistent`` / ``language_events`` entry in
``info.json``) carry both columns in the parquet itself but not in the
features dict, so ``Dataset.from_parquet`` blew up with
``CastError: column names don't match`` when trying to project a
9-column parquet onto a 7-column schema.

Probe one parquet shard's actual schema; if either language column is
present in the parquet but missing from ``features``, graft it on
using PR 1's ``language_persistent_column_feature`` /
``language_events_column_feature`` helpers. No-op when neither column
is present (fully backwards-compatible with v3.0 datasets), no-op when
both are already registered (fully forwards-compatible with future
v3.1 ``info.json`` writes).

This unblocks dry-run inference on PR 2-annotated datasets that
weren't re-tagged to v3.1 — including the ones in the field today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:31:19 +02:00
Pepijn 7b4d281ef5 fix(smolvla2): build preprocessor fresh, don't round-trip the recipe
``PolicyProcessorPipeline.from_pretrained`` reconstructs each saved
step by passing the persisted JSON config back to ``__init__``, but
``RenderMessagesStep.recipe`` (a ``TrainingRecipe``) doesn't survive
the JSON round-trip — the saved entry is ``{}`` and the reconstructor
crashes with ``missing 1 required argument: 'recipe'``.

Bypass the round-trip in the runtime CLI by passing
``pretrained_path=None`` to ``make_pre_post_processors``. That re-runs
``make_smolvla2_pre_post_processors``, which reloads the recipe YAML
referenced by ``cfg.recipe_path`` and wires it back into the step
correctly. ``NormalizerProcessorStep`` still gets stats from
``ds_meta.stats`` so normalization matches training.

Proper fix is to make ``RenderMessagesStep`` serializable (e.g. by
persisting the recipe path / contents); this commit keeps it scoped to
the runtime path so dry-run testing isn't blocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:27:12 +02:00
Pepijn 29bb8bb20e fix(tools): unblock pocket-tts resolution (>=1.0.0,<3.0.0)
The previous bound `>=0.1.0,<1.0.0` matched zero published versions —
pocket-tts went straight to 1.0.0 on PyPI, with 0.x never released.
That made `uv sync --extra tools` (and any sync that pulls the `dev` /
`all` superset) fail with "requirements are unsatisfiable" on every
Python version uv tried, including 3.12.

Bump to `>=1.0.0,<3.0.0` so 1.x and 2.x are reachable. SayTool only
touches `TTSModel.load_model()`, `get_state_for_audio_prompt`,
`generate_audio`, and `sample_rate` — small enough surface that 1.x
and 2.x should both work; tighten if a real API break shows up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:15:20 +02:00
Pepijn 3fe686ce9f feat(smolvla2): runtime accepts Hub IDs + dataset-driven dry-run
The runtime CLI's loader was broken — it imported a `make_policy_from_path`
that doesn't exist in `lerobot.policies.factory` — and the high-level text
steps generated plan / subtask / memory / VQA from a text-only batch with
no images or state, so dry-runs drifted from the training distribution.

Switch to the standard `PreTrainedConfig.from_pretrained` +
`make_policy(cfg, ds_meta=...)` flow so `--policy.path` accepts both local
directories and Hub repo ids, and add a `--dataset.repo_id` path that walks
a chosen episode and feeds preprocessed observations into every forward
pass — including the four high-level steps (`HighLevelSubtaskFwd`,
`MemoryUpdateFwd`, `UserInterjectionFwd`, `AskVQAFwd`). Frames are routed
through the saved preprocessor pipeline with `language_persistent` /
`language_events` stripped so the recipe-render step stays a no-op (the
runtime supplies its own messages from current state).

Also wires the rich-based two-zone REPL layout (`ui.py`) that the script
was already importing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:09:19 +02:00
pepijn a1b8134ef1 fix(smolvla2): train on rendered language batches
Keep annotated language columns through collation, render batched recipe samples, and make SmolVLA2 text loss robust enough for distributed training on the steerable dataset.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-05 08:55:56 +00:00
Pepijn 5f7c6ba61d feat(annotate): compact steerable annotation prompts
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-04 15:57:04 +02:00
Pepijn 223cc8a9e2 feat(smolvla2): inference runtime — select_message + multi-rate REPL
Closes the loop on PR 3: SmolVLA2 can now be queried interactively at
inference, dispatching the same five sub-recipe shapes it was trained
on (action chunks, subtask gen, memory updates, plan/speech on
interjection, VQA on questions).

Modeling fixes + additions
--------------------------

- ``_compute_text_loss``: standard next-token CE shift was missing
  (logits at position t were CE'd against the label at t — identity-
  mapped, learning nothing). Adds ``logits[:, :-1]`` /
  ``labels[:, 1:]`` shift to match HuggingFace ``LlamaForCausalLM``.

- New ``select_message`` on ``SmolVLA2Policy``: AR text generation
  with KV caching, mirroring SmolVLA's ``select_action`` pattern.
  Single prefix forward fills the cache, then per-token forwards
  reuse it. Greedy + top-p nucleus sampling. Returns the decoded
  string with the prompt stripped.

Runtime package — ``src/lerobot/policies/smolvla2/inference/``
-------------------------------------------------------------

- ``triggers.py`` — ``Trigger`` Protocol + ``HzTrigger`` /
  ``EventTrigger`` + ``TickClock``. The whole runtime ticks at
  ``max_rate_hz=50`` and each step gates itself off its own
  cadence.

- ``runtime_state.py`` — runtime state dict factory plus tiny
  helpers (``take_event``, ``set_if_changed``, ``push_log``).
  Stable keys are documented at the top of the module.

- ``steps.py`` — :class:`InferenceStep` base + concrete steps:
  ``LowLevelForward`` / ``DispatchAction`` (action path),
  ``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` /
  ``UserInterjectionFwd`` / ``AskVQAFwd`` (text paths),
  ``DispatchToolCalls`` (tool registry → ``Tool.call``). Each
  text step builds a chat-template prompt from current
  ``RuntimeState`` (task / plan / memory / subtask) matching
  what ``smolvla2_hirobot.yaml`` renders during training.
  Includes a tiny ``<say>...</say>`` parser for the
  ``user_interjection_response`` branch's combined plan + speech
  output.

- ``runtime.py`` — :class:`SmolVLA2Runtime` composes the pipeline,
  drives ticks via ``TickClock``, polls a user-supplied
  ``event_collector`` per tick, and prints state-change log lines.

- ``repl.py`` — :class:`StdinReader` non-blocking line reader
  with simple intent classification: ``stop`` / ``quit`` /
  ``exit`` → terminate; ``?`` suffix → ``user_vqa_query`` event;
  first line → set task; other lines → ``user_interjection``.

CLI
---

- ``src/lerobot/scripts/lerobot_smolvla2_runtime.py``: console
  script ``lerobot-smolvla2-runtime`` that loads a checkpoint,
  optionally instantiates ``SayTool`` (pocket-tts), wires up
  ``SmolVLA2Runtime`` + ``StdinReader``, and runs.

  Real-robot wiring (observation_provider / robot_executor) is
  intentionally left as a follow-up — v1 is dry-run / language-
  only so the REPL works without robot hardware.

  Registered in ``pyproject.toml`` ``[project.scripts]``.

Known follow-ups
----------------

- Real-robot integration: today ``LowLevelForward`` only fires when
  an observation_provider is wired. The CLI prints a warning if
  ``--no_robot`` is omitted.
- ``select_message`` runs an extra prefix forward; could share with
  the action path's prefix when both are needed in the same tick.
- Tests: no end-to-end runtime test yet (would need a tiny SmolVLM
  fixture). The components compile and the public surface is
  exercised by the CLI's argument-parsing path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:04:00 +02:00
Pepijn af6d8ebd5b feat(smolvla2): dual-head forward — flow loss + lm_head text loss
The third and final commit of PR 3's SmolVLA2 work. Wires the actual
training signal through:

* ``predict_actions[i] = True``  → sample i contributes to flow loss
* ``text_labels[i, t] != -100``  → token t of sample i contributes to
                                    LM-head cross-entropy

Both routing knobs come from ``SmolVLA2ChatTokenizerStep`` (previous
commit on this branch), which builds them from the recipe's
``message_streams`` / ``target_message_indices``. The per-sample
``predict_actions`` mask preserves the Pi0.5 convention from the
plan's Section I.7: "True iff any low_level target exists".

Implementation:

- ``forward`` reads ``text_labels`` and ``predict_actions`` from the
  batch. When neither is present (vanilla SmolVLA usage with no
  recipe), delegates to ``SmolVLAPolicy.forward`` so unannotated
  datasets keep training as before — full backward compatibility.
- ``flow_loss``: super().forward(reduction="none") returns the
  per-sample (B,) flow loss; we mask non-action samples with the
  ``predict_actions`` bool and renormalize by the count of action
  samples. ``flow_loss_weight = 0`` in the config disables this
  branch entirely (text-only training).
- ``text_loss``: a prefix-only forward through the VLM (no action
  expert / suffix), slicing the lang-token range out of the
  resulting hidden states (``embed_prefix`` orders the prefix as
  ``[image_blocks..., lang, state]`` so the slice is unambiguous).
  Apply ``vlm.lm_head`` to those hidden states, cross-entropy with
  ``text_labels`` (ignore_index=-100). ``text_loss_weight = 0``
  disables this branch (reverts to flow-only behaviour, matching
  SmolVLA exactly).
- The two losses are summed with the config-supplied weights.

Mixed-stream samples (one batch containing both action targets and
text-only sub-recipes) are handled correctly: each sample contributes
where its labels are valid and is masked elsewhere.

Limitations / known follow-ups:

- Text loss runs an additional prefix-only forward separate from the
  flow path's prefix forward. The forwards could share their prefix
  computation; for clarity of this first commit they don't.
  Optimization is straightforward when needed.
- Per-sample loss for ``reduction="none"`` is not yet meaningfully
  defined for the dual path — we broadcast the scalar to (B,) for
  caller compatibility (e.g. RA-BC weighting will need follow-up).
- Inference ``select_action`` is unchanged from SmolVLA today —
  it predicts actions only. A separate "generate text"
  ``select_message`` path is the natural next step for runtime
  use of the LM head (memory updates, plan refreshes, VQA answers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:54:57 +02:00
Pepijn 37b1eb218a feat(smolvla2): chat-template processor + label mask + predict_actions
Wires PR 1's recipe stack into the SmolVLA2 pipeline so multi-target
sub-recipes (memory_update, ask_vqa, user_interjection_response,
high_level_subtask) carry meaningful supervision through to the model.

- New ``chat_processor_smolvla2.py`` with
  ``SmolVLA2ChatTokenizerStep``: reads ``messages`` /
  ``message_streams`` / ``target_message_indices`` from the rendered
  sample (PR 1 ``RenderMessagesStep``), calls
  ``apply_chat_template(messages, tools=DEFAULT_TOOLS, ...)`` on the
  SmolVLM tokenizer, and writes:

    OBS_LANGUAGE_TOKENS / _ATTENTION_MASK   ← chat-templated prompt
    text_labels                              ← -100 except target msg tokens
    predict_actions                          ← True iff any low_level target

  Builds the label mask robustly by re-rendering the chat through
  each target's prefix and reading off the prefix length — same
  tokenizer, same tools, so the prefix tokens are guaranteed to be
  a prefix of the full sequence. Image/video content blocks
  (LeRobot ``feature``-keyed) are stripped before tokenizing; the
  actual image tensors flow through SmolVLA's existing
  ``OBS_IMAGES_*`` channels and ``embed_prefix`` puts them before
  the language embeddings, matching the chat-template-stripped
  text order.

- ``processor_smolvla2.py``: when ``config.recipe_path`` is set,
  build a new pipeline with ``RenderMessagesStep`` +
  ``SmolVLA2ChatTokenizerStep`` instead of SmolVLA's plain
  ``TokenizerProcessorStep``. When ``recipe_path`` is ``None``,
  fall back to SmolVLA's pipeline so unannotated datasets still
  work unchanged. Resolves recipe paths relative to
  ``src/lerobot/configs/`` so ``recipes/smolvla2_hirobot.yaml``
  works directly.

The next commit on this branch picks up ``text_labels`` and
``predict_actions`` from the batch and routes them through the
SmolVLM ``lm_head`` for the actual dual-loss training.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:21:03 +02:00
Pepijn 52e1fd35cb feat(tools): src/lerobot/tools/ — runnable tool registry + SayTool
Ships the runtime side of the OpenAI-style function-calling stack
introduced in PR 1 (catalog in ``meta/info.json["tools"]``) and PR 2
(annotation pipeline writes the catalog after a run). One file per
tool — heavy deps stay isolated.

Layout:

- ``base.py`` — :class:`Tool` Protocol: ``name``, ``schema``,
  ``call(arguments)``. Runtime-checkable so tests can use
  ``isinstance(...)``.
- ``registry.py`` — :data:`TOOL_REGISTRY` (name → class) plus
  ``get_tools(meta, **kwargs)`` that instantiates every entry whose
  ``function.name`` is registered. Tools whose name is unknown are
  silently skipped — the schema still rides through the chat
  template, the model just can't actually invoke that tool at
  inference.
- ``say.py`` — :class:`SayTool` wrapping Kyutai's pocket-tts
  (CPU-only, ~100M params, ~6× real-time on a MacBook Air M4).
  Lazy model load: pocket-tts is imported and the voice state
  computed on first ``call(...)`` (or eagerly via ``preload()``).
  Returns the PCM tensor; optionally writes a ``.wav`` to
  ``output_dir`` for offline inspection.
- ``__init__.py`` — re-exports the public surface.

Optional install:

    pip install lerobot[tools]

The ``[tools]`` extra in ``pyproject.toml`` pulls in ``pocket-tts`` +
``scipy`` (for the wav writer). Adding more tools later means a new
file + a registry entry — no new extras unless the tool brings new
deps.

To add your own tool, follow the three-step guide in
``docs/source/tools.mdx`` (PR 1):

  1. Drop ``src/lerobot/tools/<my_tool>.py`` with a ``Tool``-conforming
     class.
  2. Register the class in ``TOOL_REGISTRY`` (this file).
  3. Pre-populate ``meta/info.json["tools"]`` with the schema (or let
     ``lerobot-annotate`` add it on the next run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:58:04 +02:00
Pepijn 7459dfccb6 feat(policies): scaffold smolvla2 (smolvla + lm_head re-enabled)
PR 3 of the steerable-annotation plan retargeted from Pi0.5 to SmolVLA
because the recipe stack (PR 1 + PR 2) outputs HF/TRL-compatible chat
which a chat-pretrained backbone consumes natively. SmolVLA strips the
SmolVLM ``lm_head`` though, so it can only do flow-matching action
prediction. SmolVLA2 keeps the LM head so the same model can train on
the full Hi Robot / MEM / ECoT blend defined in the plan:

  * action-only sub-recipes  (low_level_execution)        flow loss
  * text-only sub-recipes    (memory_update / ask_vqa /   CE loss on
                              user_interjection_response)  lm_head
  * mixed sub-recipes                                      both summed

This first commit lays down the structural scaffold:

- ``src/lerobot/policies/smolvla2/`` — new package with thin subclasses
  of ``SmolVLAConfig`` / ``SmolVLAPolicy`` so we don't fork the 900-line
  modeling code. ``SmolVLA2Config`` adds ``recipe_path``,
  ``apply_chat_template``, ``text_loss_weight``, ``flow_loss_weight``,
  and ``unfreeze_lm_head``. ``SmolVLA2Policy`` unfreezes the SmolVLM
  ``lm_head`` (and the surrounding norm + last text-model layer SmolVLA
  freezes) when ``unfreeze_lm_head=True`` and ``text_loss_weight>0``.
- ``factory.py`` registers ``smolvla2`` in ``get_policy_class``,
  ``make_policy_config``, and the pre/post-processor builder. Important:
  the ``smolvla2`` branch lives BEFORE the ``isinstance(config,
  SmolVLAConfig)`` check because ``SmolVLA2Config`` subclasses
  ``SmolVLAConfig`` — without the ordering, SmolVLA2 would silently
  pick up SmolVLA's processor.
- ``configs/recipes/smolvla2_hirobot.yaml`` — canonical Hi Robot blend
  for SmolVLA2. Same shape as ``pi05_hirobot.yaml`` (PR 1) so the
  recipe stack stays uniform across policy backbones.

Behaviour today is identical to SmolVLA: the modeling forward
delegates to ``SmolVLAPolicy.forward`` and the processor delegates to
``make_smolvla_pre_post_processors``. The next commit on this branch
adds the chat-template processor + ``text_labels`` / ``predict_actions``
batch keys; the commit after that wires the actual text-loss path
through ``vlm.lm_head``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:55:23 +02:00
Pepijn 73740ecf4b feat(annotate): write tool catalog to meta/info.json after annotation
After every ``lerobot-annotate`` run, the executor ensures
``meta/info.json["tools"]`` contains at minimum the canonical ``say``
schema, while preserving any tools the user pre-declared on the
dataset. Chat-template consumers (PR 3 SmolVLA2 / Pi0.5 / dataset
visualizer) read the catalog through
``LeRobotDatasetMetadata.tools`` and pass it to
``apply_chat_template(messages, tools=meta.tools, ...)``.

- ``executor.py``: new ``_ensure_tools_in_info`` helper called
  after the parquet rewrite. Idempotent and additive — merges by
  ``function.name``, only writes back if the list changed.
- ``writer.py``: drops the duplicated ``SAY_TOOL_SCHEMA`` /
  ``DEFAULT_TOOLS`` constants in favour of importing from
  ``lerobot.datasets.language`` (PR 1's single source of truth).
  Re-exported so existing imports keep working.
- ``annotation_pipeline.mdx``: replace the "code constant only" note
  with a pointer to the new Tools doc and a description of the
  meta/info.json behaviour, including how to pre-declare custom
  tools before annotation runs.

This is the storage half of the tools work; PR 3 ships the runnable
implementations under ``src/lerobot/tools/`` (one file per tool,
first up: ``say.py`` wired to Kyutai's pocket-tts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:51:38 +02:00
Pepijn 1b81e49214 feat(annotate): task rephrasings + video-derived task fallback
Module 1 now produces ``task_aug`` rows (registered in PR 1) so the
PR-1 ``${task}`` resolver can rotate phrasings deterministically per
``sample_idx``. Plus an opt-in video-derived task that bypasses the
canonical ``meta/tasks.parquet`` task when it's empty, low-quality, or
explicitly disabled — every downstream Module-1 prompt then uses the
derived task as its grounding.

- ``Module1Config``: adds ``n_task_rephrasings`` (default 10) and
  ``derive_task_from_video`` ∈ ``{off, if_short, always}`` (default
  ``if_short``: triggers when canonical is empty, < 3 words, or matches
  a placeholder string like ``debug`` / ``unnamed`` / ``tbd``).
- ``plan_subtasks_memory.py``: ``run_episode`` now resolves an
  ``effective_task`` (canonical OR video-derived) and threads it
  through ``_generate_subtasks`` / ``_generate_plan`` /
  ``_generate_memory`` so subtasks, plans, and memory are all grounded
  in the same task string. Then generates ``n`` rephrasings of the
  effective task and writes them as ``task_aug`` rows at ``t=0`` with
  ``role=user``. The effective task itself is included as the first
  variant so the rotation is guaranteed to cover the source-of-truth
  phrasing.
- New prompts: ``module_1_video_task.txt`` (one-shot video → task),
  ``module_1_task_rephrasings.txt`` (text-only paraphraser, ``n`` per
  call).
- ``meta/tasks.parquet`` is NOT modified — derived tasks live only in
  ``language_persistent``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn d813c75b76 fix(annotate): align interjections with the actual demo trajectory
qwen36moe-11 surfaced a deeper semantic problem with mid-episode
interjections: they were generated as *counterfactual* user requests
("actually skip the wipe", "use the blue one instead") but teleop data
is frozen — the robot in the video already executed everything,
including the steps the user "asked to skip". The training signal was
therefore self-contradictory: interjection text said one thing, the
robot's subsequent action stream did the opposite.

Flip the framing. Anchor every interjection at a subtask boundary and
write it as a natural user request for the *upcoming* subtask. The
robot's visible next behavior IS the interjection's effect, so:

  interjection text → plan refresh → action stream

are all consistent with the same observed video.

Concretely:

- ``interjections_and_speech.py``: instead of sampling random
  timestamps from ``frame_timestamps``, walk Module 1's subtask spans
  and sample from the (subtask N → subtask N+1) transitions. Pass both
  the just-finished and the upcoming subtask texts into the prompt.

- ``_window_timestamps``: re-center the multi-frame video window on
  the boundary itself (half the frames cover the end of the previous
  subtask, half cover the start of the next one) so the VLM has the
  same visual conditioning the policy will see at training time.

- ``module_2_interjection.txt``: rewritten. The prompt now states
  explicitly that this is offline data, the robot already committed to
  the next subtask, and the interjection must be a natural request
  that aligns with — not contradicts — the next subtask. Removes the
  "negative task / situated correction" Hi Robot framing because those
  scenarios require online execution to be coherent.

Plan-refresh logic from the previous commit (forwarding interjection
text into the refresh prompt) is unchanged and now reinforces the same
direction: the refreshed plan emphasizes the upcoming subtask the
interjection just asked for.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 3434d2ef22 fix(annotate): ground interjections in video + propagate text to plan refresh
qwen36moe-10 showed three Module-2 / plan-refresh quality issues that
are not architecture problems — they're prompt-grounding bugs:

1. Interjection prompt passed ``current_subtask = record.episode_task``
   (the WHOLE-episode task), not the actual subtask in force at the
   chosen timestamp. The VLM had no signal about what was visible at
   that moment, so its interjections were generic ("actually skip X"
   where X had nothing to do with the visible activity).

2. Interjection prompt only attached a single frame
   (``frames_at(record, [t_snap])``). With one frozen image the VLM
   couldn't read the ongoing motion. Module 1 already gets the whole
   episode video for subtask decomposition, which is why subtasks are
   well-grounded; Module 2 was the outlier.

3. The plan-refresh prompt told the model "a plan refresh after a user
   interjection at t=X.YZs" but never showed it the interjection
   *text*. So the refreshed plan couldn't actually reflect the user's
   correction — at best it recombined the same step list.

Fix:

- ``interjections_and_speech.py``: Module 2 reads Module 1's subtask
  rows from the same staging tree (executor orders module_1 → module_2
  so they're already there) and resolves the actual ``current_subtask``
  at each chosen timestamp. Pulls a small clip
  (``interjection_window_seconds`` × ``interjection_window_frames``,
  defaulting to 4 frames over the leading 2 s) instead of one frame.
  Drops the silently-zeroing ``len(candidate_ts) // 4`` cap on the
  interjection count.

- ``module_2_interjection.txt``: prompt is rewritten to reference the
  multi-frame visual context and require the interjection to mention
  something visible OR named in the current subtask, not invented.

- ``plan_subtasks_memory.py``: ``run_plan_updates`` now accepts and
  threads through interjection texts. ``_generate_plan(refresh_t,
  interjection)`` injects both the current subtask AND the interjection
  text into the prompt so the refreshed plan can drop / reorder /
  constrain steps to match the user's correction. (Plan still refreshes
  ONLY at user interjections — subtask generation runs ~1 Hz at
  inference, plan re-emission is event-driven.)

- ``executor.py``: forwards ``interjection_texts`` alongside
  ``interjection_times`` to ``run_plan_updates``.

- ``Module2Config``: bumps ``max_interjections_per_episode`` default
  from 1 to 3 and exposes ``interjection_window_seconds`` /
  ``interjection_window_frames``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn b71e10da6b refactor(annotate): drop dataset-level `tools` parquet column
PR 2 used to write a top-level ``tools`` column on every parquet shard
holding the JSON schema for the ``say`` tool, broadcast identically
across every row. That extends PR 1's schema for no real information
gain — the schema is a fixed code constant, parquet's RLE/dict encoding
collapses it on disk anyway, and HF/TRL chat-template consumers can
just import the constant directly.

PR 2 should fill in PR 1's existing schema, not add to it. So:

- ``writer.py``: stop emitting the ``tools`` column. Strip any legacy
  ``tools`` column from older shards on rerun so the schema converges to
  v3.1. ``SAY_TOOL_SCHEMA`` stays as a public constant (now joined by
  ``DEFAULT_TOOLS = [SAY_TOOL_SCHEMA]``); chat-template policies and the
  visualizer import them directly.
- ``test_writer.py``: replace the "tools column present" assertion with
  one that explicitly checks the column is absent, plus a new test
  asserting the constant's shape.
- ``test_pipeline_recipe_render.py``: drop the tools-column read; assert
  it's not present in the rewritten parquet.
- ``annotation_pipeline.mdx``: update the writer description to note the
  parquet stays small and the schema lives as a code constant.

If multi-tool-set support ever becomes real (datasets with different
tool inventories), the right home is ``meta/info.json["tools"]`` —
adding it later is non-breaking; ripping out a parquet column already
shipped is not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 0f6e3230df fix(annotate): decode video frames with PyAV directly
``lerobot.datasets.video_utils.decode_video_frames`` routes
``backend="pyav"`` through ``decode_video_frames_torchvision`` →
``torchvision.io.VideoReader``, but ``VideoReader`` was removed in
torchvision >= 0.22 (the vllm/vllm-openai:latest container ships with
torchvision 0.25). That made every Module 3 frame decode raise
``AttributeError: module 'torchvision.io' has no attribute 'VideoReader'``,
which the previous catch-all silently turned into an empty image list,
which then made every Module 3 prompt skip via the
``not _has_image_block(messages)`` branch and produce zero VQA rows.

Bypass ``video_utils`` entirely. The annotation pipeline only needs
a handful of PIL frames per (episode, ts), so a direct PyAV decode is
both simpler and insulated from torchvision API churn. ``av`` is already
in the install set, no new dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 2f2e42c4aa log(annotate): warn loudly on first video decode failure
VideoFrameProvider._decode used to swallow every exception silently and
return []. That made Module 3 (VQA) produce zero rows whenever local
video decoding broke (codec, backend, missing file, ...) because every
prompt got skipped via the ``not _has_image_block(messages)`` branch in
general_vqa.py — without any signal in the job log.

Log the first failure with full exception info (subsequent failures
stay quiet to avoid log spam) so this fast-path is debuggable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 5ee0104739 log(annotate): surface resolved frame-provider cameras at startup
Print the default and full camera list once at the top of every run so a
silent Module-3-no-op (cam_keys=[]) is visible in the job log instead of
only being discoverable by counting parquet rows after upload.

Also warn loudly when Module 3 is enabled but no cameras resolved, with
a hint about the --vlm.camera_key fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn e064cfcb04 fix(annotate): seed Module 3 cameras from camera_keys + camera_key fallback
Module 3 fast-pathed out (50 episodes in 0.6s) when
``frame_provider.camera_keys`` came back empty even though Module 1/2
worked, because they use ``frame_provider.camera_key`` (singular) and
were happy with the explicit ``--vlm.camera_key=...`` override.

Two fixes:

- ``frames.py``: read ``meta.camera_keys`` (covers both video- and
  image-stored cameras) instead of ``meta.video_keys`` (video-only),
  matching :class:`LeRobotDatasetMetadata`'s canonical accessor. If
  metadata still surfaces nothing but the caller explicitly passed
  ``--vlm.camera_key=<key>``, fall back to ``[<key>]`` — the key is by
  definition known to exist on the dataset.
- ``general_vqa.py``: emit a one-time WARNING log when Module 3 sees
  zero cameras so this never silently produces zero VQA again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn b3d9494831 docs(annotate): add HF Jobs runner example for lerobot-annotate
A ready-to-run example of launching the annotation pipeline on a
Hugging Face job (h200x2) with two vllm replicas serving
Qwen3.6-35B-A3B-FP8. Lives next to other end-to-end recipes under
examples/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 1217fdb6f0 feat(annotate): emit VQA per-camera and propagate camera field
Module 3 now produces one (vqa, user) + (vqa, assistant) pair per
emission tick *per camera* rather than only against the dataset's first
camera. Each emitted row carries the `camera` field added in PR 1
(language-columns), so the resolver can disambiguate per-camera VQA via
`emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity.

- `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property
  and a `camera_key=` argument on `frames_at` / `video_for_episode`.
  `VideoFrameProvider` exposes every `observation.images.*` key the
  dataset declares (not just the first) and keys its decode cache on
  `(episode, camera, timestamp)` so per-camera reads don't collide.
  Module 1 / 2 keep their old single-camera behaviour by leaving
  `camera_key=None` (falls back to the default camera).
- `modules/general_vqa.py`: `run_episode` iterates `frame_provider
  .camera_keys` for each emission tick, builds one prompt per camera,
  batches all of them through the VLM, and stamps the resulting rows
  with `camera=<that key>`. Empty `camera_keys` (null provider) makes
  the module a no-op rather than silently emitting untagged rows.
- `writer.py`: `_normalize_persistent_row` / `_normalize_event_row`
  carry `camera` through and call `validate_camera_field` so the
  invariant is enforced at the writer boundary. Event sort key now
  includes `camera` for deterministic ordering when several cameras
  share `(timestamp, style, role)`. `speech_atom` sets `camera=None`.
- `validator.py`: `StagingValidator` gains a `dataset_camera_keys`
  field; `_check_camera_field` enforces the invariant and cross-checks
  every view-dependent row's `camera` against the dataset's known video
  keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate
  `(vqa, role)` pairs at the same `(t, camera)`.
- `lerobot_annotate.py`: passes the live frame provider's
  `camera_keys` into the validator so the cross-check uses the actual
  dataset camera set.
- Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new
  `camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera`
  configures two cameras and asserts both are represented, that every
  emitted row has a `camera` tag, and that uniqueness holds per
  `(timestamp, camera, role)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn d0388e1142 fix(annotate): transcode subclips to H.264 instead of stream-copy
Modern LeRobot datasets store videos in AV1, which vllm's libav build
cannot decode (the video processor returns 0 frames and downstream
chokes with ZeroDivisionError). Re-encode each per-episode subclip
with libx264 (preset ultrafast, crf 23) so the resulting mp4 is
universally decodable. Strip audio with -an for a smaller payload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 524aa59faa feat(annotate): pack multiple vllm replicas per GPU via num_gpus
Adds VlmConfig.num_gpus so parallel_servers can exceed the physical
GPU count. Replicas are round-robin-assigned to GPUs (e.g.
parallel_servers=4 + num_gpus=2 → replicas pinned to GPUs 0,1,0,1).
Backward-compatible: num_gpus=0 keeps the existing 1-replica-per-GPU
behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 27f7829b09 feat(annotate): forward chat_template_kwargs to OpenAI extra_body
Lets callers pass per-request template flags such as
{"enable_thinking": false} for Qwen3.5/Qwen3.6 models, where the
default thinking preamble otherwise consumes the entire max_new_tokens
budget before any JSON is emitted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 7f8bf108e8 fix(annotate): include prompt .txt files in wheel
The setuptools package-data declaration only listed envs/*.json, so
pip-installed wheels (including HF Jobs runs) were missing the
module_1_subtasks/plan/memory and module_2/3 prompt templates,
causing FileNotFoundError at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 855ff027f8 refactor(annotate): drop HF Inference Providers code path
Default backend is now a local OpenAI-compatible server (vllm /
transformers) which auto_serve spawns. Removes the
use_hf_inference_providers config flag and the router.huggingface.co
routing branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 3b797bb118 feat(annotate): --vlm.push_to_hub uploads the annotated dataset
After the pipeline completes, optionally create/locate a dataset repo
and upload the dataset root (excluding .annotate_staging/). Add
push_private and push_commit_message knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn aea04721ae feat(annotate): parallelize episodes within each module phase
Saturates parallel_servers + client_concurrency. Previously the
executor processed one episode at a time, so each Module 1 episode's
3-5 dependent VLM calls hit a single server with the others idle. Now
defaults to 16 episodes in flight; configurable via
ExecutorConfig.episode_parallelism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn ab5479129a fix(annotate): probe /v1/models for spawn-helper readiness
vllm with --uvicorn-log-level warning suppresses the "Uvicorn running"
banner that the readiness watcher waited for, so the spawn helper hung
forever even after the API was live. Add an HTTP probe in parallel with
the log watcher and broaden the log markers to include vllm's own
"Starting vLLM API server" / "Available routes are" lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn e6d4ac6f02 fix(annotate): lock-protect per-line writes for parallel server streams
8 server-streaming threads writing chars unsynchronized cause UTF-8
sequences from different servers to interleave mid-byte, garbling the
terminal output. Switch to line-buffered reads with a single shared
print lock — output stays readable, ready-marker detection still works
on the line containing 'Uvicorn running' / 'Application startup
complete'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 5722d365c5 feat(annotate): client_concurrency for parallel in-flight requests
Adds vlm.client_concurrency (default 16) which uses a ThreadPoolExecutor
to fan out batched chat.completions calls. vllm batches them internally
on the server side, giving big throughput wins on a single TP=1 server
without needing DP/TP and the NCCL setup it requires.

Module 3 now batches all per-episode VQA calls into a single
generate_json invocation so they fire in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 3d7e60cee4 feat(annotate): parallel_servers spawns N independent vllm replicas
Adds --vlm.parallel_servers=N. Spawns N independent vllm processes
(each pinned to GPU i via CUDA_VISIBLE_DEVICES, listening on
serve_port+i) and round-robins requests across them. Sidesteps DP/TP
NCCL setup failures on nodes with restricted P2P/SHM.

Default serve_command for parallel mode: vllm serve <model_id>
--tensor-parallel-size 1 --max-model-len 32768 --uvicorn-log-level
warning. Override via --vlm.serve_command (use {port} placeholder).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 7b767d4d60 feat(annotate): per-episode progress logs in executor 2026-04-30 18:48:35 +02:00
Pepijn f1e3ab7794 fix(annotate): don't crash pipeline on persistent JSON parse failure
Some prompts/models occasionally return pure prose with no JSON object
even on retry. Returning None (and logging a preview) lets the pipeline
skip that one VLM call cleanly instead of aborting the whole episode.
The modules already check for None / non-dict results and degrade
gracefully (no row emitted from that call).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 585341ba9f fix(annotate): robust JSON extraction (think tags + first balanced object)
Models often wrap JSON in prose or <think>...</think> blocks. Strip the
think tags first, then try direct json.loads, then fall back to scanning
for the first balanced {...} substring (ignoring braces inside strings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 23ff346027 fix(annotate): stream child stdout char-by-char so tqdm \\r progress flushes 2026-04-30 18:48:35 +02:00
Pepijn 3c5cbe7af4 test(annotate): adjust video-block test for fps-based frame sampling 2026-04-30 18:48:35 +02:00
Pepijn f2cbd97635 feat(annotate): Module 1 samples image frames at fps rate
Replace the fixed max_video_frames count with a rate (default 1 fps).
A 30 s episode now sends 30 frames; a 5 s episode sends 5; capped at
max_video_frames (default 128) to avoid blowing up the payload on long
episodes.

Override with --module_1.frames_per_second=2.0 for denser sampling, or
--module_1.frames_per_second=0.5 for sparser.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn c06c8d594a feat(annotate): use cached HF token from huggingface-cli login
Fall back to huggingface_hub.get_token() when HF_TOKEN/HUGGINGFACE_API_KEY
env vars aren't set. That picks up the token cached by
'huggingface-cli login' so users don't need to export it on every shell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn cd495a3a9d feat(annotate): default to HF Inference Providers, no local GPU needed
Flip the default backend to 'openai' with use_hf_inference_providers=True
and a Qwen3-VL-30B-A3B-Instruct:novita default model_id. The CLI now
runs end-to-end without a local model load — annotations are produced
by sending video_url + prompt to https://router.huggingface.co/v1.

Switch back to local inference with --vlm.backend=vllm or
--vlm.use_hf_inference_providers=false.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn c99ac45cd1 feat(annotate): one-flag HF Inference Providers backend
Setting --vlm.use_hf_inference_providers=true routes requests through
https://router.huggingface.co/v1 using HF_TOKEN as the API key, and
disables auto_serve so no local server is spawned. Combine with a
provider-pinned model id like 'Qwen/Qwen3-VL-30B-A3B-Instruct:novita'
or any plain model id to let HF route.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 13aaafeae0 fix(annotate): omit mm_processor_kwargs by default; transformers serve rejects it
transformers serve returns HTTP 422 'Unexpected fields' when
mm_processor_kwargs is in extra_body — that field is vllm-specific.
Drop it by default; opt in via LEROBOT_OPENAI_SEND_MM_KWARGS=1 when
talking to vllm serve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 2129648bf4 fix(annotate): mm_processor_kwargs in extra_body; inline file URLs as data URLs
Two fixes for video_url with transformers serve:
- fps must be in extra_body.mm_processor_kwargs, not in the content
  block; otherwise the server discards it as unknown kwargs.
- file:// URLs aren't fetched by transformers serve. Read the local mp4
  and inline it as a base64 data:video/mp4 URL so the server sees the
  bytes directly.

Both surface as std::bad_alloc on the server side when wrong, which is
unhelpful but explains what we hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn f5cd3f6e4e fix(annotate): detect server ready via stdout banner, not /v1/models polls
transformers serve rescans the HF cache on every /v1/models request
which exceeds the 2s urllib timeout, leaving the probe loop spinning
even after Uvicorn is fully up. Watch the streamed server output for
'Uvicorn running' / 'Application startup complete' instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ecf5766301 fix(annotate): visible auto_serve via stdout prints + live server log stream
The previous logger-based output never appeared, leaving users in the
dark when auto_serve silently no-op'd. Switch to print(flush=True) so
the spawn decision is unmistakable, and stream the server's stdout to
the parent terminal in real-time on a background thread so model-load
progress and errors surface immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 11597d4f71 fix(annotate): auto_serve defaults to True; probe before spawning
Default auto_serve to True so lerobot-annotate can drive the entire
flow with one command. Probe api_base/models first — if a server is
already reachable (user started one manually, or it's a remote
endpoint), skip the spawn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 8b9c598cf4 feat(annotate): auto_serve mode spawns and tears down inference server
Setting --vlm.auto_serve=true with --vlm.backend=openai makes the CLI
launch 'transformers serve <model_id> --port <serve_port>
--continuous-batching' as a child process, poll /v1/models until ready
(up to serve_ready_timeout_s), run the pipeline, then SIGINT the
server on process exit.

Override the spawn command with --vlm.serve_command='vllm serve ...'
or any OpenAI-compatible launcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn b325475b38 feat(annotate): video_url block for openai backend
Module 1 can now send the episode's actual mp4 file as a video_url
content block instead of pre-decoded frames. The server (transformers
serve / vllm serve / ktransformers serve) handles frame sampling at
the configured fps. Default fps=1 (one frame per second is enough for
subtask-boundary detection on manipulation episodes).

A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/
via ffmpeg stream-copy (no re-encode) so the model sees only this
episode's frames, not the whole shard.

Enable with --module_1.use_video_url=true (and --vlm.backend=openai).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ef137ff86a feat(annotate): openai-compatible backend for transformers/ktransformers serve
Adds a third backend that talks to any OpenAI-compatible server. This
unblocks Qwen3.6 (and other models) that work in transformers serve /
ktransformers but not in vllm 0.10.2's fallback path:

- launch the server out-of-process (transformers serve, vllm serve,
  ktransformers serve)
- point lerobot-annotate at it via --vlm.backend=openai
  --vlm.api_base=http://localhost:8000/v1 --vlm.model_id=...

Image and video blocks are converted to OpenAI image_url/video_url
data URLs automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn c5df821a96 fix(annotate): use vllm.chat() API for multimodal prompts
vllm.generate() expects a string/TextPrompt; passing message dicts
fails. vllm.chat() applies the chat template and extracts image/video
blocks automatically, which is what we need for VL models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 7ec3d7999c fix(annotate): drop guided_decoding=dict (api differs across vllm)
vllm 0.10.2 expects guided_decoding to be a GuidedDecodingParams object,
not a dict. Different vllm versions differ here. The parser already has
a one-retry JSON-recovery path, so drop guided decoding entirely for
portability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 712d63abbd fix(annotate): tolerate decoder returning fewer frames than requested
pyav (and sometimes torchcodec) decode can return fewer frames than
requested timestamps when some timestamps fall outside the video file's
content range. Drop the strict=True on the zip and rely on the
None-filter to discard missing frames.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 6653999983 fix(annotate): default video decode backend to pyav
torchcodec's __init__ bad-allocs on the cu128/torch-2.8 stack in some
environments (Lustre/conda combos). The annotation pipeline calls
decode_video_frames many times per episode, so this is a hard blocker.
Default to pyav (always available via the av package) and let users
opt back into torchcodec via LEROBOT_VIDEO_BACKEND=torchcodec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 4bdbedc9a0 fix(annotate): default trust_remote_code=False for HF loaders
Setting trust_remote_code=True unconditionally pulled custom loader
code that triggers std::bad_alloc post-load on Qwen3-VL — the official
transformers class is sufficient. Flip the default to False; keep the
config field so users can opt in for models that actually need it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn e240305e8e fix(annotate): default transformers backend to manual GPU placement
Loading Qwen3-VL via transformers + accelerate's device_map='auto'
fails with std::bad_alloc on hosts with abundant RAM. The bug is in
accelerate's post-load dispatch path. Bypassing accelerate by loading
to CPU first and then calling .to('cuda') manually avoids that path.

LEROBOT_TRANSFORMERS_DEVICE_MAP=auto switches back to the old behavior
for cases where it works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ccd189b264 fix(annotate): LEROBOT_DISABLE_CUDNN escape hatch for conv3d crash
cuDNN 9.x + torch 2.8 has a regression where the conv3d kernel used in
Qwen-VL vision tower patch embedders fails with
CUDNN_STATUS_NOT_INITIALIZED. The crash is independent of model size
and reproduces on both Qwen2.5-VL and Qwen3-VL because both use 3D conv
for video patch embedding.

Setting LEROBOT_DISABLE_CUDNN=1 falls back to native PyTorch conv3d
kernels (slower but functional) so the pipeline can run while the
torch/cuDNN stack is still on the broken combo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ef1242bbd4 fix(annotate): expose gpu_memory_utilization and max_model_len for vllm
Large VL models (Qwen3-VL-30B-A3B BF16) take ~58 GB of an 80 GB H100,
leaving only ~22 GB for KV cache + cuDNN workspace. The vision tower's
3D conv then fails with CUDNN_STATUS_NOT_INITIALIZED because cuDNN
can't grab a workspace large enough.

- vlm.gpu_memory_utilization (default 0.9) — drop to 0.7 when the vision
  encoder needs more cuDNN workspace.
- vlm.max_model_len — cap context to free KV cache memory; the 262k
  default for Qwen3 is wildly more than annotation prompts need.
- vlm.trust_remote_code — already plumbed; now also passed to LLM().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn ebf4a04d41 fix(annotate): pass trust_remote_code=True to HF auto-classes
Required for many newer VL checkpoints (Qwen3.x FP8 in particular) that
ship custom loader code in their repo. Without it, the FP8
weight_scale_inv parameters never bind to FP8Linear modules and the
post-load dispatch path bad-allocs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn 4419b4ef1b fix(annotate): low_cpu_mem_usage=True on transformers load path
The std::bad_alloc we hit on Qwen3-line VL models is not a real OOM —
it triggers in the post-load tensor-placement path even on hosts with
2 TB RAM. low_cpu_mem_usage=True bypasses the offending intermediate
staging buffer and is the standard accelerate workaround.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn ff06ca82d2 fix(annotate): use device_map='auto' for transformers backend
Without device_map, transformers stages the full FP8 checkpoint in CPU
RAM before any GPU placement, OOMing the host on 27B+ models even when
the GPU has enough VRAM. device_map='auto' streams shards directly to
GPU memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn fcb01e73eb fix(annotate): try AutoModelForImageTextToText first, fall back to AutoModelForVision2Seq
Newer transformers versions renamed/removed AutoModelForVision2Seq in
favour of AutoModelForImageTextToText for VL models. Try the new name
first and fall back gracefully so the transformers backend works on
both transformers 4.45-4.5x and 5.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn 268f8d1f53 fix(annotate): replace Literal types with str for older draccus
Older draccus versions (e.g. 0.10.x bundled in some envs) lack a decoder
for typing.Literal and raise:
  No decoding function for type typing.Literal['vllm', 'transformers', 'stub']

Switching VlmConfig.backend from Literal to str works under every
draccus version. The runtime branch in vlm_client.make_vlm_client
already validates the value and raises ValueError on unknown backends,
so the constraint stays enforced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn 663fff0ae2 feat(annotate): Module 1 sees the whole episode as one video block
Replaces keyframe sampling with a single Qwen-VL video block covering
the whole demonstration. The model pools temporally itself and chooses
where to cut subtasks — no stride, no count, no keyframe count knob to
tune.

- frames.py: ``FrameProvider`` gains ``video_for_episode(record,
  max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames``
  uniformly across the episode duration; ``_NullProvider`` returns []
  for the no-video fallback. New ``to_video_block`` helper.
- Module 1: drops keyframe sampling. The subtask prompt now goes out as
  ``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and
  the prompt template asks the model to "watch the whole clip, then
  segment it" with cut points decided from gripper/contact/regrasp
  events the model sees.
- Module1Config: ``keyframes_per_episode`` removed; replaced with
  ``max_video_frames: int = 32`` (model-capacity bound, not annotation
  logic).
- Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in
  the single-video-block invariant.
- Stub-VLM markers updated: tests now key on "atomic subtasks" instead
  of the old "Decompose the demonstration" phrase that no longer
  appears in the prompt.
- Docs: updated to describe the whole-episode video-block behavior and
  the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn 9d6af804bf feat(annotate): attach camera keyframes to module prompts; default to Qwen3.6-27B-FP8
Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.

- New `frames.py`:
  - `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
    dataset's first `observation.images.*` stream via
    `LeRobotDatasetMetadata.get_video_file_path` and
    `decode_video_frames`, with the same `from_timestamp` shift the main
    dataset uses.
  - Per-process LRU cache so co-timestamped Module 1 plan-update + Module
    2 calls share decode work.
  - `make_frame_provider` falls back to a null provider when the dataset
    has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
  prepend image blocks before the text block.
  - Module 1 attaches `keyframes_per_episode` keyframes to the subtask
    decomposition prompt.
  - Module 2 attaches the frame at the interjection timestamp.
  - Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
  `Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
  `--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
  on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
  emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
  Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
  attachment behaviour and the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn f763f85213 feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.

Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
  refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
  mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
  configurable cadence with one-retry JSON validation.

Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.

Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.

Branched from PR 1 (`feat/language-columns`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn e3e9374e2c feat(language): tool catalog in meta/info.json + LeRobotDatasetMetadata.tools
Stores OpenAI-style function schemas at ``meta/info.json["tools"]`` so
datasets can declare which tools are available (today: just ``say``;
tomorrow: per-dataset extensions). The ``DEFAULT_TOOLS`` constant
fills in for unannotated datasets so chat-template consumers don't
have to special-case anything.

Three pieces:

- ``language.py``: ``SAY_TOOL_SCHEMA`` and ``DEFAULT_TOOLS``
  constants. Single source of truth — PR 2's writer and PR 3's
  runtime tool registry will both import from here instead of
  duplicating the dict.
- ``dataset_metadata.py``: ``LeRobotDatasetMetadata.tools`` property
  reads ``info.json["tools"]`` and falls back to ``DEFAULT_TOOLS``.
  Returns deep-copied dicts so callers can mutate the result safely.
- ``docs/source/tools.mdx``: spec page covering the catalog, per-row
  invocations, and the three-step "how to add a new tool" workflow
  (declare schema, implement, register). Linked from the docs
  toctree under the Datasets section.

This lays the groundwork for PR 2's pipeline writing the catalog out
during annotation, and PR 3's ``src/lerobot/tools/`` package shipping
runnable implementations (one file per tool — first up:
``say.py`` wrapping Kyutai's pocket-tts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:44:58 +02:00
Pepijn c1a0c601e2 feat(language): task_aug style + automatic ${task} rephrasing rotation
Adds task-prompt diversity (Xiao 2022 / CAST) without touching
``meta/tasks.parquet`` or forcing recipes to opt in. The plan reserved
``task_aug`` as a future style; this lands it now.

- ``language.py``: add ``task_aug`` to ``CORE_STYLES`` and
  ``PERSISTENT_STYLES``. ``column_for_style("task_aug")`` returns
  ``language_persistent`` so PR 2 writers route it correctly.

- ``language_render.py``: ``_resolve_task`` now consults the persistent
  slice for rows of ``style="task_aug", role="user"``. When any exist
  it picks one deterministically by ``sample_idx`` (blake2b-keyed, not
  Python's randomized hash) so an epoch sees every rephrasing of every
  episode while the same sample still resolves identically across
  reruns. Falls back to the canonical ``meta/tasks.parquet`` task when
  no rephrasings are present, so existing datasets and unannotated runs
  keep their behaviour. Explicit ``task=`` overrides still win.

- Tests: rephrasing coverage across samples, determinism on repeat
  ``sample_idx``, fallback when persistent has no ``task_aug`` rows,
  and explicit override priority.

Recipes get this for free: any ``${task}`` placeholder rotates through
the available rephrasings. Recipes that want the literal canonical task
can override the binding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:45:39 +02:00
Pepijn 1ca38d9748 fix(language): drop motion from VIEW_DEPENDENT_STYLES
Motion primitives are described in robot-frame (joint / Cartesian) terms,
not pixel space, so they are camera-agnostic. Only `vqa` (event) and
`trace` (event, pixel-trajectory) are view-dependent.

The `camera` field stays on PERSISTENT_ROW_FIELDS for schema symmetry —
the validator, resolver, and HF feature mapping behave identically across
the two columns regardless of which styles populate `camera` today —
but persistent rows now always have `camera=None` in practice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:54:12 +02:00
Pepijn 5a6aa64570 feat(language): per-camera tagging on view-dependent styles
Adds a nullable `camera` field to the language row struct (both persistent
and event variants) so view-dependent styles like `vqa` can carry which
`observation.images.*` view they were grounded against. Without this,
multi-camera datasets ended up with multiple `(vqa, role)` rows at the
same timestamp that the resolver could not disambiguate.

- `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS,
  to both Arrow struct types and the HF datasets feature mappings;
  introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus
  `is_view_dependent_style` and `validate_camera_field` helpers (camera
  required iff style is view-dependent).
- `language_render.py`: thread an optional `camera=` kwarg through every
  resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through
  `_matching_rows` / `_select_*`, so recipes can disambiguate per-camera
  VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`.
  Without a `camera` filter, multi-row matches keep raising the existing
  ambiguity error — which is the desired behaviour on multi-camera data.
- `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with
  `ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying
  the matching image block), keeping the original 0.20 budget and
  documenting the customization point for datasets with different cameras.
- Tests: schema test asserts the new field order; new tests cover
  `is_view_dependent_style`, `validate_camera_field` (both required and
  forbidden directions), per-camera `emitted_at` filtering, and the
  ambiguity error when two cameras emit `(vqa, assistant)` at the same
  timestamp without a `camera=` filter. RenderMessagesStep + dataset
  passthrough fixtures updated to include the new field.
- `docs/source/language_and_recipes.mdx`: document the `camera` field,
  the per-camera resolver pattern, and the canonical recipe convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:48:17 +02:00
Pepijn 0b06790da0 feat(language): add motion (persistent) and trace (event-only) styles
Promote the previously-reserved motion/trace styles to first-class core
styles. motion routes to language_persistent (it tracks robot state over
time); trace routes to language_events (single-moment annotations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 14:21:49 +02:00
Pepijn b43dc39ba4 Add docstrings to all new helpers; revert uv.lock
Covers private helpers in recipe.py, language.py, language_render.py,
and render_messages_processor.py. Also reverts uv.lock to main (it was
re-generated by `uv run` during local checks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 14:15:03 +02:00
Pepijn 2b71221194 Address review: split persistent/event schemas, drop event timestamps
- recipe.py: derive _VALID_ROLES/_VALID_STREAMS from MessageRole/MessageStream Literals
- dataset_metadata.py: keep CODEBASE_VERSION at v3.0
- language.py: remove RESERVED_STYLES; split arrow/feature schemas into
  persistent (with timestamp) and event (without timestamp); add docstrings
- language_render.py: events use frame-row timestamp implicitly; no
  per-event timestamp filtering or sorting
- converters.py: drop unused subtask_key passthrough
- add docstrings to new public APIs (recipe, render_messages_processor, collate)
- update tests for split schemas; revert uv.lock

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 13:38:23 +02:00
Pepijn 8833d735a1 Add extensive language support 2026-04-27 10:56:32 +02:00
Pepijn ba27aab79c fix(robotwin): pin compatible curobo in benchmark image (#3427)
* fix(robotwin): pin compatible curobo in benchmark image

* fix(robotwin): make curobo smoke check gpu-free
2026-04-21 19:51:44 +02:00
Pepijn 5adad11128 feat(sim): VLABench benchmark integration (#3396)
feat(sim): add VLABench benchmark integration
Add VLABench as a new simulation benchmark in LeRobot, following the existing LIBERO and MetaWorld patterns.
This PR wires VLABench end-to-end across environment integration, Docker setup, CI smoke evaluation, and documentation. It also fixes a number of upstream packaging and runtime issues required to make VLABench usable and reproducible in CI.
What’s included
Benchmark integration
Add VLABench as a new simulation benchmark.
Expose supported VLABench tasks through the LeRobot env interface.
Follow the established LIBERO / MetaWorld factory patterns.
Preserve lazy async-env metadata so env.unwrapped.metadata["render_fps"] continues to work.
CI smoke evaluation
Add a VLABench smoke-eval job using lerobot/smolvla_vlabench.
Use the correct rename_map for the 3-camera dataset layout.
Expand smoke coverage from 1 to 10 primitive tasks.
Extract task descriptions after eval so metrics artifacts include per-task labels.
Skip Docker Hub login when secrets are unavailable (e.g. fork PRs).
Docker / install fixes
Install VLABench from GitHub rather than PyPI.
Use uv pip, not pip, in the base image.
Fail loudly on install errors instead of masking them.
Clone VLABench into the non-root user’s home directory.
Use shallow editable installs for VLABench and rrt-algorithms to work around missing __init__.py issues.
Pin upstream clones to exact commit SHAs for reproducibility.
Add undeclared runtime dependencies required by VLABench (open3d, colorlog, scikit-learn, openai).
Unpin open3d so Python 3.12 wheels resolve.
Assets
Support downloading VLABench assets from a Hugging Face Hub mirror via VLABENCH_ASSETS_REPO.
Keep Google Drive download support as fallback.
Install huggingface_hub[hf_xet] so Xet-backed assets download correctly.
Validate required mesh/XML asset subtrees at build time.
Patch VLABench constants to tolerate missing asset directories at import time.
Runtime / env correctness
Import VLABench robots and tasks explicitly so decorator-based registry population happens.
Resize and normalize camera observations so they always match the declared (H, W, 3) uint8 observation space.
Reinstall LeRobot editably inside the image so the new env code is actually used.
Coerce agent_pos / ee_state to the expected shape.
Pad actions when needed to match data.ctrl.
Replace zero-padding fallback with proper dm_control IK for 7D end-effector actions.
Refetch dm_control physics on each step instead of caching weakrefs.
Retry unstable resets with reseeding and handle PhysicsError gracefully at step time.
Dataset / policy alignment
Align VLABench observations and actions with Hugging Face dataset conventions used by lerobot/vlabench_unified:
convert EE position between world frame and robot-base frame at the env boundary,
expose / consume Euler XYZ instead of raw quaternion layout,
align gripper semantics with dataset convention (1 = open, 0 = closed).
This fixes policy/env mismatches that previously caused incorrect IK targets and unstable behavior at evaluation time.
Docs
Add a full docs/source/vlabench.mdx page aligned with the standard benchmark template.
Document task selection forms (single task, comma list, suite shortcut).
Document installation, evaluation, training, and result reproduction.
Point examples at lerobot/smolvla_vlabench.
Add a benchmark banner image.
Remove outdated / misleading references to upstream evaluation tracks.
Document manual install flow instead of a broken vlabench extra.
Packaging cleanup
Remove the unresolvable vlabench extra from pyproject.toml.
Remove the no-op VLABench processor step.
Remove the obsolete env unit test that only covered the dropped gripper remap helper.
Apply formatting / logging / style cleanup from review feedback.
Why this is needed
VLABench is not currently consumable as a normal Python dependency and requires several upstream workarounds:
no PyPI release,
missing package declarations,
undeclared runtime deps,
SSH-only submodule references,
asset downloads outside normal package install flow,
registry population that depends on import side effects,
env outputs that do not always match declared observation shapes,
task resets that can diverge under some random layouts.
This PR makes the benchmark usable in LeRobot despite those constraints, and ensures CI runs are reproducible and informative.
If you want a much shorter squash commit message, I’d use this:
feat(sim): integrate VLABench benchmark with CI, Docker, and docs
Add VLABench as a new LeRobot simulation benchmark, following the existing LIBERO / MetaWorld patterns.
This includes:
LeRobot env integration and task exposure,
CI smoke eval with lerobot/smolvla_vlabench,
Docker install and asset-download fixes,
runtime fixes for registry loading, assets, camera obs, action handling, dm_control IK, and PhysicsError recovery,
alignment of obs/action semantics with HF VLABench datasets,
docs and packaging cleanup.
The PR also incorporates review feedback, improves reproducibility by pinning upstream commits, and makes VLABench usable in CI despite upstream packaging and asset-management issues.
2026-04-21 17:54:11 +02:00
Pepijn a07f22e22c feat(envs): add LIBERO-plus robustness benchmark (#3313)
* feat(envs): add LIBERO-plus robustness benchmark integration

- LiberoPlusEnv config (subclass of LiberoEnv, same gym interface)
- Docker image installing LIBERO-plus fork via PYTHONPATH
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_libero_plus
- pyproject.toml: libero_plus extra

* fix(libero): use suite's perturbation-aware init_states loader

LIBERO-plus's Benchmark class exposes a `get_task_init_states(i)` method that
strips perturbation suffixes (`_table_N`, `_tb_N`, `_view_`, `_language_`,
`_light_`, `_add_`, `_level`) and loads the underlying base `.pruned_init`
file — the on-disk name for a perturbation variant doesn't exist as a file,
only the base does. lerobot's loader was bypassing that logic and trying to
read the suffix-bearing filename directly, which failed for every non-zero
task id and killed the eval before any rollout video could be written.

Delegate to the suite's method when it exists; fall back to the path-based
loader for vanilla LIBERO (which does not provide the method).

Also drop the hf-libero install + init_files copy from the LIBERO-plus
Dockerfile — the LIBERO-plus clone already ships both `bddl_files/` and
`init_files/` for all five suites, so the copy was unnecessary and the
`cp -r` into an existing dir produced a confusing nested layout.

* fix(libero): resolve LIBERO-plus perturbation init_states path ourselves

Delegating to `task_suite.get_task_init_states(i)` works for path resolution
but LIBERO-plus's method calls `torch.load(path)` without `weights_only=False`,
which fails on PyTorch 2.6+ because the pickled init_states contains numpy
objects not in the default allowlist:

    _pickle.UnpicklingError: Weights only load failed.
    WeightsUnpickler error: Unsupported global:
      GLOBAL numpy.core.multiarray._reconstruct was not an allowed global.

Mirror LIBERO-plus's suffix-stripping logic (`_table_N`, `_tb_N`, `_view_`,
`_language_`, `_light_`, `_add_`, `_level`) in our own helper so we can pass
`weights_only=False` ourselves. Vanilla LIBERO task names don't contain any
of these patterns except for `_table_` when followed by the word `center`
(e.g. `pick_up_the_black_bowl_from_table_center_...`), and the regex
requires `_table_\\d+` so semantic uses are preserved.

* fix(libero-plus): download perturbation assets from Sylvest/LIBERO-plus

LIBERO-plus's bddl_base_domain.py resolves scene XMLs with
`os.path.join(DIR_PATH, "../assets")`, so the `assets` key in config.yaml
has no effect on scene lookup — MuJoCo always opens
`<clone>/libero/libero/assets/scenes/...`. With no such directory present,
every perturbation task fails on:

    FileNotFoundError: No such file or directory:
      .../libero-plus/libero/libero/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml

These textures, views, and extra objects ship only in the 6.4 GB `assets.zip`
published at `Sylvest/LIBERO-plus` (the LIBERO-plus README explicitly says
to download and unzip it into the package dir). Fetch it via `hf_hub_download`,
unzip into `${LIBERO_PLUS_ROOT}/`, install `unzip`, and point config.yaml at
the extracted dir so everything stays consistent. The download lives in its
own Docker layer so subsequent rebuilds reuse the cached assets.

Drops the lerobot/libero-assets snapshot_download — that mirror only has
vanilla LIBERO textures and is ignored for scene loading anyway.

* fix(libero-plus): flatten deep path prefix from Sylvest/LIBERO-plus assets.zip

The 6.4 GB zip ships with every entry prefixed by
`inspire/hdd/project/embodied-multimodality/public/syfei/libero_new/release/dataset/LIBERO-plus-0/assets/...`
(the author's internal filesystem layout, not the layout the LIBERO-plus
README promises), so the previous `unzip -d ${LIBERO_PLUS_ROOT}/` created
`${LIBERO_PLUS_ROOT}/inspire/.../assets/` — robosuite still opened
`${LIBERO_PLUS_ROOT}/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml`
and hit the same FileNotFoundError.

Extract to a scratch dir, then `mv` the nested `assets/` subtree to the
expected location. Verified the target file exists in the zip central
directory under that exact prefix.

* refactor(libero): inline init_states resolver behind single regex

Collapse the three-style suffix stripper (split/re.sub/in) into one
compiled regex, drop the (Path, bool) tuple return, and move the
`_add_`/`_level` reshape branch into the caller so each branch loads
its own file and returns directly. Net: -11 lines, one fewer helper.

* refactor(libero-plus): rebase docker image on huggingface/lerobot-gpu

Mirror the libero/metaworld/robomme pattern: start from the nightly GPU
image (apt deps, python, uv, venv, lerobot[all] already there) and only
layer on what LIBERO-plus uniquely needs — its wand/ImageMagick build
deps, the non-extra runtime pips (robosuite==1.4.1, bddl, …), the
PYTHONPATH-shadowed fork, and the 6.4 GB assets.zip.

Drops ~50 lines of duplicated base setup (CUDA FROM, apt python, uv
install, user creation, venv init) the nightly already provides.
123 → 73 lines.

Also:
- Add libero_plus to docs/source/_toctree.yml under Benchmarks so
  doc-builder's TOC integrity check stops failing.
- Repoint the docs dataset link from pepijn223/libero_plus_lerobot to
  the canonical lerobot/libero_plus.
- Revert the stray uv.lock churn (revision/marker diff that crept in
  from an unrelated resolve — unrelated to LIBERO-plus).

* fix(libero-plus): stop touching pyproject + uv.lock

The fast-tests job was rejecting the branch because pyproject.toml had a
[libero_plus] extra whose git dep wasn't represented in uv.lock.

The Docker image no longer needs the extra — it clones LIBERO-plus
directly and PYTHONPATH-shadows hf-libero. Drop [libero_plus] from
pyproject and restore pyproject.toml + uv.lock to exactly what's on
origin/main, so `uv sync --locked --extra test` is a no-op for this PR.

Also repoint the doc/CI/env comments that still mentioned the extra at
the Docker install path.

* fix(libero-plus): strip perturbation metadata from task descriptions

LIBERO-plus builds task.language by space-joining the perturbation-variant
filename, so every non-_language_ variant inherits a trailing blob like
"view 0 0 100 0 0 initstate 0 noise 45" or "add 16". That shows up in the
dashboard video labels and no longer matches the base instruction stored
in the training dataset.

Strip those tokens in extract_task_descriptions.py with an end-anchored
regex over the {view,initstate,noise,add,tb,table,light,level}(+digits)
vocabulary. The anchor preserves mid-sentence literal uses of those words
(e.g. "from table center and place it on the plate") — only the trailing
metadata chain is removed. _language_ variants carry real BDDL-sourced
text and are left untouched.

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: integrate PR #3313 review feedback

- docs: fix paper link to arxiv, add benchmark image, add suite descriptions,
  add LIBERO-plus replacement warning, restructure eval section to match
  LIBERO doc style, fix policy I/O section, remove false try/except claim
- docker: fix shell grouping for hf-libero uninstall, replace hardcoded
  asset path with dynamic find
- ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step
- envs: add is_libero_plus param to get_task_init_states so vanilla LIBERO
  always takes the simple path

* fix(docs): use correct LIBERO-plus teaser image URL

* ci(libero-plus): drop redundant hf auth login step

The standalone login step ran `hf auth login` in a throwaway
`docker run --rm` container, so no credentials persisted. Auth is
already performed inside the eval step's container. Removing the
redundant step per PR #3313 review feedback.

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch. Without these attributes eval crashes
when calling `env.unwrapped.metadata["render_fps"]` with async vector
envs. Adds `metadata` / `unwrapped` to `_LazyAsyncVectorEnv` and
caches the metadata alongside obs/action spaces in the LIBERO and
MetaWorld factories.

* ci: gate Docker Hub login on secret availability

Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`,
which made every benchmark job fail at the login step before any of
the actual build/eval work could run. Gate the login on the env-var
expansion of the username so the step is skipped (not failed) when
secrets are absent. Mirrors the existing pattern in the VLABench job.

* Update .github/workflows/benchmark_tests.yml

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update scripts/ci/extract_task_descriptions.py

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update .github/workflows/benchmark_tests.yml

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update docker/Dockerfile.benchmark.libero_plus

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update .github/workflows/benchmark_tests.yml

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* fix(libero-plus): address review feedback

* ci(libero-plus): fix YAML indentation in upload-artifact steps

The `uses:` key on two upload-artifact steps was at column 0 instead
of nested under the step, causing `pre-commit run check-yaml` to fail
with "expected <block end>, but found '<block mapping start>'".


Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-04-20 21:07:21 +02:00
Pepijn 282c31cfef feat(envs): add RoboMME benchmark (#3311)
* feat(envs): add RoboMME benchmark integration

- RoboMME env wrapper with image/wrist_image/state observations
- Docker image with Vulkan, SAPIEN, mani-skill deps
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_robomme
- preprocess_observation: handle image/wrist_image/state keys
- pyproject.toml: robomme extra

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(docker): rebase RoboMME image on huggingface/lerobot-gpu

Mirror the libero/metaworld pattern: start from the nightly GPU image
(which already has apt deps, uv, venv, and lerobot[all] preinstalled)
and only layer on what RoboMME uniquely needs — the Vulkan libs
ManiSkill/SAPIEN requires, plus the robomme extra with the
gymnasium/numpy overrides.

Drops 48 lines of duplicated base setup (CUDA FROM, python install,
user creation, venv init, base apt deps) that the nightly image already
provides. Net: 102 → 54 lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(robomme): drop prototype-branch note and move dataset to lerobot/robomme

- Remove the "Related work" block referencing the prototype branch
  feat/robomme-integration; the PR stands on its own.
- Point all dataset references at lerobot/robomme (docs, env module
  docstring, RoboMMEEnvConfig docstring) — this is the canonical HF
  location once the dataset is mirrored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(robomme): make docs build + fast tests green

1. Docs: add robomme to _toctree.yml under Benchmarks so doc-builder's
   TOC integrity check stops rejecting the new page.

2. Fast tests: robomme's mani-skill transitively pins numpy<2 which is
   unsatisfiable against the project's numpy>=2 base pin, so `uv sync`
   couldn't resolve a universal lockfile.

   Drop robomme as a pyproject extra entirely — it truly cannot coexist
   with the rest of the dep tree. The Dockerfile installs robomme
   directly from its git URL via `uv pip install --override`, which was
   already the runtime path. pyproject, docs, env docstrings, and the
   CI job comment all now point to the docker-only install.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(robomme): realign unit tests with current env API

The tests were written against an earlier env layout and never updated when
the wrapper was refactored, so CI's fast-test job was failing with:

- KeyError: 'front_rgb' / 'wrist_rgb' — these were renamed to the
  lerobot-canonical 'image' / 'wrist_image' keys (matching the dataset
  columns and preprocess_observation's built-in fallbacks).
- AssertionError: 'robomme' not in result — create_robomme_envs now
  returns {task_name: {task_id: env}}, not {'robomme': {...}}, so
  comma-separated task lists work.
- ModuleNotFoundError: lerobot.envs.lazy_vec_env — LazyVectorEnv was
  removed; create_robomme_envs is straightforward synchronous now.

Rewrite the 7 failing cases against the current API, drop the three
LazyVectorEnv tests, and add a multi-task test so the new comma-separated
task parsing is covered. Stub install/teardown is moved into helpers
(`_install_robomme_stub` / `_uninstall_robomme_stub`) so individual tests
stop repeating six boilerplate lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: integrate PR #3311 review feedback

- envs: rename obs keys to pixels/image, pixels/wrist_image, agent_pos
- envs: add __post_init__ for dynamic action_dim in RoboMMEEnv config
- envs: remove special-case obs conversion in utils.py (no longer needed)
- ci: add Docker Hub login, HF_USER_TOKEN guard, --env.task_ids=[0]
- scripts: extract_task_descriptions supports multiple task_ids
- docs: title to # RoboMME, add image, restructure eval section
- tests: update all key assertions to match new obs naming

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(docs): use correct RoboMME teaser image URL

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci(robomme): smoke-eval 10 tasks instead of 5

Broader coverage on the RoboMME benchmark CI job: bump the smoke eval
from 5 tasks to 10 (one episode each), all drawn from ROBOMME_TASKS.

Tasks now run: PickXtimes, BinFill, StopCube, MoveCube, InsertPeg,
SwingXtimes, VideoUnmask, ButtonUnmask, PickHighlight, PatternLock.

Updated the parse_eval_metrics.py `--task` label from the single
`PickXtimes` stub to the full comma list so the metrics artifact
reflects what was actually run. `parse_eval_metrics.py` already reads
`overall` for multi-task runs, so no parser change is needed.

Made-with: Cursor

* fix(robomme): nest `pixels` as a dict so preprocess_observation picks it up

`_convert_obs` was returning flat keys (`pixels/image`,
`pixels/wrist_image`). `preprocess_observation()` in envs/utils.py
keys off the top-level `"pixels"` entry and, not finding it,
silently dropped every image from the batch. The policy then saw
zero image features and raised

    ValueError: All image features are missing from the batch.

Match the LIBERO layout: return
`{"pixels": {"image": ..., "wrist_image": ...}, "agent_pos": ...}`
and declare the same shape in `observation_space`.

Made-with: Cursor

* fix(robomme): align docs and tests with nested pixels obs layout

Addresses PR #3311 review feedback:
- Docs: correct observation keys to `pixels/image` / `pixels/wrist_image`
  (mapped to `observation.images.image` / `observation.images.wrist_image`)
  and drop the now-obsolete column-rename snippet.
- Tests: assert `result["pixels"]["image"]` instead of flat `pixels/image`,
  matching the nested layout required by `preprocess_observation()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: gate Docker Hub login on secret availability

Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`,
which made every benchmark job fail at the login step. Gate the login
on the env-var expansion of the username so the step is skipped (not
failed) when secrets are absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(robomme): address review feedback

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 20:21:27 +02:00
Pepijn a147fa4439 feat(envs): add RoboCerebra long-horizon manipulation benchmark (#3314)
* feat(ci): add RoboCerebra benchmark eval job

- Docker image with robosuite/libero deps for RoboCerebra eval
- CI workflow: 1-episode eval with pepijn223/smolvla_robocerebra
- Reuses libero env with rename_map + empty_cameras=3

* docs(robocerebra): add benchmark page and toctree entry

Add a dedicated docs page for RoboCerebra that points at the canonical
dataset lerobot/robocerebra_unified and shows how to run eval + fine-tune
against it. Wire it into the Benchmarks section of the toctree so
doc-builder picks it up.

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

* fix(robocerebra): drop alias extra + simplify docker image

Two problems rolled up:

1. `uv sync --locked --extra test` was failing because pyproject.toml added
   a `robocerebra = ["lerobot[libero]"]` alias extra but uv.lock wasn't
   regenerated. Drop the alias — nothing in CI actually needs the extra
   name (the Dockerfile just installs what it needs directly), so this
   restores pyproject.toml and uv.lock to byte-exact origin/main.

2. Rebase docker/Dockerfile.benchmark.robocerebra on
   huggingface/lerobot-gpu:latest (same pattern as libero/metaworld/…).
   The nightly image already ships lerobot[all] which includes [libero],
   so the RoboCerebra image is essentially identical to the LIBERO one:
   fetch libero-assets, write ~/.libero/config.yaml, overlay source.
   92 → 43 lines.

Also repoint the CI workflow comment that referenced the removed extra.

* ci: use dedicated lerobot/smolvla_robocerebra checkpoint for smoke eval

Replace the generic pepijn223/smolvla_libero placeholder with the
purpose-trained lerobot/smolvla_robocerebra model in the RoboCerebra
CI smoke test.

* fix(ci): align RoboCerebra eval with training pipeline

Fixes 5 mismatches that caused 0% success rate:
- env.type: robocerebra (unregistered) → libero
- resolution: 360x360 (default) → 256x256 (matches dataset)
- camera_name_mapping: map eye_in_hand → wrist_image (not image2)
- empty_cameras: 3 → 1 (matches training)
- add HF_USER_TOKEN guard on eval step

* fix(ci): set env.fps=20 and explicit obs_type for RoboCerebra eval

Match the dataset's 20 FPS (LiberoEnv defaults to 30) and make
obs_type=pixels_agent_pos explicit for safety against future changes.

* docs(robocerebra): align page with adding_benchmarks template

Rework docs/source/robocerebra.mdx to follow the standard benchmark
doc structure: intro + links + available tasks + installation + eval
+ recommended episodes + policy I/O + training + reproducing results.

- Point everything at lerobot/smolvla_robocerebra (the released
  checkpoint), not the personal pepijn223 mirror.
- Add the --env.fps=20 and --env.obs_type=pixels_agent_pos flags
  that CI actually uses, so copy-paste eval reproduces CI.
- Split the "Training" block out of the recipe section into its own
  section with the feature table.
- Add an explicit "Reproducing published results" section pointing
  at the CI smoke eval.

* fix: integrate PR #3314 review feedback

- ci(robocerebra): drop redundant hf auth login step (auth is
  already performed inside the eval step's container).
- ci(robocerebra): add Docker Hub login before the image build
  to pick up the authenticated rate limit.
- docs(robocerebra): align eval snippet with the CI command
  (observation size, camera_name_mapping, use_async_envs, device,
  empty_cameras=1).

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch.

* ci: gate Docker Hub login on secret availability

* Update .github/workflows/benchmark_tests.yml

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update .github/workflows/benchmark_tests.yml

Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-04-20 19:12:15 +02:00
Pepijn 0f1c9b0851 feat(envs): add RoboTwin 2.0 benchmark (#3315)
* feat(envs): add RoboTwin 2.0 benchmark integration

- RoboTwinEnvConfig with 4-camera setup (head/front/left_wrist/right_wrist)
- Docker image with SAPIEN, mplib, CuRobo, pytorch3d (Python 3.12)
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_robotwin
- RoboTwinProcessorStep for state float32 casting
- Camera rename_map: head_camera/front_camera/left_wrist -> camera1/2/3

* fix(robotwin): re-enable autograd for CuRobo planner warmup and take_action

lerobot_eval wraps the full rollout in torch.no_grad() (lerobot_eval.py:566),
but RoboTwin's setup_demo → load_robot → CuroboPlanner(...) runs
motion_gen.warmup(), which invokes Newton's-method trajectory optimization.
That optimizer calls cost.backward() internally, which raises

    RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

when autograd is disabled. take_action() hits the same planner path at every
step. Wrap both setup_demo and take_action in torch.enable_grad() so CuRobo's
optimizer can build its computation graph. Policy inference is unaffected —
rollout()'s inner torch.inference_mode() block around select_action() is
untouched, so we still don't allocate grad buffers during policy forward.

* fix(robotwin): read nested get_obs() output and use aloha-agilex camera names

RoboTwin's base_task.get_obs() returns a nested dict:

    {"observation": {cam: {"rgb": ..., "intrinsic_matrix": ...}},
     "joint_action": {"left_arm": ..., "left_gripper": ...,
                      "right_arm": ..., "right_gripper": ...,
                      "vector": np.ndarray},
     "endpose": {...}}

Our _get_obs was reading raw["{cam}_rgb"] / raw["{cam}"] and raw["joint_action"]
as if they were flat, so np.asarray(raw["joint_action"], dtype=float64) tripped
on a dict and raised

    TypeError: float() argument must be a string or a real number, not 'dict'

Fix:
- Pull images from raw["observation"][cam]["rgb"]
- Pull joint state from raw["joint_action"]["vector"] (the flat array)
- Update the default camera tuple to (head_camera, left_camera, right_camera)
  to match RoboTwin's actual wrist-camera names (envs/camera/camera.py:135-151)

* refactor(robotwin): drop defensive dict guards, cache black fallback frame

_get_obs was guarding every dict access with isinstance(..., dict) in case
RoboTwin's get_obs returned something else — but the API contract
(envs/_base_task.py:437) always returns a dict, so the guards were silently
masking real failures behind plausible-looking zero observations. Drop them.

Also:
- Cache a single black fallback frame in __init__ instead of allocating
  a fresh np.zeros((H, W, 3), uint8) for every missing camera on every
  step — the "camera not exposed" set is static per env.
- Only allocate the zero joint_state on the fallback path (not unconditionally
  before the real value overwrites it).
- Replace .flatten() with .ravel() (no copy when already 1-D).
- Fold the nested-dict schema comment and two identical torch.enable_grad()
  rationales into a single Autograd section in the class docstring.
- Fix stale `left_wrist` camera name in the observation docstring.

* fix(robotwin): align observation_space dims with D435 camera output

lerobot_eval crashed in gym.vector's SyncVectorEnv.reset with:

    ValueError: Output array is the wrong shape

because RoboTwinEnvConfig declared observation_space = (480, 640, 3) but
task_config/demo_clean.yml specifies head_camera_type=D435, which renders
(240, 320, 3). gym.vector.concatenate pre-allocates a buffer from the
declared space, so the first np.stack raises on shape mismatch.

Changes:
- Config defaults now 240×320 (the D435 dims in _camera_config.yml), with
  a comment pointing at the source of truth.
- RoboTwinEnv.__init__ accepts observation_height/width as Optional and
  falls back to setup_kwargs["head_camera_h/w"] so the env is self-consistent
  even if the config is not in sync.
- Config camera_names / features_map use the actual aloha-agilex camera
  names (head_camera, left_camera, right_camera). Drops the stale
  "front_camera" and "left_wrist"/"right_wrist" entries that never matched
  anything RoboTwin exposes.
- CI workflow's rename_map updated to match the new camera names.

* fix(robotwin): expose _max_episode_steps for lerobot_eval.rollout

rollout() does `env.call("_max_episode_steps")` (lerobot_eval.py:157) to
know when to stop stepping. LiberoEnv and MetaworldEnv set this attribute;
RoboTwinEnv was tracking the limit under `episode_length` only, so the call
raised AttributeError once CuRobo finished warming up.

* fix(robotwin): install av-dep so lerobot_eval can write rollout MP4s

write_video (utils/io_utils.py:53) lazily imports PyAV via require_package
and raises silently inside the video-writing thread when the extra is not
installed — so the eval itself succeeds with pc_success=100 but no MP4
ever lands in videos/, and the artifact upload reports "No files were
found". Add av-dep to the install line (same pattern as the RoboMME image).

* feat(robotwin): eval 5 diverse tasks per CI run with NL descriptions

Widen the smoke eval from a single task (beat_block_hammer) to five:
click_bell, handover_block, open_laptop, stack_blocks_two on top of the
original. Each gets its own rollout video in videos/<task>_0/ so the
dashboard can surface visually distinct behaviours.

extract_task_descriptions.py now has a RoboTwin branch that reads
`description/task_instruction/<task>.json` (already shipped in the clone
at /opt/robotwin) and pulls the `full_description` field. CI cds into
the clone before invoking the script so the relative path resolves.

parse_eval_metrics.py is invoked with the same 5-task list so the
metrics.json embeds one entry per task.

* ci: point benchmark eval checkpoints at the lerobot/ org mirrors

pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.

* refactor(robotwin): rebase docker image on huggingface/lerobot-gpu

Mirror the libero/metaworld/libero_plus/robomme pattern: start from the
nightly GPU image (apt deps, python, uv, venv, lerobot[all] already
there) and layer on only what RoboTwin 2.0 uniquely needs —
cuda-nvcc + cuda-cudart-dev (CuRobo builds from source), Vulkan libs +
NVIDIA ICD (SAPIEN renderer), sapien/mplib/open3d/pytorch3d/curobo
installs, the mplib + sapien upstream patches, and the TianxingChen
asset download.

Drops ~90 lines of duplicated base setup (CUDA FROM, apt python, uv
install, user creation, venv init, base lerobot install). 199 → 110.

Also repoint the docs + env docstring dataset link from
hxma/RoboTwin-LeRobot-v3.0 to the canonical lerobot/robotwin_unified.

* docs(robotwin): add robotwin to _toctree.yml under Benchmarks

doc-builder's TOC integrity check was rejecting the branch because
docs/source/robotwin.mdx existed but wasn't listed in _toctree.yml.


* fix(robotwin): defer YAML lookup and realign tests with current API

__init__ was eagerly calling _load_robotwin_setup_kwargs just to read
head_camera_h/w from the YAML. That import (`from envs import CONFIGS_PATH`)
required a real RoboTwin install, so constructing the env — and thus every
test in tests/envs/test_robotwin.py — blew up with ModuleNotFoundError
on fast-tests where RoboTwin isn't installed.

Replace the eager lookup with DEFAULT_CAMERA_H/W constants (240×320, the
D435 dims baked into task_config/demo_clean.yml). reset() still resolves
the full setup_kwargs lazily — that's fine because reset() is only
called inside the benchmark Docker image where RoboTwin is present.

Also resync the test file with the current env API:
  - mock get_obs() as the real nested {"observation": {cam: {"rgb": …}},
    "joint_action": {"vector": …}} shape
  - patch both _load_robotwin_task and _load_robotwin_setup_kwargs
    (_patch_load → _patch_runtime)
  - drop `front_camera` / `left_wrist` from assertions — aloha-agilex
    exposes head_camera + left_camera + right_camera, not those
  - black-frame test now uses left_camera as the missing camera
  - setup_demo call check loosened to the caller-provided seed/is_test
    bits (full kwargs include the YAML-derived blob)

* fix: integrate PR #3315 review feedback

- ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step
- docker: tie patches to pinned versions with removal guidance, remove
  unnecessary HF_TOKEN for public dataset, fix hadolint warnings
- docs: fix paper link to arxiv, add teaser image, fix camera names
  (4→3 cameras), fix observation dims (480x640→240x320)


* fix(docs): correct RoboTwin 2.0 paper arxiv link


* fix(docs): use correct RoboTwin 2.0 teaser image URL


* fix(docs): use plain markdown image to fix MDX build

* ci(robotwin): smoke-eval 10 tasks instead of 5

Broader coverage on the RoboTwin 2.0 benchmark CI job: bump the smoke
eval from 5 tasks to 10 (one episode each). Added tasks are all drawn
from ROBOTWIN_TASKS and mirror the shape/complexity of the existing
set (simple single-object or single-fixture manipulations).

Tasks now run: beat_block_hammer, click_bell, handover_block,
open_laptop, stack_blocks_two, click_alarmclock, close_laptop,
close_microwave, open_microwave, place_block.

`parse_eval_metrics.py` reads `overall` for multi-task runs so no
parser change is needed. Bumped the step name and the metrics label
to reflect the 10-task layout.


* fix(ci): swap 4 broken RoboTwin tasks in smoke eval

The smoke eval hit two upstream issues:
- `open_laptop`: bug in OpenMOSS/RoboTwin main — `check_success()` uses
  `self.arm_tag`, but that attribute is only set inside `play_once()`
  (the scripted-expert path). During eval `take_action()` calls
  `check_success()` directly, hitting `AttributeError: 'open_laptop'
  object has no attribute 'arm_tag'`.
- `close_laptop`, `close_microwave`, `place_block`: not present in
  upstream RoboTwin `envs/` at all — our ROBOTWIN_TASKS tuple drifted
  from upstream and these names leaked into CI.

Replace the four broken tasks with upstream-confirmed equivalents
that exist both in ROBOTWIN_TASKS and in RoboTwin's `envs/`:
`adjust_bottle`, `lift_pot`, `stamp_seal`, `turn_switch`.

New 10-task smoke set: beat_block_hammer, click_bell, handover_block,
stack_blocks_two, click_alarmclock, open_microwave, adjust_bottle,
lift_pot, stamp_seal, turn_switch.


* fix(robotwin): sync ROBOTWIN_TASKS + doc with upstream (50 tasks)

The local ROBOTWIN_TASKS tuple drifted from upstream
RoboTwin-Platform/RoboTwin. Users passing names like `close_laptop`,
`close_microwave`, `dump_bin`, `place_block`, `pour_water`,
`fold_cloth`, etc. got past our validator (the names were in the
tuple) but then crashed inside robosuite with a confusing error,
because those tasks don't exist in upstream `envs/`.

- Replace ROBOTWIN_TASKS with a verbatim mirror of upstream's
  `envs/` directory: 50 tasks as of main (was 60 with many
  stale entries). Added a `gh api`-based one-liner comment so
  future bumps are mechanical.
- Update the `60 tasks` claims in robotwin.mdx and
  RoboTwinEnvConfig's docstring to `50`.
- Replace the stale example-task table in robotwin.mdx with ten
  upstream-confirmed examples, and flag `open_laptop` as
  temporarily broken (its `check_success()` uses `self.arm_tag`
  which is only set inside `play_once()`; eval-mode callers hit
  AttributeError).
- Rebuild the "Full benchmark" command with the actual 50-task
  list, omitting `open_laptop`.


* test(robotwin): lower task-count floor from 60 to 50

ROBOTWIN_TASKS was trimmed to 50 tasks (see comment in
`src/lerobot/envs/robotwin.py:48`), but the assertion still
required ≥60, causing CI failures. Align the test with the
current upstream task count.


* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch.

* ci: gate Docker Hub login on secret availability


* fix: integrate PR #3315 review feedback

- envs(robotwin): default `observation_height/width` in
  `create_robotwin_envs` to `DEFAULT_CAMERA_H/W` (240/320) so they
  match the D435 dims baked into `task_config/demo_clean.yml`.
- envs(robotwin): resolve `task_config/demo_clean.yml` via
  `CONFIGS_PATH` instead of a cwd-relative path; works regardless
  of where `lerobot-eval` is invoked.
- envs(robotwin): replace `print()` calls in `create_robotwin_envs`
  with `logger.info(...)` (module-level `logger = logging.getLogger`).
- envs(robotwin): use `_LazyAsyncVectorEnv` for the async path so
  async workers start lazily (matches LIBERO / RoboCasa / VLABench).
- envs(robotwin): cast `agent_pos` space + joint-state output to
  float32 end-to-end (was mixed float64/float32).
- envs(configs): use the existing `_make_vec_env_cls(use_async,
  n_envs)` helper in `RoboTwinEnvConfig.create_envs`; drop the
  `get_env_processors` override so RoboTwin uses the identity
  processor inherited from `EnvConfig`.
- processor: delete `RoboTwinProcessorStep` — the float32 cast now
  happens in the wrapper itself, so the processor is redundant.
- tests: drop the `TestRoboTwinProcessorStep` suite; update the
  mock obs fixture to use float32 `joint_action.vector`.
- ci: hoist `ROBOTWIN_POLICY` and `ROBOTWIN_TASKS` to job-level
  env vars so the task list and policy aren't duplicated across
  eval / extract / parse steps.
- docker: pin RoboTwin + CuRobo upstream clones to commit SHAs
  (`RoboTwin@0aeea2d6`, `curobo@ca941586`) for reproducibility.
2026-04-20 17:46:39 +02:00
Pepijn e699e52388 feat(envs): add RoboCasa365 benchmark integration (#3375)
* feat(envs): add RoboCasa365 benchmark integration

Add RoboCasa365 (arXiv:2603.04356) as a new simulation benchmark with
365 everyday kitchen manipulation tasks across 2,500 diverse environments.

New files:
- src/lerobot/envs/robocasa.py: gym.Env wrapper with deferred env creation,
  flat 12D action / 16D state vectors, 3-camera support
- docs/source/robocasa.mdx: user-facing documentation
- docker/Dockerfile.benchmark.robocasa: CI benchmark image

Modified files:
- src/lerobot/envs/configs.py: RoboCasaEnv config (--env.type=robocasa)
- pyproject.toml: robocasa optional dependency group
- docs/source/_toctree.yml: sidebar entry
- .github/workflows/benchmark_tests.yml: integration test job

Refs: https://arxiv.org/abs/2603.04356, https://robocasa.ai
Related: huggingface/lerobot#321

* fix(docker): use uv pip to install robocasa in benchmark image

The huggingface/lerobot-gpu base image uses `uv` with a venv at
/lerobot/.venv — `pip` is not on PATH, so `pip install` fails with
"pip: not found". Switch to `uv pip install` which installs into the
existing venv.

Also drop the @v1.0.0 tag pin from the robocasa git URL since the
upstream repo may not have that tag; use default branch instead.

* fix(robocasa): editable install + switch to lerobot/smolvla_robocasa

- pip install from git omits data files like box_links_assets.json
  (not declared in package_data). Clone and install editable so the
  source tree is used at runtime.
- Download only tex + fixtures_lw asset types (smoke test doesn't need
  objaverse/aigen objects). Pipe 'y' to auto-accept download prompt.
- Switch CI policy from pepijn223/smolvla_robocasa to lerobot/smolvla_robocasa.

* fix(docker): re-install lerobot editably after COPY

The nightly huggingface/lerobot-gpu image predates the RoboCasaEnv
registration — so `lerobot-eval --env.type=robocasa` fails at argparse
with "invalid choice" even after COPY . . overlays the new source.
Force an editable reinstall so the venv picks up the current configs.py.


* fix(ci): add rename_map for robocasa eval (image* -> camera*)

Policy lerobot/smolvla_robocasa expects observation.images.camera1/2/3,
but RoboCasaEnv produces observation.images.image/image2/image3.

* fix(robocasa): override RoboCasaGymEnv default split (test -> all)

RoboCasaGymEnv defaults split="test", but create_env only accepts
{None, "all", "pretrain", "target"}, so the out-of-the-box default
crashes with ValueError. Always pass "all" when split is None.


* fix(docker): also download objs_lw (lightwheel objects) for robocasa

Kitchen tasks (e.g. CloseFridge) reference lightwheel object meshes
like Stool022/model.xml. fixtures_lw alone isn't enough — we also
need objs_lw. Still skipping objaverse/aigen to keep image size down.

Made-with: Cursor

* feat(robocasa): raw camera names + benchmark-group task shortcuts

Align the LeRobot env with RoboCasa's native conventions so policies
trained on the upstream datasets don't need a --rename_map at eval
time, and expose the standard task groups as first-class --env.task
values.

- Preserve raw RoboCasa camera names (e.g. robot0_agentview_left)
  as observation.images.<name> end-to-end. Drops camera_name_mapping
  and DEFAULT_CAMERA_NAME_MAPPING; features/features_map are now
  built dynamically from the parsed camera list.
- Accept benchmark-group names as --env.task: atomic_seen,
  composite_seen, composite_unseen, pretrain50/100/200/300. Expanded
  lazily via robocasa.utils.dataset_registry and auto-sets the
  split ("target" | "pretrain").
- Update CI smoke-eval rename_map to map raw cam names to the
  camera1/2/3 keys expected by lerobot/smolvla_robocasa.


* docs(robocasa): single-task smolvla train+eval recipe on pepijn223/robocasa_CloseFridge

- Rewrite observation section to use raw RoboCasa camera keys
  (observation.images.robot0_agentview_{left,right},
  observation.images.robot0_eye_in_hand).
- Add a "Training on a single task" section with a full smolvla
  training command on pepijn223/robocasa_CloseFridge, plus matching
  single-task eval command.
- Document benchmark-group task shortcuts (atomic_seen, composite_seen,
  composite_unseen, pretrain50/100/200/300) as valid --env.task values.


* fix(robocasa): restrict obj_registries to lightwheel by default

CloseFridge (and most kitchen tasks) crashed at reset with
`ValueError: Probabilities contain NaN` coming out of
`sample_kitchen_object_helper`. RoboCasa's upstream default
`obj_registries=("objaverse", "lightwheel")` normalizes per-registry
candidate counts as probabilities; when a sampled category has zero
mjcf paths in every configured registry (because the objaverse asset
pack isn't on disk — ~30GB, skipped by our Docker build), the 0/0
divide yields NaNs and `rng.choice` raises.

- Add `obj_registries: list[str] = ["lightwheel"]` to `RoboCasaEnv`
  config; thread it through `create_robocasa_envs`, `_make_env_fns`,
  and the gym.Env wrapper to the underlying `RoboCasaGymEnv` (which
  forwards to `create_env` → `robosuite.make` → kitchen env).
- Default matches what `download_kitchen_assets --type objs_lw`
  actually ships, so the env works out of the box without a 30GB
  objaverse download.
- Document the override (`--env.obj_registries='[objaverse,lightwheel]'`)
  for users who have downloaded the full asset set.


* fix(docker): also download tex_generative for robocasa benchmark

RoboCasa's lightwheel kitchen fixtures embed references to
`generative_textures/wall/tex*.png` directly in their MuJoCo XML, so
`MjModel.from_xml_string` errors out at reset time with
"No such file or directory" even when the env is constructed with
`generative_textures=None`. The generative textures live under a
separate asset registry key (`tex_generative`) in
`download_kitchen_assets`, distinct from the base `tex` pack we were
already fetching.

- Add `tex_generative` to the download list so the fixture XMLs
  resolve.
- Document the remaining omissions (objaverse/aigen, ~30GB) and how
  the runtime side pairs this with obj_registries=["lightwheel"] to
  avoid sampling from categories whose assets aren't on disk.

* ci(robocasa): smoke-eval 10 atomic tasks instead of 1

Broader coverage in the benchmark CI job: evaluate SmolVLA on ten
fixture-centric atomic RoboCasa tasks (one episode each) instead of
just CloseFridge. The tasks are all drawn from TARGET_TASKS.atomic_seen
and selected to avoid object-manipulation categories that would require
the objaverse/aigen asset packs (we only ship objs_lw in the Docker
image, paired with obj_registries=["lightwheel"] on the runtime side).

Tasks: CloseFridge, OpenCabinet, OpenDrawer, TurnOnMicrowave,
TurnOffStove, CloseToasterOvenDoor, SlideDishwasherRack,
TurnOnSinkFaucet, NavigateKitchen, TurnOnElectricKettle.

`scripts/ci/parse_eval_metrics.py` already handles multi-task output
via the `overall` key, so no parser changes needed. Bumped the metrics
artifact's task label to `atomic_smoke_10` to reflect the grouping.

* fix(pyproject): drop unresolvable robocasa extra

robocasa's upstream setup.py hardcodes `lerobot==0.3.3` in
install_requires. Exposing it as the `lerobot[robocasa]` extra made
uv's dep resolver cycle: `lerobot[robocasa]` -> robocasa -> lerobot
(a different version) -> unsolvable. This broke every `uv sync` — even
invocations with an unrelated extra like `--extra test` — because uv
validates the whole lockfile graph.

- Remove the `robocasa` extra from pyproject.toml. Installation
  instructions in docs/source/robocasa.mdx now walk users through the
  manual `git clone` + `pip install --no-deps` flow, which matches
  what the Docker image already does and sidesteps the cyclic dep
  entirely.
- Dockerfile: `uv pip install -e ~/robocasa --no-deps` so the
  shadowed lerobot==0.3.3 never lands in the image; install
  robocasa's actual runtime deps (numpy, numba, scipy, mujoco,
  tianshou, etc.) explicitly.

* docs(robocasa): align page with adding_benchmarks template

Rework docs/source/robocasa.mdx to follow the standard benchmark doc
structure: intro + links + available tasks (with family breakdown and
first-class benchmark-group shortcuts) + installation + eval +
recommended episodes + policy I/O + training + reproducing results.

- Fix the paper link (was pointing at a non-existent arxiv ID).
- Surface lerobot/smolvla_robocasa and pepijn223/robocasa_CloseFridge
  in the top-of-page links so they're findable without reading the
  training section.
- Add an explicit "Object registries" subsection explaining the
  `--env.obj_registries=[objaverse,lightwheel]` override path.
- Add an explicit "Reproducing published results" section pointing
  at the CI smoke eval.

* fix: integrate PR #3375 review feedback

- envs(robocasa): hoist the duplicated `_parse_camera_names` helper
  out of `libero.py` and `robocasa.py` into `envs/utils.py` as the
  public `parse_camera_names`; call sites updated.
- envs(robocasa): give each factory a distinct `episode_index`
  (`0..n_envs-1`) and derive a per-worker seed series in `reset()`
  so n_envs workers don't all roll the same scene under a shared
  outer seed.
- envs(robocasa): drop the unused `**kwargs` on `_make_env`; declare
  `visualization_height` / `visualization_width` on both the wrapper
  and the `RoboCasaEnv` config + propagate via `gym_kwargs`.
- envs(robocasa): emit `info["final_info"]` on termination (matching
  MetaWorld) so downstream vector-env auto-reset keeps the terminal
  task/success flags.
- docs(robocasa): add `--rename_map` (robot0_agentview_left/
  eye_in_hand/agentview_right → camera1/2/3) plus CI-parity flags to
  all three eval snippets.
- docker(robocasa): pin robocasa + robosuite git SHAs and the pip
  dep versions (pygame, Pillow, opencv-python, pyyaml, pynput, tqdm,
  termcolor, imageio, h5py, lxml, hidapi, gymnasium) for
  reproducible benchmark images.
- ci(robocasa): update the workflow comment — there is no
  `lerobot[robocasa]` extra; robocasa/robosuite are installed
  manually because upstream's `lerobot==0.3.3` pin shadows ours.

* docs(robocasa): add benchmark banner image

* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs

Port of #3416 onto this branch. Also threads the cached metadata
through the RoboCasa factory so async eval on `--env.type=robocasa`
keeps the same improvement.


* fix: integrate PR #3375 review feedback (round 2)

- envs(robocasa): when the caller passes `seed=None` to `reset()`,
  fall back to `self.episode_index` for the inner env seed so each
  worker still samples a distinct trajectory instead of all workers
  inheriting the same global RNG state.
- envs(robocasa): replace the two module-level `print()` calls in
  `create_robocasa_envs` with `logger.info(...)` via a module-level
  `logger = logging.getLogger(__name__)`.
- ci(robocasa): run `scripts/ci/extract_task_descriptions.py` after
  the eval so `metrics.json` carries per-task natural-language
  labels, matching LIBERO / MetaWorld / VLABench jobs. Added a
  `_robocasa_descriptions()` extractor that splits CamelCase task
  names into word-level labels keyed by `<task>_0`.
2026-04-20 17:10:53 +02:00
Haoming Song b2765b39b8 Cache lazy async env metadata for eval (#3416)
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-04-20 15:33:13 +02:00
Pepijn 777b808c70 ci: skip Docker Hub login step on fork PRs (#3417)
On fork PRs, `secrets.DOCKERHUB_LEROBOT_*` expand to empty strings,
which fails `docker/login-action@v3` with `Error: Username and
password required` before any of the actual build/eval work runs.

Gate the login step on the env-var expansion of the username so the
step is skipped (not failed) when secrets are absent. On the main
repo + maintainer-approved fork runs (`pull_request_review` path),
the secrets resolve normally, the step runs, and image pulls get
the authenticated Docker Hub rate limit.

Scope: only `benchmark_tests.yml`, the lone benchmark workflow that
triggers on `pull_request` from forks. `full_tests.yml` and
`latest_deps_tests.yml` run under `pull_request_review` / schedule /
workflow_dispatch, where secrets are already guaranteed.

Context: surfaced on #3416 where an external contributor's PR failed
at the login step before any test could run.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 15:14:35 +02:00
Defalt 5c43fa1cce fix(policies): replace deprecated torch.cuda.amp.autocast with torch.amp.autocast (#3167)
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-04-19 16:25:08 +02:00
k1000dai 3f16d98a9b episods→episodes (#3410)
Fixing typo
2026-04-19 12:58:06 +02:00
whats2000 52f508c51c fix(dataset): cleanup_interrupted_episode wipes image temp dirs (#3405) 2026-04-19 12:04:24 +02:00
Steven Palma a8b72d9615 feat(dataset): 2x faster dataloader via parallel decode, uint8 transport, and persistent workers (#3406)
* feat(dataset): 2xfaster dataloader

* fix(dataset): streaming return uint8 decode

* fix(tests): adjust normalization step comparison

* fix(dataset): with threadexecutor + False default

* chore(dataset): make it a config

* fix(test): account for uint8 in training path testing
2026-04-19 00:08:22 +02:00
Steven Palma 760220d532 chore(dependencies): update uv.lock (#3365)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-04-18 22:32:05 +02:00
Shu Jiuhe a99943ca26 Improve loading performance in _absolute_to_relative_idx when remapping indices (#3279) 2026-04-18 19:28:50 +02:00
Cheng Yin a9821af61b fix(record): pass rename_map to make_policy in lerobot-record (#3240)
* fix(record): pass rename_map to make_policy in lerobot-record

Fixes #3181. The rename_map from dataset config was used for preprocessor
construction but not passed to make_policy(), causing feature mismatch
errors when camera key names differ between dataset and model config.

make_policy() already accepts a rename_map parameter and uses it to skip
visual feature consistency validation when remapping is active, but
lerobot_record.py was not passing it through.

* style: fix ruff format for ternary expression

---------

Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-04-17 16:40:08 +02:00
Steven Palma d4a229444b fix(ci): not fail when skipped (#3399) 2026-04-17 12:02:38 +02:00
Steven Palma 098ebb4d72 feat(ci): send slack notification if latest dependecy test is broken (#3398) 2026-04-17 11:28:24 +02:00
Maxime Ellerbach 9bc2df80bb chore(docs): adding a jupyter notebook that gives you ready-to-paste commands (#3395)
* chore(docs): adding an example quickstart jupyter notebook that gives you ready-to-paste commands

* some fixes in the commands

* uv lock

* Adding notebook to all

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* uv lock again

---------

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-04-16 17:53:35 +02:00
Remy bd74f6733d chore: bump doc-builder SHA for PR upload workflow (#3386) 2026-04-15 12:15:24 +02:00
Steven Palma 6f4a96333e chore(docs): update contributing (#3387) 2026-04-15 11:02:37 +02:00
Steven Palma 9021d2d240 refactor(imports): enforce guard pattern (#3382)
* refactor(imports): enforce guard pattern

* fix(tests): skip reachy2 if not installed

* Address review feedback
2026-04-14 22:54:05 +02:00
Khalil Meftah 60e7d67cb8 fix: catch KeyboardInterrupt in safe_stop_image_writer to prevent corrupted frames (#3381) 2026-04-14 18:22:56 +02:00
163 changed files with 23006 additions and 1062 deletions
+4 -22
View File
@@ -2,11 +2,6 @@
Short, imperative summary (e.g., "fix(robots): handle None in sensor parser"). See [CONTRIBUTING.md](../CONTRIBUTING.md) for PR conventions.
## Type / Scope
- **Type**: (Bug | Feature | Docs | Performance | Test | CI | Chore)
- **Scope**: (optional — name of module or package affected)
## Summary / Motivation
- One-paragraph description of what changes and why.
@@ -19,28 +14,14 @@ Short, imperative summary (e.g., "fix(robots): handle None in sensor parser"). S
## What changed
- Short, concrete bullets of the modifications (files/behaviour).
- Short, concrete bullets explaining the functional changes (how the behavior or output differs now).
- Short note if this introduces breaking changes and migration steps.
## How was this tested (or how to run locally)
- Tests added: list new tests or test files.
- Tests added: list new tests or test files. `pytest -q tests/ -k <keyword>`
- Manual checks / dataset runs performed.
- Instructions for the reviewer
Example:
- Ran the relevant tests:
```bash
pytest -q tests/ -k <keyword>
```
- Reproduce with a quick example or CLI (if applicable):
```bash
lerobot-train --some.option=true
```
- Instructions for the reviewer for reproducing with a quick example or CLI (if applicable)
## Checklist (required before merge)
@@ -48,6 +29,7 @@ Example:
- [ ] All tests pass locally (`pytest`)
- [ ] Documentation updated
- [ ] CI is green
- [ ] Community Review: I have reviewed another contributor's open PR and linked it here: # (insert PR number/link)
## Reviewer notes
+637 -4
View File
@@ -83,10 +83,13 @@ jobs:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
# Build the benchmark-specific image. The Dockerfile separates dep-install
# from source-copy, so code-only changes skip the slow uv-sync layer
@@ -115,7 +118,7 @@ jobs:
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=pepijn223/smolvla_libero \
--policy.path=lerobot/smolvla_libero \
--env.type=libero \
--env.task=libero_spatial \
--eval.batch_size=1 \
@@ -144,7 +147,7 @@ jobs:
--artifacts-dir /tmp/libero-artifacts \
--env libero \
--task libero_spatial \
--policy pepijn223/smolvla_libero
--policy lerobot/smolvla_libero
- name: Upload Libero rollout video
if: always()
@@ -238,10 +241,13 @@ jobs:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build MetaWorld benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
@@ -264,7 +270,7 @@ jobs:
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=pepijn223/smolvla_metaworld \
--policy.path=lerobot/smolvla_metaworld \
--env.type=metaworld \
--env.task=metaworld-push-v3 \
--eval.batch_size=1 \
@@ -293,7 +299,7 @@ jobs:
--artifacts-dir /tmp/metaworld-artifacts \
--env metaworld \
--task metaworld-push-v3 \
--policy pepijn223/smolvla_metaworld
--policy lerobot/smolvla_metaworld
- name: Upload MetaWorld rollout video
if: always()
@@ -310,3 +316,630 @@ jobs:
name: metaworld-metrics
path: /tmp/metaworld-artifacts/metrics.json
if-no-files-found: warn
# ── ROBOTWIN 2.0 ──────────────────────────────────────────────────────────
# Isolated image: full RoboTwin 2.0 stack — SAPIEN, mplib, CuRobo,
# pytorch3d, + simulation assets (~4 GB).
# Build takes ~20 min on first run; subsequent runs hit the layer cache.
# Requires an NVIDIA GPU runner with CUDA 12.1 drivers.
robotwin-integration-test:
name: RoboTwin 2.0 — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
ROBOTWIN_POLICY: lerobot/smolvla_robotwin
ROBOTWIN_TASKS: beat_block_hammer,click_bell,handover_block,stack_blocks_two,click_alarmclock,open_microwave,adjust_bottle,lift_pot,stamp_seal,turn_switch
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
# Build the full-install image: SAPIEN, mplib, CuRobo, pytorch3d +
# simulation assets (~4 GB). Layer cache lives in the runner's local
# Docker daemon — reused across re-runs on the same machine.
- name: Build RoboTwin 2.0 benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.robotwin
push: false
load: true
tags: lerobot-benchmark-robotwin:ci
cache-from: type=local,src=/tmp/.buildx-cache-robotwin
cache-to: type=local,dest=/tmp/.buildx-cache-robotwin,mode=max
- name: Run RoboTwin 2.0 smoke eval (10 tasks, 1 episode each)
if: env.HF_USER_TOKEN != ''
run: |
# Named container (no --rm) so we can docker cp artifacts out.
docker run --name robotwin-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e ROBOTWIN_POLICY="${ROBOTWIN_POLICY}" \
-e ROBOTWIN_TASKS="${ROBOTWIN_TASKS}" \
lerobot-benchmark-robotwin:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
cd /opt/robotwin && lerobot-eval \
--policy.path=\"\$ROBOTWIN_POLICY\" \
--env.type=robotwin \
--env.task=\"\$ROBOTWIN_TASKS\" \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.head_camera\": \"observation.images.camera1\", \"observation.images.left_camera\": \"observation.images.camera2\", \"observation.images.right_camera\": \"observation.images.camera3\"}' \
--output_dir=/tmp/eval-artifacts
python /lerobot/scripts/ci/extract_task_descriptions.py \
--env robotwin \
--task \"\$ROBOTWIN_TASKS\" \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy RoboTwin artifacts from container
if: always()
run: |
mkdir -p /tmp/robotwin-artifacts
docker cp robotwin-eval:/tmp/eval-artifacts/. /tmp/robotwin-artifacts/ 2>/dev/null || true
docker rm -f robotwin-eval || true
- name: Parse RoboTwin eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/robotwin-artifacts \
--env robotwin \
--task "${ROBOTWIN_TASKS}" \
--policy "${ROBOTWIN_POLICY}"
- name: Upload RoboTwin rollout video
if: always()
uses: actions/upload-artifact@v4
with:
name: robotwin-rollout-video
path: /tmp/robotwin-artifacts/videos/
if-no-files-found: warn
- name: Upload RoboTwin eval metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: robotwin-metrics
path: /tmp/robotwin-artifacts/metrics.json
if-no-files-found: warn
# ── ROBOCASA365 ──────────────────────────────────────────────────────────
# Isolated image: robocasa + robosuite installed manually as editable
# clones (no `lerobot[robocasa]` extra — robocasa's setup.py pins
# `lerobot==0.3.3`, which would shadow this repo's lerobot).
robocasa-integration-test:
name: RoboCasa365 — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build RoboCasa365 benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.robocasa
push: false
load: true
tags: lerobot-benchmark-robocasa:ci
- name: Run RoboCasa365 smoke eval (10 atomic tasks, 1 episode each)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name robocasa-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e MUJOCO_GL=egl \
lerobot-benchmark-robocasa:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_robocasa \
--env.type=robocasa \
--env.task=CloseFridge,OpenCabinet,OpenDrawer,TurnOnMicrowave,TurnOffStove,CloseToasterOvenDoor,SlideDishwasherRack,TurnOnSinkFaucet,NavigateKitchen,TurnOnElectricKettle \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.robot0_agentview_left\": \"observation.images.camera1\", \"observation.images.robot0_eye_in_hand\": \"observation.images.camera2\", \"observation.images.robot0_agentview_right\": \"observation.images.camera3\"}' \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env robocasa \
--task CloseFridge,OpenCabinet,OpenDrawer,TurnOnMicrowave,TurnOffStove,CloseToasterOvenDoor,SlideDishwasherRack,TurnOnSinkFaucet,NavigateKitchen,TurnOnElectricKettle \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy RoboCasa365 artifacts from container
if: always()
run: |
mkdir -p /tmp/robocasa-artifacts
docker cp robocasa-eval:/tmp/eval-artifacts/. /tmp/robocasa-artifacts/ 2>/dev/null || true
docker rm -f robocasa-eval || true
- name: Parse RoboCasa365 eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/robocasa-artifacts \
--env robocasa \
--task atomic_smoke_10 \
--policy lerobot/smolvla_robocasa
- name: Upload RoboCasa365 rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robocasa-rollout-video
path: /tmp/robocasa-artifacts/videos/
if-no-files-found: warn
- name: Upload RoboCasa365 eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robocasa-metrics
path: /tmp/robocasa-artifacts/metrics.json
if-no-files-found: warn
# ── ROBOCEREBRA ───────────────────────────────────────────────────────────
# Reuses the LIBERO simulator (libero_10 suite) with RoboCerebra camera
# defaults (image/wrist_image). The image is layered on
# huggingface/lerobot-gpu, which already ships [libero] as part of [all].
robocerebra-integration-test:
name: RoboCerebra — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build RoboCerebra benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.robocerebra
push: false
load: true
tags: lerobot-benchmark-robocerebra:ci
cache-from: type=local,src=/tmp/.buildx-cache-robocerebra
cache-to: type=local,dest=/tmp/.buildx-cache-robocerebra,mode=max
- name: Run RoboCerebra smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name robocerebra-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e LIBERO_DATA_FOLDER=/tmp/libero_data \
lerobot-benchmark-robocerebra:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_robocerebra \
--env.type=libero \
--env.task=libero_10 \
--env.fps=20 \
--env.obs_type=pixels_agent_pos \
--env.observation_height=256 \
--env.observation_width=256 \
'--env.camera_name_mapping={\"agentview_image\": \"image\", \"robot0_eye_in_hand_image\": \"wrist_image\"}' \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.wrist_image\": \"observation.images.camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env libero --task libero_10 \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy RoboCerebra artifacts from container
if: always()
run: |
mkdir -p /tmp/robocerebra-artifacts
docker cp robocerebra-eval:/tmp/eval-artifacts/. /tmp/robocerebra-artifacts/ 2>/dev/null || true
docker rm -f robocerebra-eval || true
- name: Parse RoboCerebra eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/robocerebra-artifacts \
--env robocerebra \
--task libero_10 \
--policy lerobot/smolvla_robocerebra
- name: Upload RoboCerebra rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robocerebra-rollout-video
path: /tmp/robocerebra-artifacts/videos/
if-no-files-found: warn
- name: Upload RoboCerebra eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robocerebra-metrics
path: /tmp/robocerebra-artifacts/metrics.json
if-no-files-found: warn
# ── ROBOMME ───────────────────────────────────────────────────────────────
# Isolated image: mani-skill/SAPIEN/Vulkan chain with gymnasium and numpy
# overrides (robomme can't be a pyproject extra due to numpy<2 pin).
robomme-integration-test:
name: RoboMME — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
ROBOMME_POLICY: lerobot/smolvla_robomme
ROBOMME_TASKS: PickXtimes,BinFill,StopCube,MoveCube,InsertPeg,SwingXtimes,VideoUnmask,ButtonUnmask,PickHighlight,PatternLock
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build RoboMME benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.robomme
push: false
load: true
tags: lerobot-benchmark-robomme:ci
- name: Run RoboMME smoke eval (10 tasks, 1 episode each)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name robomme-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e ROBOMME_POLICY="${ROBOMME_POLICY}" \
-e ROBOMME_TASKS="${ROBOMME_TASKS}" \
lerobot-benchmark-robomme:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=\"\$ROBOMME_POLICY\" \
--env.type=robomme \
--env.task=\"\$ROBOMME_TASKS\" \
--env.dataset_split=test \
--env.task_ids=[0] \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.wrist_image\": \"observation.images.camera2\"}' \
--policy.empty_cameras=3 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env robomme --task \"\$ROBOMME_TASKS\" \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy RoboMME artifacts from container
if: always()
run: |
mkdir -p /tmp/robomme-artifacts
docker cp robomme-eval:/tmp/eval-artifacts/. /tmp/robomme-artifacts/ 2>/dev/null || true
docker rm -f robomme-eval || true
- name: Parse RoboMME eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/robomme-artifacts \
--env robomme \
--task "${ROBOMME_TASKS}" \
--policy "${ROBOMME_POLICY}"
- name: Upload RoboMME rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robomme-rollout-video
path: /tmp/robomme-artifacts/videos/
if-no-files-found: warn
- name: Upload RoboMME eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robomme-metrics
path: /tmp/robomme-artifacts/metrics.json
if-no-files-found: warn
# ── LIBERO-plus ───────────────────────────────────────────────────────────
# Isolated image: LIBERO-plus fork cloned into /home/user_lerobot on top of
# huggingface/lerobot-gpu (see docker/Dockerfile.benchmark.libero_plus).
libero-plus-integration-test:
name: LIBERO-plus — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
LIBERO_PLUS_SUITE: libero_spatial
LIBERO_PLUS_POLICY: lerobot/smolvla_libero_plus
LIBERO_PLUS_TASK_IDS: "[0,100,260,500,1000,1500,2000,2400]"
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build LIBERO-plus benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.libero_plus
push: false
load: true
tags: lerobot-benchmark-libero-plus:ci
cache-from: type=local,src=/tmp/.buildx-cache-libero-plus
cache-to: type=local,dest=/tmp/.buildx-cache-libero-plus,mode=max
- name: Run LIBERO-plus smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name libero-plus-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e LIBERO_PLUS_SUITE="${LIBERO_PLUS_SUITE}" \
-e LIBERO_PLUS_POLICY="${LIBERO_PLUS_POLICY}" \
-e LIBERO_PLUS_TASK_IDS="${LIBERO_PLUS_TASK_IDS}" \
lerobot-benchmark-libero-plus:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=\"\$LIBERO_PLUS_POLICY\" \
--env.type=libero_plus \
--env.task=\"\$LIBERO_PLUS_SUITE\" \
--env.task_ids=\"\$LIBERO_PLUS_TASK_IDS\" \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env libero_plus --task \"\$LIBERO_PLUS_SUITE\" \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy LIBERO-plus artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-plus-artifacts
docker cp libero-plus-eval:/tmp/eval-artifacts/. /tmp/libero-plus-artifacts/ 2>/dev/null || true
docker rm -f libero-plus-eval || true
- name: Parse LIBERO-plus eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/libero-plus-artifacts \
--env libero_plus \
--task "${LIBERO_PLUS_SUITE}" \
--policy "${LIBERO_PLUS_POLICY}"
- name: Upload LIBERO-plus rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-plus-rollout-video
path: /tmp/libero-plus-artifacts/videos/
if-no-files-found: warn
- name: Upload LIBERO-plus eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-plus-metrics
path: /tmp/libero-plus-artifacts/metrics.json
if-no-files-found: warn
# ── VLABENCH ─────────────────────────────────────────────────────────────
# Isolated image: lerobot[vlabench] only (VLABench, mujoco==3.2.2, dm-control chain)
vlabench-integration-test:
name: VLABench — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
if: ${{ env.DOCKERHUB_USERNAME != '' }}
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
- name: Build VLABench benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.vlabench
push: false
load: true
tags: lerobot-benchmark-vlabench:ci
build-args: |
VLABENCH_ASSETS_REPO=lerobot/vlabench-assets
- name: Run VLABench smoke eval (10 tasks, 1 episode each)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name vlabench-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-e MUJOCO_GL=egl \
lerobot-benchmark-vlabench:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.second_image\": \"observation.images.camera2\", \"observation.images.wrist_image\": \"observation.images.camera3\"}' \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env vlabench \
--task select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy VLABench artifacts from container
if: always()
run: |
mkdir -p /tmp/vlabench-artifacts
docker cp vlabench-eval:/tmp/eval-artifacts/. /tmp/vlabench-artifacts/ 2>/dev/null || true
docker rm -f vlabench-eval || true
- name: Parse VLABench eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/vlabench-artifacts \
--env vlabench \
--task select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
--policy lerobot/smolvla_vlabench
- name: Upload VLABench rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: vlabench-rollout-video
path: /tmp/vlabench-artifacts/videos/
if-no-files-found: warn
- name: Upload VLABench eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: vlabench-metrics
path: /tmp/vlabench-artifacts/metrics.json
if-no-files-found: warn
@@ -33,7 +33,7 @@ jobs:
github.event.workflow_run.event == 'pull_request' &&
github.event.workflow_run.conclusion == 'success' &&
github.repository == 'huggingface/lerobot'
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@9ad2de8582b56c017cb530c1165116d40433f1c6 # main
with:
package_name: lerobot
secrets:
+18
View File
@@ -217,6 +217,24 @@ jobs:
- name: Run end-to-end tests
run: make test-end-to-end
slack-notification:
name: Slack Notification
needs: [cpu-tests, gpu-tests, upgrade-lock]
if: always() && needs.upgrade-lock.outputs.changed == 'true'
runs-on: ubuntu-latest
permissions:
contents: read
env:
CI_SLACK_CHANNEL: ${{ secrets.CI_SLACK_CHANNEL }}
steps:
- name: Post to a Slack channel
uses: huggingface/hf-workflows/.github/actions/post-slack@a88e7fa2eaee28de5a4d6142381b1fb792349b67 # main
with:
slack_channel: ${{ env.CI_SLACK_CHANNEL }}
title: "Results of the latest dependency tests (CPU + GPU)"
status: ${{ (needs.cpu-tests.result == 'success' && needs.gpu-tests.result == 'success') && 'success' || 'failure' }}
slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
# This job creates or updates a PR with the upgraded lockfile
open-pr:
name: Open PR
+4 -1
View File
@@ -78,6 +78,9 @@ Use the templates for required fields and examples.
- **Issues:** Follow the [ticket template](https://github.com/huggingface/lerobot/blob/main/.github/ISSUE_TEMPLATE/bug-report.yml).
- **Pull requests:** Rebase on `upstream/main`, use a descriptive branch (don't work on `main`), run `pre-commit` and tests locally, and follow the [PR template](https://github.com/huggingface/lerobot/blob/main/.github/PULL_REQUEST_TEMPLATE.md).
One member of the LeRobot team will then review your contribution.
> [!IMPORTANT]
> Community Review Policy: To help scale our efforts and foster a collaborative environment, we ask contributors to review at least one other person's open PR before their own receives attention. This shared responsibility multiplies our review capacity and helps everyone's code get merged faster!
Once you have submitted your PR and completed a peer review, a member of the LeRobot team will review your contribution.
Thank you for contributing to LeRobot!
+6
View File
@@ -178,3 +178,9 @@ test-smolvla-ete-eval:
--env.episode_length=5 \
--eval.n_episodes=1 \
--eval.batch_size=1
# E2E annotation pipeline smoke test against a tiny in-memory fixture
# dataset. Opt-in (not part of `make test-end-to-end`) and uses a stub VLM
# backend, so it does not require a real model checkpoint or GPU.
annotation-e2e:
uv run python -m tests.annotations.run_e2e_smoke
+84
View File
@@ -0,0 +1,84 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for LIBERO-plus integration tests.
# Extends the nightly GPU image (which has lerobot[all]) with the LIBERO-plus
# fork source + its 6.4 GB perturbation assets.
#
# Build: docker build -f docker/Dockerfile.benchmark.libero_plus -t lerobot-benchmark-libero-plus .
# Run: docker run --gpus all --rm lerobot-benchmark-libero-plus lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
ENV MUJOCO_GL=egl
# unzip for the 6.4 GB assets.zip; the rest are LIBERO-plus build-time extras
# (wand / ImageMagick / fontconfig) not in the nightly base.
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
unzip libexpat1 libfontconfig1-dev libmagickwand-dev \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
USER user_lerobot
# robosuite==1.4.1 is mandatory (the fork uses `single_arm_env` removed in
# v1.5+). The rest are LIBERO-plus runtime deps pulled from its setup.py.
# We install these explicitly instead of via the [libero_plus] extra because
# the extra's `libero @ git+...` dep installs as a namespace package and then
# clone and PYTHONPATH-override it below.
RUN uv pip install --no-cache \
"robosuite==1.4.1" \
"bddl==1.0.1" \
"easydict==1.13" \
"mujoco==3.7.0" \
"matplotlib==3.10.8" \
"Wand==0.6.13" \
"scikit-image==0.25.2" \
"gym==0.26.2"
# Clone LIBERO-plus and make it importable as `libero`. The nightly base has
# hf-libero (10 tasks) preinstalled via lerobot[libero]; uninstall it so
# Python resolves `import libero` to the 2402-task LIBERO-plus module instead.
# Pinned to the current upstream main SHA so benchmark builds stay reproducible.
ARG LIBERO_PLUS_SHA=4976dc3
ENV LIBERO_PLUS_ROOT=/home/user_lerobot/libero-plus/libero/libero
RUN git clone https://github.com/sylvestf/LIBERO-plus.git /home/user_lerobot/libero-plus \
&& git -C /home/user_lerobot/libero-plus checkout ${LIBERO_PLUS_SHA} \
&& cd /home/user_lerobot/libero-plus && uv pip install --no-cache --no-deps -e "." \
&& (uv pip uninstall hf-libero 2>/dev/null || true)
ENV PYTHONPATH="/home/user_lerobot/libero-plus:${PYTHONPATH}"
# Perturbation textures/scenes: bddl_base_domain.py resolves XMLs via
# DIR_PATH/../assets (package-relative, ignoring ~/.libero/config.yaml). All
# 2402 tasks reference files that ship only in Sylvest/LIBERO-plus's
# assets.zip (6.4 GB) under a deep author-internal prefix — extract and
# flatten it under ${LIBERO_PLUS_ROOT}/assets.
RUN python -c "\
from huggingface_hub import hf_hub_download; \
hf_hub_download(repo_id='Sylvest/LIBERO-plus', repo_type='dataset', \
filename='assets.zip', local_dir='/tmp/libero-plus-dl')" \
&& unzip -q /tmp/libero-plus-dl/assets.zip -d /tmp/libero-plus-dl/extract \
&& ASSETS_DIR=$(find /tmp/libero-plus-dl/extract -type d -name assets | head -1) \
&& mv "${ASSETS_DIR}" ${LIBERO_PLUS_ROOT}/assets \
&& rm -rf /tmp/libero-plus-dl
# Point ~/.libero/config.yaml at the clone so LIBERO-plus's imports are
# non-interactive (it calls input() when the config is missing).
RUN mkdir -p /home/user_lerobot/.libero \
&& printf "assets: ${LIBERO_PLUS_ROOT}/assets\nbddl_files: ${LIBERO_PLUS_ROOT}/bddl_files\ndatasets: ${LIBERO_PLUS_ROOT}/../datasets\ninit_states: ${LIBERO_PLUS_ROOT}/init_files\n" \
> /home/user_lerobot/.libero/config.yaml
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]
+71
View File
@@ -0,0 +1,71 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for RoboCasa365 integration tests.
# Extends the nightly GPU image (which already has all extras installed)
# with the PR's source code and RoboCasa-specific asset setup.
#
# Build: docker build -f docker/Dockerfile.benchmark.robocasa -t lerobot-benchmark-robocasa .
# Run: docker run --gpus all --rm lerobot-benchmark-robocasa lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Install robocasa + robosuite as editable clones. pip-installing from git
# omits data files like robocasa/models/assets/box_links/box_links_assets.json
# (not declared in package_data), which download_kitchen_assets needs at import.
#
# `--no-deps` on robocasa is deliberate: its setup.py pins `lerobot==0.3.3`
# in install_requires, which would shadow the editable lerobot baked into
# this image. We install robocasa's actual runtime deps explicitly instead.
# Pinned SHAs for reproducible benchmark runs. Bump when you need an
# upstream fix; don't rely on `main`/`master` drift.
ARG ROBOCASA_SHA=56e355ccc64389dfc1b8a61a33b9127b975ba681
ARG ROBOSUITE_SHA=aaa8b9b214ce8e77e82926d677b4d61d55e577ab
RUN git clone https://github.com/robocasa/robocasa.git ~/robocasa && \
git -C ~/robocasa checkout ${ROBOCASA_SHA} && \
git clone https://github.com/ARISE-Initiative/robosuite.git ~/robosuite && \
git -C ~/robosuite checkout ${ROBOSUITE_SHA} && \
uv pip install --no-cache -e ~/robocasa --no-deps && \
uv pip install --no-cache -e ~/robosuite && \
uv pip install --no-cache \
"numpy==2.2.5" "numba==0.61.2" "scipy==1.15.3" "mujoco==3.3.1" \
"pygame==2.6.1" "Pillow==12.2.0" "opencv-python==4.13.0.92" \
"pyyaml==6.0.3" "pynput==1.8.1" "tqdm==4.67.3" "termcolor==3.3.0" \
"imageio==2.37.3" "h5py==3.16.0" "lxml==6.0.4" "hidapi==0.14.0.post4" \
"tianshou==0.4.10" "gymnasium==1.2.3"
# Set up robocasa macros and download kitchen assets. We need:
# - tex : base environment textures
# - tex_generative : AI-generated textures; kitchen fixture XMLs embed
# refs to generative_textures/wall/tex*.png
# unconditionally, so MjModel.from_xml_string fails
# at reset time without them (even if the env is
# constructed with generative_textures=None).
# - fixtures_lw : lightwheel kitchen fixtures (fridge, counters...)
# - objs_lw : lightwheel object meshes (stools, misc props)
# We skip the objaverse/aigen object packs (~30GB combined) by pairing
# this with --env.obj_registries=["lightwheel"] on the lerobot side.
# The download script prompts interactively, so pipe 'y' to auto-accept.
RUN python -m robocasa.scripts.setup_macros && \
yes y | python -m robocasa.scripts.download_kitchen_assets \
--type tex tex_generative fixtures_lw objs_lw
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
# Re-install lerobot editably so the new source (with RoboCasaEnv registration)
# replaces the stale package baked into the nightly image.
RUN uv pip install --no-cache --no-deps -e .
CMD ["/bin/bash"]
+43
View File
@@ -0,0 +1,43 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for RoboCerebra integration tests.
# RoboCerebra reuses LIBERO's simulator (libero_10 suite) with a different
# rename_map, so this image is identical to the LIBERO benchmark image —
# extends the nightly GPU base with LIBERO assets + the PR's source code.
#
# Build: docker build -f docker/Dockerfile.benchmark.robocerebra -t lerobot-benchmark-robocerebra .
# Run: docker run --gpus all --rm lerobot-benchmark-robocerebra lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Pre-download lerobot/libero-assets from HF Hub so nothing is fetched at
# runtime (which times out on CI). Point the libero config at the cached path.
# libero/libero/__init__.py calls input() when ~/.libero/config.yaml is missing,
# so we write the config before any libero import can happen.
RUN LIBERO_DIR=$(python -c \
"import importlib.util, os; s=importlib.util.find_spec('libero'); \
print(os.path.join(os.path.dirname(s.origin), 'libero'))") && \
mkdir -p /home/user_lerobot/.libero && \
python -c "\
from huggingface_hub import snapshot_download; \
snapshot_download(repo_id='lerobot/libero-assets', repo_type='dataset', \
local_dir='/home/user_lerobot/.libero/assets')" && \
printf "assets: /home/user_lerobot/.libero/assets\nbddl_files: ${LIBERO_DIR}/bddl_files\ndatasets: ${LIBERO_DIR}/../datasets\ninit_states: ${LIBERO_DIR}/init_files\n" \
> /home/user_lerobot/.libero/config.yaml
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]
+56
View File
@@ -0,0 +1,56 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for RoboMME integration tests.
# Extends the nightly GPU image (which has lerobot[all]) with Vulkan system
# libs for ManiSkill/SAPIEN and the robomme extra. robomme isn't in [all]
# because mani-skill hard-pins gymnasium==0.29.1 and numpy<2.0.0 which
# conflict with lerobot's defaults; both are safe at runtime:
# - gymnasium 0.29.x has the same 5-tuple step() API as 1.x (since 0.26)
# - numpy 1.26.4 is API-compatible with lerobot's actual usage.
#
# Build: docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-benchmark-robomme .
# Run: docker run --gpus all --rm lerobot-benchmark-robomme lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# NVIDIA Container Toolkit: expose Vulkan driver capability for headless rendering.
ENV NVIDIA_DRIVER_CAPABILITIES=all \
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json
# ManiSkill/SAPIEN's renderer needs Vulkan, which isn't in the base image.
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
libvulkan1 libvulkan-dev mesa-vulkan-drivers \
&& mkdir -p /usr/share/vulkan/icd.d \
&& echo '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.3.0"}}' \
> /usr/share/vulkan/icd.d/nvidia_icd.json \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
USER user_lerobot
# Install smolvla + av-dep via the PR's pyproject, then layer robomme on top
# with gymnasium/numpy overrides. robomme isn't a pyproject extra because its
# mani-skill pin conflicts with lerobot's base numpy>=2 (see pyproject.toml).
COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml uv.lock README.md MANIFEST.in ./
RUN printf 'gymnasium==0.29.1\nnumpy==1.26.4\n' > /tmp/robomme_override.txt \
&& uv pip install --no-cache --override /tmp/robomme_override.txt \
-e ".[smolvla,av-dep]" \
"robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main" \
&& python -c "import robomme; print('robomme import OK')"
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]
+138
View File
@@ -0,0 +1,138 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for RoboTwin 2.0 integration tests.
# Extends the nightly GPU image with the RoboTwin simulator stack:
# sapien/mplib/pytorch3d + NVlabs CuRobo + embodiments.zip + objects.zip
# (~3.96 GB of assets; background_texture.zip ~11 GB skipped for smoke eval).
#
# Build: docker build -f docker/Dockerfile.benchmark.robotwin -t lerobot-benchmark-robotwin .
# Run: docker run --gpus all --rm lerobot-benchmark-robotwin \
# lerobot-eval --env.type=robotwin --env.task=beat_block_hammer ...
FROM huggingface/lerobot-gpu:latest
ENV NVIDIA_DRIVER_CAPABILITIES=all \
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json \
ROBOTWIN_ROOT=/opt/robotwin
# The nightly base is CUDA -base (no compiler, no Vulkan loader). CuRobo's
# `pip install -e .` runs nvcc, and SAPIEN renders via Vulkan — add both.
USER root
# Pinned upstream SHA for reproducible benchmark runs. Bump when we need
# an upstream fix; don't rely on `main` drift.
ARG ROBOTWIN_SHA=0aeea2d669c0f8516f4d5785f0aa33ba812c14b4
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
cuda-nvcc-12-4 cuda-cudart-dev-12-4 \
libvulkan1 vulkan-tools \
&& mkdir -p /usr/share/vulkan/icd.d \
&& echo '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.3.0"}}' \
> /usr/share/vulkan/icd.d/nvidia_icd.json \
&& git clone https://github.com/RoboTwin-Platform/RoboTwin.git ${ROBOTWIN_ROOT} \
&& git -C ${ROBOTWIN_ROOT} checkout ${ROBOTWIN_SHA} \
&& chown -R user_lerobot:user_lerobot ${ROBOTWIN_ROOT} \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
USER user_lerobot
# RoboTwin runtime deps (av is already in the base via [av-dep]).
RUN uv pip install --no-cache \
"sapien==3.0.0b1" "mplib==0.2.1" "transforms3d==0.4.2" "trimesh==4.4.3" \
"open3d==0.19.0" "imageio==2.34.2" termcolor zarr pydantic h5py
# pytorch3d has no universal wheel; must be built from source (~10 min, cached).
RUN uv pip install --no-cache --no-build-isolation \
"git+https://github.com/facebookresearch/pytorch3d.git@stable"
# CuRobo — NVlabs motion generator; TORCH_CUDA_ARCH_LIST must be set or the
# build aborts on an empty arch list. RoboTwin's own installer pins v0.7.8,
# which still exposes the v1 API (`curobo.types.math`) that RoboTwin imports.
ARG CUROBO_REF=v0.7.8
RUN cd ${ROBOTWIN_ROOT}/envs \
&& git clone --branch ${CUROBO_REF} --depth 1 https://github.com/NVlabs/curobo.git \
&& cd curobo \
&& TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;8.9;9.0" \
uv pip install -e . --no-build-isolation --no-cache
# Upstream patches (mirror RoboTwin's script/_install.sh).
# These patches target the exact versions pinned above; re-check when upgrading.
# mplib==0.2.1: drop a broken `or collide` clause in planner.py.
# Safe to remove once mplib > 0.2.1 ships with the fix upstream.
# sapien==3.0.0b1: fix URDF loader encoding + .srdf extension check.
# Safe to remove once sapien > 3.0.0b1 ships with the fix upstream.
RUN python - <<'EOF'
import pathlib, re, site
for d in site.getsitepackages():
p = pathlib.Path(d) / "mplib" / "planner.py"
if p.exists():
p.write_text(re.sub(r"\bor collide\b", "", p.read_text(), count=1))
print(f"mplib patch applied: {p}")
p = pathlib.Path(d) / "sapien" / "wrapper" / "urdf_loader.py"
if p.exists():
src = p.read_text().replace(
"with open(srdf_path) as f:", 'with open(srdf_path, encoding="utf-8") as f:'
).replace('"srdf"', '".srdf"')
p.write_text(src)
print(f"sapien patch applied: {p}")
EOF
# Simulation assets from TianxingChen/RoboTwin2.0: embodiments (~220 MB) +
# objects (~3.74 GB). background_texture (~11 GB) is intentionally skipped.
# The dataset is public — no auth token needed.
RUN python - <<'EOF'
import os, pathlib, zipfile
from huggingface_hub import hf_hub_download
assets_dir = pathlib.Path(os.environ["ROBOTWIN_ROOT"]) / "assets"
assets_dir.mkdir(parents=True, exist_ok=True)
for fname in ("embodiments.zip", "objects.zip"):
local = hf_hub_download(
repo_id="TianxingChen/RoboTwin2.0",
repo_type="dataset",
filename=fname,
local_dir=str(assets_dir),
)
with zipfile.ZipFile(local, "r") as z:
z.extractall(str(assets_dir))
pathlib.Path(local).unlink()
EOF
WORKDIR ${ROBOTWIN_ROOT}
RUN python script/update_embodiment_config_path.py
ENV PYTHONPATH="${ROBOTWIN_ROOT}"
# Fail the image build early if the CuRobo package layout regresses. Importing
# RoboTwin's planner here is too eager because CuRobo constructs CUDA-backed
# defaults at import time, while Docker builds don't have access to an NVIDIA
# driver.
RUN python - <<'EOF'
from pathlib import Path
from curobo.types.math import Pose
planner_src = (Path("/opt/robotwin/envs/robot/planner.py")).read_text()
assert "from curobo.types.math import Pose as CuroboPose" in planner_src
print("CuRobo import OK:", Pose.__name__)
print("RoboTwin planner import references curobo.types.math")
EOF
# Return to the lerobot source directory (set by base image) before overlaying.
WORKDIR /lerobot
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]
+99
View File
@@ -0,0 +1,99 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for VLABench integration tests.
# Extends the nightly GPU image with the PR's source code and VLABench setup.
#
# Build: docker build -f docker/Dockerfile.benchmark.vlabench -t lerobot-benchmark-vlabench .
# Run: docker run --gpus all --rm lerobot-benchmark-vlabench lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Install VLABench from GitHub (not on PyPI) and pin MuJoCo/dm-control.
# Shallow-clone without submodule recursion (nested SSH-only submodules fail in CI).
# Editable install (-e) because VLABench/utils/ has no __init__.py, so
# find_packages() omits it from wheels; editable mode uses the source tree directly.
# rrt-algorithms has the same packaging issue (rrt/ dir missing __init__.py).
# Patch: constant.py calls os.listdir on ~100 asset/obj/meshes/* dirs at import
# time. Guard the call so missing dirs return [] instead of crashing (in case
# the asset download is partial).
#
# Pinned upstream SHAs for reproducible benchmark runs. Bump when you need
# an upstream fix; don't rely on `main`/`develop` drift.
ARG VLABENCH_SHA=cf588fe60c0c7282174fe979f5913170cfe69017
ARG RRT_ALGORITHMS_SHA=e51d95ee489a225220d6ae2a764c4111f6ba7d85
RUN git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench && \
git -C ~/VLABench checkout ${VLABENCH_SHA} && \
git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms && \
git -C ~/rrt-algorithms checkout ${RRT_ALGORITHMS_SHA} && \
python3 -c "\
import pathlib; \
p = pathlib.Path.home() / 'VLABench/VLABench/configs/constant.py'; \
t = p.read_text(); \
p.write_text(t.replace( \
'subdirs = os.listdir(xml_dir)', \
'if not os.path.isdir(xml_dir): return []\n subdirs = os.listdir(xml_dir)'))" && \
uv pip install --no-cache -e ~/VLABench -e ~/rrt-algorithms \
mujoco==3.2.2 dm-control==1.0.22 \
open3d colorlog scikit-learn openai gdown
# Download VLABench mesh assets. Task configs reference object meshes
# (obj/meshes/fruit/, containers/basket/, tablewares/plates/, etc.); without
# them the task builder picks from an empty mesh list and crashes with
# IndexError at task-build time (random.choice([]) in config_manager.py).
#
# Preferred source: an HF Hub mirror. Set VLABENCH_ASSETS_REPO at build time
# (e.g. --build-arg VLABENCH_ASSETS_REPO=lerobot/vlabench-assets) and we'll
# snapshot_download the repo into VLABench's assets dir. This is the reliable
# path for CI — Google Drive frequently returns HTTP 429 ("Too many users have
# viewed or downloaded this file recently") on shared academic files.
#
# After download we *validate* that at least one XML exists under each
# task-critical subtree and fail the build loudly if not. Silent-empty asset
# dirs are the #1 cause of VLABench runtime crashes in CI, so we surface them
# here rather than after a 10-minute eval build.
#
# Fallback: VLABench's own gdown-based script. Best-effort only.
ARG VLABENCH_ASSETS_REPO=""
RUN ASSETS_DIR="$HOME/VLABench/VLABench/assets" && \
if [ -n "${VLABENCH_ASSETS_REPO}" ]; then \
echo "Downloading VLABench assets from HF Hub: ${VLABENCH_ASSETS_REPO}" && \
uv pip install --no-cache "huggingface_hub[hf_xet]>=0.26" && \
python -c "from huggingface_hub import snapshot_download; \
p = snapshot_download(repo_id='${VLABENCH_ASSETS_REPO}', repo_type='dataset', \
local_dir='${ASSETS_DIR}', allow_patterns=['obj/**', 'scenes/**']); \
print('snapshot_download returned:', p)"; \
else \
echo "No VLABENCH_ASSETS_REPO set — falling back to gdown" && \
python ~/VLABench/scripts/download_assets.py --choice all; \
fi && \
python -c "\
from pathlib import Path; \
import sys; \
root = Path('${ASSETS_DIR}'); \
checks = ['obj/meshes/tablewares/plates', 'obj/meshes/containers/basket', 'obj/meshes/fruit', 'obj/meshes/containers/tray']; \
failed = []; \
print(f'Validating VLABench assets under {root}'); \
[print(f' {c}: {len(list((root/c).rglob(\"*.xml\")))} XMLs') for c in checks]; \
[failed.append(c) for c in checks if not any((root/c).rglob('*.xml'))]; \
sys.exit(f'Empty asset dirs (no *.xml): {failed}') if failed else print('All asset dirs populated.')"
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
# Re-install lerobot editably so the new source (with VLABenchEnv registration
# and updated obs handling) replaces the stale package baked into the nightly image.
RUN uv pip install --no-cache --no-deps -e .
CMD ["/bin/bash"]
+18 -2
View File
@@ -31,8 +31,12 @@
title: Porting Large Datasets
- local: using_dataset_tools
title: Using the Dataset Tools
- local: dataset_subtask
title: Using Subtasks in the Dataset
- local: language_and_recipes
title: Language Columns and Recipes
- local: tools
title: Tools
- local: annotation_pipeline
title: Annotation Pipeline
- local: streaming_video_encoding
title: Streaming Video Encoding
title: "Datasets"
@@ -77,10 +81,22 @@
title: Adding a New Benchmark
- local: libero
title: LIBERO
- local: libero_plus
title: LIBERO-plus
- local: metaworld
title: Meta-World
- local: robotwin
title: RoboTwin 2.0
- local: robocasa
title: RoboCasa365
- local: robocerebra
title: RoboCerebra
- local: robomme
title: RoboMME
- local: envhub_isaaclab_arena
title: NVIDIA IsaacLab Arena Environments
- local: vlabench
title: VLABench
title: "Benchmarks"
- sections:
- local: introduction_processors
+161
View File
@@ -0,0 +1,161 @@
# Annotation Pipeline
`lerobot-annotate` populates the two language columns introduced by the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — directly into
`data/chunk-*/file-*.parquet`. There is no flavor namespace and no sidecar
file tree: multiple revisions of a dataset mean multiple dataset copies.
## What the pipeline produces
Three modules write into a per-episode staging tree, then a single writer
rewrites the data shards in place:
| Style / atom | Column | Module |
| ------------------------------------------- | --------------------- | -------- |
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | Module 1 |
| `plan` (initial + refresh on interjection) | `language_persistent` | Module 1 |
| `memory` (MEM-style compression) | `language_persistent` | Module 1 |
| `interjection` | `language_events` | Module 2 |
| speech tool-call atom (`style=null`, `say`) | `language_events` | Module 2 |
| `vqa` (user / assistant pair) | `language_events` | Module 3 |
The writer drops the legacy `subtask_index` column. It does **not** add a
`tools` column to the parquet — the tool catalog lives at
`meta/info.json["tools"]` instead (see [Tools](./tools)). After every
annotation run the pipeline ensures the canonical `say` schema is
present in that list, preserving any tools the user pre-declared. Chat-
template consumers read the catalog through
`LeRobotDatasetMetadata.tools` and pass it to
`apply_chat_template(messages, tools=meta.tools, ...)`.
If you want to declare additional tools for a dataset before annotation
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
anything already there. Implementations of those tools live under
`src/lerobot/tools/`; one file per tool, registered via
`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
## How to run it locally or on SLURM
Install the extra and invoke the console script:
```bash
uv sync --extra annotations
uv run lerobot-annotate \
--repo_id=imstevenpmwork/super_poulain_draft \
--vlm.backend=vllm \
--vlm.model_id=Qwen/Qwen3.6-27B-FP8 \
--vlm.tensor_parallel_size=2
```
The pipeline attaches actual camera footage to every Module 1/2/3 prompt
by default, decoded from the dataset's first `observation.images.*`
stream. Override with `--vlm.camera_key=observation.images.<name>` to
pin a specific viewpoint. Datasets with no video tracks fall back to
text-only prompts automatically.
**Module 1 sees the whole episode as one video block.** Subtask
decomposition gets a `{"type":"video", "video":[<frames>]}` block
covering the entire demonstration; Qwen-VL pools temporally on its own
and decides where to cut. There is no keyframe stride or count knob —
`--module_1.max_video_frames` (default 32) only caps the frames packed
into the video block as a model-capacity bound. Module 2 attaches a
single still frame at the interjection timestamp; Module 3 attaches the
exact emission frame to each VQA pair.
The executor picks `LocalPipelineExecutor` for small datasets and
`SlurmPipelineExecutor` for large ones based on
`--executor.auto_threshold` (default 32 episodes). Force local with
`--executor.force_local=true`. SLURM jobs honour `--executor.slurm_partition`,
`--executor.slurm_gpus`, and `--executor.slurm_time`.
## Style-to-recipe consumer mapping
The pipeline produces exactly the styles consumed by
`src/lerobot/configs/recipes/pi05_hirobot.yaml`:
- `low_level_execution`, `high_level_subtask`, `memory_update` consume
`subtask`/`plan`/`memory` from `language_persistent`.
- `user_interjection_response` consumes `interjection` events plus the
paired speech atom (merged into one assistant target turn via
`tool_calls_from`) and the same-timestamp `plan` refresh.
- `ask_vqa` consumes the `(vqa, user)` and `(vqa, assistant)` pairs from
`language_events`.
## Why the design is scoped to the canonical recipe
Two things drive the scope:
1. **Persistent state vs exact-event split.** Persistent rows (`subtask`,
`plan`, `memory`) broadcast per episode and answer "what state is in
force at this frame?". Event rows (`interjection`, `vqa`, speech) only
appear on the exact frame whose timestamp matches the emission. The
pipeline writes timestamps taken straight from the source parquet — no
floating-point recomputation.
2. **One Qwen-VL pass.** All three modules share a single VLM client
(vLLM if available, transformers fallback) so the cost is one model
load per dataset, not three.
## Module independence and staged reruns
Each module writes its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
prompt iteration cheap — re-running one module overwrites only its own
JSONL file before the writer composes the final parquet. Modules can be
disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
test them in isolation.
## Validation/report checks before final write
Before the writer runs, `StagingValidator` checks:
- exact frame-timestamp alignment for every event row;
- no orphan speech / interjection pairs;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (warning, not error);
- VQA assistant `content` parses as JSON in one of the
bbox / keypoint / count / attribute / spatial shapes;
- every row routes to the column dictated by `column_for_style(style)`.
Errors abort the writer (`--skip_validation=true` overrides for debugging).
## Paper inspirations per module
- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
atom granularity ("pick up one piece of lettuce", "place bowl to box");
Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
what" detail.
- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
compression directive: keep only minimal relevant information; functional
outcomes preserved, specific attributes dropped.
- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
situated correction, specific constraint, preference. Speech is a
tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
arguments:{text:...}}}]`).
- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
keypoints) and Steerable Policies' multi-abstraction grounding.
Future maintainers should adjust the prompt templates in
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
references rather than rewriting from scratch.
## Compute and list-size estimates
Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
O(`max_interjections_per_episode`) Module 2 calls, and
O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
KB at most (parquet dictionary-encodes one entry per episode);
`language_events` is empty on most frames and is bounded by the number of
emissions, not `num_frames × num_emissions`.
## Reproducibility via seed and prompt hashes
`--seed` (default 1729) feeds the per-episode RNGs that select interjection
timestamps and VQA question types. Combined with the deterministic prompt
templates checked into `prompts/`, two runs at the same seed against the
same dataset and the same model checkpoint produce byte-identical staging
artifacts. Prompt edits are recorded by file hash; future tooling can pin
expected `(seed, prompt_hash)` pairs into the dataset card.
-277
View File
@@ -1,277 +0,0 @@
# Using Subtasks in LeRobot Datasets
Subtask support in robotics datasets has proven effective in improving robot reasoning and understanding. Subtasks are particularly useful for:
- **Hierarchical policies**: Building policies that include subtask predictions to visualize robot reasoning in real time
- **Reward modeling**: Helping reward models understand task progression (e.g., SARM-style stage-aware reward models)
- **Task decomposition**: Breaking down complex manipulation tasks into atomic, interpretable steps
LeRobotDataset now supports subtasks as part of its dataset structure, alongside tasks.
## What are Subtasks?
While a **task** describes the overall goal (e.g., "Pick up the apple and place it in the basket"), **subtasks** break down the execution into finer-grained steps:
1. "Approach the apple"
2. "Grasp the apple"
3. "Lift the apple"
4. "Move to basket"
5. "Release the apple"
Each frame in the dataset can be annotated with its corresponding subtask, enabling models to learn and predict these intermediate stages.
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/subtask-asset.png"
alt="An overview of subtask annotation showing how frames are labeled with intermediate subtask stages"
width="80%"
/>
<p>
<em>Figure: Overview of subtask annotation.</em>
</p>
**Reference:** _Subtask-learning based for robot self-assembly in flexible collaborative assembly in manufacturing_, Original Article, Published: 19 April 2022.
## Dataset Structure
Subtask information is stored in the dataset metadata:
```
my-dataset/
├── data/
│ └── ...
├── meta/
│ ├── info.json
│ ├── stats.json
│ ├── tasks.parquet
│ ├── subtasks.parquet # Subtask index → subtask string mapping
│ └── episodes/
│ └── ...
└── videos/
└── ...
```
### Subtasks Parquet File
The `meta/subtasks.parquet` file maps subtask indices to their natural language descriptions:
| subtask_index | subtask (index column) |
| ------------- | ---------------------- |
| 0 | "Approach the apple" |
| 1 | "Grasp the apple" |
| 2 | "Lift the apple" |
| ... | ... |
### Frame-Level Annotations
Each frame in the dataset can include a `subtask_index` field that references the subtasks parquet file:
```python
# Example frame data in the parquet file
{
"index": 42,
"timestamp": 1.4,
"episode_index": 0,
"task_index": 0,
"subtask_index": 2, # References "Lift the apple"
"observation.state": [...],
"action": [...],
}
```
## Annotating Datasets with Subtasks
We provide a HuggingFace Space for easily annotating any LeRobotDataset with subtasks:
**[https://huggingface.co/spaces/lerobot/annotate](https://huggingface.co/spaces/lerobot/annotate)**
After completing your annotation:
1. Click "Push to Hub" to upload your annotated dataset
2. You can also run the annotation space locally by following the instructions at [github.com/huggingface/lerobot-annotate](https://github.com/huggingface/lerobot-annotate)
## Loading Datasets with Subtasks
When you load a dataset with subtask annotations, the subtask information is automatically available:
```python
from lerobot.datasets import LeRobotDataset
# Load a dataset with subtask annotations
dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
# Access a sample
sample = dataset[100]
# The sample includes both task and subtask information
print(sample["task"]) # "Collect the fruit"
print(sample["subtask"]) # "Grasp the apple"
print(sample["task_index"]) # tensor(0)
print(sample["subtask_index"]) # tensor(2)
```
### Checking for Subtask Support
You can check if a dataset has subtask annotations:
```python
# Check if subtasks are available
has_subtasks = (
"subtask_index" in dataset.features
and dataset.meta.subtasks is not None
)
if has_subtasks:
print(f"Dataset has {len(dataset.meta.subtasks)} unique subtasks")
print("Subtasks:", list(dataset.meta.subtasks.index))
```
## Using Subtasks for Training
### With the Tokenizer Processor
The `TokenizerProcessor` automatically handles subtask tokenization for Vision-Language Action (VLA) models:
```python
from lerobot.processor import TokenizerProcessorStep
# Create a tokenizer processor step
tokenizer_processor = TokenizerProcessorStep(
tokenizer_name_or_path="google/paligemma-3b-pt-224",
padding="max_length",
max_length=64,
)
# The processor will automatically tokenize subtasks if present in the batch
# and add them to the observation under:
# - "observation.subtask.tokens"
# - "observation.subtask.attention_mask"
```
When subtasks are available in the batch, the tokenizer processor adds:
- `observation.subtask.tokens`: Tokenized subtask text
- `observation.subtask.attention_mask`: Attention mask for the subtask tokens
### DataLoader with Subtasks
```python
import torch
from lerobot.datasets import LeRobotDataset
dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=16,
shuffle=True,
)
for batch in dataloader:
# Access subtask information in the batch
subtasks = batch["subtask"] # List of subtask strings
subtask_indices = batch["subtask_index"] # Tensor of subtask indices
# Use for training hierarchical policies or reward models
print(f"Batch subtasks: {set(subtasks)}")
```
## Example Datasets with Subtask Annotations
Try loading a dataset with subtask annotations:
```python
from lerobot.datasets import LeRobotDataset
# Example dataset with subtask annotations
dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
# Explore the subtasks
print("Available subtasks:")
for subtask_name in dataset.meta.subtasks.index:
print(f" - {subtask_name}")
# Get subtask distribution
subtask_counts = {}
for i in range(len(dataset)):
sample = dataset[i]
subtask = sample["subtask"]
subtask_counts[subtask] = subtask_counts.get(subtask, 0) + 1
print("\nSubtask distribution:")
for subtask, count in sorted(subtask_counts.items(), key=lambda x: -x[1]):
print(f" {subtask}: {count} frames")
```
## Use Cases
### 1. Hierarchical Policy Training
Train policies that predict both actions and current subtask:
```python
class HierarchicalPolicy(nn.Module):
def __init__(self, num_subtasks):
super().__init__()
self.action_head = nn.Linear(hidden_dim, action_dim)
self.subtask_head = nn.Linear(hidden_dim, num_subtasks)
def forward(self, observations):
features = self.encoder(observations)
actions = self.action_head(features)
subtask_logits = self.subtask_head(features)
return actions, subtask_logits
```
### 2. Stage-Aware Reward Modeling (SARM)
Build reward models that understand task progression:
```python
# SARM predicts:
# - Stage: Which subtask is being executed (discrete)
# - Progress: How far along the subtask (continuous 0-1)
class SARMRewardModel(nn.Module):
def forward(self, observations):
features = self.encoder(observations)
stage_logits = self.stage_classifier(features)
progress = self.progress_regressor(features)
return stage_logits, progress
```
### 3. Progress Visualization
Monitor robot execution by tracking subtask progression:
```python
def visualize_execution(model, observations):
for t, obs in enumerate(observations):
action, subtask_logits = model(obs)
predicted_subtask = subtask_names[subtask_logits.argmax()]
print(f"t={t}: Executing '{predicted_subtask}'")
```
## API Reference
### LeRobotDataset Properties
| Property | Type | Description |
| --------------------------- | ---------------------- | ------------------------------------------ |
| `meta.subtasks` | `pd.DataFrame \| None` | DataFrame mapping subtask names to indices |
| `features["subtask_index"]` | `dict` | Feature spec for subtask_index if present |
### Sample Keys
When subtasks are available, each sample includes:
| Key | Type | Description |
| --------------- | -------------- | ------------------------------------ |
| `subtask_index` | `torch.Tensor` | Integer index of the current subtask |
| `subtask` | `str` | Natural language subtask description |
## Related Resources
- [SARM Paper](https://arxiv.org/pdf/2509.25358) - Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
- [LeRobot Annotate Space](https://huggingface.co/spaces/lerobot/annotate) - Interactive annotation tool
- [LeRobotDataset v3.0](./lerobot-dataset-v3) - Dataset format documentation
+6
View File
@@ -32,6 +32,12 @@ Once youve gathered enough trajectories, youll train a neural network to i
If you run into any issues at any point, jump into our [Discord community](https://discord.com/invite/s3KuuzsPFb) for support.
<Tip>
Want to quickly get the right commands for your setup? The [quickstart notebook](https://github.com/huggingface/lerobot/blob/main/examples/notebooks/quickstart.ipynb) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/lerobot/blob/main/examples/notebooks/quickstart.ipynb) lets you configure your robot once and generates all the commands below ready to paste.
</Tip>
## Set up and Calibrate
If you haven't yet set up and calibrated your robot and teleop device, please do so by following the robot-specific tutorial.
+109
View File
@@ -0,0 +1,109 @@
# Language columns and recipes
LeRobot stores reusable language annotations directly next to frame data in `data/chunk-*/file-*.parquet`.
The two optional columns are:
- `language_persistent`: a list of rows broadcast across every frame in an episode for state that remains active, such as `subtask`, `plan`, and `memory`.
- `language_events`: a list of rows only on the exact frame where an event was emitted, such as `interjection`, `vqa`, and speech tool calls.
Both columns share the same row shape (event rows omit `timestamp` because the
frame the row sits on already provides it):
```text
role: string
content: string | null
style: string | null
timestamp: float64 # persistent rows only
camera: string | null # observation.images.* feature key, view-dependent rows only
tool_calls: list[Json] | null
```
The `camera` field tags rows whose `content` is grounded in a specific camera
view. Rows of view-dependent styles (`vqa`, and the reserved `motion` /
`trace`) MUST set `camera` to the matching `observation.images.*` feature key.
Rows of every other style MUST leave `camera` as `null`. Pipeline writers and
the validator enforce this via `validate_camera_field(style, camera)`.
`meta/tasks.parquet` remains the canonical source for the task. The special `${task}` recipe binding always reads that task string and does not depend on language annotations.
## Architecture
The language stack has three layers:
1. `lerobot.datasets.language` defines the schema, style registry, and `column_for_style`.
2. `lerobot.datasets.language_render` resolves rows and renders messages.
3. `RenderMessagesStep` turns dataset samples into `messages`, `message_streams`, and `target_message_indices`.
`LeRobotDataset` stays recipe-agnostic. It passes `language_persistent` and `language_events` through when present, and unannotated datasets keep their existing behavior.
## Temporal semantics
Persistent styles are active after emission until replaced:
- `active_at(t, style=subtask)`
- `nth_prev(style=memory, offset=1)`
- `nth_next(style=subtask, offset=1)`
Event styles only exist on their exact timestamp:
- `emitted_at(t, style=interjection)`
- `emitted_at(t, style=vqa, role=user, camera=observation.images.top)`
- `emitted_at(t, role=assistant, tool_name=say)`
Exact event matching has no tolerance window, so writers must stamp event rows with frame timestamps from the parquet data.
## View-dependent resolution
For view-dependent styles (`vqa`, `motion`, `trace`), the resolver gains a
`camera=` filter parallel to `role=` and `tool_name=`. Datasets with multiple
cameras typically emit one (`vqa`, `user`) + (`vqa`, `assistant`) pair per
camera at the same timestamp; without `camera=`, those resolvers see two
matches and raise an ambiguity error. Recipes consume each camera through its
own binding plus a matching image block, e.g.
```yaml
ask_vqa_top:
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- { type: image, feature: observation.images.top }
- { type: text, text: "${vqa_query}" }
- { role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa }
```
Add one such sub-recipe per camera the dataset records.
## Recipe anatomy
Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`.
```yaml
messages:
- { role: user, content: "${task}", stream: high_level }
- { role: assistant, content: "${subtask}", stream: low_level, target: true }
```
Rendered samples use HF-style chat messages plus LeRobot sidecars:
```python
sample["messages"]
sample["message_streams"]
sample["target_message_indices"]
```
The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone.
## Blends
Blend recipes select one weighted sub-recipe deterministically from the sample index.
The canonical `recipes/pi05_hirobot.yaml` combines memory updates, interjection responses, high-level subtask prediction, low-level execution, and VQA.
## Graceful absence
If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
If an event-scoped branch is selected on a frame without the required event row, rendering returns `None`, allowing a loader to retry another sample.
+188
View File
@@ -0,0 +1,188 @@
# LIBERO-plus
LIBERO-plus is a **robustness benchmark** for Vision-Language-Action (VLA) models built on top of [LIBERO](./libero). It systematically stress-tests policies by applying **seven independent perturbation dimensions** to the original LIBERO task set, exposing failure modes that standard benchmarks miss.
- Paper: [In-depth Robustness Analysis of Vision-Language-Action Models](https://arxiv.org/abs/2510.13626)
- GitHub: [sylvestf/LIBERO-plus](https://github.com/sylvestf/LIBERO-plus)
- Dataset: [lerobot/libero_plus](https://huggingface.co/datasets/lerobot/libero_plus)
![An overview of the LIBERO-plus benchmark perturbation dimensions](https://github.com/sylvestf/LIBERO-plus/raw/main/static/images/libero-plus.jpg)
## Perturbation dimensions
LIBERO-plus creates ~10 000 task variants by perturbing each original LIBERO task along these axes:
| Dimension | What changes |
| --------------------- | ----------------------------------------------------- |
| Objects layout | Target position, presence of confounding objects |
| Camera viewpoints | Camera position, orientation, field-of-view |
| Robot initial states | Manipulator start pose |
| Language instructions | LLM-rewritten task description (paraphrase / synonym) |
| Light conditions | Intensity, direction, color, shadow |
| Background textures | Scene surface and object appearance |
| Sensor noise | Photometric distortions and image degradation |
## Available task suites
LIBERO-plus covers the same five suites as LIBERO:
| Suite | CLI name | Tasks | Max steps | Description |
| -------------- | ---------------- | ----- | --------- | -------------------------------------------------- |
| LIBERO-Spatial | `libero_spatial` | 10 | 280 | Tasks requiring reasoning about spatial relations |
| LIBERO-Object | `libero_object` | 10 | 280 | Tasks centered on manipulating different objects |
| LIBERO-Goal | `libero_goal` | 10 | 300 | Goal-conditioned tasks with changing targets |
| LIBERO-90 | `libero_90` | 90 | 400 | Short-horizon tasks from the LIBERO-100 collection |
| LIBERO-Long | `libero_10` | 10 | 520 | Long-horizon tasks from the LIBERO-100 collection |
<Tip warning={true}>
Installing LIBERO-plus **replaces** vanilla LIBERO — it uninstalls `hf-libero`
so that `import libero` resolves to the LIBERO-plus fork. You cannot have both
installed at the same time. To switch back to vanilla LIBERO, uninstall the
fork and reinstall with `pip install -e ".[libero]"`.
</Tip>
## Installation
### System dependencies (Linux only)
```bash
sudo apt install libexpat1 libfontconfig1-dev libmagickwand-dev
```
### Python package
```bash
pip install -e ".[libero]" "robosuite==1.4.1" bddl easydict mujoco wand scikit-image gym
git clone https://github.com/sylvestf/LIBERO-plus.git
cd LIBERO-plus && pip install --no-deps -e .
pip uninstall -y hf-libero # so `import libero` resolves to the fork
```
LIBERO-plus is installed from its GitHub fork rather than a pyproject extra — the fork ships as a namespace package that pip can't handle, so it must be cloned and added to `PYTHONPATH`. See `docker/Dockerfile.benchmark.libero_plus` for the canonical install. MuJoCo is required, so only Linux is supported.
<Tip>
Set the MuJoCo rendering backend before running evaluation:
```bash
export MUJOCO_GL=egl # headless / HPC / cloud
```
</Tip>
### Download LIBERO-plus assets
LIBERO-plus ships its extended asset pack separately. Download `assets.zip` from the [Hugging Face dataset](https://huggingface.co/datasets/Sylvest/LIBERO-plus/tree/main) and extract it into the LIBERO-plus package directory:
```bash
# After installing the package, find where it was installed:
python -c "import libero; print(libero.__file__)"
# Then extract assets.zip into <package_root>/libero/assets/
```
## Evaluation
### Default evaluation (recommended)
Evaluate across the four standard suites (10 episodes per task):
```bash
lerobot-eval \
--policy.path="your-policy-id" \
--env.type=libero_plus \
--env.task=libero_spatial,libero_object,libero_goal,libero_10 \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--env.max_parallel_tasks=1
```
### Single-suite evaluation
Evaluate on one LIBERO-plus suite:
```bash
lerobot-eval \
--policy.path="your-policy-id" \
--env.type=libero_plus \
--env.task=libero_spatial \
--eval.batch_size=1 \
--eval.n_episodes=10
```
- `--env.task` picks the suite (`libero_spatial`, `libero_object`, etc.).
- `--env.task_ids` restricts to specific task indices (`[0]`, `[1,2,3]`, etc.). Omit to run all tasks in the suite.
- `--eval.batch_size` controls how many environments run in parallel.
- `--eval.n_episodes` sets how many episodes to run per task.
### Multi-suite evaluation
Benchmark a policy across multiple suites at once by passing a comma-separated list:
```bash
lerobot-eval \
--policy.path="your-policy-id" \
--env.type=libero_plus \
--env.task=libero_spatial,libero_object \
--eval.batch_size=1 \
--eval.n_episodes=10
```
### Control mode
LIBERO-plus supports two control modes — `relative` (default) and `absolute`. Different VLA checkpoints are trained with different action parameterizations, so make sure the mode matches your policy:
```bash
--env.control_mode=relative # or "absolute"
```
### Policy inputs and outputs
**Observations:**
- `observation.state` — 8-dim proprioceptive features (eef position, axis-angle orientation, gripper qpos)
- `observation.images.image` — main camera view (`agentview_image`), HWC uint8
- `observation.images.image2` — wrist camera view (`robot0_eye_in_hand_image`), HWC uint8
**Actions:**
- Continuous control in `Box(-1, 1, shape=(7,))` — 6D end-effector delta + 1D gripper
### Recommended evaluation episodes
For reproducible benchmarking, use **10 episodes per task** across all four standard suites (Spatial, Object, Goal, Long). This gives 400 total episodes and matches the protocol used for published results.
## Training
### Dataset
A LeRobot-format training dataset for LIBERO-plus is available at:
- [lerobot/libero_plus](https://huggingface.co/datasets/lerobot/libero_plus)
### Example training command
```bash
lerobot-train \
--policy.type=smolvla \
--policy.repo_id=${HF_USER}/smolvla_libero_plus \
--policy.load_vlm_weights=true \
--dataset.repo_id=lerobot/libero_plus \
--env.type=libero_plus \
--env.task=libero_spatial \
--output_dir=./outputs/ \
--steps=100000 \
--batch_size=4 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval_freq=1000
```
## Relationship to LIBERO
LIBERO-plus is a drop-in extension of LIBERO:
- Same Python gym interface (`LiberoEnv`, `LiberoProcessorStep`)
- Same camera names and observation/action format
- Same task suite names
- Installs under the same `libero` Python package name (different GitHub repo)
To use the original LIBERO benchmark, see [LIBERO](./libero) and use `--env.type=libero`.
+188
View File
@@ -0,0 +1,188 @@
# RoboCasa365
[RoboCasa365](https://robocasa.ai) is a large-scale simulation framework for training and benchmarking **generalist robots** in everyday kitchen tasks. It ships 365 diverse manipulation tasks across 2,500 kitchen environments, 3,200+ object assets and 600+ hours of human demonstration data, on a PandaOmron 12-DOF mobile manipulator (Franka arm on a holonomic base).
- Paper: [RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots](https://arxiv.org/abs/2406.02523)
- GitHub: [robocasa/robocasa](https://github.com/robocasa/robocasa)
- Project website: [robocasa.ai](https://robocasa.ai)
- Pretrained policy: [`lerobot/smolvla_robocasa`](https://huggingface.co/lerobot/smolvla_robocasa)
- Single-task dataset (CloseFridge): [`pepijn223/robocasa_CloseFridge`](https://huggingface.co/datasets/pepijn223/robocasa_CloseFridge)
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/robocasa-banner.webp"
alt="RoboCasa365 benchmark overview"
width="85%"
/>
## Available tasks
RoboCasa365 organizes its 365 tasks into two families and three upstream benchmark groups that LeRobot exposes as first-class `--env.task` shortcuts:
| Family | Tasks | Description |
| --------- | ----- | ------------------------------------------------------------------------------- |
| Atomic | ~65 | Single-skill tasks: pick-and-place, door/drawer manipulation, appliance control |
| Composite | ~300 | Multi-step tasks across 60+ categories: cooking, cleaning, organizing, etc. |
**Atomic task examples:** `CloseFridge`, `OpenDrawer`, `OpenCabinet`, `TurnOnMicrowave`, `TurnOffStove`, `NavigateKitchen`, `PickPlaceCounterToStove`.
**Composite task categories:** baking, boiling, brewing, chopping, clearing table, defrosting food, loading dishwasher, making tea, microwaving food, washing dishes, and more.
`--env.task` accepts three forms:
- a single task name (`CloseFridge`)
- a comma-separated list (`CloseFridge,OpenBlenderLid,PickPlaceCoffee`)
- a benchmark-group shortcut — `atomic_seen`, `composite_seen`, `composite_unseen`, `pretrain50`, `pretrain100`, `pretrain200`, `pretrain300` — which auto-expands to the upstream task list and auto-sets the dataset `split` (`target` or `pretrain`).
## Installation
RoboCasa and its dependency `robosuite` are not published on PyPI, and RoboCasa's own `setup.py` hardcodes `lerobot==0.3.3`, which conflicts with this repo's `lerobot`. LeRobot therefore does **not** expose a `robocasa` extra — install the two packages manually as editable clones (using `--no-deps` on `robocasa` to skip its shadowed `lerobot` pin):
```bash
# After following the standard LeRobot installation instructions.
git clone https://github.com/robocasa/robocasa.git ~/robocasa
git clone https://github.com/ARISE-Initiative/robosuite.git ~/robosuite
pip install -e ~/robocasa --no-deps
pip install -e ~/robosuite
# Robocasa's runtime deps (the ones its setup.py would have pulled, minus
# the bad lerobot pin).
pip install numpy numba scipy mujoco pygame Pillow opencv-python \
pyyaml pynput tqdm termcolor imageio h5py lxml hidapi \
tianshou gymnasium
python -m robocasa.scripts.setup_macros
# Lightweight assets (lightwheel object meshes + textures). Enough for
# the default env out of the box.
python -m robocasa.scripts.download_kitchen_assets \
--type tex tex_generative fixtures_lw objs_lw
# Optional: full objaverse/aigen registries (~30GB) for richer object
# variety. Enable at eval time via --env.obj_registries (see below).
# python -m robocasa.scripts.download_kitchen_assets --type objs_objaverse
```
<Tip>
RoboCasa requires MuJoCo. Set the rendering backend before training or evaluation:
```bash
export MUJOCO_GL=egl # for headless servers (HPC, cloud)
```
</Tip>
### Object registries
By default the env samples objects only from the `lightwheel` registry (what `--type objs_lw` ships), which avoids a `Probabilities contain NaN` crash when the objaverse / aigen packs aren't on disk. If you've downloaded the full asset set, enable the full registry at runtime:
```bash
--env.obj_registries='[objaverse,lightwheel]'
```
## Evaluation
All eval snippets below mirror the CI command (see `.github/workflows/benchmark_tests.yml`). The `--rename_map` argument maps RoboCasa's native camera keys (`robot0_agentview_left` / `robot0_eye_in_hand` / `robot0_agentview_right`) onto the three-camera (`camera1` / `camera2` / `camera3`) input layout the released `smolvla_robocasa` policy was trained on.
### Single-task evaluation (recommended for quick iteration)
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_robocasa \
--env.type=robocasa \
--env.task=CloseFridge \
--eval.batch_size=1 \
--eval.n_episodes=20 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.robot0_agentview_left": "observation.images.camera1", "observation.images.robot0_eye_in_hand": "observation.images.camera2", "observation.images.robot0_agentview_right": "observation.images.camera3"}'
```
### Multi-task evaluation
Pass a comma-separated list of tasks:
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_robocasa \
--env.type=robocasa \
--env.task=CloseFridge,OpenCabinet,OpenDrawer,TurnOnMicrowave,TurnOffStove \
--eval.batch_size=1 \
--eval.n_episodes=20 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.robot0_agentview_left": "observation.images.camera1", "observation.images.robot0_eye_in_hand": "observation.images.camera2", "observation.images.robot0_agentview_right": "observation.images.camera3"}'
```
### Benchmark-group evaluation
Run an entire upstream group (e.g. all 18 `atomic_seen` tasks with `split=target`):
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_robocasa \
--env.type=robocasa \
--env.task=atomic_seen \
--eval.batch_size=1 \
--eval.n_episodes=20 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.robot0_agentview_left": "observation.images.camera1", "observation.images.robot0_eye_in_hand": "observation.images.camera2", "observation.images.robot0_agentview_right": "observation.images.camera3"}'
```
### Recommended evaluation episodes
**20 episodes per task** for reproducible benchmarking. Matches the protocol used in published results.
## Policy inputs and outputs
**Observations** (raw RoboCasa camera names are preserved verbatim):
- `observation.state` — 16-dim proprioceptive state (base position, base quaternion, relative end-effector position, relative end-effector quaternion, gripper qpos)
- `observation.images.robot0_agentview_left` — left agent view, 256×256 HWC uint8
- `observation.images.robot0_eye_in_hand` — wrist camera view, 256×256 HWC uint8
- `observation.images.robot0_agentview_right` — right agent view, 256×256 HWC uint8
**Actions:**
- Continuous control in `Box(-1, 1, shape=(12,))` — base motion (4D) + control mode (1D) + end-effector position (3D) + end-effector rotation (3D) + gripper (1D).
## Training
### Single-task example
A ready-to-use single-task dataset is on the Hub:
[`pepijn223/robocasa_CloseFridge`](https://huggingface.co/datasets/pepijn223/robocasa_CloseFridge).
Fine-tune a SmolVLA base on `CloseFridge`:
```bash
lerobot-train \
--policy.type=smolvla \
--policy.repo_id=${HF_USER}/smolvla_robocasa_CloseFridge \
--policy.load_vlm_weights=true \
--policy.push_to_hub=true \
--dataset.repo_id=pepijn223/robocasa_CloseFridge \
--env.type=robocasa \
--env.task=CloseFridge \
--output_dir=./outputs/smolvla_robocasa_CloseFridge \
--steps=100000 \
--batch_size=4 \
--eval_freq=5000 \
--eval.batch_size=1 \
--eval.n_episodes=5 \
--save_freq=10000
```
Evaluate the resulting checkpoint:
```bash
lerobot-eval \
--policy.path=${HF_USER}/smolvla_robocasa_CloseFridge \
--env.type=robocasa \
--env.task=CloseFridge \
--eval.batch_size=1 \
--eval.n_episodes=20
```
## Reproducing published results
The released checkpoint [`lerobot/smolvla_robocasa`](https://huggingface.co/lerobot/smolvla_robocasa) is evaluated with the commands in the [Evaluation](#evaluation) section. CI runs a 10-atomic-task smoke eval (one episode each) on every PR touching the benchmark, picking fixture-centric tasks that don't require the objaverse asset pack.
+99
View File
@@ -0,0 +1,99 @@
# RoboCerebra
[RoboCerebra](https://robocerebra-project.github.io/) is a long-horizon manipulation benchmark that evaluates **high-level reasoning, planning, and memory** in VLAs. Episodes chain multiple sub-goals with language-grounded intermediate instructions, built on top of LIBERO's simulator stack (MuJoCo + robosuite, Franka Panda 7-DOF).
- Paper: [RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation](https://arxiv.org/abs/2506.06677)
- Project website: [robocerebra-project.github.io](https://robocerebra-project.github.io/)
- Dataset: [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) — LeRobot v3.0, 6,660 episodes / 571,116 frames at 20 fps, 1,728 language-grounded sub-tasks.
- Pretrained policy: [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra)
## Available tasks
RoboCerebra reuses LIBERO's simulator, so evaluation runs against the LIBERO `libero_10` long-horizon suite:
| Suite | CLI name | Tasks | Description |
| --------- | ----------- | ----- | ------------------------------------------------------------- |
| LIBERO-10 | `libero_10` | 10 | Long-horizon kitchen/living room tasks chaining 36 sub-goals |
Each RoboCerebra episode in the dataset is segmented into multiple sub-tasks with natural-language instructions, which the unified dataset exposes as independent supervision signals.
## Installation
RoboCerebra piggybacks on LIBERO, so the `libero` extra is all you need:
```bash
pip install -e ".[libero]"
```
<Tip>
RoboCerebra requires Linux (MuJoCo / robosuite). Set the rendering backend before training or evaluation:
```bash
export MUJOCO_GL=egl # for headless servers (HPC, cloud)
```
</Tip>
## Evaluation
RoboCerebra eval runs against LIBERO's `libero_10` suite with RoboCerebra's camera naming (`image` + `wrist_image`) and an extra empty-camera slot so a three-view-trained policy receives the expected input layout:
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_robocerebra \
--env.type=libero \
--env.task=libero_10 \
--env.fps=20 \
--env.obs_type=pixels_agent_pos \
--env.observation_height=256 \
--env.observation_width=256 \
'--env.camera_name_mapping={"agentview_image": "image", "robot0_eye_in_hand_image": "wrist_image"}' \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.wrist_image": "observation.images.camera2"}' \
--policy.empty_cameras=1
```
### Recommended evaluation episodes
**10 episodes per task** across the `libero_10` suite (100 total) for reproducible benchmarking. Matches the protocol used in the RoboCerebra paper.
## Policy inputs and outputs
**Observations:**
- `observation.state` — 8-dim proprioceptive state (7 joint positions + gripper)
- `observation.images.image` — third-person view, 256×256 HWC uint8
- `observation.images.wrist_image` — wrist-mounted camera view, 256×256 HWC uint8
**Actions:**
- Continuous control in `Box(-1, 1, shape=(7,))` — end-effector delta (6D) + gripper (1D)
## Training
The unified dataset at [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) exposes two RGB streams and language-grounded sub-task annotations:
| Feature | Shape | Description |
| -------------------------------- | ------------- | -------------------- |
| `observation.images.image` | (256, 256, 3) | Third-person view |
| `observation.images.wrist_image` | (256, 256, 3) | Wrist-mounted camera |
| `observation.state` | (8,) | Joint pos + gripper |
| `action` | (7,) | EEF delta + gripper |
Fine-tune a SmolVLA base on it:
```bash
lerobot-train \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=lerobot/robocerebra_unified \
--env.type=libero \
--env.task=libero_10 \
--output_dir=outputs/smolvla_robocerebra
```
## Reproducing published results
The released checkpoint [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra) was trained on `lerobot/robocerebra_unified` and evaluated with the command in the [Evaluation](#evaluation) section. CI runs the same command with `--eval.n_episodes=1` as a smoke test on every PR touching the benchmark.
+130
View File
@@ -0,0 +1,130 @@
# RoboMME
[RoboMME](https://robomme.github.io) is a memory-augmented manipulation benchmark built on ManiSkill (SAPIEN). It evaluates a robot's ability to retain and use information across an episode — counting, object permanence, reference, and imitation.
- **16 tasks** across 4 memory-skill suites
- **1,600 training demos** (100 per task, 50 val, 50 test)
- **Dataset**: [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) — LeRobot v3.0, 768K frames at 10 fps
- **Simulator**: ManiSkill / SAPIEN, Panda arm, Linux only
![RoboMME benchmark tasks overview](https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2603.04639/gradient.png)
## Tasks
| Suite | Tasks |
| --------------------------------- | ------------------------------------------------------------- |
| **Counting** (temporal memory) | BinFill, PickXtimes, SwingXtimes, StopCube |
| **Permanence** (spatial memory) | VideoUnmask, VideoUnmaskSwap, ButtonUnmask, ButtonUnmaskSwap |
| **Reference** (object memory) | PickHighlight, VideoRepick, VideoPlaceButton, VideoPlaceOrder |
| **Imitation** (procedural memory) | MoveCube, InsertPeg, PatternLock, RouteStick |
## Installation
> RoboMME requires **Linux** (ManiSkill/SAPIEN uses Vulkan rendering). Docker is recommended to isolate dependency conflicts.
### Native (Linux)
```bash
pip install --override <(printf 'gymnasium==0.29.1\nnumpy==1.26.4\n') \
-e '.[smolvla,av-dep]' \
'robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main'
```
> **Dependency note**: `mani-skill` (pulled by `robomme`) pins `gymnasium==0.29.1` and `numpy<2.0.0`, which conflict with lerobot's base `numpy>=2.0.0`. That's why `robomme` is not a pyproject extra — use the override install above, or the Docker approach below to avoid conflicts entirely.
### Docker (recommended)
```bash
# Build base image first (from repo root)
docker build -f docker/Dockerfile.eval-base -t lerobot-eval-base .
# Build RoboMME eval image (applies gymnasium + numpy pin overrides)
docker build -f docker/Dockerfile.benchmark.robomme -t lerobot-robomme .
```
The `docker/Dockerfile.benchmark.robomme` image overrides `gymnasium==0.29.1` and `numpy==1.26.4` after lerobot's install. Both versions are runtime-safe for lerobot's actual API usage.
## Running Evaluation
### Default (single task, single episode)
```bash
lerobot-eval \
--policy.path=<your_policy_repo> \
--env.type=robomme \
--env.task=PickXtimes \
--env.dataset_split=test \
--env.task_ids=[0] \
--eval.batch_size=1 \
--eval.n_episodes=1
```
### Multi-task evaluation
Evaluate multiple tasks in one run by comma-separating task names. Use `task_ids` to control which episodes are evaluated per task. Recommended: 50 episodes per task for the test split.
```bash
lerobot-eval \
--policy.path=<your_policy_repo> \
--env.type=robomme \
--env.task=PickXtimes,BinFill,StopCube,MoveCube,InsertPeg \
--env.dataset_split=test \
--env.task_ids=[0,1,2,3,4,5,6,7,8,9] \
--eval.batch_size=1 \
--eval.n_episodes=50
```
### Key CLI options for `env.type=robomme`
| Option | Default | Description |
| -------------------- | ------------- | -------------------------------------------------- |
| `env.task` | `PickXtimes` | Any of the 16 task names above (comma-separated) |
| `env.dataset_split` | `test` | `train`, `val`, or `test` |
| `env.action_space` | `joint_angle` | `joint_angle` (8-D) or `ee_pose` (7-D) |
| `env.episode_length` | `300` | Max steps per episode |
| `env.task_ids` | `null` | List of episode indices to evaluate (null = `[0]`) |
## Dataset
The dataset [`lerobot/robomme`](https://huggingface.co/datasets/lerobot/robomme) is in **LeRobot v3.0 format** and can be loaded directly:
```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset
dataset = LeRobotDataset("lerobot/robomme")
```
### Dataset features
| Feature | Shape | Description |
| ------------------ | ------------- | ------------------------------- |
| `image` | (256, 256, 3) | Front camera RGB |
| `wrist_image` | (256, 256, 3) | Wrist camera RGB |
| `actions` | (8,) | Joint angles + gripper |
| `state` | (8,) | Joint positions + gripper state |
| `simple_subgoal` | str | High-level language annotation |
| `grounded_subgoal` | str | Grounded language annotation |
| `episode_index` | int | Episode ID |
| `frame_index` | int | Frame within episode |
### Feature key alignment (training)
The env wrapper exposes `pixels/image` and `pixels/wrist_image` as observation keys. The `features_map` in `RoboMMEEnv` maps these to `observation.images.image` and `observation.images.wrist_image` for the policy. State is exposed as `agent_pos` and maps to `observation.state`.
The dataset's `image` and `wrist_image` columns already align with the policy input keys, so no renaming is needed when fine-tuning.
## Action Spaces
| Type | Dim | Description |
| ------------- | --- | --------------------------------------------------------- |
| `joint_angle` | 8 | 7 joint angles + 1 gripper (1 closed, +1 open, absolute) |
| `ee_pose` | 7 | xyz + roll/pitch/yaw + gripper |
Set via `--env.action_space=joint_angle` (default) or `--env.action_space=ee_pose`.
## Platform Notes
- **Linux only**: ManiSkill requires SAPIEN/Vulkan. macOS and Windows are not supported.
- **GPU recommended**: Rendering is CPU-capable but slow; CUDA + Vulkan gives full speed.
- **gymnasium / numpy conflict**: See installation note above. Docker image handles this automatically.
- **ManiSkill fork**: `robomme` depends on a specific ManiSkill fork (`YinpeiDai/ManiSkill`), pulled in automatically via the `robomme` package.
+223
View File
@@ -0,0 +1,223 @@
# RoboTwin 2.0
RoboTwin 2.0 is a **large-scale dual-arm manipulation benchmark** built on the SAPIEN physics engine. It provides a standardized evaluation protocol for bimanual robotic policies across 50 tasks (as of upstream `main`) with strong domain randomization (clutter, lighting, background, tabletop height, and language instructions).
- Paper: [RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation](https://arxiv.org/abs/2506.18088)
- GitHub: [RoboTwin-Platform/RoboTwin](https://github.com/RoboTwin-Platform/RoboTwin)
- Leaderboard: [robotwin-platform.github.io/leaderboard](https://robotwin-platform.github.io/leaderboard)
- Dataset: [lerobot/robotwin_unified](https://huggingface.co/datasets/lerobot/robotwin_unified)
![RoboTwin 2.0 benchmark overview](https://www.aitntnews.com/pictures/2025/7/8/9a7f79cb-5ba9-11f0-8581-fa163e47d677.png)
## Overview
| Property | Value |
| ------------- | -------------------------------------------------------- |
| Tasks | 50 dual-arm manipulation tasks |
| Robot | Aloha-AgileX bimanual (14 DOF, 7 per arm) |
| Action space | 14-dim joint-space, continuous in `[-1, 1]` |
| Cameras | `head_camera`, `left_camera`, `right_camera` |
| Simulator | SAPIEN (not MuJoCo) |
| Eval protocol | 100 episodes/task, 50 demo_clean demonstrations |
| Eval settings | **Easy** (`demo_clean`) and **Hard** (`demo_randomized`) |
## Available tasks
RoboTwin 2.0 ships 50 dual-arm manipulation tasks in its upstream `envs/` directory. The canonical list is the `ROBOTWIN_TASKS` tuple in `src/lerobot/envs/robotwin.py`, mirrored verbatim from the upstream repo. Example tasks:
| Task | CLI name | Category |
| ------------------------ | ------------------------ | ----------------- |
| Beat block with hammer | `beat_block_hammer` | Tool use |
| Click bell / alarm clock | `click_bell` | Precision press |
| Stack blocks (2 / 3) | `stack_blocks_two/three` | Stacking |
| Stack bowls (2 / 3) | `stack_bowls_two/three` | Stacking |
| Handover block / mic | `handover_block` | Bimanual coord. |
| Lift pot | `lift_pot` | Bimanual lift |
| Shake bottle | `shake_bottle` | Continuous motion |
| Turn switch | `turn_switch` | Articulated obj |
| Stamp seal | `stamp_seal` | Precision place |
| Scan object | `scan_object` | Mobile manip. |
Pass a comma-separated list to `--env.task` to run multiple tasks in a single eval sweep.
<Tip warning={true}>
`open_laptop` is currently broken upstream (its `check_success()` uses
`self.arm_tag`, which is only set inside the scripted-expert `play_once()`
path and therefore unavailable during normal policy eval). Avoid it until the
upstream bug is fixed, or patch the task to default `self.arm_tag = "left"` in
`load_actors()`.
</Tip>
## Dataset
The RoboTwin 2.0 dataset is available in **LeRobot v3.0 format** on the Hugging Face Hub:
```
lerobot/robotwin_unified
```
It contains over 100,000 pre-collected trajectories across all 50 tasks (79.6 GB, Apache 2.0 license). No format conversion is needed — it is already in the correct LeRobot v3.0 schema with video observations and action labels.
You can load it directly with the HF Datasets library:
```python
from datasets import load_dataset
ds = load_dataset("lerobot/robotwin_unified", split="train")
```
## Installation
RoboTwin 2.0 requires **Linux** with an NVIDIA GPU (CUDA 12.1 recommended). Installation takes approximately 20 minutes.
### 1. Create a conda environment
```bash
conda create -n robotwin python=3.10 -y
conda activate robotwin
```
### 2. Install LeRobot
```bash
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e "."
```
### 3. Install RoboTwin 2.0
```bash
git clone https://github.com/RoboTwin-Platform/RoboTwin.git
cd RoboTwin
bash script/_install.sh
bash script/_download_assets.sh
```
The install script handles all Python dependencies including SAPIEN, CuRobo, mplib, and pytorch3d.
<Tip warning={true}>
If the automated install fails, install manually:
```bash
pip install -r requirements.txt
pip install "git+https://github.com/facebookresearch/pytorch3d.git@stable"
cd envs && git clone https://github.com/NVlabs/curobo.git && cd curobo
pip install -e . --no-build-isolation
```
Then apply the required mplib fix: in `mplib/planner.py` line 807, remove `or collide` from the conditional.
</Tip>
### 4. Add RoboTwin to PYTHONPATH
The RoboTwin task modules must be importable by LeRobot. From within the `RoboTwin/` directory:
```bash
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
```
Add this to your shell profile to make it permanent.
## Evaluation
### Standard evaluation (recommended)
Evaluate a policy on a single task with the official protocol (100 episodes):
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=beat_block_hammer \
--eval.batch_size=1 \
--eval.n_episodes=100
```
### Single-task quick check
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=beat_block_hammer \
--eval.batch_size=1 \
--eval.n_episodes=5
```
### Multi-task sweep
Evaluate on several tasks in one run:
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=beat_block_hammer,click_bell,handover_block,stack_blocks_two \
--eval.batch_size=1 \
--eval.n_episodes=100
```
### Full benchmark (all 50 tasks)
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=adjust_bottle,beat_block_hammer,blocks_ranking_rgb,blocks_ranking_size,click_alarmclock,click_bell,dump_bin_bigbin,grab_roller,handover_block,handover_mic,hanging_mug,lift_pot,move_can_pot,move_pillbottle_pad,move_playingcard_away,move_stapler_pad,open_microwave,pick_diverse_bottles,pick_dual_bottles,place_a2b_left,place_a2b_right,place_bread_basket,place_bread_skillet,place_burger_fries,place_can_basket,place_cans_plasticbox,place_container_plate,place_dual_shoes,place_empty_cup,place_fan,place_mouse_pad,place_object_basket,place_object_scale,place_object_stand,place_phone_stand,place_shoe,press_stapler,put_bottles_dustbin,put_object_cabinet,rotate_qrcode,scan_object,shake_bottle,shake_bottle_horizontally,stack_blocks_three,stack_blocks_two,stack_bowls_three,stack_bowls_two,stamp_seal,turn_switch \
--eval.batch_size=1 \
--eval.n_episodes=100
```
<Tip>
`open_laptop` is intentionally omitted above because of the upstream
`self.arm_tag` bug (see the **Available tasks** section). Re-add it once the
upstream fix lands.
</Tip>
## Camera configuration
By default, all three cameras are included:
| Camera key | Description |
| -------------- | ------------------------------ |
| `head_camera` | Torso-mounted overhead view |
| `left_camera` | Left arm wrist-mounted camera |
| `right_camera` | Right arm wrist-mounted camera |
To use a subset of cameras, override `--env.camera_names`:
```bash
lerobot-eval \
--policy.path="your-hf-policy-id" \
--env.type=robotwin \
--env.task=beat_block_hammer \
--env.camera_names="head_camera,left_camera" \
--eval.batch_size=1 \
--eval.n_episodes=10
```
## Environment config reference
Key parameters for `RoboTwinEnvConfig`:
| Parameter | Default | Description |
| -------------------- | ---------------------------------------- | ---------------------------------- |
| `task` | `"beat_block_hammer"` | Comma-separated task name(s) |
| `fps` | `25` | Simulation FPS |
| `episode_length` | `300` | Max steps per episode |
| `obs_type` | `"pixels_agent_pos"` | `"pixels"` or `"pixels_agent_pos"` |
| `camera_names` | `"head_camera,left_camera,right_camera"` | Comma-separated active cameras |
| `observation_height` | `240` | Camera pixel height |
| `observation_width` | `320` | Camera pixel width |
## Leaderboard submission
Results can be submitted to the [RoboTwin 2.0 leaderboard](https://robotwin-platform.github.io/leaderboard). The official protocol requires:
- Training on 50 `demo_clean` demonstrations per task
- Evaluating 100 episodes per task
- Reporting success rate separately for **Easy** (`demo_clean`) and **Hard** (`demo_randomized`) settings
For submission instructions, refer to the [RoboTwin 2.0 documentation](https://robotwin-platform.github.io/doc/).
+198
View File
@@ -0,0 +1,198 @@
# Tools
LeRobot v3.1 supports **tool calls** in policies — assistant messages can
emit structured invocations like `say(text="OK, starting now")` that the
runtime dispatches to a real implementation (TTS, controller, logger, …).
This page covers:
1. Where the tool catalog lives (PR 1).
2. How the annotation pipeline produces tool-call atoms (PR 2).
3. How to add your own tool (PR 3).
## Where tools are declared
Two layers.
**The catalog** — a list of OpenAI-style function schemas — lives at
`meta/info.json["tools"]` on each dataset. Example:
```json
{
"features": { "...": "..." },
"tools": [
{
"type": "function",
"function": {
"name": "say",
"description": "Speak a short utterance to the user via the TTS executor.",
"parameters": {
"type": "object",
"properties": {
"text": { "type": "string", "description": "The verbatim text to speak." }
},
"required": ["text"]
}
}
}
]
}
```
Read it via the dataset metadata accessor:
```python
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
meta = LeRobotDatasetMetadata(repo_id="pepijn/super_poulain_final_annotations")
tools = meta.tools # list[dict] — OpenAI tool schemas
```
If the dataset's `info.json` doesn't declare any tools, `meta.tools`
returns `DEFAULT_TOOLS` from `lerobot.datasets.language` — currently a
single-entry list with the canonical `say` schema. So unannotated
datasets and chat-template consumers keep working without any
configuration:
```python
prompt_str = tokenizer.apply_chat_template(
sample["messages"],
tools=meta.tools, # works either way
add_generation_prompt=False,
tokenize=False,
)
```
**The implementations** — runnable Python — live under
`src/lerobot/tools/`, one file per tool. The `say` implementation
arrives in PR 3 and wraps Kyutai's pocket-tts model.
## Per-row tool *invocations*
The catalog above describes *what can be called*. The actual *call* — the
function name plus the argument values — is stored per-row, on the
assistant atoms in `language_events`:
```python
{
"role": "assistant",
"content": null,
"style": null,
"timestamp": 12.4,
"camera": null,
"tool_calls": [
{ "type": "function",
"function": { "name": "say", "arguments": { "text": "On it." } } }
]
}
```
Recipes splice these into rendered messages via `tool_calls_from`:
```yaml
user_interjection_response:
bindings:
speech: "emitted_at(t, role=assistant, tool_name=say)"
messages:
- { role: user, content: "${task}", stream: high_level }
- { role: assistant, content: "${current_plan}", stream: high_level,
target: true, tool_calls_from: speech }
```
The model's training target is one assistant turn that carries both the
plan text *and* the `say` tool call. At inference, the runtime parses
the generated text back into structured `tool_calls` and dispatches to
the matching implementation.
## How to add your own tool
Three steps. Concrete example: a `record_observation` tool the policy
can call to capture an extra observation outside the regular control
loop.
### Step 1 — declare the schema
Add an entry under `meta/info.json["tools"]`. Either edit the file
directly on disk *before* running the annotation pipeline (it'll be
preserved) or hand it to `lerobot-annotate` via a config flag (PR 2 —
exact CLI lands with the pipeline change).
```json
{
"tools": [
{ "type": "function", "function": { "name": "say", "...": "..." } },
{
"type": "function",
"function": {
"name": "record_observation",
"description": "Capture a high-resolution still image for the user.",
"parameters": {
"type": "object",
"properties": {
"label": { "type": "string", "description": "Short label for the saved image." }
},
"required": ["label"]
}
}
}
]
}
```
The schema follows OpenAI's function-calling convention exactly, so the
chat template can render it natively.
### Step 2 — implement the call
Create `src/lerobot/tools/record_observation.py`:
```python
from .base import Tool
from typing import Any
RECORD_OBSERVATION_SCHEMA: dict[str, Any] = { "...": "..." } # mirrors the JSON above
class RecordObservationTool:
name = "record_observation"
schema = RECORD_OBSERVATION_SCHEMA
def __init__(self, schema: dict | None = None, output_dir: str = "."):
self.output_dir = output_dir
def call(self, arguments: dict) -> str:
label = arguments["label"]
# ... save the latest camera frame to <output_dir>/<label>.png ...
return f"saved {label}.png"
```
One file per tool keeps dependencies isolated — `record_observation`
might pull `pillow`, while `say` (PR 3) pulls `pocket-tts`. Users
installing only the tools they need avoid heavy transitive deps.
### Step 3 — register it
Add to `src/lerobot/tools/registry.py` (PR 3):
```python
from .record_observation import RecordObservationTool
TOOL_REGISTRY["record_observation"] = RecordObservationTool
```
That's it. At runtime `get_tools(meta)` looks up each schema in
`meta.tools`, instantiates the matching registered class, and returns
a name → instance dict the dispatcher can route into.
## Where this fits in the three-PR stack
| Layer | PR | What lands |
|---|---|---|
| Catalog storage in `meta/info.json` + `meta.tools` accessor | PR 1 | This page; `SAY_TOOL_SCHEMA`, `DEFAULT_TOOLS` constants in `lerobot.datasets.language`; `LeRobotDatasetMetadata.tools` property |
| Annotation pipeline writes `tools` to meta after a run; honors anything users pre-populated | PR 2 | `lerobot-annotate` ensures `meta/info.json["tools"]` includes the canonical `say` and merges any user-declared tools |
| Runnable implementations under `src/lerobot/tools/`; runtime dispatcher; `say.py` wired to Kyutai's pocket-tts | PR 3 | One file per tool; `Tool` protocol; `TOOL_REGISTRY`; optional `[tools]` extra in `pyproject.toml` |
If you want to use a tool *without* writing an implementation (e.g. for
training-time chat-template formatting only), step 1 alone is enough —
the model still learns to *generate* the call. Steps 2 and 3 are only
needed to actually *execute* it at inference.
+176
View File
@@ -0,0 +1,176 @@
# VLABench
[VLABench](https://github.com/OpenMOSS/VLABench) is a large-scale benchmark for **language-conditioned robotic manipulation with long-horizon reasoning**. The upstream suite covers 100 task categories across 2,000+ objects and evaluates six dimensions of robot intelligence: mesh & texture understanding, spatial reasoning, world-knowledge transfer, semantic instruction comprehension, physical-law understanding, and long-horizon planning. Built on MuJoCo / dm_control with a Franka Panda 7-DOF arm. LeRobot exposes **43 of these tasks** through `--env.task` (21 primitives + 22 composites, see [Available tasks](#available-tasks) below).
- Paper: [VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning](https://arxiv.org/abs/2412.18194)
- GitHub: [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench)
- Project website: [vlabench.github.io](https://vlabench.github.io)
- Pretrained policy: [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench)
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/vlabench.png"
alt="VLABench benchmark overview"
width="85%"
/>
## Available tasks
VLABench ships two task suites covering **43 task categories** in LeRobot's `--env.task` surface:
| Suite | CLI name | Tasks | Description |
| --------- | ----------- | ----- | ---------------------------------------------------------------- |
| Primitive | `primitive` | 21 | Single / few-skill combinations (select, insert, physics QA) |
| Composite | `composite` | 22 | Multi-step reasoning and long-horizon planning (cook, rearrange) |
**Primitive tasks:** `select_fruit`, `select_toy`, `select_chemistry_tube`, `add_condiment`, `select_book`, `select_painting`, `select_drink`, `insert_flower`, `select_billiards`, `select_ingredient`, `select_mahjong`, `select_poker`, and physical-reasoning tasks (`density_qa`, `friction_qa`, `magnetism_qa`, `reflection_qa`, `simple_cuestick_usage`, `simple_seesaw_usage`, `sound_speed_qa`, `thermal_expansion_qa`, `weight_qa`).
**Composite tasks:** `cluster_billiards`, `cluster_book`, `cluster_drink`, `cluster_toy`, `cook_dishes`, `cool_drink`, `find_unseen_object`, `get_coffee`, `hammer_nail`, `heat_food`, `make_juice`, `play_mahjong`, `play_math_game`, `play_poker`, `play_snooker`, `rearrange_book`, `rearrange_chemistry_tube`, `set_dining_table`, `set_study_table`, `store_food`, `take_chemistry_experiment`, `use_seesaw_complex`.
`--env.task` accepts three forms:
- a single task name (`select_fruit`)
- a comma-separated list (`select_fruit,heat_food`)
- a suite shortcut (`primitive`, `composite`, or `primitive,composite`)
## Installation
VLABench is **not on PyPI** — its only distribution is the [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench) GitHub repo — so LeRobot does not expose a `vlabench` extra. Install it manually as an editable clone, alongside the MuJoCo / dm_control pins VLABench needs, then fetch the mesh assets:
```bash
# After following the standard LeRobot installation instructions.
git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench
git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms
pip install -e ~/VLABench -e ~/rrt-algorithms
pip install "mujoco==3.2.2" "dm-control==1.0.22" \
open3d colorlog scikit-learn openai gdown
python ~/VLABench/scripts/download_assets.py
```
<Tip>
VLABench requires Linux (`sys_platform == 'linux'`) and Python 3.10+. Set the MuJoCo rendering backend before running:
```bash
export MUJOCO_GL=egl # for headless servers (HPC, cloud)
```
</Tip>
## Evaluation
All eval snippets below mirror the command CI runs (see `.github/workflows/benchmark_tests.yml`). The `--rename_map` argument maps VLABench's `image` / `second_image` / `wrist_image` camera keys onto the three-camera (`camera1` / `camera2` / `camera3`) input layout the released `smolvla_vlabench` policy was trained on.
### Single-task evaluation (recommended for quick iteration)
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=select_fruit \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```
### Multi-task evaluation
Pass a comma-separated list of tasks:
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=select_fruit,select_toy,add_condiment,heat_food \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```
### Suite-wide evaluation
Run an entire suite (all 21 primitives or all 22 composites):
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=primitive \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
--env.max_parallel_tasks=1 \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```
Or both suites:
```bash
lerobot-eval \
--policy.path=lerobot/smolvla_vlabench \
--env.type=vlabench \
--env.task=primitive,composite \
--eval.batch_size=1 \
--eval.n_episodes=10 \
--eval.use_async_envs=false \
--policy.device=cuda \
--env.max_parallel_tasks=1 \
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
```
### Recommended evaluation episodes
**10 episodes per task** for reproducible benchmarking (210 total for the full primitive suite, 220 for composite). Matches the protocol in the VLABench paper.
## Policy inputs and outputs
**Observations:**
- `observation.state` — 7-dim end-effector state (position xyz + Euler xyz + gripper)
- `observation.images.image` — front camera, 480×480 HWC uint8
- `observation.images.second_image` — second camera, 480×480 HWC uint8
- `observation.images.wrist_image` — wrist camera, 480×480 HWC uint8
**Actions:**
- Continuous control in `Box(-1, 1, shape=(7,))` — 3D position + 3D Euler orientation + 1D gripper.
## Training
### Datasets
Pre-collected VLABench datasets in LeRobot format on the Hub:
- [`VLABench/vlabench_primitive_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_primitive_ft_lerobot_video) — 5,000 episodes, 128 tasks, 480×480 images.
- [`VLABench/vlabench_composite_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_composite_ft_lerobot_video) — 5,977 episodes, 167 tasks, 224×224 images.
### Example training command
Fine-tune a SmolVLA base on the primitive suite:
```bash
lerobot-train \
--policy.type=smolvla \
--policy.repo_id=${HF_USER}/smolvla_vlabench_primitive \
--policy.load_vlm_weights=true \
--policy.push_to_hub=true \
--dataset.repo_id=VLABench/vlabench_primitive_ft_lerobot_video \
--env.type=vlabench \
--env.task=select_fruit \
--output_dir=./outputs/smolvla_vlabench_primitive \
--steps=100000 \
--batch_size=4 \
--eval_freq=5000 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--save_freq=10000
```
## Reproducing published results
The released checkpoint [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench) was trained on the primitive-suite dataset above and is evaluated with the [Single-task](#single-task-evaluation-recommended-for-quick-iteration) / [Suite-wide](#suite-wide-evaluation) commands. CI runs a 10-primitive-task smoke eval (one episode each) on every PR touching the benchmark.
+69
View File
@@ -0,0 +1,69 @@
#!/usr/bin/env python
"""Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6 MoE).
Spawns one ``h200x2`` job that:
1. installs this branch of ``lerobot`` plus the annotation extras,
2. boots two vllm servers (one per GPU) with Qwen3.6-35B-A3B-FP8,
3. runs Module 1/2/3 across the dataset (per-camera VQA via PR 3471),
4. uploads the annotated dataset to ``--push_to_hub``.
Usage:
HF_TOKEN=hf_... uv run python examples/annotation/run_hf_job.py
Adjust ``CMD`` below to point at your own dataset / target hub repo.
"""
import os
from huggingface_hub import get_token, run_job
token = os.environ.get("HF_TOKEN") or get_token()
if not token:
raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")
CMD = (
"apt-get update -qq && apt-get install -y -qq git ffmpeg && "
"pip install --no-deps "
"'lerobot @ git+https://github.com/huggingface/lerobot.git@feat/language-annotation-pipeline' && "
"pip install --upgrade-strategy only-if-needed "
"datasets pyarrow av jsonlines draccus gymnasium torchcodec mergedeep pyyaml-include toml typing-inspect && "
"export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 && "
"export VLLM_VIDEO_BACKEND=pyav && "
"lerobot-annotate "
"--repo_id=imstevenpmwork/super_poulain_draft "
"--vlm.backend=openai "
"--vlm.model_id=Qwen/Qwen3.6-35B-A3B-FP8 "
"--vlm.parallel_servers=2 "
"--vlm.num_gpus=2 "
'--vlm.serve_command="vllm serve Qwen/Qwen3.6-35B-A3B-FP8 '
"--tensor-parallel-size 1 --max-model-len 32768 "
'--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
"--vlm.serve_ready_timeout_s=1800 "
"--vlm.client_concurrency=128 "
"--vlm.max_new_tokens=512 "
"--vlm.temperature=0.7 "
"--executor.episode_parallelism=16 "
"--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
"--vlm.camera_key=observation.images.wrist "
"--module_1.frames_per_second=1.0 "
"--module_1.use_video_url=true "
"--module_1.use_video_url_fps=1.0 "
"--module_1.derive_task_from_video=always "
"--module_1.n_task_rephrasings=30 "
"--module_2.max_interjections_per_episode=6 "
"--module_3.K=3 "
"--module_3.vqa_emission_hz=1.0 "
"--push_to_hub=pepijn223/super_poulain_full_tool3"
)
job = run_job(
image="vllm/vllm-openai:latest",
command=["bash", "-c", CMD],
flavor="h200x2",
secrets={"HF_TOKEN": token},
timeout="2h",
)
print(f"Job URL: {job.url}")
print(f"Job ID: {job.id}")
+342
View File
@@ -0,0 +1,342 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 🤗 LeRobot Quickstart\n",
"\n",
"Calibration → teleoperation → data collection → training → evaluation.\n",
"\n",
"Install the required dependencies: `pip install -e .[notebook,dataset,training,viz,hardware]`.\n",
"\n",
"**How to use:**\n",
"1. Edit the **Configuration** cell with your settings.\n",
"2. Run all cells (`Run All`).\n",
"3. Each section prints a ready-to-paste terminal command - copy it and run it.\n",
"\n",
"Each setup is different, please refer to the [LeRobot documentation](https://huggingface.co/docs/lerobot/il_robots) for more details on each step and available options. <br>\n",
"Feel free to make this notebook your own and adapt it to your needs!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Utils"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def _cameras_arg(cameras: dict) -> str:\n",
" if not cameras:\n",
" return \"\"\n",
" entries = [f\"{n}: {{{', '.join(f'{k}: {v}' for k, v in cfg.items())}}}\" for n, cfg in cameras.items()]\n",
" return \"{ \" + \", \".join(entries) + \" }\"\n",
"\n",
"\n",
"def print_cmd(*parts: str) -> None:\n",
" \"\"\"Print a shell command with line continuations, skipping empty parts.\"\"\"\n",
" non_empty = [p for p in parts if p]\n",
" print(\" \\\\\\n \".join(non_empty))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Configuration\n",
"\n",
"Edit this cell, then **Run All** to generate all commands below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Robot (follower) - run `lerobot-find-port` to discover the port\n",
"ROBOT_TYPE = \"so101_follower\"\n",
"ROBOT_PORT = \"/dev/ttyACM0\"\n",
"ROBOT_ID = \"my_follower_arm\"\n",
"\n",
"# Teleop (leader) - run `lerobot-find-port` to discover the port\n",
"TELEOP_TYPE = \"so101_leader\"\n",
"TELEOP_PORT = \"/dev/ttyACM1\"\n",
"TELEOP_ID = \"my_leader_arm\"\n",
"\n",
"# Cameras - set to {} to disable\n",
"# Run `lerobot-find-cameras opencv` to list available cameras and their indices\n",
"CAMERAS = {\n",
" \"top\": {\"type\": \"opencv\", \"index_or_path\": 2, \"width\": 640, \"height\": 480, \"fps\": 30},\n",
" \"wrist\": {\"type\": \"opencv\", \"index_or_path\": 4, \"width\": 640, \"height\": 480, \"fps\": 30},\n",
"}\n",
"\n",
"# Dataset\n",
"HF_USER = \"your_hf_username\" # `huggingface-cli whoami` to find your username\n",
"DATASET_NAME = \"my_so101_dataset\"\n",
"TASK_DESCRIPTION = \"pick and place the block\"\n",
"NUM_EPISODES = 10\n",
"\n",
"# Training\n",
"POLICY_TYPE = \"act\" # act, diffusion, smolvla, ...\n",
"POLICY_DEVICE = \"cuda\" # cuda / cpu / mps\n",
"TRAIN_STEPS = 10_000\n",
"SAVE_FREQ = 2_000\n",
"OUTPUT_DIR = f\"outputs/train/{DATASET_NAME}\"\n",
"\n",
"# Inference - Hub repo ID or local checkpoint path\n",
"# e.g. set to f\"{OUTPUT_DIR}/checkpoints/last\" to use a local checkpoint\n",
"POLICY_PATH = f\"{HF_USER}/{DATASET_NAME}_{POLICY_TYPE}\"\n",
"LAST_CHECKPOINT_PATH = f\"{OUTPUT_DIR}/checkpoints/last\"\n",
"\n",
"# Derived\n",
"DATASET_REPO_ID = f\"{HF_USER}/{DATASET_NAME}\"\n",
"DATASET_ROOT = f\"data/{DATASET_NAME}\"\n",
"POLICY_REPO_ID = f\"{HF_USER}/{DATASET_NAME}_{POLICY_TYPE}\"\n",
"EVAL_REPO_ID = f\"{HF_USER}/eval_{DATASET_NAME}\"\n",
"CAMERAS_ARG = _cameras_arg(CAMERAS)\n",
"CAMERAS_FLAG = f'--robot.cameras=\"{CAMERAS_ARG}\"' if CAMERAS_ARG else \"\"\n",
"\n",
"print(f\"Robot : {ROBOT_TYPE} @ {ROBOT_PORT}\")\n",
"print(f\"Teleop : {TELEOP_TYPE} @ {TELEOP_PORT}\")\n",
"print(f\"Cameras: {list(CAMERAS) or 'none'}\")\n",
"print(f\"Dataset: {DATASET_REPO_ID} ({NUM_EPISODES} episodes) saved to {DATASET_ROOT}\")\n",
"print(f\"Policy : {POLICY_TYPE} -> {POLICY_REPO_ID}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 1. Calibration\n",
"\n",
"Run once per arm before first use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Follower\n",
"print_cmd(\n",
" \"lerobot-calibrate\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Leader\n",
"print_cmd(\n",
" \"lerobot-calibrate\",\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 2. Teleoperation\n",
"\n",
"See the [teleoperation docs](https://huggingface.co/docs/lerobot/il_robots#teleoperate) and the [cameras guide](https://huggingface.co/docs/lerobot/cameras) for more options."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_cmd(\n",
" \"lerobot-teleoperate\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
" CAMERAS_FLAG,\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
" \"--display_data=true\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 3. Record Dataset\n",
"\n",
"See the [recording docs](https://huggingface.co/docs/lerobot/il_robots#record-a-dataset) for tips on gathering good data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_cmd(\n",
" \"lerobot-record\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
" CAMERAS_FLAG,\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
" f\"--dataset.repo_id={DATASET_REPO_ID}\",\n",
" f\"--dataset.num_episodes={NUM_EPISODES}\",\n",
" f'--dataset.single_task=\"{TASK_DESCRIPTION}\"',\n",
" \"--dataset.streaming_encoding=true\",\n",
" \"--display_data=true\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Resume a previously interrupted recording session\n",
"print_cmd(\n",
" \"lerobot-record\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
" CAMERAS_FLAG,\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
" f\"--dataset.repo_id={DATASET_REPO_ID}\",\n",
" f\"--dataset.root={DATASET_ROOT}\",\n",
" f\"--dataset.num_episodes={NUM_EPISODES}\",\n",
" f'--dataset.single_task=\"{TASK_DESCRIPTION}\"',\n",
" \"--dataset.streaming_encoding=true\",\n",
" \"--display_data=true\",\n",
" \"--resume=true\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 4. Train Policy\n",
"\n",
"See the [training docs](https://huggingface.co/docs/lerobot/il_robots#train-a-policy) for configuration options and tips."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_cmd(\n",
" \"lerobot-train\",\n",
" f\"--dataset.repo_id={DATASET_REPO_ID}\",\n",
" f\"--policy.type={POLICY_TYPE}\",\n",
" f\"--policy.device={POLICY_DEVICE}\",\n",
" f\"--policy.repo_id={POLICY_REPO_ID}\",\n",
" f\"--output_dir={OUTPUT_DIR}\",\n",
" f\"--steps={TRAIN_STEPS}\",\n",
" f\"--save_freq={SAVE_FREQ}\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Resume a previously interrupted training session\n",
"print_cmd(\n",
" \"lerobot-train\",\n",
" f\"--config_path={LAST_CHECKPOINT_PATH}/pretrained_model/train_config.json\",\n",
" \"--resume=true\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## 5. Inference\n",
"\n",
"Uses `POLICY_PATH` from the Configuration cell (defaults to the Hub repo ID). You can also put there the `LAST_CHECKPOINT_PATH`.\n",
"\n",
"See the [inference docs](https://huggingface.co/docs/lerobot/il_robots#run-inference-and-evaluate-your-policy) for details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_cmd(\n",
" \"lerobot-record\",\n",
" f\"--policy.path={POLICY_PATH}\",\n",
" f\"--robot.type={ROBOT_TYPE}\",\n",
" f\"--robot.port={ROBOT_PORT}\",\n",
" f\"--robot.id={ROBOT_ID}\",\n",
" CAMERAS_FLAG,\n",
" f\"--teleop.type={TELEOP_TYPE}\",\n",
" f\"--teleop.port={TELEOP_PORT}\",\n",
" f\"--teleop.id={TELEOP_ID}\",\n",
" f\"--dataset.repo_id={EVAL_REPO_ID}\",\n",
" f\"--dataset.num_episodes={NUM_EPISODES}\",\n",
" f'--dataset.single_task=\"{TASK_DESCRIPTION}\"',\n",
" \"--dataset.streaming_encoding=true\",\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "lerobot (3.12.3)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
+51 -11
View File
@@ -95,7 +95,7 @@ dependencies = [
# ── Feature-scoped extras ──────────────────────────────────
dataset = [
"datasets>=4.0.0,<5.0.0",
"datasets>=4.7.0,<5.0.0",
"pandas>=2.0.0,<3.0.0", # NOTE: Transitive dependency of datasets
"pyarrow>=21.0.0,<30.0.0", # NOTE: Transitive dependency of datasets
"lerobot[av-dep]",
@@ -108,9 +108,9 @@ training = [
"wandb>=0.24.0,<0.25.0",
]
hardware = [
"pynput>=1.7.8,<1.9.0",
"pyserial>=3.5,<4.0",
"deepdiff>=7.0.1,<9.0.0",
"lerobot[pynput-dep]",
"lerobot[pyserial-dep]",
"lerobot[deepdiff-dep]",
]
viz = [
"rerun-sdk>=0.24.0,<0.27.0",
@@ -129,6 +129,7 @@ av-dep = ["av>=15.0.0,<16.0.0"]
pygame-dep = ["pygame>=2.5.1,<2.7.0"]
placo-dep = ["placo>=0.9.6,<0.9.17"]
transformers-dep = ["transformers==5.3.0"] # TODO(Steven): https://github.com/huggingface/lerobot/pull/3249
sentencepiece-dep = ["sentencepiece>=0.2.0,<0.3.0"] # FAST action tokenizer backend (pi052, pi0_fast)
grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
can-dep = ["python-can>=4.2.0,<5.0.0"]
peft-dep = ["peft>=0.18.0,<1.0.0"]
@@ -136,10 +137,14 @@ scipy-dep = ["scipy>=1.14.0,<2.0.0"]
diffusers-dep = ["diffusers>=0.27.2,<0.36.0"]
qwen-vl-utils-dep = ["qwen-vl-utils>=0.0.11,<0.1.0"]
matplotlib-dep = ["matplotlib>=3.10.3,<4.0.0", "contourpy>=1.3.0,<2.0.0"] # NOTE: Explicitly listing contourpy helps the resolver converge faster.
pyserial-dep = ["pyserial>=3.5,<4.0"]
deepdiff-dep = ["deepdiff>=7.0.1,<9.0.0"]
pynput-dep = ["pynput>=1.7.8,<1.9.0"]
pyzmq-dep = ["pyzmq>=26.2.1,<28.0.0"]
# Motors
feetech = ["feetech-servo-sdk>=1.0.0,<2.0.0"]
dynamixel = ["dynamixel-sdk>=3.7.31,<3.9.0"]
feetech = ["feetech-servo-sdk>=1.0.0,<2.0.0", "lerobot[pyserial-dep]", "lerobot[deepdiff-dep]"]
dynamixel = ["dynamixel-sdk>=3.7.31,<3.9.0", "lerobot[pyserial-dep]", "lerobot[deepdiff-dep]"]
damiao = ["lerobot[can-dep]"]
robstride = ["lerobot[can-dep]"]
@@ -147,10 +152,11 @@ robstride = ["lerobot[can-dep]"]
openarms = ["lerobot[damiao]"]
gamepad = ["lerobot[pygame-dep]", "hidapi>=0.14.0,<0.15.0"]
hopejr = ["lerobot[feetech]", "lerobot[pygame-dep]"]
lekiwi = ["lerobot[feetech]", "pyzmq>=26.2.1,<28.0.0"]
lekiwi = ["lerobot[feetech]", "lerobot[pyzmq-dep]"]
unitree_g1 = [
# "unitree-sdk2==1.0.1",
"pyzmq>=26.2.1,<28.0.0",
"lerobot[pyzmq-dep]",
"lerobot[pyserial-dep]",
"onnxruntime>=1.16.0,<2.0.0",
"onnx>=1.16.0,<2.0.0",
"meshcat>=0.3.0,<0.4.0",
@@ -174,7 +180,7 @@ wallx = [
"torchdiffeq>=0.2.4,<0.3.0",
"lerobot[qwen-vl-utils-dep]",
]
pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]"]
pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]", "lerobot[sentencepiece-dep]"]
smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0"]
multi_task_dit = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]"]
groot = [
@@ -195,8 +201,26 @@ hilserl = ["lerobot[transformers-dep]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpci
async = ["lerobot[grpcio-dep]", "lerobot[matplotlib-dep]"]
peft = ["lerobot[transformers-dep]", "lerobot[peft-dep]"]
# Annotation pipeline (lerobot-annotate). datatrove is mandatory; vllm is
# the preferred backend on Linux, with a transformers fallback elsewhere.
annotations = [
"lerobot[dataset]",
"lerobot[transformers-dep]",
"datatrove>=0.4.0,<2.0.0",
"vllm>=0.6.0,<1.0.0; sys_platform == 'linux'",
]
# Tool implementations under src/lerobot/tools/. Each tool's dependencies
# are isolated so adding a new tool doesn't bloat the base install.
# Currently only `say` (Kyutai pocket-tts; CPU-only, ~100M params).
tools = [
"pocket-tts>=1.0.0,<3.0.0",
"scipy>=1.11.0,<2.0.0", # SayTool.output_dir uses scipy.io.wavfile
]
# Development
dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1", "ruff>=0.14.1"]
dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1", "ruff>=0.14.1", "lerobot[notebook]"]
notebook = ["jupyter>=1.0.0,<2.0.0", "ipykernel>=6.0.0,<7.0.0"]
test = ["pytest>=8.1.0,<9.0.0", "pytest-timeout>=2.4.0,<3.0.0", "pytest-cov>=5.0.0,<8.0.0", "mock-serial>=0.0.1,<0.1.0 ; sys_platform != 'win32'"]
video_benchmark = ["scikit-image>=0.23.2,<0.26.0", "pandas>=2.2.2,<2.4.0"]
@@ -206,6 +230,20 @@ aloha = ["lerobot[dataset]", "gym-aloha>=0.1.2,<0.2.0", "lerobot[scipy-dep]"]
pusht = ["lerobot[dataset]", "gym-pusht>=0.1.5,<0.2.0", "pymunk>=6.6.0,<7.0.0"] # TODO: Fix pymunk version in gym-pusht instead
libero = ["lerobot[dataset]", "lerobot[transformers-dep]", "hf-libero>=0.1.3,<0.2.0; sys_platform == 'linux'", "lerobot[scipy-dep]"]
metaworld = ["lerobot[dataset]", "metaworld==3.0.0", "lerobot[scipy-dep]"]
# NOTE: vlabench is NOT exposed as a `lerobot` extra. Its only distribution
# is the OpenMOSS/VLABench GitHub repo (package name `VLABench`, no PyPI
# release), so any `vlabench>=X` pip spec is unresolvable. Install it
# manually alongside MuJoCo / dm-control — see docs/source/vlabench.mdx
# for the recipe.
# NOTE: robomme is NOT a pyproject extra — mani-skill hard-pins numpy<2
# which conflicts with lerobot's numpy>=2 base pin, so the two trees can't
# resolve into a single env. Install it only in the RoboMME Docker image
# via `uv pip install --override` (see docker/Dockerfile.benchmark.robomme).
# NOTE: robocasa is NOT exposed as a `lerobot` extra. Its setup.py pins
# `lerobot==0.3.3` in install_requires, which cyclically shadows our own
# workspace `lerobot` and makes the graph unsolvable under any resolver
# (uv, pip). Install it manually alongside robosuite — see
# docs/source/robocasa.mdx for the recipe.
# All
all = [
@@ -269,10 +307,12 @@ lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
lerobot-annotate="lerobot.scripts.lerobot_annotate:main"
lerobot-smolvla2-runtime="lerobot.scripts.lerobot_smolvla2_runtime:main"
# ---------------- Tool Configurations ----------------
[tool.setuptools.package-data]
lerobot = ["envs/*.json"]
lerobot = ["envs/*.json", "annotations/steerable_pipeline/prompts/*.txt"]
[tool.setuptools.packages.find]
where = ["src"]
+120 -2
View File
@@ -31,9 +31,23 @@ from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
# LIBERO-plus derives task.language by space-joining the perturbation-variant
# filename (grab_language_from_filename in libero/libero/benchmark/__init__.py),
# so non-_language_ variants inherit a trailing metadata blob like
# "view 0 0 100 0 0 initstate 0 noise 45" or "add 16". Strip those tokens so
# the description matches the base instruction used in the training dataset.
_LIBERO_PERTURBATION_TAIL_RE = re.compile(
r"(?:\s(?:view|initstate|noise|add|tb|table|light|level)(?:\s\d+)+)+$"
)
def _strip_libero_perturbation_tail(instruction: str) -> str:
return _LIBERO_PERTURBATION_TAIL_RE.sub("", instruction).strip()
def _libero_descriptions(task_suite: str) -> dict[str, str]:
from libero.libero import benchmark # type: ignore[import-untyped]
@@ -47,7 +61,10 @@ def _libero_descriptions(task_suite: str) -> dict[str, str]:
)
return {}
suite = suite_dict[task_suite]()
return {f"{task_suite}_{i}": suite.get_task(i).language for i in range(suite.n_tasks)}
return {
f"{task_suite}_{i}": _strip_libero_perturbation_tail(suite.get_task(i).language)
for i in range(suite.n_tasks)
}
def _metaworld_descriptions(task_name: str) -> dict[str, str]:
@@ -57,19 +74,120 @@ def _metaworld_descriptions(task_name: str) -> dict[str, str]:
return {f"{task_name}_0": label}
def _robotwin_descriptions(task_names: str) -> dict[str, str]:
"""Return descriptions for each requested RoboTwin task. Reads
`description/task_instruction/<task>.json` from the RoboTwin clone
(cwd is /opt/robotwin in CI). Falls back to the task name if missing."""
out: dict[str, str] = {}
root = Path("description/task_instruction")
for name in (t.strip() for t in task_names.split(",") if t.strip()):
desc_file = root / f"{name}.json"
desc = name.replace("_", " ")
if desc_file.is_file():
data = json.loads(desc_file.read_text())
full = data.get("full_description") or desc
# Strip the schema placeholders ({A}, {a}) — keep the sentence readable.
desc = full.replace("<", "").replace(">", "")
out[f"{name}_0"] = desc
return out
def _robocasa_descriptions(task_spec: str) -> dict[str, str]:
"""For each task in the comma-separated list, emit a cleaned-name label.
RoboCasa episodes carry their language instruction in the env's
`ep_meta['lang']`, populated per reset. Pulling it requires spinning
up the full kitchen env per task (~seconds each); we use the task
name as the key here and let the eval's episode info carry the
actual instruction.
"""
out: dict[str, str] = {}
for task in (t.strip() for t in task_spec.split(",") if t.strip()):
# Split CamelCase into words: "CloseFridge" → "close fridge".
label = "".join(f" {c.lower()}" if c.isupper() else c for c in task).strip()
out[f"{task}_0"] = label or task
return out
_ROBOMME_DESCRIPTIONS = {
"BinFill": "Fill the target bin with the correct number of cubes",
"PickXtimes": "Pick the indicated cube the specified number of times",
"SwingXtimes": "Swing the object the specified number of times",
"StopCube": "Grasp and stop the moving cube",
"VideoUnmask": "Pick the cube shown in the reference video",
"VideoUnmaskSwap": "Pick the cube matching the reference video after a swap",
"ButtonUnmask": "Press the button indicated by the reference",
"ButtonUnmaskSwap": "Press the correct button after objects are swapped",
"PickHighlight": "Pick the highlighted cube",
"VideoRepick": "Repick the cube shown in the reference video",
"VideoPlaceButton": "Place the cube on the button shown in the video",
"VideoPlaceOrder": "Place cubes in the order shown in the video",
"MoveCube": "Move the cube to the target location",
"InsertPeg": "Insert the peg into the target hole",
"PatternLock": "Unlock the pattern by pressing buttons in sequence",
"RouteStick": "Route the stick through the required waypoints",
}
def _robomme_descriptions(task_names: str, task_ids: list[int] | None = None) -> dict[str, str]:
"""Return descriptions for each requested RoboMME task. Keys match the
video filename pattern `<task>_<task_id>` used by the eval script."""
if task_ids is None:
task_ids = [0]
out: dict[str, str] = {}
for name in (t.strip() for t in task_names.split(",") if t.strip()):
desc = _ROBOMME_DESCRIPTIONS.get(name, name)
for tid in task_ids:
out[f"{name}_{tid}"] = desc
return out
def _vlabench_descriptions(task_spec: str) -> dict[str, str]:
"""For each task in the comma-separated list, emit a cleaned-name label.
VLABench tasks carry language instructions on their dm_control task
object, but pulling them requires loading the full env per task
(~seconds each). The CI smoke-eval already captures the instruction
inside its episode info; this mapping is just enough to key
`metrics.json` by `<task>_0`.
"""
out: dict[str, str] = {}
for task in (t.strip() for t in task_spec.split(",") if t.strip()):
out[f"{task}_0"] = task.replace("_", " ").strip()
return out
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--env", required=True, help="Environment family (libero, metaworld, ...)")
parser.add_argument("--task", required=True, help="Task/suite name (e.g. libero_spatial)")
parser.add_argument(
"--task-ids",
type=str,
default=None,
help="Comma-separated task IDs (e.g. '0,1,2'). Default: [0]",
)
parser.add_argument("--output", required=True, help="Path to write task_descriptions.json")
args = parser.parse_args()
task_ids: list[int] | None = None
if args.task_ids:
task_ids = [int(x.strip()) for x in args.task_ids.split(",")]
descriptions: dict[str, str] = {}
try:
if args.env == "libero":
if args.env == ("libero", "libero_plus"):
descriptions = _libero_descriptions(args.task)
elif args.env == "metaworld":
descriptions = _metaworld_descriptions(args.task)
elif args.env == "robotwin":
descriptions = _robotwin_descriptions(args.task)
elif args.env == "robocasa":
descriptions = _robocasa_descriptions(args.task)
elif args.env == "robomme":
descriptions = _robomme_descriptions(args.task, task_ids=task_ids)
elif args.env == "vlabench":
descriptions = _vlabench_descriptions(args.task)
else:
print(
f"[extract_task_descriptions] No description extractor for env '{args.env}'.",
+15
View File
@@ -0,0 +1,15 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
@@ -0,0 +1,36 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Steerable annotation pipeline producing ``language_persistent`` and
``language_events`` columns for LeRobot datasets.
The pipeline is decomposed into three independently runnable modules whose
outputs are staged per-episode before a final parquet rewrite:
- :mod:`.modules.plan_subtasks_memory` (Module 1) — persistent styles
- :mod:`.modules.interjections_and_speech` (Module 2) — event styles + speech
- :mod:`.modules.general_vqa` (Module 3) — event-style VQA pairs
"""
from .config import AnnotationPipelineConfig
from .validator import StagingValidator, ValidationReport
from .writer import LanguageColumnsWriter
__all__ = [
"AnnotationPipelineConfig",
"LanguageColumnsWriter",
"StagingValidator",
"ValidationReport",
]
@@ -0,0 +1,260 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
@dataclass
class Module1Config:
"""Module 1 hyperparameters: plan + subtasks + memory + task augmentation.
Subtask decomposition sees the **whole episode** as one Qwen-VL video
block — no keyframe stride or count: the model handles temporal pooling
itself and decides where to cut. ``max_video_frames`` only caps the
number of frames packed into the video block (a model-capacity bound,
not an annotation-logic knob).
"""
enabled: bool = True
n_task_rephrasings: int = 10
"""Number of task rephrasings to generate at ``t=0`` as ``task_aug``
persistent rows (PR 1 ``CORE_STYLES``). The renderer's ``${task}``
binding rotates among them deterministically per ``sample_idx``,
realizing Xiao 2022 / CAST-style task-prompt diversity without
touching ``meta/tasks.parquet``. Set to 0 to disable."""
derive_task_from_video: str = "if_short"
"""When to bypass the user-provided ``record.episode_task`` and
derive a fresh task description from the episode video alone:
- ``off`` never; always use the canonical task as the basis.
- ``if_short`` derive when the canonical task is empty, has fewer
than ``derive_task_min_words`` words, or matches a
placeholder string (``debug``, ``unnamed``, ``tbd``,
...). Default — fixes noisy / placeholder tasks
without forcing derivation everywhere.
- ``always`` ignore the canonical task entirely; always derive
from the video. Useful when the dataset's task
labels are uniformly bad.
The video-derived task replaces the canonical task as the basis for
subtask decomposition, plan, memory, AND the ``task_aug`` rephrasings,
so every downstream annotation is grounded in what's actually visible.
``meta/tasks.parquet`` is NOT modified — the Module-1-derived task
only lives in ``language_persistent`` rows."""
derive_task_min_words: int = 3
"""Word-count threshold for ``derive_task_from_video=if_short``."""
frames_per_second: float = 1.0
"""Sample one image-frame per ``1/fps`` seconds across the episode for
Module 1's subtask-decomposition prompt. ``1.0`` = 1 fps. Capped by
``max_video_frames`` to avoid blowing up the request payload."""
max_video_frames: int = 128
"""Hard cap on the number of frames Module 1 sends. With ``fps=1`` and
a 30 s episode this yields 30 frames. Bumped from 32 since each frame
is small (~30-100 KB PNG when base64'd)."""
min_subtask_seconds: float = 1.5
plan_max_steps: int = 8
use_video_url: bool = False
"""When True (and backend supports it, e.g. ``openai``), Module 1
sends a ``video_url`` content block pointing at the episode's mp4
file instead of pre-decoded frames. Lets the server sample frames at
its own ``fps`` — no in-process conv3d cost. The video file is
extracted as a per-episode subclip to ``staging/.video_clips/`` so
the model sees only this episode's frames."""
use_video_url_fps: float = 1.0
"""Frame-rate hint to send to the server (mm_processor_kwargs.fps).
Only used when ``use_video_url=True``. ``1.0`` = sample 1 frame per
second, which is plenty for subtask-boundary detection on most
manipulation episodes."""
@dataclass
class Module2Config:
"""Module 2 hyperparameters: interjections + paired speech."""
enabled: bool = True
max_interjections_per_episode: int = 3
"""Number of mid-episode interjections to generate per episode. Each
creates a paired ``(interjection, speech)`` event row plus triggers a
``plan`` refresh at the same timestamp via Module 1. Bumped from the
original ``1`` after qwen36moe-10 showed plan/interjection coverage
was too sparse for Hi Robot-style training."""
interjection_min_t: float = 2.0
interjection_window_seconds: float = 2.0
"""How many seconds of video to attach to the interjection prompt as
visual context. Without this the VLM only sees a single frozen frame
and writes generic interjections that aren't grounded in the actual
motion happening at the chosen timestamp."""
interjection_window_frames: int = 4
"""How many frames to sample over ``interjection_window_seconds``.
Default 4 ⇒ ~0.5 fps over the leading 2 seconds — enough for the
model to read the ongoing motion, cheap enough to keep prompt size
bounded for the 32k context."""
@dataclass
class Module3Config:
"""Module 3 hyperparameters: general VQA."""
enabled: bool = True
vqa_emission_hz: float = 1.0
K: int = 3
question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")
@dataclass
class VlmConfig:
"""Shared Qwen-VL client configuration."""
backend: str = "openai"
"""One of ``vllm``, ``transformers``, ``openai``, or ``stub`` (tests only).
Default ``openai`` talks to a local OpenAI-compatible server (vllm /
transformers) which the CLI auto-spawns when ``auto_serve=True``."""
model_id: str = "Qwen/Qwen2.5-VL-7B-Instruct"
api_base: str = "http://localhost:8000/v1"
"""Base URL for the ``openai`` backend."""
api_key: str = "EMPTY"
"""API key for the ``openai`` backend; ``EMPTY`` works for local servers."""
auto_serve: bool = True
"""When True with ``backend=openai``, the CLI probes ``api_base``
first; if no server answers, it spawns one (default:
``transformers serve``), waits for it to be ready, runs the
pipeline, and tears it down on exit. Default ``True`` so a single
``lerobot-annotate`` call can drive the whole flow. Set to ``False``
if you want to fail fast when no server is reachable (e.g. you're
pointing at a remote endpoint that should already be up)."""
serve_port: int = 8000
"""Port the auto-spawned server binds to. Sets ``api_base`` automatically."""
serve_command: str | None = None
"""Override the auto-serve command (full shell command). When ``None``,
we run ``transformers serve <model_id> --port <serve_port> --continuous-batching``.
When ``parallel_servers > 1``, the literal ``{port}`` placeholder in
this command (if present) is substituted per-replica."""
parallel_servers: int = 1
"""When >1, spawn this many independent inference servers (each pinned
to a GPU via ``CUDA_VISIBLE_DEVICES`` and listening on
``serve_port + i``) and round-robin client requests across them.
Useful when DP/TP NCCL setup is broken on the node — single-GPU
replicas don't need cross-GPU communication. When
``parallel_servers > num_gpus``, replicas are round-robin-assigned
to GPUs (e.g. 4 replicas on 2 GPUs → 0,1,0,1)."""
num_gpus: int = 0
"""How many physical GPUs are available for round-robin replica
placement. ``0`` means ``parallel_servers`` (one GPU per replica,
backward-compatible default). Set this to ``2`` with
``parallel_servers=4`` to pack 2 replicas per GPU."""
client_concurrency: int = 16
"""Maximum number of in-flight chat requests the client issues in
parallel. vllm batches them internally for free, so bumping this
typically gives big throughput wins on a single TP=1 server. Set to
``1`` for strict serial calls."""
serve_ready_timeout_s: float = 600.0
"""Max seconds to wait for the server to start serving requests."""
max_new_tokens: int = 512
temperature: float = 0.2
json_mode: bool = True
batch_size: int = 4
tensor_parallel_size: int = 1
gpu_memory_utilization: float = 0.9
"""Fraction of GPU memory vllm allocates for weights + KV cache.
Lower (e.g. 0.7) when the vision encoder needs cuDNN workspace, or to
avoid CUDNN_STATUS_NOT_INITIALIZED on tight VRAM (30B BF16 on 80 GB)."""
max_model_len: int | None = None
"""Cap context length. ``None`` keeps the model's default; on H100 80 GB
a 30B BF16 model often needs ``max_model_len=8192`` or smaller to leave
room for KV cache."""
trust_remote_code: bool = False
"""Pass ``trust_remote_code`` to HF auto-classes. Default ``False`` —
only enable for models that actually ship custom code in their repo
(rare for first-class VL releases). On Qwen3-VL it triggers an
std::bad_alloc post-load even though the official transformers class
is sufficient, so leaving this off is safest."""
camera_key: str | None = None
"""Override the camera stream used for keyframe attachment. ``None`` picks
the first ``observation.images.*`` key the dataset declares."""
chat_template_kwargs: dict[str, Any] | None = None
"""Forwarded as ``extra_body.chat_template_kwargs`` on every chat call.
Use this to pass model-specific template flags such as
``{"enable_thinking": false}`` for Qwen3.5/Qwen3.6 to suppress the
reasoning preamble that otherwise eats the entire ``max_new_tokens``
budget before any JSON is emitted."""
@dataclass
class ExecutorConfig:
"""Executor selection and SLURM hyperparameters."""
auto_threshold: int = 32
force_local: bool = False
slurm_partition: str | None = None
slurm_gpus: int = 1
slurm_time: str = "06:00:00"
workers: int = 1
episode_parallelism: int = 16
"""Number of episodes processed concurrently within each module phase.
Each in-flight episode sends 35 dependent VLM calls; bumping this is
how you actually saturate ``parallel_servers`` and ``client_concurrency``
— without it, the executor loops one episode at a time and the
inference servers sit ~90% idle. Set to ``1`` for strict serial
execution."""
@dataclass
class AnnotationPipelineConfig:
"""Top-level config for ``lerobot-annotate``.
Mirrors the structure of :class:`lerobot.configs.train.TrainPipelineConfig`:
a draccus-parsed dataclass that contains nested per-module sub-configs and
leaves the dataset, executor, and VLM choices independently knobbable.
Output is always in-place: the writer rewrites ``data/chunk-*/file-*.parquet``
in place. Multiple revisions of the same dataset live in separate copies.
"""
repo_id: str | None = None
root: Path | None = None
staging_dir: Path | None = None
"""If unset, defaults to ``<root>/.annotate_staging/``."""
seed: int = 1729
module_1: Module1Config = field(default_factory=Module1Config)
module_2: Module2Config = field(default_factory=Module2Config)
module_3: Module3Config = field(default_factory=Module3Config)
vlm: VlmConfig = field(default_factory=VlmConfig)
executor: ExecutorConfig = field(default_factory=ExecutorConfig)
skip_validation: bool = False
only_episodes: tuple[int, ...] | None = None
push_to_hub: str | None = None
"""If set, after the pipeline completes, upload the annotated dataset
root to the Hugging Face Hub as a dataset repo with this id (e.g.
``pepijn/super_poulain_steerable``). Creates the repo if missing."""
push_private: bool = False
"""When ``push_to_hub`` is set, create the repo as private."""
push_commit_message: str | None = None
"""Override the commit message used for the hub upload."""
def resolved_staging_dir(self, root: Path) -> Path:
return self.staging_dir if self.staging_dir is not None else root / ".annotate_staging"
@@ -0,0 +1,263 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Executor selection: local vs SLURM via datatrove.
The executor plans **four phases** with the dependency order from the plan:
phase 1: Module 1 (plan + subtasks + memory)
phase 2: Module 2 (interjections + speech)
phase 3: Module 1 plan-update pass — re-runs plan emission at every
interjection timestamp produced by phase 2
phase 4: Module 3 (VQA)
phase 5: validator
phase 6: writer
Phase 3 is why ``executor.py`` documents the dependency: Module 1 must be
re-entered after Module 2 to refresh ``plan`` rows at interjection times.
"""
from __future__ import annotations
import logging
from dataclasses import dataclass
from pathlib import Path
from typing import Any
from .config import AnnotationPipelineConfig, ExecutorConfig
from .reader import EpisodeRecord, iter_episodes
from .staging import EpisodeStaging
from .validator import StagingValidator
from .writer import LanguageColumnsWriter
logger = logging.getLogger(__name__)
@dataclass
class PhaseResult:
"""Summary of one pipeline phase across all episodes."""
name: str
episodes_processed: int
episodes_skipped: int
@dataclass
class PipelineRunSummary:
"""Aggregated result returned by :meth:`Executor.run`."""
phases: list[PhaseResult]
written_paths: list[Path]
validation_report: Any # ValidationReport, kept Any to avoid import cycle
def select_executor_class(num_episodes: int, config: ExecutorConfig) -> str:
"""Return ``"local"`` or ``"slurm"`` based on the threshold.
The plan's "executor selection threshold" lives in
:class:`ExecutorConfig.auto_threshold`. ``force_local`` always wins.
"""
if config.force_local:
return "local"
return "local" if num_episodes <= config.auto_threshold else "slurm"
@dataclass
class Executor:
"""Run all four phases over a dataset root.
The executor is intentionally framework-agnostic: by default it runs the
phases inline (suitable for tests, small datasets, and the CLI's
``--force-local`` mode). It will optionally hand off to datatrove's
:class:`LocalPipelineExecutor` or :class:`SlurmPipelineExecutor` when those
are installed and the dataset is large enough to benefit from them.
Tests construct the executor directly with stub modules.
"""
config: AnnotationPipelineConfig
module_1: Any # PlanSubtasksMemoryModule
module_2: Any # InterjectionsAndSpeechModule
module_3: Any # GeneralVqaModule
writer: LanguageColumnsWriter
validator: StagingValidator
def run(self, root: Path) -> PipelineRunSummary:
records = list(iter_episodes(root, only_episodes=self.config.only_episodes))
n = len(records)
if n == 0:
raise ValueError(f"No episodes found under {root}/data/")
executor_kind = select_executor_class(n, self.config.executor)
print(f"[annotate] {n} episodes total; executor={executor_kind}", flush=True)
staging_dir = self.config.resolved_staging_dir(root)
staging_dir.mkdir(parents=True, exist_ok=True)
phases: list[PhaseResult] = []
# Phase 1: Module 1 (plan + subtasks + memory)
phases.append(self._run_module_phase("module_1", records, staging_dir, self.module_1))
# Phase 2: Module 2 (interjections + speech). Module 2 reads
# Module 1's subtask rows from the same staging tree to ground
# the interjection prompt in the correct local subtask.
phases.append(self._run_module_phase("module_2", records, staging_dir, self.module_2))
# Phase 3: Module 1 plan-update pass at interjection timestamps.
phases.append(self._run_plan_update_phase(records, staging_dir))
# Phase 4: Module 3 (VQA)
phases.append(self._run_module_phase("module_3", records, staging_dir, self.module_3))
print("[annotate] running validator...", flush=True)
report = self.validator.validate(records, staging_dir)
if not report.ok and not self.config.skip_validation:
raise RuntimeError(f"Staging validation failed: {report.summary()}")
print(f"[annotate] validator: {report.summary()}", flush=True)
print(f"[annotate] writing parquet shards into {root}/data/...", flush=True)
written = self.writer.write_all(records, staging_dir, root)
print(f"[annotate] wrote {len(written)} shard(s); pipeline complete", flush=True)
# Persist the tool catalog to meta/info.json so chat-template
# consumers (PR 3 SmolVLA2 / Pi0.5 / dataset visualizer) can read
# it via ``LeRobotDatasetMetadata.tools`` (PR 1). Idempotent and
# additive: anything the user pre-populated is preserved; we only
# ensure the canonical ``say`` schema is present.
self._ensure_tools_in_info(root)
return PipelineRunSummary(phases=phases, written_paths=written, validation_report=report)
def _ensure_tools_in_info(self, root: Path) -> None:
"""Write ``meta/info.json["tools"]`` if missing the canonical ``say``.
Reads any user-declared tools already in ``info.json`` and merges
the canonical ``SAY_TOOL_SCHEMA`` into the list (deduped by
``function.name``). Writes back to disk only if the list
changed.
"""
import json # noqa: PLC0415
from lerobot.datasets.language import SAY_TOOL_SCHEMA # noqa: PLC0415
info_path = root / "meta" / "info.json"
if not info_path.exists():
return
try:
info = json.loads(info_path.read_text())
except Exception as exc: # noqa: BLE001
print(f"[annotate] could not read {info_path}: {exc}", flush=True)
return
existing = info.get("tools")
if not isinstance(existing, list):
existing = []
names = {
(t.get("function") or {}).get("name")
for t in existing
if isinstance(t, dict)
}
merged = list(existing)
if SAY_TOOL_SCHEMA["function"]["name"] not in names:
merged.append(SAY_TOOL_SCHEMA)
if merged != existing:
info["tools"] = merged
info_path.write_text(json.dumps(info, indent=2))
print(
f"[annotate] meta/info.json: tools={[t['function']['name'] for t in merged]}",
flush=True,
)
def _run_module_phase(
self,
name: str,
records: list[EpisodeRecord],
staging_dir: Path,
module: Any,
) -> PhaseResult:
import time as _time # noqa: PLC0415
from concurrent.futures import ThreadPoolExecutor, as_completed # noqa: PLC0415
if not module.enabled:
print(f"[annotate] phase={name} skipped (module disabled)", flush=True)
return PhaseResult(name=name, episodes_processed=0, episodes_skipped=len(records))
n = len(records)
parallelism = max(1, min(self.config.executor.episode_parallelism, n))
print(
f"[annotate] phase={name} starting on {n} episode(s) "
f"(parallelism={parallelism})",
flush=True,
)
t0 = _time.time()
def _do(idx_record: tuple[int, EpisodeRecord]) -> tuple[int, int, float]:
i, record = idx_record
ep_start = _time.time()
staging = EpisodeStaging(staging_dir, record.episode_index)
module.run_episode(record, staging)
return i, record.episode_index, _time.time() - ep_start
processed = 0
if parallelism == 1:
for i, record in enumerate(records, 1):
_, ep_idx, elapsed = _do((i, record))
processed += 1
print(
f"[annotate] {name} episode {i}/{n} "
f"(idx={ep_idx}) done in {elapsed:.1f}s",
flush=True,
)
else:
with ThreadPoolExecutor(max_workers=parallelism) as pool:
futures = [pool.submit(_do, (i, r)) for i, r in enumerate(records, 1)]
for fut in as_completed(futures):
i, ep_idx, elapsed = fut.result()
processed += 1
print(
f"[annotate] {name} episode {processed}/{n} "
f"(idx={ep_idx}, submit_order={i}) done in {elapsed:.1f}s",
flush=True,
)
total = _time.time() - t0
print(f"[annotate] phase={name} complete: {processed}/{n} in {total:.1f}s", flush=True)
return PhaseResult(name=name, episodes_processed=processed, episodes_skipped=0)
def _run_plan_update_phase( # noqa: PLR0915
self, records: list[EpisodeRecord], staging_dir: Path
) -> PhaseResult:
"""Re-emit ``plan`` rows at each interjection timestamp from Module 2.
Module 1 owns the prompt; Module 2 produced the timestamps. This phase
therefore calls back into Module 1 with the interjection timestamps so
Module 1's existing prompt path is reused.
"""
if not self.module_1.enabled or not self.module_2.enabled:
return PhaseResult(
name="module_1_plan_update", episodes_processed=0, episodes_skipped=len(records)
)
processed = 0
for record in records:
staging = EpisodeStaging(staging_dir, record.episode_index)
interjection_rows = [
row
for row in staging.read("module_2")
if row.get("style") == "interjection"
]
interjection_times = [float(row["timestamp"]) for row in interjection_rows]
interjection_texts = [str(row.get("content") or "") for row in interjection_rows]
if interjection_times:
self.module_1.run_plan_updates(
record, staging, interjection_times, interjection_texts
)
processed += 1
return PhaseResult(name="module_1_plan_update", episodes_processed=processed, episodes_skipped=0)
@@ -0,0 +1,400 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Keyframe extraction for the annotation pipeline.
Modules attach decoded camera frames to their VLM prompts so the model can
ground subtask decomposition, interjection scenarios, and VQA in actual
visual content. The pipeline shares one provider across modules and one
episode at a time, with a small per-episode cache so multiple modules
querying the same timestamp pay decode cost once.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Protocol
from .reader import EpisodeRecord
class FrameProvider(Protocol):
"""Decodes camera frames at episode-relative timestamps."""
@property
def camera_keys(self) -> list[str]:
"""All ``observation.images.*`` feature keys this provider can decode."""
def frames_at(
self,
record: EpisodeRecord,
timestamps: list[float],
camera_key: str | None = None,
) -> list[Any]:
"""Return one PIL.Image per timestamp from ``camera_key`` (or default).
Empty list if the camera is unavailable. ``camera_key=None`` falls back
to the provider's default camera so existing single-camera callers
(Module 1, Module 2) keep working unchanged.
"""
def video_for_episode(
self,
record: EpisodeRecord,
max_frames: int,
camera_key: str | None = None,
) -> list[Any]:
"""Return up to ``max_frames`` PIL images covering the whole episode.
Sampling is uniform across the episode duration. The returned list is
intended to be passed as one ``{"type":"video", "video":<list>}``
block to a Qwen-VL-compatible model that pools temporally itself.
Empty list if no camera available.
"""
@dataclass
class _NullProvider:
"""No-op provider used when the dataset has no video keys or in tests."""
@property
def camera_keys(self) -> list[str]:
return []
def frames_at(
self,
record: EpisodeRecord,
timestamps: list[float],
camera_key: str | None = None,
) -> list[Any]:
return []
def video_for_episode(
self,
record: EpisodeRecord,
max_frames: int,
camera_key: str | None = None,
) -> list[Any]:
return []
def null_provider() -> FrameProvider:
return _NullProvider()
@dataclass
class VideoFrameProvider:
"""Decodes frames from the dataset's ``observation.images.*`` streams.
By default the *first* camera key is used for Module 1 (subtask
decomposition) and Module 2 (interjection scenarios) — those prompts care
about *what is happening*, not which angle. Module 3 (VQA) instead
iterates over every camera in :attr:`camera_keys` so each frame's
grounded answer (bbox/keypoint/...) is tagged with the camera it was
grounded against.
``camera_key`` overrides the default-camera choice but does not restrict
:attr:`camera_keys`. Pass ``camera_key`` explicitly to ``frames_at`` /
``video_for_episode`` to read a non-default stream.
Caches up to ``cache_size`` decoded frames per process to keep
co-timestamped Module 2 + Module 1 plan-update calls cheap.
"""
root: Path
camera_key: str | None = None
tolerance_s: float = 1e-2
cache_size: int = 256
_meta: Any = field(default=None, init=False, repr=False)
_cache: dict = field(default_factory=dict, init=False, repr=False)
_camera_keys: list[str] = field(default_factory=list, init=False, repr=False)
def __post_init__(self) -> None:
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata # noqa: PLC0415
self._meta = LeRobotDatasetMetadata(repo_id="local", root=self.root)
# ``camera_keys`` covers both image- and video-stored cameras
# (``video_keys`` is video-only). Some datasets declare cameras with
# ``dtype=image``, which would otherwise look empty here and silently
# disable Module 3 even though the videos are there.
keys = list(getattr(self._meta, "camera_keys", None) or self._meta.video_keys or [])
# Last-resort fallback: if metadata didn't surface anything but the
# caller explicitly named a camera (``--vlm.camera_key=...``), trust
# them — the key is by definition known to exist on the dataset.
if not keys and self.camera_key:
keys = [self.camera_key]
self._camera_keys = keys
if self.camera_key is None:
self.camera_key = keys[0] if keys else None
@property
def camera_keys(self) -> list[str]:
"""All ``observation.images.*`` keys available on this dataset."""
return list(self._camera_keys)
def frames_at(
self,
record: EpisodeRecord,
timestamps: list[float],
camera_key: str | None = None,
) -> list[Any]:
target = camera_key if camera_key is not None else self.camera_key
if not timestamps or target is None:
return []
out: list[Any] = []
misses: list[float] = []
miss_indices: list[int] = []
for i, ts in enumerate(timestamps):
key = (record.episode_index, target, round(float(ts), 6))
cached = self._cache.get(key)
if cached is not None:
out.append(cached)
else:
out.append(None)
misses.append(float(ts))
miss_indices.append(i)
if misses:
decoded = self._decode(record.episode_index, misses, target)
# decoder may return fewer frames than requested when some
# timestamps fall outside the video; pair what we have and
# leave the rest as None to be filtered below.
for i, img in zip(miss_indices, decoded):
out[i] = img
key = (record.episode_index, target, round(float(timestamps[i]), 6))
if len(self._cache) >= self.cache_size:
self._cache.pop(next(iter(self._cache)))
self._cache[key] = img
# filter out any None left over from decode failures
return [img for img in out if img is not None]
def _decode(
self, episode_index: int, timestamps: list[float], camera_key: str
) -> list[Any]:
ep = self._meta.episodes[episode_index]
from_timestamp = ep[f"videos/{camera_key}/from_timestamp"]
shifted = [from_timestamp + ts for ts in timestamps]
video_path = self.root / self._meta.get_video_file_path(episode_index, camera_key)
try:
return _decode_pyav_direct(video_path, shifted, self.tolerance_s)
except Exception as exc:
# Log loudly the first time decoding fails so silent
# Module-3-no-op (every prompt skipped because frames_at returned
# []) is debuggable from the job log instead of post-hoc parquet
# inspection. Subsequent failures stay quiet.
if not getattr(self, "_warned_decode_fail", False):
import logging # noqa: PLC0415
logging.getLogger(__name__).warning(
"VideoFrameProvider._decode failed for episode=%s camera=%s "
"video_path=%s: %s",
episode_index,
camera_key,
video_path,
exc,
exc_info=True,
)
self._warned_decode_fail = True
return []
def _decode_pyav_direct(
video_path: Any, timestamps: list[float], tolerance_s: float
) -> list[Any]:
"""Decode the requested timestamps from ``video_path`` using PyAV directly.
Bypasses ``lerobot.datasets.video_utils.decode_video_frames`` entirely
because its "pyav" path actually goes through
``decode_video_frames_torchvision`` → ``torchvision.io.VideoReader``,
which was removed in torchvision >= 0.22 (the vllm/vllm-openai:latest
container ships with torchvision 0.25). The annotation pipeline only
needs a handful of PIL images per (episode, ts), so we can decode them
with PyAV without any torch dependency at all.
Returns one ``PIL.Image`` per requested timestamp, in the same order.
Any timestamp the decoder couldn't reach is silently dropped (mirrors
the previous behaviour); callers filter ``None``/missing entries.
"""
import av # noqa: PLC0415
from PIL import Image # noqa: PLC0415
if not timestamps:
return []
targets = sorted(set(timestamps))
seek_to = max(0.0, min(targets) - max(0.5, tolerance_s))
container = av.open(str(video_path))
try:
stream = container.streams.video[0]
# PyAV needs the seek target in stream timebase ticks.
if stream.time_base is None:
seek_pts = 0
else:
seek_pts = int(seek_to / float(stream.time_base))
try:
container.seek(seek_pts, any_frame=False, backward=True, stream=stream)
except av.AVError:
# Some streams reject the explicit seek; fall back to decoding from start.
container.seek(0)
results: dict[float, Any] = {}
target_iter = iter(targets)
next_target = next(target_iter, None)
for frame in container.decode(stream):
if next_target is None:
break
ts = float(frame.pts * frame.time_base) if frame.pts is not None else None
if ts is None:
continue
# Walk past targets we've already overshot — we keep the closest
# frame within tolerance.
while next_target is not None and ts >= next_target - tolerance_s:
if abs(ts - next_target) <= tolerance_s or ts >= next_target:
img = frame.to_image() # PIL.Image.Image (RGB)
results.setdefault(next_target, img)
next_target = next(target_iter, None)
else:
break
finally:
container.close()
return [results[ts] for ts in timestamps if ts in results]
def video_for_episode(
self,
record: EpisodeRecord,
max_frames: int,
camera_key: str | None = None,
) -> list[Any]:
"""Return up to ``max_frames`` images uniformly sampled across the episode.
The whole episode duration is covered; the model picks subtask
boundaries from the temporal pooling it does internally.
"""
target = camera_key if camera_key is not None else self.camera_key
if max_frames <= 0 or target is None or not record.frame_timestamps:
return []
n_frames = min(max_frames, len(record.frame_timestamps))
if n_frames == len(record.frame_timestamps):
timestamps = list(record.frame_timestamps)
else:
t0 = record.frame_timestamps[0]
t_last = record.frame_timestamps[-1]
if t_last <= t0:
timestamps = [float(t0)] * n_frames
else:
step = (t_last - t0) / (n_frames - 1) if n_frames > 1 else 0.0
timestamps = [float(t0 + i * step) for i in range(n_frames)]
return self.frames_at(record, timestamps, camera_key=target)
def make_frame_provider(root: Path, camera_key: str | None = None) -> FrameProvider:
"""Build a :class:`VideoFrameProvider` if videos are present, else null."""
try:
provider = VideoFrameProvider(root=root, camera_key=camera_key)
except Exception:
return null_provider()
if provider.camera_key is None:
return null_provider()
return provider
def to_image_blocks(images: list[Any]) -> list[dict[str, Any]]:
"""Convert PIL images to Qwen-VL-compatible content blocks."""
return [{"type": "image", "image": img} for img in images]
def to_video_block(images: list[Any]) -> list[dict[str, Any]]:
"""Wrap a list of PIL images as one Qwen-VL video block.
Returns ``[]`` when the list is empty, so the caller can splat the result
into a content array without a separate emptiness check.
"""
if not images:
return []
return [{"type": "video", "video": list(images)}]
def to_video_url_block(url: str | None, fps: float = 2.0) -> list[dict[str, Any]]:
"""Wrap a video file URL as one ``video_url`` block.
Used by the ``openai`` backend (transformers serve / vllm serve /
ktransformers serve), where the server handles frame sampling.
Returns ``[]`` when ``url`` is ``None`` so the caller can splat.
"""
if not url:
return []
return [{"type": "video_url", "video_url": {"url": url}, "fps": fps}]
def episode_clip_path(
record: EpisodeRecord,
provider: "VideoFrameProvider",
cache_dir: Path,
) -> Path | None:
"""Extract the episode's subclip to ``cache_dir/ep_{idx:06d}.mp4``.
Returns ``None`` if the dataset has no video tracks. Skips re-extract
when the cached clip already exists. Re-encodes to H.264
(libx264) so the resulting mp4 is decodable by every downstream
video processor — stream-copy would inherit the source codec
(often AV1 in modern LeRobot datasets), which vllm's libav build
cannot decode.
"""
import subprocess # noqa: PLC0415
if provider.camera_key is None:
return None
cache_dir.mkdir(parents=True, exist_ok=True)
out_path = cache_dir / f"ep_{record.episode_index:06d}.mp4"
if out_path.exists() and out_path.stat().st_size > 0:
return out_path
ep = provider._meta.episodes[record.episode_index]
from_timestamp = float(ep[f"videos/{provider.camera_key}/from_timestamp"])
to_timestamp = float(ep[f"videos/{provider.camera_key}/to_timestamp"])
src = provider.root / provider._meta.get_video_file_path(
record.episode_index, provider.camera_key
)
cmd = [
"ffmpeg",
"-y",
"-loglevel",
"error",
"-ss",
f"{from_timestamp:.3f}",
"-to",
f"{to_timestamp:.3f}",
"-i",
str(src),
"-c:v",
"libx264",
"-preset",
"ultrafast",
"-crf",
"23",
"-pix_fmt",
"yuv420p",
"-an",
str(out_path),
]
try:
subprocess.run(cmd, check=True, timeout=300)
except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
return None
return out_path if out_path.exists() and out_path.stat().st_size > 0 else None
@@ -0,0 +1,25 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .general_vqa import GeneralVqaModule
from .interjections_and_speech import InterjectionsAndSpeechModule
from .plan_subtasks_memory import PlanSubtasksMemoryModule
__all__ = [
"GeneralVqaModule",
"InterjectionsAndSpeechModule",
"PlanSubtasksMemoryModule",
]
@@ -0,0 +1,238 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Module 3: general VQA at a timed cadence.
Anchors ``K`` (question, answer) pairs to ``K`` consecutive frames per
emission. For datasets with multiple cameras, every emission tick produces
one ``(vqa, user)`` + ``(vqa, assistant)`` pair *per camera*: each pair is
generated against that camera's frame and stamped with the matching
``camera`` field on the emitted rows. The resolver disambiguates via
``camera=...``; recipes that consume VQA do so through one sub-recipe
per camera (see ``recipes/pi05_hirobot.yaml``).
Within a single (frame, camera) we still emit at most one ``(vqa, user)``
and one ``(vqa, assistant)`` row, so the resolver contract stays scalar.
Question types covered (per the plan's Module 3 table): bbox, keypoint,
count, attribute, spatial. The assistant's ``content`` is a JSON string
whose schema depends on the question type. Malformed JSON triggers one
retry inside :meth:`VlmClient.generate_json`.
"""
from __future__ import annotations
import json
import random
from collections.abc import Sequence
from dataclasses import dataclass, field
from typing import Any
from ..config import Module3Config
from ..frames import FrameProvider, null_provider, to_image_blocks
from ..prompts import load as load_prompt
from ..reader import EpisodeRecord
from ..staging import EpisodeStaging
from ..validator import classify_vqa_answer
from ..vlm_client import VlmClient
def _emission_anchor_indices(frame_timestamps: Sequence[float], hz: float, k: int) -> list[int]:
"""Return the relative frame indices to anchor VQA emissions to.
For each emission tick (every ``1/hz`` seconds), we anchor ``k``
consecutive frames starting at the tick. Ticks fall on the nearest
available source frame timestamp.
"""
if hz <= 0 or k <= 0 or not frame_timestamps:
return []
t0 = frame_timestamps[0]
t_last = frame_timestamps[-1]
period = 1.0 / hz
indices: list[int] = []
t = t0
while t <= t_last + 1e-9:
# find the index of the nearest frame to t
nearest_i = min(range(len(frame_timestamps)), key=lambda i: abs(frame_timestamps[i] - t))
for offset in range(k):
j = nearest_i + offset
if j >= len(frame_timestamps):
break
if not indices or indices[-1] != j:
indices.append(j)
t += period
# dedupe while preserving order
seen: set[int] = set()
deduped: list[int] = []
for i in indices:
if i in seen:
continue
seen.add(i)
deduped.append(i)
return deduped
@dataclass
class GeneralVqaModule:
"""Emit grounded VQA pairs at a timed cadence."""
vlm: VlmClient
config: Module3Config
seed: int = 1729
frame_provider: FrameProvider = field(default_factory=null_provider)
@property
def enabled(self) -> bool:
return self.config.enabled
def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
if not record.frame_timestamps:
staging.write("module_3", [])
return
rng = random.Random(f"{self.seed}:{record.episode_index}:vqa")
anchor_idx = _emission_anchor_indices(
record.frame_timestamps, self.config.vqa_emission_hz, self.config.K
)
cameras = self._target_cameras()
if not cameras:
# No camera available — emit nothing rather than producing
# untagged rows that would fail validation. Surface a loud one-
# time warning so this is never silently a no-op.
if not getattr(self, "_warned_no_camera", False):
import logging # noqa: PLC0415
logging.getLogger(__name__).warning(
"Module 3 (VQA) found no cameras on the frame provider — "
"every episode will emit zero VQA rows. Check that the "
"dataset declares observation.images.* features in "
"meta/info.json; passing --vlm.camera_key=<key> at the "
"CLI now also seeds the cameras list as a fallback."
)
self._warned_no_camera = True
staging.write("module_3", [])
return
# Build all messages first (one per (frame, camera)), then issue them
# as a single batched generate_json call so the client can fan them
# out concurrently.
per_call: list[tuple[float, str, str, list[dict[str, Any]]]] = []
for idx in anchor_idx:
ts = float(record.frame_timestamps[idx])
qtype = rng.choice(self.config.question_types)
for camera in cameras:
messages = self._build_messages(record, qtype, ts, camera)
# Skip cameras that decoded to zero frames at this ts: no point
# asking the VLM to ground a bbox without an image.
if not _has_image_block(messages):
continue
per_call.append((ts, camera, qtype, messages))
if not per_call:
staging.write("module_3", [])
return
results = self.vlm.generate_json([m for _, _, _, m in per_call])
rows: list[dict[str, Any]] = []
for (ts, camera, _qtype, _messages), result in zip(per_call, results):
qa = self._postprocess(result)
if qa is None:
continue
question, answer = qa
rows.append(
{
"role": "user",
"content": question,
"style": "vqa",
"timestamp": ts,
"camera": camera,
"tool_calls": None,
}
)
rows.append(
{
"role": "assistant",
"content": json.dumps(answer, sort_keys=True),
"style": "vqa",
"timestamp": ts,
"camera": camera,
"tool_calls": None,
}
)
staging.write("module_3", rows)
def _target_cameras(self) -> list[str]:
"""Return the cameras Module 3 should iterate per emission tick.
Defaults to every camera the provider exposes. Datasets with no
cameras (or test/null providers) yield an empty list, which makes
``run_episode`` a no-op.
"""
return list(getattr(self.frame_provider, "camera_keys", []) or [])
def _build_messages(
self,
record: EpisodeRecord,
question_type: str,
frame_timestamp: float,
camera_key: str,
) -> list[dict[str, Any]]:
prompt = load_prompt("module_3_vqa").format(
episode_task=record.episode_task,
question_type=question_type,
)
images = self.frame_provider.frames_at(
record, [frame_timestamp], camera_key=camera_key
)
content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
return [{"role": "user", "content": content}]
def _postprocess(self, result: Any) -> tuple[str, dict[str, Any]] | None:
if not isinstance(result, dict):
return None
question = result.get("question")
answer = result.get("answer")
if not isinstance(question, str) or not question.strip():
return None
if not isinstance(answer, dict):
return None
# The validator will enforce shape; here we just sanity-check that the
# answer matches *some* known shape so we can drop garbage early.
if classify_vqa_answer(answer) is None:
return None
return question.strip(), answer
def _generate_one(
self,
record: EpisodeRecord,
question_type: str,
frame_timestamp: float,
camera_key: str,
) -> tuple[str, dict[str, Any]] | None:
messages = self._build_messages(record, question_type, frame_timestamp, camera_key)
result = self.vlm.generate_json([messages])[0]
return self._postprocess(result)
def _has_image_block(messages: list[dict[str, Any]]) -> bool:
"""Return True if any user content block is a populated image block."""
for msg in messages:
content = msg.get("content")
if not isinstance(content, list):
continue
for block in content:
if isinstance(block, dict) and block.get("type") == "image":
return True
return False
@@ -0,0 +1,231 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Module 2: interjections + paired speech (EVENT styles + speech atoms).
Two sub-passes:
1. At ``t=0``, emit ONLY a speech tool-call atom (acknowledgement of the
canonical task). No interjection row — the canonical task is already the
user utterance from ``meta/tasks.parquet``.
2. For mid-episode interruptions, emit a co-timestamped pair:
{role:user, style:interjection, content:<text>}
speech atom (role:assistant, style:None, tool_calls=[say(...)])
Both rows go in ``language_events`` at the same timestamp.
Module 1's :meth:`run_plan_updates` reuses Module 2's interjection
timestamps to refresh the ``plan`` row at the same instant.
"""
from __future__ import annotations
import random
from collections.abc import Sequence
from dataclasses import dataclass, field
from typing import Any
from ..config import Module2Config
from ..frames import FrameProvider, null_provider, to_image_blocks
from ..prompts import load as load_prompt
from ..reader import EpisodeRecord
from ..staging import EpisodeStaging
from ..vlm_client import VlmClient
from ..writer import speech_atom
def _snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
if not frame_timestamps:
return float(t)
return float(min(frame_timestamps, key=lambda f: abs(f - t)))
@dataclass
class InterjectionsAndSpeechModule:
"""Generate task-start speech and mid-episode interjection/speech pairs."""
vlm: VlmClient
config: Module2Config
seed: int = 1729
frame_provider: FrameProvider = field(default_factory=null_provider)
@property
def enabled(self) -> bool:
return self.config.enabled
def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
rows: list[dict[str, Any]] = []
if record.frame_timestamps:
t0 = float(record.frame_timestamps[0])
initial = self._initial_speech(record)
if initial:
rows.append(speech_atom(t0, initial))
# Pull Module 1's subtask spans for this episode so the
# interjection prompt can ground itself in the actual current
# subtask at each chosen timestamp. Module 1 ran first.
subtask_spans = self._read_subtask_spans(staging)
rows.extend(self._mid_episode_interjections(record, subtask_spans))
staging.write("module_2", rows)
@staticmethod
def _read_subtask_spans(staging: EpisodeStaging) -> list[dict[str, Any]]:
rows = [r for r in staging.read("module_1") if r.get("style") == "subtask"]
rows.sort(key=lambda r: float(r["timestamp"]))
spans: list[dict[str, Any]] = []
last_t: float | None = None
for r in rows:
t = float(r["timestamp"])
if last_t is not None and spans:
spans[-1]["end"] = t
spans.append({"text": r.get("content") or "", "start": t, "end": t})
last_t = t
return spans
@staticmethod
def _subtask_at(spans: Sequence[dict[str, Any]], t: float) -> str | None:
current: str | None = None
for span in spans:
if float(span["start"]) <= t:
current = span.get("text")
else:
break
return current
def _initial_speech(self, record: EpisodeRecord) -> str | None:
prompt = load_prompt("module_2_initial_speech").format(
episode_task=record.episode_task,
)
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
result = self.vlm.generate_json([messages])[0]
if isinstance(result, dict) and isinstance(result.get("text"), str):
text = result["text"].strip()
if text:
return text
return None
def _mid_episode_interjections(
self,
record: EpisodeRecord,
subtask_spans: Sequence[dict[str, Any]],
) -> list[dict[str, Any]]:
"""Generate interjections aligned with the actual demo trajectory.
Teleop data is frozen — the robot already executed every step in
the video. A *counterfactual* interjection like "actually skip
the wipe" contradicts what then happens in the video, which is
what qwen36moe-10/11 surfaced as low-quality interjections.
Instead, anchor every interjection at a subtask boundary and
write it as a natural user request for the *upcoming* subtask.
The robot's visible next behavior IS the interjection's effect,
so the training signal stays consistent: interjection text →
plan refresh → action stream all line up.
"""
if self.config.max_interjections_per_episode <= 0:
return []
if len(subtask_spans) < 2:
# Need at least one transition (subtask 0 → subtask 1).
return []
# Deterministic per-episode RNG so reruns are stable across SLURM jobs.
rng = random.Random(f"{self.seed}:{record.episode_index}:interjection")
# Boundaries: the start time of every subtask except the first
# (which is just t0 and is covered by the initial-task speech atom).
boundaries: list[tuple[float, str, str]] = []
for i in range(1, len(subtask_spans)):
ts = float(subtask_spans[i]["start"])
if ts < self.config.interjection_min_t:
continue
prev_text = (subtask_spans[i - 1].get("text") or "").strip()
next_text = (subtask_spans[i].get("text") or "").strip()
if not next_text:
continue
boundaries.append((ts, prev_text, next_text))
if not boundaries:
return []
n = min(self.config.max_interjections_per_episode, len(boundaries))
chosen = sorted(rng.sample(boundaries, n), key=lambda b: b[0])
out: list[dict[str, Any]] = []
for t, prev_subtask, next_subtask in chosen:
t_snap = _snap_to_frame(t, record.frame_timestamps)
# Window straddles the boundary so the VLM sees the end of the
# previous subtask and the start of the next one — same
# conditioning the policy will see at training time.
window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
prompt = load_prompt("module_2_interjection").format(
episode_task=record.episode_task,
prev_subtask=prev_subtask or "(starting from initial state)",
next_subtask=next_subtask,
timestamp=t_snap,
window_seconds=self.config.interjection_window_seconds,
)
images = self.frame_provider.frames_at(record, window_ts)
content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
messages = [{"role": "user", "content": content}]
result = self.vlm.generate_json([messages])[0]
if not isinstance(result, dict):
continue
interjection_text = result.get("interjection")
speech_text = result.get("speech")
if not isinstance(interjection_text, str) or not interjection_text.strip():
continue
if not isinstance(speech_text, str) or not speech_text.strip():
continue
out.append(
{
"role": "user",
"content": interjection_text.strip(),
"style": "interjection",
"timestamp": t_snap,
"tool_calls": None,
}
)
out.append(speech_atom(t_snap, speech_text.strip()))
return out
def _window_timestamps(
self, t_anchor: float, frame_timestamps: Sequence[float]
) -> list[float]:
"""Return a small set of frame timestamps centered on ``t_anchor``.
The window straddles the subtask boundary the interjection sits
on: roughly half the frames cover the end of the previous
subtask, half cover the start of the next one. The VLM therefore
sees BOTH what just finished AND what's about to start, which is
the conditioning we need to write a natural "now please do X"
request that matches the visible upcoming behavior.
"""
if not frame_timestamps:
return [t_anchor]
n = max(1, int(self.config.interjection_window_frames))
if n == 1:
return [t_anchor]
window = float(self.config.interjection_window_seconds)
step = window / max(1, n - 1)
# Center the window on the anchor so half lands before, half after.
start_offset = -window / 2.0
targets = [t_anchor + start_offset + step * i for i in range(n)]
last_ts = float(frame_timestamps[-1])
snapped: list[float] = []
seen: set[float] = set()
for tgt in targets:
clamped = min(last_ts, max(0.0, tgt))
t = _snap_to_frame(clamped, frame_timestamps)
if t not in seen:
seen.add(t)
snapped.append(t)
return snapped or [t_anchor]
@@ -0,0 +1,450 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Module 1: subtask decomposition + plan + memory (PERSISTENT styles)."""
from __future__ import annotations
from collections.abc import Sequence
from dataclasses import dataclass, field
from typing import Any
from pathlib import Path
from ..config import Module1Config
from ..frames import (
FrameProvider,
VideoFrameProvider,
episode_clip_path,
null_provider,
to_video_block,
to_video_url_block,
)
from ..prompts import load as load_prompt
from ..reader import EpisodeRecord
from ..staging import EpisodeStaging
from ..vlm_client import VlmClient
def _snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
"""Snap an arbitrary float to the nearest exact source frame timestamp."""
if not frame_timestamps:
return float(t)
nearest = min(frame_timestamps, key=lambda f: abs(f - t))
return float(nearest)
@dataclass
class PlanSubtasksMemoryModule:
"""Generate subtask spans, plan, and memory rows.
All output is persistent (lives in ``language_persistent``):
- ``subtask`` rows: one per span, stamped at the span's *start* timestamp
(snapped to an exact frame).
- ``plan`` rows: emitted at ``t=0``; refreshed at every interjection
timestamp via :meth:`run_plan_updates` (called by the executor after
Module 2 completes).
- ``memory`` rows: emitted at each subtask boundary (= subtask start
timestamp from the second subtask onward).
"""
vlm: VlmClient
config: Module1Config
frame_provider: FrameProvider = field(default_factory=null_provider)
@property
def enabled(self) -> bool:
return self.config.enabled
def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
rows: list[dict[str, Any]] = []
# Resolve the task that drives every other Module-1 prompt. May be
# the canonical ``record.episode_task`` (default), or a fresh
# description derived from the video when the canonical task is
# empty / placeholder / forced-off (see Module1Config.derive_task_*).
effective_task = self._resolve_effective_task(record)
# ``task_aug`` rows at t=0 (role=user), one per rephrasing — the
# PR 1 renderer rotates ``${task}`` deterministically through them
# so the policy sees diverse phrasings during training.
t0 = float(record.frame_timestamps[0]) if record.frame_timestamps else 0.0
if self.config.n_task_rephrasings > 0 and effective_task:
rephrasings = self._generate_task_rephrasings(
effective_task, n=self.config.n_task_rephrasings
)
# Always include the effective task itself as the first variant
# so the rotation is guaranteed to cover the source-of-truth
# phrasing, not just synthetic alternatives.
seen: set[str] = set()
ordered = [effective_task, *rephrasings]
for phrasing in ordered:
key = phrasing.strip()
if not key or key in seen:
continue
seen.add(key)
rows.append(
{
"role": "user",
"content": key,
"style": "task_aug",
"timestamp": t0,
"tool_calls": None,
}
)
subtask_spans = self._generate_subtasks(record, task=effective_task)
# subtask rows
for span in subtask_spans:
rows.append(
{
"role": "assistant",
"content": span["text"],
"style": "subtask",
"timestamp": _snap_to_frame(span["start"], record.frame_timestamps),
"tool_calls": None,
}
)
# Plan rows at every subtask boundary — including t=0 (start of
# the first subtask). Because the plan is just a numbered list
# of *still-todo* subtasks, re-emitting at each boundary makes
# the active plan shrink as work progresses: at frame t the
# rendered ``${plan}`` is the most recent emission, which
# contains exactly the subtasks that started at or after the
# current span. Saves the runtime from having to derive
# "what's still left" at inference time.
for span in subtask_spans:
boundary_t = _snap_to_frame(span["start"], record.frame_timestamps)
plan_text = self._generate_plan(
record, subtask_spans, refresh_t=boundary_t, task=effective_task
)
if plan_text is not None:
rows.append(
{
"role": "assistant",
"content": plan_text,
"style": "plan",
"timestamp": float(boundary_t),
"tool_calls": None,
}
)
# memory rows at every subtask boundary except the very first start
prior_memory = ""
for i, span in enumerate(subtask_spans[1:], start=1):
completed = subtask_spans[i - 1]["text"]
remaining = [s["text"] for s in subtask_spans[i:]]
mem_text = self._generate_memory(
record, prior_memory, completed, remaining, task=effective_task
)
if mem_text:
ts = _snap_to_frame(span["start"], record.frame_timestamps)
rows.append(
{
"role": "assistant",
"content": mem_text,
"style": "memory",
"timestamp": ts,
"tool_calls": None,
}
)
prior_memory = mem_text
staging.write("module_1", rows)
# ------------------------------------------------------------------
# Task derivation + rephrasings
# ------------------------------------------------------------------
_PLACEHOLDER_TASKS: frozenset[str] = frozenset(
{
"debug",
"test",
"tbd",
"todo",
"n/a",
"na",
"untitled",
"unnamed",
"default",
"placeholder",
}
)
def _resolve_effective_task(self, record: EpisodeRecord) -> str:
"""Decide which task string drives Module 1 for this episode.
Returns the user-supplied ``record.episode_task`` unless
``derive_task_from_video`` says otherwise (see config docstring).
Falls back gracefully to the canonical task if video derivation
fails.
"""
canonical = (record.episode_task or "").strip()
mode = (self.config.derive_task_from_video or "off").strip().lower()
if mode == "always":
derived = self._derive_task_from_video(record)
return derived or canonical
if mode == "if_short" and self._task_seems_bad(canonical):
derived = self._derive_task_from_video(record)
if derived:
return derived
return canonical
def _task_seems_bad(self, task: str) -> bool:
if not task:
return True
if len(task.split()) < int(self.config.derive_task_min_words):
return True
if task.lower() in self._PLACEHOLDER_TASKS:
return True
return False
def _derive_task_from_video(self, record: EpisodeRecord) -> str | None:
"""Ask the VLM "what is this video about" with no task hint at all."""
prompt = load_prompt("module_1_video_task")
video_block = self._episode_video_block(record)
content = [*video_block, {"type": "text", "text": prompt}]
messages = [{"role": "user", "content": content}]
result = self.vlm.generate_json([messages])[0]
if isinstance(result, dict) and isinstance(result.get("task"), str):
text = result["task"].strip()
if text:
return text
return None
def _generate_task_rephrasings(self, base_task: str, *, n: int) -> list[str]:
"""Generate ``n`` text-only paraphrases of ``base_task``."""
if n <= 0 or not base_task:
return []
prompt = load_prompt("module_1_task_rephrasings").format(
base_task=base_task, n=n
)
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
result = self.vlm.generate_json([messages])[0]
if not isinstance(result, dict):
return []
raw = result.get("rephrasings")
if not isinstance(raw, list):
return []
out: list[str] = []
for item in raw:
if isinstance(item, str):
cleaned = item.strip().strip('"').strip("'")
if cleaned:
out.append(cleaned)
return out[:n]
def _episode_video_block(self, record: EpisodeRecord) -> list[dict[str, Any]]:
"""Same video block ``_generate_subtasks`` builds — extracted helper."""
if not record.frame_timestamps:
return []
if self.config.use_video_url and isinstance(self.frame_provider, VideoFrameProvider):
cache_dir = Path(self.frame_provider.root) / ".annotate_staging" / ".video_clips"
clip = episode_clip_path(record, self.frame_provider, cache_dir)
return (
to_video_url_block(f"file://{clip}", fps=self.config.use_video_url_fps)
if clip is not None
else []
)
episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
target_count = max(
1, int(round(episode_duration * self.config.frames_per_second))
)
target_count = min(target_count, self.config.max_video_frames)
video_frames = self.frame_provider.video_for_episode(record, target_count)
return to_video_block(video_frames)
def run_plan_updates(
self,
record: EpisodeRecord,
staging: EpisodeStaging,
interjection_times: Sequence[float],
interjection_texts: Sequence[str] | None = None,
) -> None:
"""Append additional ``plan`` rows at every interjection timestamp.
Plans refresh ONLY on user interjections — subtask generation
runs ~1 Hz at inference, but plan re-emission is event-driven.
Now also forwards the interjection's own text into the prompt so
the refreshed plan can actually reflect the user's correction
(the previous version told the model "an interjection happened"
without telling it what the user said).
"""
existing = staging.read("module_1")
spans = self._reconstruct_subtasks_from_rows(existing)
already_planned: set[float] = {
float(r["timestamp"]) for r in existing if r.get("style") == "plan"
}
new_rows = list(existing)
texts: list[str | None] = (
[None] * len(interjection_times)
if interjection_texts is None
else [str(t) if t else None for t in interjection_texts]
)
for raw_t, inter_text in zip(interjection_times, texts):
t = _snap_to_frame(raw_t, record.frame_timestamps)
if t in already_planned:
continue
already_planned.add(t)
plan_text = self._generate_plan(
record, spans, refresh_t=t, interjection=inter_text
)
if plan_text is not None:
new_rows.append(
{
"role": "assistant",
"content": plan_text,
"style": "plan",
"timestamp": t,
"tool_calls": None,
}
)
staging.write("module_1", new_rows)
@staticmethod
def _reconstruct_subtasks_from_rows(rows: Sequence[dict[str, Any]]) -> list[dict[str, Any]]:
out = []
last_t: float | None = None
for row in sorted(
(r for r in rows if r.get("style") == "subtask"),
key=lambda r: float(r["timestamp"]),
):
t = float(row["timestamp"])
if last_t is not None:
out[-1]["end"] = t
out.append({"text": row.get("content") or "", "start": t, "end": t})
last_t = t
return out
def _generate_subtasks(
self, record: EpisodeRecord, *, task: str | None = None
) -> list[dict[str, Any]]:
if record.row_count == 0 or not record.frame_timestamps:
return []
episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
prompt = load_prompt("module_1_subtasks").format(
episode_task=(task if task is not None else record.episode_task),
min_subtask_seconds=self.config.min_subtask_seconds,
max_steps=self.config.plan_max_steps,
episode_duration=f"{episode_duration:.3f}",
)
if self.config.use_video_url and isinstance(self.frame_provider, VideoFrameProvider):
cache_dir = Path(self.frame_provider.root) / ".annotate_staging" / ".video_clips"
clip = episode_clip_path(record, self.frame_provider, cache_dir)
video_block = (
to_video_url_block(f"file://{clip}", fps=self.config.use_video_url_fps)
if clip is not None
else []
)
else:
target_count = max(
1,
int(round(episode_duration * self.config.frames_per_second)),
)
target_count = min(target_count, self.config.max_video_frames)
video_frames = self.frame_provider.video_for_episode(record, target_count)
video_block = to_video_block(video_frames)
content = [*video_block, {"type": "text", "text": prompt}]
messages = [{"role": "user", "content": content}]
result = self.vlm.generate_json([messages])[0]
spans = result.get("subtasks") if isinstance(result, dict) else None
if not spans:
return []
# clamp to [t0, t_last] and sort
t0 = record.frame_timestamps[0]
t_last = record.frame_timestamps[-1]
cleaned: list[dict[str, Any]] = []
for span in spans:
try:
start = float(span["start"])
end = float(span["end"])
text = str(span["text"]).strip()
except (KeyError, ValueError, TypeError):
continue
start = max(t0, min(start, t_last))
end = max(t0, min(end, t_last))
if end < start:
start, end = end, start
if not text:
continue
cleaned.append({"text": text, "start": start, "end": end})
cleaned.sort(key=lambda s: s["start"])
return cleaned
def _generate_plan(
self,
record: EpisodeRecord, # noqa: ARG002 (kept for signature stability)
subtask_spans: Sequence[dict[str, Any]],
*,
refresh_t: float | None = None,
interjection: str | None = None, # noqa: ARG002
task: str | None = None, # noqa: ARG002
) -> str | None:
"""Deterministic plan = numbered list of *still-todo* subtasks.
Previously this called the VLM with a prompt that asked it to
compress the subtasks into a "compact hierarchical plan". That
produced longer-than-necessary plans, cost an extra VLM round-trip
per episode (plus one per interjection on refresh), and could
diverge from the actual subtask sequence the model is going to
execute. Replacing it with a plain summarisation keeps the plan
tightly aligned with the upcoming subtasks and removes the VLM
call entirely.
Layout (matches the v2 plan style — short imperative fragments
prefixed by "N. "):
1. <subtask 1>
2. <subtask 2>
...
On a refresh at ``refresh_t`` (called from ``run_plan_updates``
on interjection events), only subtasks whose start is at or
after ``refresh_t`` are included — the plan shrinks as work
progresses, so it always describes what's left.
"""
if not subtask_spans:
return None
remaining = [
s for s in subtask_spans
if refresh_t is None or float(s.get("start", 0.0)) >= float(refresh_t)
]
if not remaining:
# Past the last subtask boundary on a late refresh — nothing
# left to plan; emit None so the caller skips the row.
return None
return "\n".join(
f"{i}. {span.get('text', '').strip()}"
for i, span in enumerate(remaining, start=1)
)
def _generate_memory(
self,
record: EpisodeRecord,
prior_memory: str,
completed: str,
remaining: Sequence[str],
*,
task: str | None = None,
) -> str:
prompt = load_prompt("module_1_memory").format(
episode_task=(task if task is not None else record.episode_task),
prior_memory=prior_memory or "(none)",
completed_subtask=completed,
remaining_subtasks=", ".join(remaining) if remaining else "(none)",
)
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
result = self.vlm.generate_json([messages])[0]
if isinstance(result, dict) and isinstance(result.get("memory"), str):
return result["memory"].strip()
return ""
@@ -0,0 +1,33 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Prompt templates loaded as plain text.
One file per use site. Templates use ``str.format(**vars)`` substitution; we
intentionally avoid jinja2 here so the templates remain inspectable in
plain editors and roundtrip cleanly through ``ruff format``.
"""
from __future__ import annotations
from pathlib import Path
_DIR = Path(__file__).parent
def load(name: str) -> str:
"""Read prompt template ``name.txt`` from the ``prompts/`` directory."""
path = _DIR / f"{name}.txt"
return path.read_text(encoding="utf-8")
@@ -0,0 +1,35 @@
You are updating the robot's compressed semantic memory at the boundary of
a completed subtask.
Reference (MEM, Torne 2026):
"Remove or compress information in the language memory whenever
appropriate. Keep ONLY the minimal set of relevant information for future
task execution. Specific object attributes (colors, precise quantities of
each item) get discarded when their details won't affect subsequent
actions. Functional outcomes (where items went, how many) are preserved."
Episode task: "{episode_task}"
Previous memory: {prior_memory}
Just-completed subtask: "{completed_subtask}"
Remaining subtasks (for relevance judgement only): {remaining_subtasks}
Write the **shortest possible** state note that future subtasks could
need. Telegraphic style.
**Hard caps**
- ≤ 10 words total.
- No articles. No verbs in past tense ("placed", "moved"). Use
comma-separated noun→location fragments.
- Drop colors/sizes/counts unless a later subtask depends on them.
- If nothing material changed for downstream subtasks, emit "" (empty
string).
Examples
- Good: "bowl in box, lid open"
- Good: "3 bowls in cabinet"
- Good: "cup on tray, drawer closed"
- Bad: "The bowl is now in the box and the lid is still open."
- Bad: "I placed the green bowl carefully into the cardboard box."
Output strictly valid JSON:
{{ "memory": "<≤10-word telegraphic state, or empty>" }}
@@ -0,0 +1,33 @@
You are labeling a teleoperated robot demonstration.
The user originally asked: "{episode_task}"
You are shown the entire demonstration as a single video. Watch the
whole clip, then segment it into a list of consecutive atomic subtasks
the robot performs. Write **ultra-compact** action labels.
Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
- Each subtask = one atomic skill the low-level policy can execute.
- **Hard length cap: ≤ 5 words per subtask.** Drop articles, modifiers,
adverbs. Verb + object (+ optional short qualifier) only.
- Prefer: "pick lettuce", "place bowl in box", "open drawer",
"grasp sponge left hand".
- Avoid: "pick up one piece of lettuce", "carefully place the bowl",
"the robot moves its arm to the left".
- Subtasks are non-overlapping and cover the full episode in order.
Choose the cut points yourself based on what you see in the video
(gripper open/close events, contact, regrasps, transitions).
- Each subtask spans at least {min_subtask_seconds} seconds.
- Do not exceed {max_steps} subtasks total.
- Every subtask's [start_time, end_time] must lie within
[0.0, {episode_duration}] seconds.
Output strictly valid JSON of shape:
{{
"subtasks": [
{{"text": "<≤5-word verb phrase>", "start": <float>, "end": <float>}},
...
]
}}
@@ -0,0 +1,32 @@
You are generating training data for a Hi Robot-style policy. We need
{n} alternative phrasings of the same robot task so the policy sees
diverse user prompts during training instead of the same canonical
string repeated every frame.
Original task:
"{base_task}"
Generate exactly {n} alternative phrasings of the same task. Vary:
- formality (casual / polite / curt)
- verbosity (mostly short imperative; occasional polite request)
- word choice (synonyms, different verbs)
- sentence structure (imperative / question / suggestion)
Hard rules:
- Each phrasing MUST preserve the exact meaning of the original task.
Do not change which object is involved, the destination, or the
action. Do not add extra steps. Do not invent new objects.
- Each phrasing must be a short phrase or sentence, plain prose, no
markdown, no quotes, no list numbers.
- Phrasings must be distinct — no near-duplicates.
- Output exactly {n} entries.
Output strictly valid JSON:
{{
"rephrasings": [
"<phrasing 1>",
"<phrasing 2>",
...
]
}}
@@ -0,0 +1,17 @@
The video above shows a robot manipulation episode in full. Look at
the entire video and describe in ONE concise sentence what the robot
is doing.
Rules:
- One sentence, in natural English, like a user instruction.
- Capture the goal of the demonstration, not low-level motions.
Example: "place the yellow cube into the red bin" — not "move the
end-effector down 5cm and close the gripper".
- 4 to 15 words. Plain prose, no markdown, no bullets, no quotes.
- Do not invent objects or actions that aren't visible.
- Do not output anything other than the JSON object below.
Output strictly valid JSON:
{{
"task": "<single concise sentence describing what the robot does in this video>"
}}
@@ -0,0 +1,12 @@
The user just asked the robot: "{episode_task}".
Generate a short verbal acknowledgement the robot would speak back before
beginning the task. Style: compact, confident, friendly.
Examples (Hi Robot, Shi 2025): "Sure, I won't put cheese on it.",
"OK, starting with the sponge.", "Got it.".
Prefer very short replies: "Got it.", "On it.", "OK."
Output strictly valid JSON:
{{ "text": "<the spoken acknowledgement>" }}
@@ -0,0 +1,46 @@
You are generating training data for a Hi Robot-style hierarchical
robot policy. The robot in this demonstration has ALREADY executed
every step shown in the video — we cannot retroactively change the
action stream. To keep training data consistent with the video, the
"interjection" must align with what the robot is *about to do next* in
the demonstration, framed as a natural mid-task user request.
The episode's overall task: "{episode_task}".
The images above show roughly {window_seconds:.1f} seconds straddling a
subtask boundary in the demonstration:
- Subtask the robot just finished: "{prev_subtask}"
- Subtask the robot is about to start: "{next_subtask}"
- Time into episode: {timestamp:.2f}s
Write ONE compact interjection the user would naturally say at this
moment to prompt / confirm / encourage the robot to do "{next_subtask}".
Keep it like a mid-task coaching cue, not a full instruction paragraph.
Also write the robot's compact verbal acknowledgement.
Hard rules:
- The interjection MUST be consistent with the next subtask. The user
cannot ask for something different from what the robot then does in
the video. If you're tempted to say "actually skip X" or "do Y
instead", DO NOT — those would contradict the demonstration.
- The interjection must reference an object, location, or action that
is plausible given the visible scene and the next subtask text.
- One short phrase or sentence each. Conversational, not robotic.
- Prefer direct cues: "{next_subtask}, please."; "Now {next_subtask}."
- Keep robot speech very short: "OK.", "On it.", "Doing that."
Style examples (vary the phrasing — don't reuse these verbatim):
- "Now go ahead and {next_subtask}."
- "Great, can you {next_subtask} next?"
- "{next_subtask}, please."
- "Before you continue, please {next_subtask}."
- "Looking good — {next_subtask} now."
- "Okay, {next_subtask}."
Output strictly valid JSON:
{{
"interjection": "<short cue from the user, asking for the next subtask>",
"speech": "<short robot acknowledgement>"
}}
@@ -0,0 +1,32 @@
You are generating a frame-grounded visual question/answer pair for
chain-of-thought training. Reference: ECoT (Zawalski 2024) and Steerable
Policies — both train policies on grounded features such as bounding box
pixel coordinates, keypoints, counts, attributes, and spatial relations.
The frame shows a robot working on: "{episode_task}".
Question types and the EXACT answer JSON shape required for each:
bbox => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
"bbox": [x1, y1, x2, y2]}}, ...]}}
bbox is in pixel coordinates (x_min, y_min, x_max, y_max).
ECoT example: "a white cup [124, 25, 176, 113]".
keypoint => {{"label": "<point>", "point_format": "xy",
"point": [x, y]}}
count => {{"label": "<obj>", "count": <int>,
"note": "<optional short note>"}}
attribute => {{"label": "<obj>", "attribute": "<color|shape|state|...>",
"value": "<observed value>"}}
spatial => {{"subject": "<obj>", "relation": "<left_of|right_of|on|in|"
"above|below|near>", "object": "<obj>"}}
Generate a question of type "{question_type}". Output strictly valid JSON:
{{
"question": "<short, frame-grounded question>",
"answer": <object whose shape matches the schema above>
}}
@@ -0,0 +1,219 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Datatrove-shaped reader.
The reader walks ``data/chunk-*/file-*.parquet`` and yields one record per
episode containing:
- ``episode_index``: int
- ``frame_timestamps``: tuple[float, ...]
- ``frame_indices``: tuple[int, ...]
- ``episode_task``: str (canonical task from ``meta/tasks.parquet``)
- ``data_path``: pathlib.Path of the source parquet shard
- ``frames_df``: pandas.DataFrame slice for the episode (only loaded on demand)
This shape lets each module operate per-episode without loading all parquet
rows into memory at once. It deliberately does not depend on datatrove —
datatrove integration wraps this generator inside a ``PipelineStep`` in
:mod:`.executor`.
"""
from __future__ import annotations
from collections.abc import Iterator
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import pyarrow.parquet as pq
from lerobot.datasets.utils import DEFAULT_TASKS_PATH
@dataclass
class EpisodeRecord:
"""Per-episode record yielded by the reader."""
episode_index: int
episode_task: str
frame_timestamps: tuple[float, ...]
frame_indices: tuple[int, ...]
data_path: Path
row_offset: int # row offset within the parquet file where this episode starts
row_count: int # number of rows for this episode
def frames_df(self): # type: ignore[no-untyped-def]
"""Lazy-load the pandas slice for this episode."""
import pandas as pd # noqa: PLC0415 - deferred for optional dataset extra
table = pq.read_table(self.data_path)
df: pd.DataFrame = table.to_pandas()
slice_ = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(drop=True)
return slice_
def _load_tasks_lookup(root: Path) -> dict[int, str]:
tasks_path = root / DEFAULT_TASKS_PATH
if not tasks_path.exists():
return {}
table = pq.read_table(tasks_path)
cols = {name: table.column(name).to_pylist() for name in table.column_names}
if "task_index" in cols and "task" in cols:
return dict(zip(cols["task_index"], cols["task"], strict=True))
raise ValueError(f"meta/tasks.parquet at {tasks_path} missing 'task_index' or 'task'")
def iter_episodes(root: Path, *, only_episodes: tuple[int, ...] | None = None) -> Iterator[EpisodeRecord]:
"""Yield :class:`EpisodeRecord` for every episode under ``root/data/``.
Episodes are yielded in ascending ``episode_index`` order. The reader does
not assume a specific chunk/file layout: it scans every ``*.parquet``
under ``data/`` and groups by ``episode_index``.
"""
tasks = _load_tasks_lookup(root)
data_dir = root / "data"
parquet_files = sorted(data_dir.rglob("*.parquet"))
only_set = set(only_episodes) if only_episodes is not None else None
for path in parquet_files:
yield from _iter_one_path(path, tasks, only_set)
def _iter_one_path(path: Path, tasks: dict[int, str], only_set: set[int] | None) -> Iterator[EpisodeRecord]:
table = pq.read_table(path)
names = table.column_names
if "episode_index" not in names:
return
episode_col = table.column("episode_index").to_pylist()
timestamp_col = (
table.column("timestamp").to_pylist() if "timestamp" in names else [0.0] * len(episode_col)
)
frame_col = (
table.column("frame_index").to_pylist() if "frame_index" in names else list(range(len(episode_col)))
)
task_col = table.column("task_index").to_pylist() if "task_index" in names else None
def _build(
ep: int,
start: int,
end: int,
task_idx: int | None,
ts_buf: list[float],
fi_buf: list[int],
) -> EpisodeRecord | None:
if only_set is not None and ep not in only_set:
return None
task = tasks.get(task_idx, "") if task_idx is not None else ""
return EpisodeRecord(
episode_index=ep,
episode_task=task,
frame_timestamps=tuple(ts_buf),
frame_indices=tuple(fi_buf),
data_path=path,
row_offset=start,
row_count=end - start,
)
cur_ep: int | None = None
start_offset = 0
ts_buf: list[float] = []
fi_buf: list[int] = []
cur_task_idx: int | None = None
for i, ep in enumerate(episode_col):
if cur_ep is None:
cur_ep = ep
start_offset = i
ts_buf = [timestamp_col[i]]
fi_buf = [frame_col[i]]
cur_task_idx = task_col[i] if task_col is not None else None
continue
if ep != cur_ep:
rec = _build(cur_ep, start_offset, i, cur_task_idx, ts_buf, fi_buf)
if rec is not None:
yield rec
cur_ep = ep
start_offset = i
ts_buf = [timestamp_col[i]]
fi_buf = [frame_col[i]]
cur_task_idx = task_col[i] if task_col is not None else None
else:
ts_buf.append(timestamp_col[i])
fi_buf.append(frame_col[i])
if cur_ep is not None:
rec = _build(cur_ep, start_offset, len(episode_col), cur_task_idx, ts_buf, fi_buf)
if rec is not None:
yield rec
def gather_data_paths(root: Path) -> list[Path]:
"""Return every ``data/chunk-*/file-*.parquet`` path under ``root``."""
return sorted((root / "data").rglob("*.parquet"))
def episode_offsets_per_path(path: Path) -> dict[int, tuple[int, int]]:
"""Return ``{episode_index: (row_offset, row_count)}`` for one parquet."""
table = pq.read_table(path, columns=["episode_index"])
episode_col = table.column("episode_index").to_pylist()
out: dict[int, tuple[int, int]] = {}
cur_ep: int | None = None
start = 0
for i, ep in enumerate(episode_col):
if cur_ep is None:
cur_ep = ep
start = i
continue
if ep != cur_ep:
out[cur_ep] = (start, i - start)
cur_ep = ep
start = i
if cur_ep is not None:
out[cur_ep] = (start, len(episode_col) - start)
return out
def keyframe_indices(record: EpisodeRecord, k: int) -> list[int]:
"""Return ``k`` evenly spaced row indices into the episode (relative)."""
n = record.row_count
if k <= 0 or n == 0:
return []
if k >= n:
return list(range(n))
step = (n - 1) / (k - 1) if k > 1 else 0.0
return [int(round(i * step)) for i in range(k)] if k > 1 else [n // 2]
def lookup_data_path(root: Path, episode_index: int) -> tuple[Path, int, int] | None:
"""Find the parquet file containing ``episode_index`` and its slice bounds."""
for path in gather_data_paths(root):
offsets = episode_offsets_per_path(path)
if episode_index in offsets:
start, count = offsets[episode_index]
return path, start, count
return None
def episode_frame_timestamps(root: Path, episode_index: int) -> tuple[Any, list[float]]:
"""Return the parquet path and per-frame timestamps for ``episode_index``."""
found = lookup_data_path(root, episode_index)
if found is None:
raise ValueError(f"Episode {episode_index} not found under {root}/data/")
path, start, count = found
table = pq.read_table(path, columns=["timestamp"])
timestamps = table.column("timestamp").to_pylist()[start : start + count]
return path, [float(t) for t in timestamps]
@@ -0,0 +1,98 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Per-episode staging.
Each module writes its raw output as a JSONL file under
``<staging_dir>/episode_{ep:06d}/<module>.jsonl``. The writer reads back this
staging tree and partitions rows into the two language columns.
JSONL is preferred over parquet here because the staging artifact is meant to
be human-inspectable, easy to diff between prompt iterations, and trivially
appended to. The final dataset format is parquet; staging is just an
intermediate.
"""
from __future__ import annotations
import json
from collections.abc import Iterable, Iterator
from dataclasses import dataclass
from pathlib import Path
from typing import Any
ModuleName = str
_MODULES: tuple[ModuleName, ...] = (
"module_1",
"module_2",
"module_3",
)
@dataclass
class EpisodeStaging:
"""Filesystem layout for a single episode's staged module outputs."""
root: Path
episode_index: int
@property
def episode_dir(self) -> Path:
return self.root / f"episode_{self.episode_index:06d}"
def path_for(self, module: ModuleName) -> Path:
if module not in _MODULES:
raise ValueError(f"Unknown module {module!r}; expected one of {_MODULES}")
return self.episode_dir / f"{module}.jsonl"
def write(self, module: ModuleName, rows: Iterable[dict[str, Any]]) -> Path:
path = self.path_for(module)
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False, sort_keys=True))
f.write("\n")
return path
def read(self, module: ModuleName) -> list[dict[str, Any]]:
path = self.path_for(module)
if not path.exists():
return []
out: list[dict[str, Any]] = []
with path.open(encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
out.append(json.loads(line))
return out
def read_all(self) -> dict[ModuleName, list[dict[str, Any]]]:
return {m: self.read(m) for m in _MODULES}
def has(self, module: ModuleName) -> bool:
return self.path_for(module).exists()
def iter_staged_episodes(root: Path) -> Iterator[int]:
"""Yield episode indices for which any staging artifact exists."""
if not root.exists():
return
for child in sorted(root.iterdir()):
if child.is_dir() and child.name.startswith("episode_"):
try:
yield int(child.name.removeprefix("episode_"))
except ValueError:
continue
@@ -0,0 +1,334 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Pre-write validation against staged outputs.
Runs after Modules 13 have all written their per-episode artifacts but
*before* the writer rewrites parquet shards. The validator never touches
parquet; it only inspects the staging tree and the source frame timestamps
exposed by :class:`EpisodeRecord`.
Checks (per the plan's "Intermediate staging and validation" section):
- exact timestamp alignment against source frame timestamps
- no orphan speech / interjection pairs
- plan / memory emission consistency (events have a paired persistent row)
- VQA assistant ``content`` is valid JSON (one of bbox / keypoint / count /
attribute / spatial)
- every row maps to its correct column under :func:`column_for_style`
"""
from __future__ import annotations
import json
import logging
from collections.abc import Iterable, Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from lerobot.datasets.language import (
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
column_for_style,
is_view_dependent_style,
validate_camera_field,
)
from .reader import EpisodeRecord
from .staging import EpisodeStaging
logger = logging.getLogger(__name__)
@dataclass
class ValidationReport:
"""Outcome of one validation pass across all episodes."""
errors: list[str] = field(default_factory=list)
warnings: list[str] = field(default_factory=list)
episodes_checked: int = 0
@property
def ok(self) -> bool:
return not self.errors
def add_error(self, message: str) -> None:
self.errors.append(message)
def add_warning(self, message: str) -> None:
self.warnings.append(message)
def summary(self) -> str:
return f"checked={self.episodes_checked} errors={len(self.errors)} warnings={len(self.warnings)}"
VQA_ANSWER_SHAPES: dict[str, set[str]] = {
"bbox": {"detections"},
"keypoint": {"label", "point_format", "point"},
"count": {"label", "count"},
"attribute": {"label", "attribute", "value"},
"spatial": {"subject", "relation", "object"},
}
def classify_vqa_answer(payload: Any) -> str | None:
"""Best-effort classification of a VQA answer payload to a question type."""
if not isinstance(payload, dict):
return None
keys = set(payload.keys())
for kind, required in VQA_ANSWER_SHAPES.items():
if required.issubset(keys):
return kind
return None
@dataclass
class StagingValidator:
"""Walks the staging tree and produces a :class:`ValidationReport`."""
timestamp_atol: float = 0.0 # exact-match by default
dataset_camera_keys: tuple[str, ...] | None = None
"""Known ``observation.images.*`` keys on the dataset. When set, the
validator additionally enforces that every view-dependent row's
``camera`` field references one of these keys. Pass ``None`` (default)
to skip that cross-check (e.g. in unit tests with no real dataset)."""
def validate(
self,
records: Sequence[EpisodeRecord],
staging_dir: Path,
) -> ValidationReport:
report = ValidationReport()
for record in records:
self._validate_episode(record, staging_dir, report)
report.episodes_checked += 1
return report
def _validate_episode(
self,
record: EpisodeRecord,
staging_dir: Path,
report: ValidationReport,
) -> None:
staging = EpisodeStaging(staging_dir, record.episode_index)
staged = staging.read_all()
all_rows: list[dict[str, Any]] = []
for module_name, rows in staged.items():
for row in rows:
row = {**row, "_module": module_name}
all_rows.append(row)
frame_ts = set(record.frame_timestamps)
events: list[dict[str, Any]] = []
persistent: list[dict[str, Any]] = []
for row in all_rows:
self._check_column_routing(row, report, record.episode_index)
self._check_camera_field(
row, report, record.episode_index, self.dataset_camera_keys
)
if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
persistent.append(row)
else:
events.append(row)
for row in events:
self._check_event_timestamp_alignment(row, frame_ts, report, record.episode_index)
self._check_speech_interjection_pairs(events, report, record.episode_index)
self._check_plan_memory_consistency(persistent, events, report, record.episode_index)
self._check_vqa_json(events, report, record.episode_index)
self._check_vqa_uniqueness_per_frame_camera(events, report, record.episode_index)
def _check_camera_field(
self,
row: dict[str, Any],
report: ValidationReport,
episode_index: int,
dataset_camera_keys: Sequence[str] | None,
) -> None:
"""Enforce the camera invariant + that the key matches the dataset's cameras."""
style = row.get("style")
camera = row.get("camera")
try:
validate_camera_field(style, camera)
except ValueError as exc:
report.add_error(
f"ep={episode_index} module={row.get('_module')}: {exc}"
)
return
if (
is_view_dependent_style(style)
and dataset_camera_keys
and camera not in dataset_camera_keys
):
report.add_error(
f"ep={episode_index} module={row.get('_module')}: camera {camera!r} on style "
f"{style!r} is not one of the dataset's video keys {sorted(dataset_camera_keys)!r}"
)
def _check_vqa_uniqueness_per_frame_camera(
self,
events: Iterable[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
"""Ensure at most one (vqa, user) and one (vqa, assistant) per (t, camera)."""
counts: dict[tuple[float, str, str], int] = {}
for row in events:
if row.get("style") != "vqa":
continue
ts = row.get("timestamp")
camera = row.get("camera")
role = row.get("role")
if ts is None or camera is None or role is None:
continue # other validators flag these
key = (float(ts), str(camera), str(role))
counts[key] = counts.get(key, 0) + 1
for (ts, camera, role), n in counts.items():
if n > 1:
report.add_error(
f"ep={episode_index}: {n} duplicate vqa rows at t={ts} "
f"camera={camera!r} role={role!r}; expected at most one per (t, camera, role)"
)
def _check_column_routing(
self,
row: dict[str, Any],
report: ValidationReport,
episode_index: int,
) -> None:
style = row.get("style")
module = row.get("_module")
try:
target_col = column_for_style(style)
except ValueError:
report.add_error(f"ep={episode_index} module={module}: unknown style {style!r}")
return
if module == "module_1" and target_col != LANGUAGE_PERSISTENT:
report.add_error(
f"ep={episode_index} module=module_1 emitted style {style!r} that routes to {target_col} (must be persistent)"
)
if module in {"module_2", "module_3"} and target_col != LANGUAGE_EVENTS:
report.add_error(
f"ep={episode_index} module={module} emitted style {style!r} that routes to {target_col} (must be events)"
)
def _check_event_timestamp_alignment(
self,
row: dict[str, Any],
frame_ts: set[float],
report: ValidationReport,
episode_index: int,
) -> None:
ts = row.get("timestamp")
if ts is None:
report.add_error(f"ep={episode_index}: event row missing timestamp: {row!r}")
return
if self.timestamp_atol == 0.0:
if float(ts) not in frame_ts:
report.add_error(
f"ep={episode_index}: event row timestamp {ts!r} does not match any source frame timestamp"
)
else:
if not any(abs(float(ts) - f) <= self.timestamp_atol for f in frame_ts):
report.add_error(
f"ep={episode_index}: event row timestamp {ts!r} not within {self.timestamp_atol}s of any frame"
)
def _check_speech_interjection_pairs(
self,
events: Iterable[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
speech_ts: dict[float, int] = {}
interjection_ts: dict[float, int] = {}
for row in events:
ts = row.get("timestamp")
if ts is None:
continue
ts_f = float(ts)
if row.get("style") is None and row.get("role") == "assistant":
speech_ts[ts_f] = speech_ts.get(ts_f, 0) + 1
if row.get("style") == "interjection":
interjection_ts[ts_f] = interjection_ts.get(ts_f, 0) + 1
for ts in interjection_ts:
if ts not in speech_ts:
report.add_error(f"ep={episode_index}: interjection at t={ts} has no paired speech atom")
def _check_plan_memory_consistency(
self,
persistent: Sequence[dict[str, Any]],
events: Sequence[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
plan_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "plan"})
memory_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "memory"})
subtask_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "subtask"})
interjection_ts = sorted(
{
float(r["timestamp"])
for r in events
if r.get("style") == "interjection" and r.get("timestamp") is not None
}
)
if persistent and not plan_ts:
report.add_warning(f"ep={episode_index}: persistent rows present but no plan emitted")
# every interjection should have a same-timestamp plan refresh
for ts in interjection_ts:
if ts not in set(plan_ts):
report.add_error(
f"ep={episode_index}: interjection at t={ts} has no co-timestamped plan update"
)
# memory should be emitted at subtask boundaries (subset relation)
if memory_ts and subtask_ts:
mem_set = set(memory_ts)
sub_set = set(subtask_ts)
stray = sorted(mem_set - sub_set)
if stray:
report.add_warning(f"ep={episode_index}: memory rows at {stray} not at any subtask boundary")
def _check_vqa_json(
self,
events: Iterable[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
for row in events:
if row.get("style") != "vqa" or row.get("role") != "assistant":
continue
content = row.get("content")
if content is None:
report.add_error(
f"ep={episode_index}: VQA assistant row at t={row.get('timestamp')} has null content"
)
continue
try:
payload = json.loads(content)
except (TypeError, ValueError) as exc:
report.add_error(
f"ep={episode_index}: VQA assistant content not valid JSON at t={row.get('timestamp')}: {exc}"
)
continue
shape = classify_vqa_answer(payload)
if shape is None:
report.add_error(
f"ep={episode_index}: VQA assistant payload at t={row.get('timestamp')} does not match any known shape: keys={list(payload) if isinstance(payload, dict) else type(payload).__name__}"
)
@@ -0,0 +1,741 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Shared Qwen-VL client.
The pipeline uses a single shared VLM across modules. vLLM is preferred when
available (high throughput, JSON-guided decoding); transformers is the
fallback. A ``stub`` backend is used for unit tests so fixtures never call
into a real model.
The client speaks one method, :meth:`VlmClient.generate_json`, which:
- accepts a list of OpenAI/HF-style multimodal messages,
- requests JSON output (``json_mode=True`` enables guided decoding when the
backend supports it),
- batches requests transparently,
- and reprompts once on a JSON parse failure with an inline correction
message before raising.
"""
from __future__ import annotations
import json
import os
import threading
from collections.abc import Callable, Sequence
from dataclasses import dataclass
from typing import Any, Protocol
from .config import VlmConfig
class VlmClient(Protocol):
"""Protocol every backend must implement."""
def generate_json(
self,
messages_batch: Sequence[Sequence[dict[str, Any]]],
*,
max_new_tokens: int | None = None,
temperature: float | None = None,
) -> list[Any]:
"""Generate one JSON-decoded response per messages list."""
@dataclass
class StubVlmClient:
"""Deterministic stub used in unit tests.
A test passes a callable that maps the *last user message text* (or, if
that is empty, the full message list) to a JSON-serializable response.
"""
responder: Callable[[Sequence[dict[str, Any]]], Any]
def generate_json(
self,
messages_batch: Sequence[Sequence[dict[str, Any]]],
*,
max_new_tokens: int | None = None,
temperature: float | None = None,
) -> list[Any]:
return [self.responder(list(messages)) for messages in messages_batch]
def _strip_to_json(text: str) -> Any:
text = text.strip()
# Strip <think>...</think> blocks (Qwen3 Thinking style)
while "<think>" in text and "</think>" in text:
start = text.find("<think>")
end = text.find("</think>", start) + len("</think>")
text = (text[:start] + text[end:]).strip()
# Strip ```json ... ``` fences from chat-tuned backbones
if text.startswith("```"):
first = text.find("\n")
last = text.rfind("```")
if first != -1 and last != -1 and last > first:
text = text[first + 1 : last].strip()
try:
return json.loads(text)
except (ValueError, json.JSONDecodeError):
pass
# Fall back to extracting the first balanced {...} block.
obj_text = _extract_first_json_object(text)
if obj_text is None:
raise json.JSONDecodeError("No JSON object found", text, 0)
return json.loads(obj_text)
def _extract_first_json_object(text: str) -> str | None:
"""Return the first balanced ``{...}`` substring, ignoring braces in
string literals. Returns ``None`` if no balanced block is found."""
start = text.find("{")
if start < 0:
return None
depth = 0
in_string = False
escape = False
for i in range(start, len(text)):
ch = text[i]
if escape:
escape = False
continue
if ch == "\\":
escape = True
continue
if ch == '"' and not escape:
in_string = not in_string
continue
if in_string:
continue
if ch == "{":
depth += 1
elif ch == "}":
depth -= 1
if depth == 0:
return text[start : i + 1]
return None
@dataclass
class _GenericTextClient:
"""Wraps any text-generation callable in JSON-mode + one-retry semantics."""
generate_text: Callable[[Sequence[Sequence[dict[str, Any]]], int, float], list[str]]
config: VlmConfig
def generate_json(
self,
messages_batch: Sequence[Sequence[dict[str, Any]]],
*,
max_new_tokens: int | None = None,
temperature: float | None = None,
) -> list[Any]:
max_tok = max_new_tokens if max_new_tokens is not None else self.config.max_new_tokens
temp = temperature if temperature is not None else self.config.temperature
raw = self.generate_text(messages_batch, max_tok, temp)
out: list[Any] = []
for messages, text in zip(messages_batch, raw, strict=True):
try:
out.append(_strip_to_json(text))
continue
except (ValueError, json.JSONDecodeError):
pass
retry = list(messages) + [
{"role": "assistant", "content": text},
{
"role": "user",
"content": (
"Your previous reply was not valid JSON. "
"Reply with strictly valid JSON, no prose, no fences."
),
},
]
retry_text = self.generate_text([retry], max_tok, temp)[0]
try:
out.append(_strip_to_json(retry_text))
except (ValueError, json.JSONDecodeError):
# After retry: log preview and return None instead of crashing
# the whole pipeline. Modules treat None as "skip".
preview = retry_text.strip().replace("\n", " ")[:200]
print(
f"[vlm] WARNING: failed to parse JSON after retry; preview: {preview!r}",
flush=True,
)
out.append(None)
return out
def make_vlm_client(config: VlmConfig) -> VlmClient:
"""Build the shared VLM client per the configured backend.
For ``stub``, callers should construct :class:`StubVlmClient` directly with
a responder callable. ``stub`` here is rejected to make accidental misuse
obvious.
"""
if config.backend == "stub":
raise ValueError(
"Use StubVlmClient(...) directly for the stub backend; make_vlm_client builds real clients."
)
if config.backend == "vllm":
return _make_vllm_client(config)
if config.backend == "transformers":
return _make_transformers_client(config)
if config.backend == "openai":
return _make_openai_client(config)
raise ValueError(f"Unknown VLM backend: {config.backend!r}")
def _make_vllm_client(config: VlmConfig) -> VlmClient:
try:
from vllm import LLM, SamplingParams # type: ignore[import-not-found]
except ImportError as exc:
raise ImportError(
"vllm is required for backend='vllm'. Install with `pip install lerobot[annotations]`."
) from exc
# Workaround for cuDNN 9.x + torch 2.8 conv3d regression that surfaces
# as CUDNN_STATUS_NOT_INITIALIZED in Qwen-VL vision-tower patch
# embedders. Setting LEROBOT_DISABLE_CUDNN=1 forces native PyTorch
# convolution kernels — slower but functional.
import os as _os # noqa: PLC0415
if _os.environ.get("LEROBOT_DISABLE_CUDNN", "").lower() in {"1", "true", "yes"}:
import torch as _torch # noqa: PLC0415
_torch.backends.cudnn.enabled = False
llm_kwargs: dict[str, Any] = {
"model": config.model_id,
"tensor_parallel_size": config.tensor_parallel_size,
"gpu_memory_utilization": config.gpu_memory_utilization,
"trust_remote_code": config.trust_remote_code,
}
if config.max_model_len is not None:
llm_kwargs["max_model_len"] = config.max_model_len
llm = LLM(**llm_kwargs)
def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
# ``guided_decoding`` would speed up parsing but its API differs across
# vllm releases (dict vs GuidedDecodingParams). The _GenericTextClient
# wrapper already has a one-retry JSON-recovery path, so we skip it.
params = SamplingParams(max_tokens=max_tok, temperature=temp)
# ``llm.chat`` handles chat-template application + multimodal input
# extraction (image/video blocks) internally, which ``llm.generate``
# does not.
outputs = llm.chat([list(m) for m in batch], params)
return [o.outputs[0].text for o in outputs]
return _GenericTextClient(_gen, config)
def _make_transformers_client(config: VlmConfig) -> VlmClient:
try:
import torch # type: ignore[import-not-found]
import transformers # type: ignore[import-not-found]
from transformers import AutoProcessor # type: ignore[import-not-found]
except ImportError as exc:
raise ImportError("transformers + torch are required for backend='transformers'.") from exc
auto_cls = (
getattr(transformers, "AutoModelForImageTextToText", None)
or getattr(transformers, "AutoModelForVision2Seq", None)
)
if auto_cls is None:
raise ImportError(
"Neither AutoModelForImageTextToText nor AutoModelForVision2Seq is available in this "
"transformers version. Install transformers>=4.45 (which has AutoModelForImageTextToText) "
"for VL models."
)
processor = AutoProcessor.from_pretrained(
config.model_id, trust_remote_code=config.trust_remote_code
)
import os as _os # noqa: PLC0415
use_accelerate = _os.environ.get("LEROBOT_TRANSFORMERS_DEVICE_MAP", "manual") != "manual"
# ``device_map='auto'`` triggers a known std::bad_alloc on the Qwen3-VL
# post-load dispatch path (the alloc fails in accelerate's hook setup
# even with TBs of host RAM). Default to manual: load on CPU with
# ``low_cpu_mem_usage=True``, then ``.to("cuda")``. Set
# ``LEROBOT_TRANSFORMERS_DEVICE_MAP=auto`` to opt back into the old path.
if use_accelerate:
model = auto_cls.from_pretrained(
config.model_id,
torch_dtype="auto",
device_map="auto",
low_cpu_mem_usage=True,
trust_remote_code=config.trust_remote_code,
)
else:
import torch as _torch # noqa: PLC0415
model = auto_cls.from_pretrained(
config.model_id,
torch_dtype=_torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=config.trust_remote_code,
)
model = model.to("cuda")
model.eval()
def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
outs: list[str] = []
for messages in batch:
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], return_tensors="pt").to(model.device)
with torch.no_grad():
gen = model.generate(
**inputs,
max_new_tokens=max_tok,
temperature=temp,
do_sample=temp > 0.0,
)
decoded = processor.batch_decode(
gen[:, inputs["input_ids"].shape[-1] :], skip_special_tokens=True
)[0]
outs.append(decoded)
return outs
return _GenericTextClient(_gen, config)
def _make_openai_client(config: VlmConfig) -> VlmClient:
"""Backend that talks to any OpenAI-compatible server.
Compatible with ``vllm serve``, ``transformers serve``,
``ktransformers serve``, and hosted endpoints. By default the server
is expected to be already running. Set ``auto_serve=True`` to have
this client spawn one (default: ``transformers serve``), wait until
it's ready, and tear it down on process exit.
Image blocks ``{"type":"image", "image":<PIL.Image>}`` are
auto-converted to ``image_url`` data-URLs. Video blocks
``{"type":"video", "video":[<PIL>...]}`` are forwarded as
multi-frame ``video_url`` items where supported.
"""
try:
from openai import OpenAI # type: ignore[import-not-found]
except ImportError as exc:
raise ImportError(
"openai package is required for backend='openai'. "
"Install with `pip install openai`."
) from exc
api_base = config.api_base
api_key = config.api_key
auto_serve = config.auto_serve
api_bases: list[str] = [api_base]
print(
f"[lerobot-annotate] backend=openai model={config.model_id} "
f"api_base={api_base} auto_serve={auto_serve}",
flush=True,
)
if auto_serve:
if config.parallel_servers > 1:
print(
f"[lerobot-annotate] spawning {config.parallel_servers} parallel servers",
flush=True,
)
api_bases = _spawn_parallel_inference_servers(config)
elif _server_is_up(api_base):
print(f"[lerobot-annotate] reusing server already up at {api_base}", flush=True)
else:
print("[lerobot-annotate] no server reachable; spawning one", flush=True)
api_base = _spawn_inference_server(config)
api_bases = [api_base]
print(f"[lerobot-annotate] server ready at {api_base}", flush=True)
clients = [OpenAI(base_url=base, api_key=api_key) for base in api_bases]
client = clients[0]
# round-robin counter for parallel mode
rr_counter = {"i": 0}
# ``mm_processor_kwargs`` is a vllm-specific extra; transformers serve
# rejects it with HTTP 422. Send it only when explicitly opted in via
# an env var (e.g. ``LEROBOT_OPENAI_SEND_MM_KWARGS=1`` for vllm).
send_mm_kwargs = os.environ.get(
"LEROBOT_OPENAI_SEND_MM_KWARGS", ""
).lower() in {"1", "true", "yes"}
rr_lock = threading.Lock()
def _one_call(
messages: Sequence[dict[str, Any]], max_tok: int, temp: float
) -> str:
api_messages, mm_kwargs = _to_openai_messages(messages)
kwargs: dict[str, Any] = {
"model": config.model_id,
"messages": api_messages,
"max_tokens": max_tok,
"temperature": temp,
}
extra_body: dict[str, Any] = {}
if send_mm_kwargs and mm_kwargs:
extra_body["mm_processor_kwargs"] = {**mm_kwargs, "do_sample_frames": True}
if config.chat_template_kwargs:
extra_body["chat_template_kwargs"] = config.chat_template_kwargs
if extra_body:
kwargs["extra_body"] = extra_body
with rr_lock:
chosen = clients[rr_counter["i"] % len(clients)]
rr_counter["i"] += 1
response = chosen.chat.completions.create(**kwargs)
return response.choices[0].message.content or ""
def _gen(
batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float
) -> list[str]:
if len(batch) <= 1 or config.client_concurrency <= 1:
return [_one_call(messages, max_tok, temp) for messages in batch]
# Parallel fan-out — vllm batches these on the server side.
from concurrent.futures import ThreadPoolExecutor # noqa: PLC0415
max_workers = min(config.client_concurrency, len(batch))
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = [
pool.submit(_one_call, messages, max_tok, temp) for messages in batch
]
return [f.result() for f in futures]
return _GenericTextClient(_gen, config)
def _spawn_parallel_inference_servers(config: VlmConfig) -> list[str]:
"""Spawn ``config.parallel_servers`` independent vllm replicas.
Each replica:
- is pinned to a single GPU via ``CUDA_VISIBLE_DEVICES``
- listens on ``serve_port + i``
- is shut down via the same atexit hook as the single-server path
Returns the list of ``api_base`` URLs the client should round-robin
across.
"""
import atexit # noqa: PLC0415
import os as _os # noqa: PLC0415
import shlex # noqa: PLC0415
import signal # noqa: PLC0415
import subprocess # noqa: PLC0415
import sys # noqa: PLC0415
import threading # noqa: PLC0415
import time # noqa: PLC0415
n = config.parallel_servers
api_bases: list[str] = []
procs: list[subprocess.Popen] = []
ready_events: list[threading.Event] = []
# Multiple readiness signals — uvicorn's own banner is suppressed at
# ``--uvicorn-log-level warning``, so we also accept vllm's own
# "Starting vLLM API server" line and the route-listing line. The
# HTTP probe below is the ultimate fallback.
ready_markers = (
"Uvicorn running",
"Application startup complete",
"Starting vLLM API server",
"Available routes are",
)
# Single lock for all server-stream threads so multibyte chars from
# different servers don't interleave and tear UTF-8 sequences.
print_lock = threading.Lock()
base_cmd = config.serve_command or (
f"vllm serve {shlex.quote(config.model_id)} "
f"--tensor-parallel-size 1 "
f"--max-model-len {config.max_model_len or 32768} "
f"--uvicorn-log-level warning"
)
num_gpus = config.num_gpus if config.num_gpus > 0 else n
for i in range(n):
port = config.serve_port + i
gpu = i % num_gpus
env = _os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(gpu)
cmd = base_cmd
if "{port}" in cmd:
cmd = cmd.replace("{port}", str(port))
else:
cmd = f"{cmd} --port {port}"
api_base = f"http://localhost:{port}/v1"
api_bases.append(api_base)
print(f"[server-{i}] launching on GPU {gpu} port {port}: {cmd}", flush=True)
proc = subprocess.Popen(
shlex.split(cmd),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1,
env=env,
)
procs.append(proc)
ready = threading.Event()
ready_events.append(ready)
def _stream(idx: int, p: subprocess.Popen, ev: threading.Event) -> None:
# Read whole lines and emit each line atomically under the
# shared print_lock so output from N servers stays readable.
assert p.stdout is not None
for line in iter(p.stdout.readline, ""):
with print_lock:
sys.stdout.write(f"[server-{idx}] {line}")
if not line.endswith(("\n", "\r")):
sys.stdout.write("\n")
sys.stdout.flush()
if any(m in line for m in ready_markers):
ev.set()
threading.Thread(target=_stream, args=(i, proc, ready), daemon=True).start()
def _probe(idx: int, base: str, ev: threading.Event, p: subprocess.Popen) -> None:
while not ev.is_set() and p.poll() is None:
if _server_is_up(base):
print(f"[server-{idx}] ready (http probe)", flush=True)
ev.set()
return
time.sleep(2)
threading.Thread(target=_probe, args=(i, api_base, ready, proc), daemon=True).start()
def _shutdown() -> None:
for i, p in enumerate(procs):
if p.poll() is None:
print(f"[server-{i}] stopping pid={p.pid}", flush=True)
p.send_signal(signal.SIGINT)
for p in procs:
try:
p.wait(timeout=15)
except subprocess.TimeoutExpired:
p.kill()
p.wait(timeout=5)
atexit.register(_shutdown)
deadline = time.monotonic() + config.serve_ready_timeout_s
while any(not ev.is_set() for ev in ready_events) and time.monotonic() < deadline:
for i, p in enumerate(procs):
if p.poll() is not None:
raise RuntimeError(
f"[server-{i}] inference server exited unexpectedly with rc={p.returncode}"
)
time.sleep(2)
if any(not ev.is_set() for ev in ready_events):
raise RuntimeError(
f"[server] not all replicas became ready within {config.serve_ready_timeout_s}s"
)
print(f"[lerobot-annotate] all {n} servers ready: {api_bases}", flush=True)
return api_bases
def _server_is_up(api_base: str) -> bool:
"""Return True if ``api_base/models`` answers 200 within 2 seconds."""
import urllib.request # noqa: PLC0415
url = api_base.rstrip("/") + "/models"
try:
with urllib.request.urlopen(url, timeout=2) as resp:
return resp.status == 200
except Exception: # noqa: BLE001
return False
def _spawn_inference_server(config: VlmConfig) -> str:
"""Spawn ``transformers serve`` (or ``serve_command``), wait until it
accepts ``/v1/models``, and register a shutdown hook.
Streams the server's stdout/stderr to the parent terminal in
real-time on a background thread so users can see model-load
progress and errors as they happen.
Returns the full ``api_base`` URL the OpenAI client should use.
"""
import atexit # noqa: PLC0415
import shlex # noqa: PLC0415
import signal # noqa: PLC0415
import subprocess # noqa: PLC0415
import sys # noqa: PLC0415
import threading # noqa: PLC0415
import time # noqa: PLC0415
import urllib.request # noqa: PLC0415
cmd = config.serve_command
if not cmd:
cmd = (
f"transformers serve {shlex.quote(config.model_id)} "
f"--port {config.serve_port} --continuous-batching"
)
api_base = f"http://localhost:{config.serve_port}/v1"
print(f"[server] launching: {cmd}", flush=True)
proc = subprocess.Popen(
shlex.split(cmd),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1,
)
# Watch the server output for the uvicorn readiness banner. This is
# more reliable than polling /v1/models because transformers serve
# rescans its cache on every model-list request, which can exceed
# the urllib timeout and trigger an infinite probe loop.
ready_event = threading.Event()
# See _spawn_parallel_inference_servers for why we accept these.
ready_markers = (
"Uvicorn running",
"Application startup complete",
"Starting vLLM API server",
"Available routes are",
)
def _probe() -> None:
while not ready_event.is_set() and proc.poll() is None:
if _server_is_up(api_base):
print("[server] ready (http probe)", flush=True)
ready_event.set()
return
time.sleep(2)
threading.Thread(target=_probe, daemon=True).start()
def _stream_output() -> None:
# Read raw chunks instead of iterating lines so tqdm progress
# bars (which overwrite using \r) flush in real time.
assert proc.stdout is not None
buf = ""
prefix_started = False
while True:
ch = proc.stdout.read(1)
if ch == "":
# process exited; flush any tail
if buf:
sys.stdout.write(buf)
sys.stdout.flush()
return
if not prefix_started:
sys.stdout.write("[server] ")
prefix_started = True
sys.stdout.write(ch)
sys.stdout.flush()
buf += ch
if ch in ("\n", "\r"):
if any(marker in buf for marker in ready_markers):
ready_event.set()
buf = ""
prefix_started = False
threading.Thread(target=_stream_output, daemon=True).start()
def _shutdown() -> None:
if proc.poll() is None:
print(f"[server] stopping pid={proc.pid}", flush=True)
proc.send_signal(signal.SIGINT)
try:
proc.wait(timeout=15)
except subprocess.TimeoutExpired:
proc.kill()
proc.wait(timeout=5)
atexit.register(_shutdown)
deadline = time.monotonic() + config.serve_ready_timeout_s
while time.monotonic() < deadline:
if proc.poll() is not None:
raise RuntimeError(
f"[server] inference server exited unexpectedly with rc={proc.returncode}. "
f"See [server] log lines above for the cause."
)
if ready_event.wait(timeout=2):
return api_base
proc.terminate()
raise RuntimeError(
f"[server] did not become ready within {config.serve_ready_timeout_s}s"
)
def _to_openai_messages(
messages: Sequence[dict[str, Any]],
) -> tuple[list[dict[str, Any]], dict[str, Any]]:
"""Convert internal messages to OpenAI chat format.
Returns ``(api_messages, mm_kwargs)``. Multimodal-processor kwargs
(``fps`` from ``video_url`` blocks) are extracted out so the caller
can pass them via ``extra_body.mm_processor_kwargs`` rather than
inside the content blocks (which transformers serve rejects).
File-URL video blocks are inlined as base64 data URLs.
"""
out_messages: list[dict[str, Any]] = []
mm_kwargs: dict[str, Any] = {}
for message in messages:
content = message.get("content")
if not isinstance(content, list):
out_messages.append({"role": message["role"], "content": content})
continue
out_blocks: list[dict[str, Any]] = []
for block in content:
block_type = block.get("type") if isinstance(block, dict) else None
if block_type == "text":
out_blocks.append({"type": "text", "text": block.get("text", "")})
elif block_type == "image":
out_blocks.append(
{"type": "image_url", "image_url": {"url": _pil_to_data_url(block["image"])}}
)
elif block_type == "video":
frames = block.get("video", [])
for img in frames:
out_blocks.append(
{"type": "image_url", "image_url": {"url": _pil_to_data_url(img)}}
)
elif block_type == "video_url":
video_url = dict(block["video_url"])
url = video_url.get("url", "")
if url.startswith("file://"):
video_url["url"] = _file_to_data_url(url[len("file://") :])
out_blocks.append({"type": "video_url", "video_url": video_url})
fps = block.get("fps")
if fps is not None:
mm_kwargs["fps"] = fps
else:
out_blocks.append(block)
out_messages.append({"role": message["role"], "content": out_blocks})
return out_messages, mm_kwargs
def _file_to_data_url(path: str) -> str:
"""Read a local video file and return a base64 ``data:video/mp4`` URL."""
import base64 # noqa: PLC0415
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("ascii")
return f"data:video/mp4;base64,{b64}"
def _pil_to_data_url(image: Any) -> str:
"""Encode a PIL.Image as a base64 data URL."""
import base64 # noqa: PLC0415
import io # noqa: PLC0415
buf = io.BytesIO()
image.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode("ascii")
return f"data:image/png;base64,{b64}"
def _messages_to_prompt(messages: Sequence[dict[str, Any]]) -> Any:
"""Pass-through hook used by the vllm backend.
vllm exposes its own multimodal entry points that vary by version; for the
base flow we simply forward the raw message list and let the caller's
custom backend handle templating. Real deployments override this.
"""
return list(messages)
@@ -0,0 +1,341 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Final parquet rewrite.
For every episode the writer:
1. reads the staged module outputs,
2. partitions them into a persistent slice (PERSISTENT_STYLES) and an event
slice (EVENT_ONLY_STYLES + style=None tool-call atoms),
3. sorts each slice deterministically,
4. broadcasts the persistent slice across every frame in the episode,
5. for each frame, materializes the sublist of event rows whose timestamp
exactly equals that frame's timestamp,
6. drops the legacy ``subtask_index`` column,
7. writes the parquet shard back in place.
The writer does NOT add a dataset-level ``tools`` column. Tool *calls* are
emitted per-row via the existing ``tool_calls`` field on the v3.1 row
struct (PR 1) for every speech atom. The tool *schema* (the description
of the ``say`` function and its parameters) is a fixed code constant —
``SAY_TOOL_SCHEMA`` below — and downstream chat-template consumers import
it directly rather than reading a redundant per-row column.
Invariants enforced here (and re-checked by the validator):
- per-episode persistent slice is byte-identical across every frame;
- ``language_events`` rows on a frame all have ``timestamp == frame_ts``
(timestamps come straight from the source parquet — never recomputed);
- every row passes ``column_for_style(style)``.
"""
from __future__ import annotations
import logging
from collections import defaultdict
from collections.abc import Iterable, Sequence
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import pyarrow as pa
import pyarrow.parquet as pq
from lerobot.datasets.language import (
EVENT_ONLY_STYLES,
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
PERSISTENT_STYLES,
column_for_style,
validate_camera_field,
)
from .reader import EpisodeRecord
from .staging import EpisodeStaging
logger = logging.getLogger(__name__)
# Tool schema constants moved to lerobot.datasets.language in PR 1 — single
# source of truth. Re-exported here so existing imports
# (``from lerobot.annotations.steerable_pipeline.writer import SAY_TOOL_SCHEMA``)
# keep working.
from lerobot.datasets.language import DEFAULT_TOOLS, SAY_TOOL_SCHEMA # noqa: F401, E402
def _row_persistent_sort_key(row: dict[str, Any]) -> tuple:
return (float(row["timestamp"]), row.get("style") or "", row.get("role") or "")
def _row_event_sort_key(row: dict[str, Any]) -> tuple:
# events are bucketed per-frame, but within a frame we still want determinism
return (
row.get("style") or "",
row.get("role") or "",
row.get("camera") or "",
)
def _normalize_persistent_row(row: dict[str, Any]) -> dict[str, Any]:
"""Coerce a staged row into the persistent column's struct shape."""
style = row.get("style")
if style not in PERSISTENT_STYLES:
raise ValueError(
f"persistent slice contains row with non-persistent style {style!r}; "
"row would be misrouted under column_for_style()"
)
if "timestamp" not in row:
raise ValueError(f"persistent row missing timestamp: {row!r}")
camera = row.get("camera")
validate_camera_field(style, camera)
return {
"role": str(row["role"]),
"content": None if row.get("content") is None else str(row["content"]),
"style": style,
"timestamp": float(row["timestamp"]),
"camera": None if camera is None else str(camera),
"tool_calls": _normalize_tool_calls(row.get("tool_calls")),
}
def _normalize_event_row(row: dict[str, Any]) -> dict[str, Any]:
"""Coerce a staged row into the event column's struct shape (no timestamp)."""
style = row.get("style")
if style is not None and style not in EVENT_ONLY_STYLES:
raise ValueError(
f"event slice contains row with style {style!r}; expected None or one of {EVENT_ONLY_STYLES}"
)
if column_for_style(style) != LANGUAGE_EVENTS:
raise ValueError(f"event row with style {style!r} would not route to language_events")
camera = row.get("camera")
validate_camera_field(style, camera)
return {
"role": str(row["role"]),
"content": None if row.get("content") is None else str(row["content"]),
"style": style,
"camera": None if camera is None else str(camera),
"tool_calls": _normalize_tool_calls(row.get("tool_calls")),
}
def _normalize_tool_calls(value: Any) -> list[Any] | None:
if value is None:
return None
if not isinstance(value, list):
raise ValueError(f"tool_calls must be a list or None, got {type(value).__name__}")
return list(value)
def _validate_atom_invariants(row: dict[str, Any]) -> None:
"""At-least-one of content/tool_calls; style=None implies tool_calls."""
has_content = row.get("content") is not None
has_tools = row.get("tool_calls") is not None
if not (has_content or has_tools):
raise ValueError(f"row has neither content nor tool_calls: {row!r}")
if row.get("style") is None and not has_tools:
raise ValueError(f"style=None requires tool_calls: {row!r}")
def _validate_speech_atom(row: dict[str, Any]) -> None:
"""Speech atoms: role=assistant, style=None, content=None, say tool call."""
if row.get("style") is not None:
return # not a speech atom
if row.get("role") != "assistant":
raise ValueError(f"speech atom must have role=assistant: {row!r}")
if row.get("content") is not None:
raise ValueError(f"speech atom must have content=null: {row!r}")
tool_calls = row.get("tool_calls")
if not tool_calls or not isinstance(tool_calls, list):
raise ValueError(f"speech atom must have non-empty tool_calls list: {row!r}")
first = tool_calls[0]
if not isinstance(first, dict):
raise ValueError(f"speech atom tool_calls[0] must be a dict: {row!r}")
if first.get("type") != "function":
raise ValueError(f"speech atom tool_calls[0].type must be 'function': {row!r}")
fn = first.get("function") or {}
if fn.get("name") != "say":
raise ValueError(f"speech atom tool_calls[0].function.name must be 'say': {row!r}")
args = fn.get("arguments") or {}
if not isinstance(args, dict) or "text" not in args or not isinstance(args["text"], str):
raise ValueError(f"speech atom must carry 'text' string in arguments: {row!r}")
@dataclass
class LanguageColumnsWriter:
"""Rewrite ``data/chunk-*/file-*.parquet`` with the two language columns."""
drop_existing_subtask_index: bool = True
def write_all(
self,
records: Sequence[EpisodeRecord],
staging_dir: Path,
root: Path,
) -> list[Path]:
episodes_by_path: dict[Path, list[EpisodeRecord]] = defaultdict(list)
for record in records:
episodes_by_path[record.data_path].append(record)
written: list[Path] = []
for path, eps in episodes_by_path.items():
self._rewrite_one(path, eps, staging_dir, root)
written.append(path)
return written
def _rewrite_one(
self,
path: Path,
episodes: Sequence[EpisodeRecord],
staging_dir: Path,
root: Path,
) -> None:
table = pq.read_table(path)
n_rows = table.num_rows
# Ensure we cover every episode in the file. Episodes that don't have
# staging artifacts are passed through with empty annotation lists —
# this keeps the writer idempotent and safe for partial reruns.
staged_per_ep: dict[int, dict[str, list[dict[str, Any]]]] = {}
for record in episodes:
staging = EpisodeStaging(staging_dir, record.episode_index)
staged_per_ep[record.episode_index] = staging.read_all()
persistent_by_ep: dict[int, list[dict[str, Any]]] = {}
events_by_ep_ts: dict[int, dict[float, list[dict[str, Any]]]] = {}
for ep_index, ep_staged in staged_per_ep.items():
persistent_rows: list[dict[str, Any]] = []
event_rows: list[dict[str, Any]] = [] # carry timestamp until bucketed
for _module_name, rows in ep_staged.items():
for row in rows:
style = row.get("style")
if column_for_style(style) == LANGUAGE_PERSISTENT:
persistent_rows.append(row)
else:
event_rows.append(row)
persistent_rows.sort(key=_row_persistent_sort_key)
normalized_persistent = []
for r in persistent_rows:
_validate_atom_invariants(r)
_validate_speech_atom(r)
normalized_persistent.append(_normalize_persistent_row(r))
persistent_by_ep[ep_index] = normalized_persistent
buckets: dict[float, list[dict[str, Any]]] = defaultdict(list)
for r in event_rows:
_validate_atom_invariants(r)
_validate_speech_atom(r)
ts = float(r["timestamp"])
buckets[ts].append(_normalize_event_row(r))
for ts in list(buckets.keys()):
buckets[ts].sort(key=_row_event_sort_key)
events_by_ep_ts[ep_index] = buckets
episode_col = (
table.column("episode_index").to_pylist() if "episode_index" in table.column_names else None
)
ts_col = table.column("timestamp").to_pylist() if "timestamp" in table.column_names else None
if episode_col is None or ts_col is None:
raise ValueError(f"{path} is missing 'episode_index' or 'timestamp' — required by the writer.")
per_row_persistent: list[list[dict[str, Any]]] = []
per_row_events: list[list[dict[str, Any]]] = []
for i in range(n_rows):
ep = episode_col[i]
ts = float(ts_col[i])
per_row_persistent.append(persistent_by_ep.get(ep, []))
buckets = events_by_ep_ts.get(ep, {})
per_row_events.append(buckets.get(ts, []))
new_table = self._materialize_table(
table, per_row_persistent, per_row_events, drop_old=self.drop_existing_subtask_index
)
pq.write_table(new_table, path)
def _materialize_table(
self,
table: pa.Table,
persistent: list[list[dict[str, Any]]],
events: list[list[dict[str, Any]]],
*,
drop_old: bool,
) -> pa.Table:
cols = []
names = []
for name in table.column_names:
if drop_old and name == "subtask_index":
continue
if name in (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS):
continue # we'll re-add canonical versions
# Strip any legacy ``tools`` column previously emitted by older
# writers — the schema no longer uses it (constant lives in
# SAY_TOOL_SCHEMA / DEFAULT_TOOLS).
if name == "tools":
continue
cols.append(table.column(name))
names.append(name)
# We let pyarrow infer struct/list schema rather than passing the
# canonical type from `lerobot.datasets.language` directly: that type
# uses `pa.json_()` for the `tool_calls` element type, which
# `pa.array(..., type=...)` cannot materialize from Python lists on
# current pyarrow versions. The inferred schema round-trips through
# parquet and `LeRobotDataset` correctly — see PR 1's
# `tests/datasets/test_language.py` which exercises the same flow.
persistent_arr = pa.array(persistent)
events_arr = pa.array(events)
cols.extend([persistent_arr, events_arr])
names.extend([LANGUAGE_PERSISTENT, LANGUAGE_EVENTS])
return pa.Table.from_arrays(cols, names=names)
def speech_atom(timestamp: float, text: str) -> dict[str, Any]:
"""Build a canonical speech tool-call atom for the events column."""
return {
"role": "assistant",
"content": None,
"style": None,
"timestamp": float(timestamp),
"camera": None,
"tool_calls": [
{
"type": "function",
"function": {
"name": "say",
"arguments": {"text": text},
},
}
],
}
def normalize_rows_for_writer(
rows: Iterable[dict[str, Any]],
) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
"""Helper used by tests/validators to partition a flat row list into
(persistent_rows, event_rows) using ``column_for_style``.
"""
persistent: list[dict[str, Any]] = []
events: list[dict[str, Any]] = []
for row in rows:
if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
persistent.append(row)
else:
events.append(row)
return persistent, events
@@ -33,7 +33,7 @@ import cv2 # type: ignore # TODO: add type stubs for OpenCV
import numpy as np # type: ignore # TODO: add type stubs for numpy
from lerobot.utils.decorators import check_if_not_connected
from lerobot.utils.import_utils import _reachy2_sdk_available
from lerobot.utils.import_utils import _reachy2_sdk_available, require_package
if TYPE_CHECKING or _reachy2_sdk_available:
from reachy2_sdk.media.camera import CameraView
@@ -76,6 +76,7 @@ class Reachy2Camera(Camera):
Args:
config: The configuration settings for the camera.
"""
require_package("reachy2_sdk", extra="reachy2")
super().__init__(config)
self.config = config
@@ -19,16 +19,18 @@ Provides the RealSenseCamera class for capturing frames from Intel RealSense cam
import logging
import time
from threading import Event, Lock, Thread
from typing import Any
from typing import TYPE_CHECKING, Any
import cv2 # type: ignore # TODO: add type stubs for OpenCV
import numpy as np # type: ignore # TODO: add type stubs for numpy
from numpy.typing import NDArray # type: ignore # TODO: add type stubs for numpy.typing
try:
import pyrealsense2 as rs # type: ignore # TODO: add type stubs for pyrealsense2
except Exception as e:
logging.info(f"Could not import realsense: {e}")
from lerobot.utils.import_utils import _pyrealsense2_available, require_package
if TYPE_CHECKING or _pyrealsense2_available:
import pyrealsense2 as rs
else:
rs = None
from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
from lerobot.utils.errors import DeviceNotConnectedError
@@ -112,7 +114,7 @@ class RealSenseCamera(Camera):
Args:
config: The configuration settings for the camera.
"""
require_package("pyrealsense2", extra="intelrealsense")
super().__init__(config)
self.config = config
+11 -9
View File
@@ -28,12 +28,19 @@ import json
import logging
import time
from threading import Event, Lock, Thread
from typing import Any
from typing import TYPE_CHECKING, Any
import cv2
import numpy as np
from numpy.typing import NDArray
from lerobot.utils.import_utils import _zmq_available, require_package
if TYPE_CHECKING or _zmq_available:
import zmq
else:
zmq = None
from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
from lerobot.utils.errors import DeviceNotConnectedError
@@ -74,8 +81,8 @@ class ZMQCamera(Camera):
"""
def __init__(self, config: ZMQCameraConfig):
require_package("pyzmq", extra="pyzmq-dep", import_name="zmq")
super().__init__(config)
import zmq
self.config = config
self.server_address = config.server_address
@@ -117,8 +124,6 @@ class ZMQCamera(Camera):
logger.info(f"Connecting to {self}...")
try:
import zmq
self.context = zmq.Context()
self.socket = self.context.socket(zmq.SUB)
self.socket.setsockopt_string(zmq.SUBSCRIBE, "")
@@ -180,11 +185,8 @@ class ZMQCamera(Camera):
try:
message = self.socket.recv_string()
except Exception as e:
# zmq is lazy-imported in connect(), so check by name to avoid a top-level import
if type(e).__name__ == "Again":
raise TimeoutError(f"{self} timeout after {self.timeout_ms}ms") from e
raise
except zmq.Again as e:
raise TimeoutError(f"{self} timeout after {self.timeout_ms}ms") from e
# Decode JSON message
data = json.loads(message)
+7 -4
View File
@@ -28,6 +28,12 @@ import numpy as np
import torch
from lerobot.policies import PreTrainedPolicy, prepare_observation_for_inference
from lerobot.utils.import_utils import _deepdiff_available, require_package
if TYPE_CHECKING or _deepdiff_available:
from deepdiff import DeepDiff
else:
DeepDiff = None
if TYPE_CHECKING:
from lerobot.datasets import LeRobotDataset
@@ -217,10 +223,7 @@ def sanity_check_dataset_robot_compatibility(
Raises:
ValueError: If any of the checked metadata fields do not match.
"""
from lerobot.utils.import_utils import require_package
require_package("deepdiff", extra="hardware")
from deepdiff import DeepDiff
require_package("deepdiff", extra="deepdiff-dep")
from lerobot.utils.constants import DEFAULT_FEATURES
+4
View File
@@ -23,6 +23,7 @@ Import them directly: ``from lerobot.configs.train import TrainPipelineConfig``
from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
from .policies import PreTrainedConfig
from .recipe import MessageTurn, TrainingRecipe, load_recipe
from .types import (
FeatureType,
NormalizationMode,
@@ -41,7 +42,10 @@ __all__ = [
# Config classes
"DatasetConfig",
"EvalConfig",
"MessageTurn",
"PeftConfig",
"PreTrainedConfig",
"TrainingRecipe",
"WandBConfig",
"load_recipe",
]
+3
View File
@@ -35,6 +35,9 @@ class DatasetConfig:
revision: str | None = None
use_imagenet_stats: bool = True
video_backend: str = field(default_factory=get_safe_default_codec)
# When True, video frames are returned as uint8 tensors (0-255) instead of float32 (0.0-1.0).
# This reduces memory and speeds up DataLoader IPC. The training pipeline handles the conversion.
return_uint8: bool = False
streaming: bool = False
def __post_init__(self) -> None:
+208
View File
@@ -0,0 +1,208 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import re
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Literal, get_args
MessageRole = Literal["user", "assistant", "system", "tool"]
MessageStream = Literal["high_level", "low_level"]
DEFAULT_BINDINGS = {
"subtask": "active_at(t, style=subtask)",
"memory": "active_at(t, style=memory)",
"plan": "active_at(t, style=plan)",
"speech": "emitted_at(t, role=assistant, tool_name=say)",
"interjection": "emitted_at(t, style=interjection)",
"vqa": "emitted_at(t, style=vqa, role=assistant)",
"vqa_query": "emitted_at(t, style=vqa, role=user)",
}
_PLACEHOLDER_RE = re.compile(r"\$\{([A-Za-z_][A-Za-z0-9_]*)\}")
_VALID_ROLES = frozenset(get_args(MessageRole))
_VALID_STREAMS = frozenset(get_args(MessageStream))
@dataclass
class MessageTurn:
"""A single chat-style turn in a recipe template.
``content`` may be a plain string, a list of HF-style multimodal blocks, or
``None`` when ``tool_calls_from`` supplies tool-call payloads instead.
``stream`` tags the turn for downstream filtering, ``target`` flags it as a
training target, and ``if_present`` skips the turn when the named binding
resolves to ``None``.
"""
role: MessageRole
content: str | list[dict[str, Any]] | None = None
stream: MessageStream | None = None
target: bool = False
if_present: str | None = None
tool_calls_from: str | None = None
def __post_init__(self) -> None:
"""Validate role, stream, and content after dataclass construction."""
if self.role not in _VALID_ROLES:
raise ValueError(f"Unsupported message role: {self.role!r}")
if self.stream is not None and self.stream not in _VALID_STREAMS:
raise ValueError(f"Unsupported message stream: {self.stream!r}")
if self.content is None and self.tool_calls_from is None:
raise ValueError("MessageTurn.content is required unless tool_calls_from is set.")
if self.content is not None and not isinstance(self.content, (str, list)):
raise TypeError("MessageTurn.content must be a string, a list of HF-style blocks, or None.")
if isinstance(self.content, list):
for block in self.content:
if not isinstance(block, dict) or "type" not in block:
raise ValueError(
"Multimodal content blocks must be HF-style dictionaries with a type key."
)
@classmethod
def from_dict(cls, data: dict[str, Any]) -> MessageTurn:
"""Construct a :class:`MessageTurn` from a plain dictionary."""
return cls(**data)
@dataclass
class TrainingRecipe:
"""A recipe describing how to render training samples from language rows.
A recipe is either a *message recipe* (``messages`` plus optional
``bindings``) or a *blend recipe* (``blend`` mapping names to weighted
sub-recipes). ``weight`` is only meaningful inside a blend.
"""
messages: list[MessageTurn] | None = None
bindings: dict[str, str] | None = None
blend: dict[str, TrainingRecipe] | None = None
weight: float | None = None
def __post_init__(self) -> None:
"""Validate that exactly one of ``messages`` or ``blend`` is set."""
if self.messages is not None and self.blend is not None:
raise ValueError("TrainingRecipe must set only one of messages or blend.")
if self.messages is None and self.blend is None:
raise ValueError("TrainingRecipe must set one of messages or blend.")
if self.messages is not None:
self._validate_message_recipe()
if self.blend is not None:
self._validate_blend_recipe()
@classmethod
def from_dict(cls, data: dict[str, Any]) -> TrainingRecipe:
"""Construct a :class:`TrainingRecipe` from a nested dictionary."""
data = dict(data)
if data.get("messages") is not None:
data["messages"] = [
turn if isinstance(turn, MessageTurn) else MessageTurn.from_dict(turn)
for turn in data["messages"]
]
if data.get("blend") is not None:
data["blend"] = {
name: recipe if isinstance(recipe, TrainingRecipe) else cls.from_dict(recipe)
for name, recipe in data["blend"].items()
}
return cls(**data)
@classmethod
def from_yaml(cls, path: str | Path) -> TrainingRecipe:
"""Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
import yaml # type: ignore[import-untyped]
with open(path) as f:
data = yaml.safe_load(f)
if not isinstance(data, dict):
raise ValueError(f"Recipe YAML must contain a mapping at the top level: {path}")
return cls.from_dict(data)
def _validate_message_recipe(self) -> None:
"""Ensure every templated binding is known and the recipe supervises something.
A recipe is valid if it has at least one of:
* a ``target: true`` assistant turn (drives text-CE supervision), or
* a ``stream: low_level`` turn (drives flow / action supervision via
``predict_actions=True``, even when no assistant turn is targeted
e.g. π0.5-style ``low_level_execution`` where the action expert
conditions on a user-only ``${subtask}`` prompt).
"""
assert self.messages is not None
known_bindings = set(DEFAULT_BINDINGS) | set(self.bindings or {}) | {"task"}
for turn in self.messages:
missing = self._referenced_bindings(turn) - known_bindings
if missing:
raise ValueError(f"MessageTurn references unknown binding(s): {sorted(missing)}")
has_target = any(turn.target for turn in self.messages)
has_low_level = any(turn.stream == "low_level" for turn in self.messages)
if not (has_target or has_low_level):
raise ValueError(
"Message recipes must contain at least one supervised turn — "
"either ``target: true`` (text CE) or ``stream: low_level`` "
"(flow/action loss)."
)
def _validate_blend_recipe(self) -> None:
"""Ensure each blend component is a non-empty, weighted message recipe."""
assert self.blend is not None
if not self.blend:
raise ValueError("Blend recipes must contain at least one component.")
for name, recipe in self.blend.items():
if recipe.blend is not None:
raise ValueError(f"Blend component {name!r} cannot itself define a blend.")
if recipe.messages is None:
raise ValueError(f"Blend component {name!r} must define messages.")
if recipe.weight is None:
raise ValueError(f"Blend component {name!r} must define weight.")
if recipe.weight <= 0:
raise ValueError(f"Blend component {name!r} must have a positive weight.")
def _referenced_bindings(self, turn: MessageTurn) -> set[str]:
"""Return the binding names that ``turn`` references via placeholders or attributes."""
names: set[str] = set()
if turn.if_present is not None:
names.add(turn.if_present)
if turn.tool_calls_from is not None:
names.add(turn.tool_calls_from)
names.update(_placeholders_in_content(turn.content))
return names
def _placeholders_in_content(content: str | list[dict[str, Any]] | None) -> set[str]:
"""Return the set of ``${name}`` placeholders found anywhere in ``content``."""
if content is None:
return set()
if isinstance(content, str):
return set(_PLACEHOLDER_RE.findall(content))
names: set[str] = set()
for block in content:
for value in block.values():
if isinstance(value, str):
names.update(_PLACEHOLDER_RE.findall(value))
return names
def load_recipe(path: str | Path) -> TrainingRecipe:
"""Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
return TrainingRecipe.from_yaml(path)
+77
View File
@@ -0,0 +1,77 @@
# Hi-Robot blend — shared between SmolVLA2 (SmolVLM2 backbone) and
# PI052 (PaliGemma backbone). π0.5-style split:
#
# The action expert is conditioned on (images, state, subtask) only.
# Hierarchical context (task + plan + memory) only flows into the
# high-level text head.
#
# high_level_subtask — predict subtask from (task+plan+memory),
# and the new memory at boundary frames.
# low_level_execution — flow loss with [images, subtask, state].
# plan_generation — task → plan.
# ask_vqa_{top,wrist} — camera-grounded VQA.
#
# Each backbone's text tokenizer renders these messages differently
# (SmolVLA2 uses the chat template; PI052 concatenates as plain
# ``Role: content`` text), but the recipe spec is identical.
blend:
high_level_subtask:
weight: 0.50
bindings:
new_memory: "emitted_at(t, style=memory)"
messages:
- role: user
stream: high_level
content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
- {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
# Boundary-frame tail: at a subtask transition, also predict
# the new memory in the same forward pass.
- {role: assistant, content: "${new_memory}", stream: high_level, target: true, if_present: new_memory}
low_level_execution:
weight: 0.30
messages:
# π0.5-style action conditioning. The action expert sees only
# [images, this user turn (= bare subtask), state]. No text-CE
# target — subtask prediction is owned by ``high_level_subtask``.
# ``stream: low_level`` flips ``predict_actions=True`` so the
# flow loss fires.
- {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
plan_generation:
weight: 0.10
bindings:
current_plan: "active_at(t, style=plan)"
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan}
ask_vqa_top:
weight: 0.05
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.front}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
ask_vqa_wrist:
weight: 0.05
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.wrist}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
@@ -0,0 +1,74 @@
blend:
memory_update:
weight: 0.10
bindings:
prior_memory: "nth_prev(style=memory, offset=1)"
current_memory: "emitted_at(t, style=memory)"
completed_subtask: "nth_prev(style=subtask, offset=1)"
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
- {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
- {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
user_interjection_response:
weight: 0.16
bindings:
prior_plan: "nth_prev(style=plan, offset=1)"
current_plan: "emitted_at(t, style=plan)"
interjection: "emitted_at(t, style=interjection)"
speech: "emitted_at(t, role=assistant, tool_name=say)"
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "Previous plan:\n${prior_plan}", stream: high_level, if_present: prior_plan}
- {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
- {role: assistant, content: "${current_plan}", stream: high_level, target: true, if_present: current_plan, tool_calls_from: speech}
high_level_subtask:
weight: 0.15
bindings:
next_subtask: "nth_next(style=subtask, offset=1)"
messages:
- {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
- {role: user, content: "Current subtask: ${subtask}", stream: high_level, if_present: subtask}
- {role: assistant, content: "${next_subtask}", stream: high_level, target: true}
low_level_execution:
weight: 0.35
messages:
- {role: user, content: "${task}\nPlan: ${plan}\nMemory: ${memory}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: low_level, target: true}
# VQA is view-dependent: bbox / keypoint / count answers only make sense for
# the camera they were grounded against. Each camera gets its own sub-recipe
# so the resolver can disambiguate via `camera=...` and the user-turn carries
# the matching image block. Adjust the camera keys (and add more sub-recipes)
# to match the cameras present on your dataset.
ask_vqa_top:
weight: 0.10
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.top}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
ask_vqa_wrist:
weight: 0.10
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.wrist}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
+2
View File
@@ -56,6 +56,8 @@ class TrainPipelineConfig(HubMixin):
# Number of workers for the dataloader.
num_workers: int = 4
batch_size: int = 8
prefetch_factor: int = 4
persistent_workers: bool = True
steps: int = 100_000
eval_freq: int = 20_000
log_freq: int = 200
+27 -1
View File
@@ -34,9 +34,16 @@ from .dataset_tools import (
remove_feature,
split_dataset,
)
from .factory import make_dataset, resolve_delta_timestamps
from .image_writer import safe_stop_image_writer
from .io_utils import load_episodes, write_stats
from .language import (
EVENT_ONLY_STYLES,
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
PERSISTENT_STYLES,
STYLE_REGISTRY,
column_for_style,
)
from .lerobot_dataset import LeRobotDataset
from .multi_dataset import MultiLeRobotDataset
from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
@@ -45,6 +52,19 @@ from .streaming_dataset import StreamingLeRobotDataset
from .utils import DEFAULT_EPISODES_PATH, create_lerobot_dataset_card
from .video_utils import VideoEncodingManager
def make_dataset(*args, **kwargs):
from .factory import make_dataset as _make_dataset
return _make_dataset(*args, **kwargs)
def resolve_delta_timestamps(*args, **kwargs):
from .factory import resolve_delta_timestamps as _resolve_delta_timestamps
return _resolve_delta_timestamps(*args, **kwargs)
# NOTE: Low-level I/O functions (cast_stats_to_numpy, get_parquet_file_size_in_mb, etc.)
# and legacy migration constants are intentionally NOT re-exported here.
# Import directly: ``from lerobot.datasets.io_utils import ...``
@@ -53,10 +73,15 @@ __all__ = [
"CODEBASE_VERSION",
"DEFAULT_EPISODES_PATH",
"DEFAULT_QUANTILES",
"EVENT_ONLY_STYLES",
"EpisodeAwareSampler",
"LANGUAGE_EVENTS",
"LANGUAGE_PERSISTENT",
"LeRobotDataset",
"LeRobotDatasetMetadata",
"MultiLeRobotDataset",
"PERSISTENT_STYLES",
"STYLE_REGISTRY",
"StreamingLeRobotDataset",
"VideoEncodingManager",
"add_features",
@@ -66,6 +91,7 @@ __all__ = [
"convert_image_to_video_dataset",
"create_initial_features",
"create_lerobot_dataset_card",
"column_for_style",
"delete_episodes",
"get_feature_stats",
"load_episodes",
+1 -1
View File
@@ -512,7 +512,7 @@ def compute_episode_stats(
ep_stats = {}
for key, data in episode_data.items():
if features[key]["dtype"] == "string":
if features[key]["dtype"] in {"string", "language"}:
continue
if features[key]["dtype"] in ["image", "video"]:
+22 -3
View File
@@ -34,7 +34,6 @@ from .io_utils import (
load_episodes,
load_info,
load_stats,
load_subtasks,
load_tasks,
write_info,
write_json,
@@ -177,7 +176,6 @@ class LeRobotDatasetMetadata:
self.info = load_info(self.root)
check_version_compatibility(self.repo_id, self._version, CODEBASE_VERSION)
self.tasks = load_tasks(self.root)
self.subtasks = load_subtasks(self.root)
self.episodes = load_episodes(self.root)
self.stats = load_stats(self.root)
@@ -320,6 +318,28 @@ class LeRobotDatasetMetadata:
"""Keys to access visual modalities (regardless of their storage method)."""
return [key for key, ft in self.features.items() if ft["dtype"] in ["video", "image"]]
@property
def tools(self) -> list[dict]:
"""OpenAI-style tool schemas declared by this dataset.
Read from ``meta/info.json["tools"]``. Returns a copy, so callers
can mutate the result safely. Falls back to
:data:`lerobot.datasets.language.DEFAULT_TOOLS` (the canonical
``say`` schema) when the dataset doesn't declare any — that way
unannotated datasets and chat-template consumers
(``apply_chat_template(messages, tools=meta.tools)``) keep
working out of the box.
Implementations live under :mod:`lerobot.tools` (one file per
tool); see ``docs/source/tools.mdx`` for the authoring guide.
"""
from .language import DEFAULT_TOOLS # noqa: PLC0415 (avoid circular import)
declared = self.info.get("tools")
if isinstance(declared, list) and declared:
return [dict(t) for t in declared]
return [dict(t) for t in DEFAULT_TOOLS]
@property
def names(self) -> dict[str, list | dict]:
"""Names of the various dimensions of vector modalities."""
@@ -635,7 +655,6 @@ class LeRobotDatasetMetadata:
_validate_feature_names(features)
obj.tasks = None
obj.subtasks = None
obj.episodes = None
obj.stats = None
obj.info = create_empty_dataset_info(
+68 -15
View File
@@ -16,6 +16,7 @@
"""Private reader component for LeRobotDataset. Handles random-access reading (HF dataset, delta indices, video decoding)."""
from collections.abc import Callable
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import datasets
@@ -49,6 +50,7 @@ class DatasetReader:
video_backend: str,
delta_timestamps: dict[str, list[float]] | None,
image_transforms: Callable | None,
return_uint8: bool = False,
):
"""Initialize the reader with metadata, filtering, and transform config.
@@ -73,6 +75,7 @@ class DatasetReader:
self._tolerance_s = tolerance_s
self._video_backend = video_backend
self._image_transforms = image_transforms
self._return_uint8 = return_uint8
self.hf_dataset: datasets.Dataset | None = None
self._absolute_to_relative_idx: dict[int, int] | None = None
@@ -105,10 +108,8 @@ class DatasetReader:
"""Build absolute-to-relative index mapping from loaded hf_dataset."""
self._absolute_to_relative_idx = None
if self.episodes is not None and self.hf_dataset is not None:
self._absolute_to_relative_idx = {
abs_idx.item() if isinstance(abs_idx, torch.Tensor) else abs_idx: rel_idx
for rel_idx, abs_idx in enumerate(self.hf_dataset["index"])
}
indices = self.hf_dataset.data.column("index").to_numpy()
self._absolute_to_relative_idx = dict(zip(indices.tolist(), range(len(indices)), strict=True))
@property
def num_frames(self) -> int:
@@ -125,10 +126,53 @@ class DatasetReader:
def _load_hf_dataset(self) -> datasets.Dataset:
"""hf_dataset contains all the observations, states, actions, rewards, etc."""
features = get_hf_features_from_features(self._meta.features)
# Datasets annotated with the PR1 language columns may have been
# written without registering those columns in ``meta/info.json``
# (e.g. they predate ``CODEBASE_VERSION="v3.1"`` and were
# back-filled by ``lerobot-annotate``). Probe a single parquet
# shard and graft the column features on so the strict
# ``Dataset.from_parquet`` cast doesn't fail with
# ``column names don't match``.
features = self._extend_features_with_language_columns(features)
hf_dataset = load_nested_dataset(self.root / "data", features=features, episodes=self.episodes)
hf_dataset.set_transform(hf_transform_to_torch)
return hf_dataset
def _extend_features_with_language_columns(
self, features: datasets.Features
) -> datasets.Features:
"""Add ``language_persistent`` / ``language_events`` to ``features``
when the underlying parquet shards declare them but the metadata
doesn't. No-op when neither column is present or both are
already registered.
"""
# Find any one parquet to peek at; bail if there are none yet
# (the dataset will fail later for an unrelated reason and we
# want that error to surface as-is).
try:
sample = next((self.root / "data").glob("*/*.parquet"))
except StopIteration:
return features
from pyarrow import parquet as _pq # noqa: PLC0415
schema_names = set(_pq.read_schema(sample).names)
from .language import ( # noqa: PLC0415
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
language_events_column_feature,
language_persistent_column_feature,
)
extra: dict[str, object] = {}
if LANGUAGE_PERSISTENT in schema_names and LANGUAGE_PERSISTENT not in features:
extra[LANGUAGE_PERSISTENT] = language_persistent_column_feature()
if LANGUAGE_EVENTS in schema_names and LANGUAGE_EVENTS not in features:
extra[LANGUAGE_EVENTS] = language_events_column_feature()
if not extra:
return features
return datasets.Features({**features, **extra})
def _check_cached_episodes_sufficient(self) -> bool:
"""Check if the cached dataset contains all requested episodes and their video files."""
if self.hf_dataset is None or len(self.hf_dataset) == 0:
@@ -235,16 +279,30 @@ class DatasetReader:
Segmentation Fault.
"""
ep = self._meta.episodes[ep_idx]
item = {}
for vid_key, query_ts in query_timestamps.items():
def _decode_single(vid_key: str, query_ts: list[float]) -> tuple[str, torch.Tensor]:
from_timestamp = ep[f"videos/{vid_key}/from_timestamp"]
shifted_query_ts = [from_timestamp + ts for ts in query_ts]
video_path = self.root / self._meta.get_video_file_path(ep_idx, vid_key)
frames = decode_video_frames(video_path, shifted_query_ts, self._tolerance_s, self._video_backend)
item[vid_key] = frames.squeeze(0)
frames = decode_video_frames(
video_path,
shifted_query_ts,
self._tolerance_s,
self._video_backend,
return_uint8=self._return_uint8,
)
return vid_key, frames.squeeze(0)
return item
items = list(query_timestamps.items())
# Single camera: no threading overhead
if len(items) <= 1:
return {vid_key: _decode_single(vid_key, query_ts)[1] for vid_key, query_ts in items}
# Multi-camera: decode in parallel (video decoding releases the GIL)
with ThreadPoolExecutor(max_workers=len(items)) as pool:
futures = [pool.submit(_decode_single, k, ts) for k, ts in items]
return dict(f.result() for f in futures)
def get_item(self, idx) -> dict:
"""Core __getitem__ logic. Assumes hf_dataset is loaded.
@@ -280,9 +338,4 @@ class DatasetReader:
task_idx = item["task_index"].item()
item["task"] = self._meta.tasks.iloc[task_idx].name
# add subtask information if available
if "subtask_index" in self._meta.features and self._meta.subtasks is not None:
subtask_idx = item["subtask_index"].item()
item["subtask"] = self._meta.subtasks.iloc[subtask_idx].name
return item
+1 -1
View File
@@ -597,7 +597,7 @@ class DatasetWriter:
def cleanup_interrupted_episode(self, episode_index: int) -> None:
"""Remove temporary image directories for an interrupted episode."""
for key in self._meta.video_keys:
for key in self._meta.camera_keys:
img_dir = self._get_image_file_path(
episode_index=episode_index, image_key=key, frame_index=0
).parent
+2
View File
@@ -92,6 +92,7 @@ def make_dataset(cfg: TrainPipelineConfig) -> LeRobotDataset | MultiLeRobotDatas
image_transforms=image_transforms,
revision=cfg.dataset.revision,
video_backend=cfg.dataset.video_backend,
return_uint8=True,
tolerance_s=cfg.tolerance_s,
)
else:
@@ -104,6 +105,7 @@ def make_dataset(cfg: TrainPipelineConfig) -> LeRobotDataset | MultiLeRobotDatas
revision=cfg.dataset.revision,
max_num_shards=cfg.num_workers,
tolerance_s=cfg.tolerance_s,
return_uint8=True,
)
else:
raise NotImplementedError("The MultiLeRobotDataset isn't supported for now.")
+15 -1
View File
@@ -22,6 +22,12 @@ from PIL import Image as PILImage
from lerobot.utils.constants import DEFAULT_FEATURES
from lerobot.utils.utils import is_valid_numpy_dtype_string
from .language import (
LANGUAGE_PERSISTENT,
is_language_column,
language_events_column_feature,
language_persistent_column_feature,
)
from .utils import (
DEFAULT_CHUNK_SIZE,
DEFAULT_DATA_FILE_SIZE_IN_MB,
@@ -45,7 +51,13 @@ def get_hf_features_from_features(features: dict) -> datasets.Features:
"""
hf_features = {}
for key, ft in features.items():
if ft["dtype"] == "video":
if is_language_column(key):
hf_features[key] = (
language_persistent_column_feature()
if key == LANGUAGE_PERSISTENT
else language_events_column_feature()
)
elif ft["dtype"] == "video":
continue
elif ft["dtype"] == "image":
hf_features[key] = datasets.Image()
@@ -242,6 +254,8 @@ def validate_feature_dtype_and_shape(
return validate_feature_image_or_video(name, expected_shape, value)
elif expected_dtype == "string":
return validate_feature_string(name, value)
elif expected_dtype == "language":
return ""
else:
raise NotImplementedError(f"The feature dtype '{expected_dtype}' is not implemented yet.")
+2 -2
View File
@@ -30,13 +30,13 @@ def safe_stop_image_writer(func):
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except Exception as e:
except BaseException:
dataset = kwargs.get("dataset")
writer = getattr(dataset, "writer", None) if dataset else None
if writer is not None and writer.image_writer is not None:
logger.warning("Waiting for image writer to terminate...")
writer.image_writer.stop()
raise e
raise
return wrapper
+8 -11
View File
@@ -34,7 +34,6 @@ from lerobot.utils.utils import SuppressProgressBars, flatten_dict, unflatten_di
from .utils import (
DEFAULT_DATA_FILE_SIZE_IN_MB,
DEFAULT_EPISODES_PATH,
DEFAULT_SUBTASKS_PATH,
DEFAULT_TASKS_PATH,
EPISODES_DIR,
INFO_PATH,
@@ -189,14 +188,6 @@ def load_tasks(local_dir: Path) -> pandas.DataFrame:
return tasks
def load_subtasks(local_dir: Path) -> pandas.DataFrame | None:
"""Load subtasks from subtasks.parquet if it exists."""
subtasks_path = local_dir / DEFAULT_SUBTASKS_PATH
if subtasks_path.exists():
return pd.read_parquet(subtasks_path)
return None
def write_episodes(episodes: Dataset, local_dir: Path) -> None:
"""Write episode metadata to a parquet file in the LeRobot v3.0 format.
This function writes episode-level metadata to a single parquet file.
@@ -268,11 +259,13 @@ def hf_transform_to_torch(items_dict: dict[str, list[Any]]) -> dict[str, list[to
dict: The batch with items converted to torch tensors.
"""
for key in items_dict:
if key in {"language_persistent", "language_events"}:
continue
first_item = items_dict[key][0]
if isinstance(first_item, PILImage.Image):
to_tensor = transforms.ToTensor()
items_dict[key] = [to_tensor(img) for img in items_dict[key]]
elif first_item is None:
elif first_item is None or isinstance(first_item, dict):
pass
else:
items_dict[key] = [x if isinstance(x, str) else torch.tensor(x) for x in items_dict[key]]
@@ -308,7 +301,11 @@ def item_to_torch(item: dict) -> dict:
dict: Dictionary with all tensor-like items converted to torch.Tensor.
"""
for key, val in item.items():
if isinstance(val, (np.ndarray | list)) and key not in ["task"]:
if isinstance(val, (np.ndarray | list)) and key not in [
"task",
"language_persistent",
"language_events",
]:
# Convert numpy arrays and lists to torch tensors
item[key] = torch.tensor(val)
return item
+250
View File
@@ -0,0 +1,250 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
from typing import Literal
import datasets
import pyarrow as pa
LANGUAGE_PERSISTENT = "language_persistent"
LANGUAGE_EVENTS = "language_events"
LANGUAGE_COLUMNS = (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS)
PERSISTENT_ROW_FIELDS = ("role", "content", "style", "timestamp", "camera", "tool_calls")
EVENT_ROW_FIELDS = ("role", "content", "style", "camera", "tool_calls")
CORE_STYLES = {
"subtask",
"plan",
"memory",
"motion",
"interjection",
"vqa",
"trace",
"task_aug",
}
EXTENDED_STYLES = set()
STYLE_REGISTRY = CORE_STYLES | EXTENDED_STYLES
PERSISTENT_STYLES = {"subtask", "plan", "memory", "motion", "task_aug"}
EVENT_ONLY_STYLES = {"interjection", "vqa", "trace"}
# Styles whose ``content`` is grounded in a specific camera view. Rows of these
# styles MUST carry a non-null ``camera`` referencing an ``observation.images.*``
# feature key. Rows of every other style MUST have ``camera=None``. ``motion``
# is intentionally NOT in this set: motion primitives are described in
# robot-frame (joint / Cartesian) terms, not pixel space, so they are
# camera-agnostic. ``trace`` is the pixel-trajectory event style and IS
# view-dependent. The ``camera`` field nevertheless lives on
# ``PERSISTENT_ROW_FIELDS`` too so the schema, validator, and resolver
# behave symmetrically across the two columns; persistent rows simply
# always have ``camera=None`` in practice today.
VIEW_DEPENDENT_STYLES = {"vqa", "trace"}
LanguageColumn = Literal["language_persistent", "language_events"]
def _json_arrow_type() -> pa.DataType:
"""Return the Arrow JSON type, falling back to ``string`` on older pyarrow."""
return pa.json_() if hasattr(pa, "json_") else pa.string()
def _json_feature() -> object:
"""Return the HF feature used for tool-call payloads.
Older ``datasets`` versions do not expose ``datasets.Json``. The
annotation pipeline currently emits the canonical ``say`` tool call
shape, so use that explicit struct instead of falling back to a string
that cannot cast structured parquet values.
"""
if hasattr(datasets, "Json"):
return datasets.Json()
return {
"type": datasets.Value("string"),
"function": {
"name": datasets.Value("string"),
"arguments": {"text": datasets.Value("string")},
},
}
def language_persistent_row_arrow_type() -> pa.StructType:
"""Return the Arrow struct type for a single persistent language row.
Persistent rows carry their own ``timestamp`` because they represent a state
that became active at a specific moment and remains active until superseded.
"""
return pa.struct(
[
pa.field("role", pa.string(), nullable=False),
pa.field("content", pa.string(), nullable=True),
pa.field("style", pa.string(), nullable=True),
pa.field("timestamp", pa.float64(), nullable=False),
pa.field("camera", pa.string(), nullable=True),
pa.field("tool_calls", pa.list_(_json_arrow_type()), nullable=True),
]
)
def language_event_row_arrow_type() -> pa.StructType:
"""Return the Arrow struct type for a single event language row.
Event rows have no ``timestamp`` field: each event is stored on the dataset
row whose frame timestamp is the event's firing time.
"""
return pa.struct(
[
pa.field("role", pa.string(), nullable=False),
pa.field("content", pa.string(), nullable=True),
pa.field("style", pa.string(), nullable=True),
pa.field("camera", pa.string(), nullable=True),
pa.field("tool_calls", pa.list_(_json_arrow_type()), nullable=True),
]
)
def language_persistent_arrow_type() -> pa.ListType:
"""Return the Arrow list type for the ``language_persistent`` column."""
return pa.list_(language_persistent_row_arrow_type())
def language_events_arrow_type() -> pa.ListType:
"""Return the Arrow list type for the ``language_events`` column."""
return pa.list_(language_event_row_arrow_type())
def language_persistent_row_feature() -> dict[str, object]:
"""Return the HF ``datasets`` feature mapping for a persistent language row."""
return {
"role": datasets.Value("string"),
"content": datasets.Value("string"),
"style": datasets.Value("string"),
"timestamp": datasets.Value("float64"),
"camera": datasets.Value("string"),
"tool_calls": datasets.List(_json_feature()),
}
def language_event_row_feature() -> dict[str, object]:
"""Return the HF ``datasets`` feature mapping for an event language row."""
return {
"role": datasets.Value("string"),
"content": datasets.Value("string"),
"style": datasets.Value("string"),
"camera": datasets.Value("string"),
"tool_calls": datasets.List(_json_feature()),
}
def language_persistent_column_feature() -> datasets.List:
"""Return the HF ``datasets`` feature for the ``language_persistent`` column."""
return datasets.List(language_persistent_row_feature())
def language_events_column_feature() -> datasets.List:
"""Return the HF ``datasets`` feature for the ``language_events`` column."""
return datasets.List(language_event_row_feature())
def language_feature_info() -> dict[str, dict]:
"""Return the ``info["features"]`` entries for both language columns."""
return {
LANGUAGE_PERSISTENT: {"dtype": "language", "shape": (1,), "names": None},
LANGUAGE_EVENTS: {"dtype": "language", "shape": (1,), "names": None},
}
def is_language_column(key: str) -> bool:
"""Return ``True`` if ``key`` is one of the dataset's language column names."""
return key in LANGUAGE_COLUMNS
def is_view_dependent_style(style: str | None) -> bool:
"""Return ``True`` if rows of ``style`` must be tagged with a ``camera`` key."""
return style in VIEW_DEPENDENT_STYLES
def validate_camera_field(style: str | None, camera: str | None) -> None:
"""Enforce the ``camera`` invariant: required iff ``style`` is view-dependent.
Raises ``ValueError`` if a view-dependent style is missing ``camera`` or if
a non-view-dependent style carries one. Pipeline writers and the validator
should call this on every emitted row.
"""
if is_view_dependent_style(style):
if not camera:
raise ValueError(
f"Rows of view-dependent style {style!r} require a non-empty 'camera' "
f"field referencing an 'observation.images.*' feature key."
)
elif camera is not None:
raise ValueError(
f"Rows of style {style!r} must have camera=None; got camera={camera!r}."
)
# --- Tool registry --------------------------------------------------------
# Tools declared on a dataset live in ``meta/info.json["tools"]`` as a list
# of OpenAI-style function schemas. The runtime / training stack reads them
# through :class:`LeRobotDatasetMetadata.tools` (with these constants as
# fallback when the dataset doesn't declare any). Implementations live
# under :mod:`lerobot.tools` (one file per tool); see
# ``docs/source/tools.mdx`` for the authoring guide.
SAY_TOOL_SCHEMA: dict = {
"type": "function",
"function": {
"name": "say",
"description": "Speak a short utterance to the user via the TTS executor.",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The verbatim text to speak.",
}
},
"required": ["text"],
},
},
}
"""Canonical schema for the ``say`` tool emitted by the steerable
annotation pipeline (PR 2 Module 2). Single source of truth PR 2's
writer, PR 3's runtime tool registry, and the dataset visualizer all
import this constant rather than duplicating the dict."""
DEFAULT_TOOLS: list[dict] = [SAY_TOOL_SCHEMA]
"""Fallback tools list. Returned by ``LeRobotDatasetMetadata.tools``
when ``meta/info.json["tools"]`` is unset, so unannotated datasets and
chat-template consumers (``apply_chat_template(messages, tools=...)``)
keep working out of the box."""
def column_for_style(style: str | None) -> LanguageColumn:
"""Map a language style to the column where rows of that style are stored.
Styles in :data:`PERSISTENT_STYLES` route to :data:`LANGUAGE_PERSISTENT`.
Styles in :data:`EVENT_ONLY_STYLES` and the implicit ``None`` style route
to :data:`LANGUAGE_EVENTS`.
"""
if style is None:
return LANGUAGE_EVENTS
if style in PERSISTENT_STYLES:
return LANGUAGE_PERSISTENT
if style in EVENT_ONLY_STYLES:
return LANGUAGE_EVENTS
raise ValueError(f"Unknown language style: {style!r}")
+593
View File
@@ -0,0 +1,593 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import copy
import hashlib
import re
from collections.abc import Sequence
from typing import Any
from lerobot.configs.recipe import DEFAULT_BINDINGS, TrainingRecipe
from .language import (
EVENT_ONLY_STYLES,
LANGUAGE_PERSISTENT,
PERSISTENT_STYLES,
column_for_style,
)
LanguageRow = dict[str, Any]
RenderedMessages = dict[str, list[Any]]
_RESOLVER_RE = re.compile(r"^(?P<name>[A-Za-z_][A-Za-z0-9_]*)\((?P<args>.*)\)$")
_PLACEHOLDER_RE = re.compile(r"\$\{([A-Za-z_][A-Za-z0-9_]*)\}")
def active_at(
t: float,
*,
persistent: Sequence[LanguageRow],
events: Sequence[LanguageRow] | None = None,
style: str | None = None,
role: str | None = None,
tool_name: str | None = None,
camera: str | None = None,
) -> LanguageRow | None:
"""Return the persistent row of ``style`` that is active at time ``t``.
A persistent row is "active" at ``t`` when its own ``timestamp`` is the
most recent one ``<= t`` for the given ``style``/``role``/``tool_name``/
``camera`` selector. ``events`` is accepted for resolver-signature
uniformity but is not consulted: only persistent styles are valid here.
"""
_validate_persistent_resolver("active_at", style)
matches = _matching_rows(
persistent, style=style, role=role, tool_name=tool_name, camera=camera
)
matches = [row for row in matches if _timestamp(row) <= t]
return _select_latest(
matches, style=style, role=role, tool_name=tool_name, camera=camera
)
def emitted_at(
t: float,
*,
persistent: Sequence[LanguageRow],
events: Sequence[LanguageRow],
style: str | None = None,
role: str | None = None,
tool_name: str | None = None,
camera: str | None = None,
) -> LanguageRow | None:
"""Return the row of ``style`` emitted at exactly time ``t``.
For persistent styles, this matches persistent rows whose own ``timestamp``
equals ``t``. For event styles, the ``events`` list is assumed to come from
the dataset row at frame ``t`` (event rows carry no timestamp of their own),
so all matching event rows are considered emitted at ``t``. ``camera``
filters by the row's ``camera`` field — required to disambiguate when
multiple view-dependent rows share ``(t, role)`` across cameras.
"""
column = column_for_style(style)
if column == LANGUAGE_PERSISTENT:
matches = [
row
for row in _matching_rows(
persistent, style=style, role=role, tool_name=tool_name, camera=camera
)
if _timestamp(row) == t
]
return _select_one(
matches,
style=style,
role=role,
tool_name=tool_name,
camera=camera,
sort_key=_persistent_sort_key,
)
matches = _matching_rows(
events, style=style, role=role, tool_name=tool_name, camera=camera
)
return _select_one(
matches,
style=style,
role=role,
tool_name=tool_name,
camera=camera,
sort_key=_event_sort_key,
)
def nth_prev(
t: float,
*,
persistent: Sequence[LanguageRow],
events: Sequence[LanguageRow] | None = None,
style: str | None = None,
offset: int = 1,
role: str | None = None,
tool_name: str | None = None,
camera: str | None = None,
) -> LanguageRow | None:
"""Return the persistent row that was active ``offset`` steps before ``t``.
Walks back through chronologically sorted persistent rows of ``style``
(filtered by optional ``role``/``tool_name``/``camera``) and returns the
one ``offset`` positions before the row active at ``t``. Only valid for
persistent styles.
"""
return _nth_relative(
t,
persistent=persistent,
style=style,
offset=-offset,
role=role,
tool_name=tool_name,
camera=camera,
resolver_name="nth_prev",
)
def nth_next(
t: float,
*,
persistent: Sequence[LanguageRow],
events: Sequence[LanguageRow] | None = None,
style: str | None = None,
offset: int = 1,
role: str | None = None,
tool_name: str | None = None,
camera: str | None = None,
) -> LanguageRow | None:
"""Return the persistent row that becomes active ``offset`` steps after ``t``.
Walks forward through chronologically sorted persistent rows of ``style``
(filtered by optional ``role``/``tool_name``/``camera``) and returns the
one ``offset`` positions after the row active at ``t``. Only valid for
persistent styles.
"""
return _nth_relative(
t,
persistent=persistent,
style=style,
offset=offset,
role=role,
tool_name=tool_name,
camera=camera,
resolver_name="nth_next",
)
def render_sample(
*,
recipe: TrainingRecipe,
persistent: Sequence[LanguageRow] | None,
events: Sequence[LanguageRow] | None,
t: float,
sample_idx: int,
task: str | None = None,
dataset_ctx: Any | None = None,
) -> RenderedMessages | None:
"""Render the chat-style messages for a single dataset sample.
Resolves the recipe's bindings against ``persistent`` and ``events`` rows
at frame timestamp ``t``, then expands the recipe's message templates.
Returns ``None`` if the resolved sample contains no target message.
"""
persistent_rows = _normalize_rows(persistent or [])
event_rows = _normalize_rows(events or [])
selected_recipe = _select_recipe(recipe, sample_idx)
bindings = _resolve_bindings(
selected_recipe,
persistent=persistent_rows,
events=event_rows,
t=t,
sample_idx=sample_idx,
task=task,
dataset_ctx=dataset_ctx,
)
return _render_message_recipe(selected_recipe, bindings)
def _select_recipe(recipe: TrainingRecipe, sample_idx: int) -> TrainingRecipe:
"""Pick a deterministic blend component for ``sample_idx`` (or return ``recipe``)."""
if recipe.blend is None:
return recipe
total_weight = sum(component.weight or 0.0 for component in recipe.blend.values())
if total_weight <= 0:
raise ValueError("Blend weights must sum to a positive value.")
digest = hashlib.blake2b(str(sample_idx).encode(), digest_size=8).digest()
draw = int.from_bytes(digest, "big") / 2**64 * total_weight
cumulative = 0.0
last_component: TrainingRecipe | None = None
for component in recipe.blend.values():
last_component = component
cumulative += component.weight or 0.0
if draw < cumulative:
return component
assert last_component is not None
return last_component
def _resolve_bindings(
recipe: TrainingRecipe,
*,
persistent: Sequence[LanguageRow],
events: Sequence[LanguageRow],
t: float,
sample_idx: int,
task: str | None,
dataset_ctx: Any | None,
) -> dict[str, LanguageRow | str | None]:
"""Resolve every binding in ``recipe`` (plus ``task``) at time ``t``."""
bindings: dict[str, LanguageRow | str | None] = {
"task": _resolve_task(
task, dataset_ctx, persistent=persistent, sample_idx=sample_idx
),
}
specs = {**DEFAULT_BINDINGS, **(recipe.bindings or {})}
for name, spec in specs.items():
bindings[name] = _resolve_spec(spec, persistent=persistent, events=events, t=t)
return bindings
def _resolve_task(
task: str | None,
dataset_ctx: Any | None,
*,
persistent: Sequence[LanguageRow] = (),
sample_idx: int = 0,
) -> str | None:
"""Return the task string for ``sample_idx``.
Resolution order:
1. Explicit ``task`` override (caller-supplied) wins.
2. If ``persistent`` contains rows of style ``task_aug`` (role=user),
deterministically pick one by ``sample_idx`` so each frame of an
episode rotates through the available rephrasings across an epoch.
This realizes Xiao 2022 / CAST-style task-prompt diversity without
changing ``meta/tasks.parquet`` and without forcing recipes to opt
in: ``${task}`` automatically picks a rephrasing when one exists,
and falls back to the canonical task otherwise. Recipes that want
the literal canonical task can override the binding.
3. Otherwise read the canonical task from ``dataset_ctx`` (which is
backed by ``meta/tasks.parquet``).
"""
if task is not None:
return task
aug_rows = [
r
for r in persistent
if r.get("style") == "task_aug" and r.get("role") == "user"
]
if aug_rows:
# Deterministic, blake2b-based pick keyed on sample_idx so the
# rotation is reproducible across runs (Python's built-in ``hash``
# is process-randomized).
digest = hashlib.blake2b(
f"task_aug:{sample_idx}".encode(), digest_size=8
).digest()
idx = int.from_bytes(digest, "big") % len(aug_rows)
chosen = aug_rows[idx].get("content")
if chosen:
return str(chosen)
if dataset_ctx is None:
return None
if isinstance(dataset_ctx, dict):
return dataset_ctx.get("task")
return getattr(dataset_ctx, "task", None)
def _resolve_spec(
spec: str,
*,
persistent: Sequence[LanguageRow],
events: Sequence[LanguageRow],
t: float,
) -> LanguageRow | None:
"""Parse a single binding's resolver expression and dispatch to its function."""
match = _RESOLVER_RE.match(spec.strip())
if match is None:
raise ValueError(f"Invalid resolver expression: {spec!r}")
name = match.group("name")
kwargs = _parse_resolver_args(match.group("args"))
kwargs.pop("t_arg", None)
resolvers = {
"active_at": active_at,
"emitted_at": emitted_at,
"nth_prev": nth_prev,
"nth_next": nth_next,
}
if name not in resolvers:
raise ValueError(f"Unknown language resolver: {name!r}")
return resolvers[name](t, persistent=persistent, events=events, **kwargs)
def _parse_resolver_args(args: str) -> dict[str, Any]:
"""Parse a comma-separated resolver argument list into a kwargs dict."""
kwargs: dict[str, Any] = {}
if not args.strip():
return kwargs
parts = [part.strip() for part in args.split(",") if part.strip()]
for part in parts:
if part == "t":
kwargs["t_arg"] = True
continue
if "=" not in part:
raise ValueError(f"Invalid resolver argument: {part!r}")
key, value = (item.strip() for item in part.split("=", 1))
if key == "offset":
kwargs[key] = int(value)
else:
kwargs[key] = value.strip("\"'")
return kwargs
def _render_message_recipe(
recipe: TrainingRecipe,
bindings: dict[str, LanguageRow | str | None],
) -> RenderedMessages | None:
"""Expand ``recipe.messages`` into rendered chat messages using ``bindings``."""
assert recipe.messages is not None
messages: list[dict[str, Any]] = []
streams: list[str | None] = []
target_indices: list[int] = []
for turn in recipe.messages:
if turn.if_present is not None and bindings.get(turn.if_present) is None:
continue
message = {"role": turn.role}
if turn.content is not None:
message["content"] = _render_content(turn.content, bindings)
if turn.tool_calls_from is not None:
row = bindings.get(turn.tool_calls_from)
tool_calls = row.get("tool_calls") if isinstance(row, dict) else None
if tool_calls:
message["tool_calls"] = copy.deepcopy(tool_calls)
message_idx = len(messages)
messages.append(message)
streams.append(turn.stream)
if turn.target:
target_indices.append(message_idx)
if not target_indices:
return None
rendered = {
"messages": messages,
"message_streams": streams,
"target_message_indices": target_indices,
}
_validate_rendered(rendered)
return rendered
def _render_content(
content: str | list[dict[str, Any]],
bindings: dict[str, LanguageRow | str | None],
) -> str | list[dict[str, Any]]:
"""Substitute bindings into a string or each string field of multimodal blocks."""
if isinstance(content, str):
return _substitute(content, bindings)
rendered_blocks = []
for block in content:
rendered_block = copy.deepcopy(block)
for key, value in rendered_block.items():
if isinstance(value, str):
rendered_block[key] = _substitute(value, bindings)
rendered_blocks.append(rendered_block)
return rendered_blocks
def _substitute(template: str, bindings: dict[str, LanguageRow | str | None]) -> str:
"""Replace ``${name}`` placeholders in ``template`` with their bound values."""
def replace(match: re.Match[str]) -> str:
"""Resolve a single ``${name}`` match to its bound string value."""
name = match.group(1)
if name not in bindings:
raise ValueError(f"Unknown template binding: {name!r}")
value = bindings[name]
if value is None:
return ""
if isinstance(value, dict):
content = value.get("content")
return "" if content is None else str(content)
return str(value)
return _PLACEHOLDER_RE.sub(replace, template)
def _validate_rendered(rendered: RenderedMessages) -> None:
"""Sanity-check the rendered output for stream/target alignment."""
messages = rendered["messages"]
streams = rendered["message_streams"]
target_indices = rendered["target_message_indices"]
if len(streams) != len(messages):
raise ValueError("message_streams must be aligned with messages.")
if not target_indices:
raise ValueError("Rendered samples must contain at least one target message.")
for idx in target_indices:
if idx < 0 or idx >= len(messages):
raise ValueError(f"Target message index {idx} is out of bounds.")
for idx, stream in enumerate(streams):
if stream is None:
raise ValueError(f"Rendered message {idx} has no stream.")
def _nth_relative(
t: float,
*,
persistent: Sequence[LanguageRow],
style: str | None,
offset: int,
role: str | None,
tool_name: str | None,
camera: str | None,
resolver_name: str,
) -> LanguageRow | None:
"""Shared body for ``nth_prev`` / ``nth_next`` with signed ``offset``."""
_validate_persistent_resolver(resolver_name, style)
if abs(offset) < 1:
raise ValueError(f"{resolver_name} offset must be non-zero.")
rows = sorted(
_matching_rows(persistent, style=style, role=role, tool_name=tool_name, camera=camera),
key=_persistent_sort_key,
)
if not rows:
return None
anchor_idx = None
for idx, row in enumerate(rows):
if _timestamp(row) <= t:
anchor_idx = idx
else:
break
target_idx = (offset - 1 if offset > 0 else None) if anchor_idx is None else anchor_idx + offset
if target_idx is None or target_idx < 0 or target_idx >= len(rows):
return None
return rows[target_idx]
def _validate_persistent_resolver(resolver_name: str, style: str | None) -> None:
"""Reject calls with missing or event-only ``style`` for persistent resolvers."""
if style is None:
raise ValueError(f"{resolver_name} requires a persistent style.")
if style in EVENT_ONLY_STYLES:
raise ValueError(f"{resolver_name} cannot be used with event-only style {style!r}.")
if style not in PERSISTENT_STYLES:
column_for_style(style)
def _matching_rows(
rows: Sequence[LanguageRow],
*,
style: str | None,
role: str | None,
tool_name: str | None,
camera: str | None,
) -> list[LanguageRow]:
"""Return ``rows`` filtered by optional ``style``/``role``/``tool_name``/``camera`` selectors."""
return [
row
for row in rows
if (style is None or row.get("style") == style)
and (role is None or row.get("role") == role)
and (tool_name is None or _row_has_tool_name(row, tool_name))
and (camera is None or row.get("camera") == camera)
]
def _select_latest(
rows: Sequence[LanguageRow],
*,
style: str | None,
role: str | None,
tool_name: str | None,
camera: str | None,
) -> LanguageRow | None:
"""Return the row tied for the latest ``timestamp`` (disambiguated by selectors)."""
if not rows:
return None
rows = sorted(rows, key=_persistent_sort_key)
latest_ts = _timestamp(rows[-1])
return _select_one(
[row for row in rows if _timestamp(row) == latest_ts],
style=style,
role=role,
tool_name=tool_name,
camera=camera,
sort_key=_persistent_sort_key,
)
def _select_one(
rows: Sequence[LanguageRow],
*,
style: str | None,
role: str | None,
tool_name: str | None,
camera: str | None,
sort_key: Any,
) -> LanguageRow | None:
"""Return the single matching row, or raise if the selectors are ambiguous."""
if not rows:
return None
if len(rows) > 1 and role is None and tool_name is None and camera is None:
raise ValueError(
f"Ambiguous resolver for style={style!r}; add role=..., tool_name=..., "
f"or camera=... to disambiguate."
)
return sorted(rows, key=sort_key)[0]
def _persistent_sort_key(row: LanguageRow) -> tuple[float, str, str]:
"""Sort key for persistent rows: ``(timestamp, style, role)``."""
return (_timestamp(row), row.get("style") or "", row.get("role") or "")
def _event_sort_key(row: LanguageRow) -> tuple[str, str]:
"""Sort key for event rows: ``(style, role)`` (timestamp is implicit in the frame)."""
return (row.get("style") or "", row.get("role") or "")
def _timestamp(row: LanguageRow) -> float:
"""Extract a row's ``timestamp`` as a Python float (unwrapping numpy scalars)."""
value = row["timestamp"]
return float(value.item() if hasattr(value, "item") else value)
def _row_has_tool_name(row: LanguageRow, tool_name: str) -> bool:
"""Return ``True`` if any of the row's tool calls invokes ``tool_name``."""
for tool_call in row.get("tool_calls") or []:
if isinstance(tool_call, str):
continue
function = tool_call.get("function") if isinstance(tool_call, dict) else None
if isinstance(function, dict) and function.get("name") == tool_name:
return True
return False
def _normalize_rows(rows: Sequence[Any]) -> list[LanguageRow]:
"""Convert pyarrow scalars / mappings into a fresh list of plain dict rows."""
normalized = []
for row in rows:
if row is None:
continue
if hasattr(row, "as_py"):
row = row.as_py()
if not isinstance(row, dict):
raise TypeError(f"Language rows must be dictionaries, got {type(row).__name__}.")
normalized.append(dict(row))
return normalized
+6
View File
@@ -56,6 +56,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
force_cache_sync: bool = False,
download_videos: bool = True,
video_backend: str | None = None,
return_uint8: bool = False,
batch_encoding_size: int = 1,
vcodec: str = "libsvtav1",
streaming_encoding: bool = False,
@@ -202,6 +203,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
self.tolerance_s = tolerance_s
self.revision = revision if revision else CODEBASE_VERSION
self._video_backend = video_backend if video_backend else get_safe_default_codec()
self._return_uint8 = return_uint8
self._batch_encoding_size = batch_encoding_size
self._vcodec = resolve_vcodec(vcodec)
self._encoder_threads = encoder_threads
@@ -225,6 +227,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
video_backend=self._video_backend,
delta_timestamps=delta_timestamps,
image_transforms=image_transforms,
return_uint8=self._return_uint8,
)
# Load actual data
@@ -288,6 +291,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
video_backend=self._video_backend,
delta_timestamps=self.delta_timestamps,
image_transforms=self.image_transforms,
return_uint8=self._return_uint8,
)
return self.reader
@@ -683,6 +687,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
obj.delta_timestamps = None
obj.episodes = None
obj._video_backend = video_backend if video_backend is not None else get_safe_default_codec()
obj._return_uint8 = False
obj._batch_encoding_size = batch_encoding_size
obj._vcodec = vcodec
obj._encoder_threads = encoder_threads
@@ -775,6 +780,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
obj.delta_timestamps = None
obj.episodes = None
obj._video_backend = video_backend if video_backend else get_safe_default_codec()
obj._return_uint8 = False
obj._batch_encoding_size = batch_encoding_size
obj._vcodec = vcodec
obj._encoder_threads = encoder_threads
+7 -1
View File
@@ -251,6 +251,7 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
seed: int = 42,
rng: np.random.Generator | None = None,
shuffle: bool = True,
return_uint8: bool = False,
):
"""Initialize a StreamingLeRobotDataset.
@@ -288,6 +289,7 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
self.streaming = streaming
self.buffer_size = buffer_size
self._return_uint8 = return_uint8
# We cache the video decoders to avoid re-initializing them at each frame (avoiding a ~10x slowdown)
self.video_decoder_cache = None
@@ -553,7 +555,11 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
root = self.meta.url_root if self.streaming and not self.streaming_from_local else self.root
video_path = f"{root}/{self.meta.get_video_file_path(ep_idx, video_key)}"
frames = decode_video_frames_torchcodec(
video_path, query_ts, self.tolerance_s, decoder_cache=self.video_decoder_cache
video_path,
query_ts,
self.tolerance_s,
decoder_cache=self.video_decoder_cache,
return_uint8=self._return_uint8,
)
item[video_key] = frames.squeeze(0) if len(query_ts) == 1 else frames
+17 -11
View File
@@ -83,7 +83,6 @@ VIDEO_DIR = "videos"
CHUNK_FILE_PATTERN = "chunk-{chunk_index:03d}/file-{file_index:03d}"
DEFAULT_TASKS_PATH = "meta/tasks.parquet"
DEFAULT_SUBTASKS_PATH = "meta/subtasks.parquet"
DEFAULT_EPISODES_PATH = EPISODES_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
DEFAULT_DATA_PATH = DATA_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
DEFAULT_VIDEO_PATH = VIDEO_DIR + "/{video_key}/" + CHUNK_FILE_PATTERN + ".mp4"
@@ -238,17 +237,24 @@ def get_safe_version(repo_id: str, version: str | packaging.version.Version) ->
hub_versions = get_repo_versions(repo_id)
if not hub_versions:
raise RevisionNotFoundError(
f"""Your dataset must be tagged with a codebase version.
Assuming _version_ is the codebase_version value in the info.json, you can run this:
```python
from huggingface_hub import HfApi
hub_api = HfApi()
hub_api.create_tag("{repo_id}", tag="_version_", repo_type="dataset")
```
"""
msg = (
f"Repo {repo_id!r} has no codebase-version tags. The dataset "
f"either doesn't exist on the Hub yet, or it was uploaded "
f"without a ``v3.x``-style tag. To tag an existing dataset run:\n"
f" from huggingface_hub import HfApi\n"
f" HfApi().create_tag({repo_id!r}, tag='v3.0', repo_type='dataset', exist_ok=True)"
)
# ``RevisionNotFoundError`` extends ``HfHubHTTPError`` whose
# ``__init__`` indexes ``response.headers`` unconditionally on
# current ``huggingface_hub`` versions. Constructing it without
# a real ``Response`` object crashes with either
# ``TypeError: missing 1 required keyword-only argument`` (old
# builds) or ``AttributeError: 'NoneType' object has no attribute
# 'headers'`` (new builds). Skip that path entirely — this isn't
# really an HTTP error, it's a configuration issue — and raise a
# plain ``RuntimeError`` so the message actually reaches the
# caller.
raise RuntimeError(msg)
if target_version in hub_versions:
return f"v{target_version}"
+22 -10
View File
@@ -123,6 +123,7 @@ def decode_video_frames(
timestamps: list[float],
tolerance_s: float,
backend: str | None = None,
return_uint8: bool = False,
) -> torch.Tensor:
"""
Decodes video frames using the specified backend.
@@ -131,19 +132,23 @@ def decode_video_frames(
video_path (Path): Path to the video file.
timestamps (list[float]): List of timestamps to extract frames.
tolerance_s (float): Allowed deviation in seconds for frame retrieval.
backend (str, optional): Backend to use for decoding. Defaults to "torchcodec" when available in the platform; otherwise, defaults to "pyav"..
backend (str, optional): Backend to use for decoding. Defaults to "torchcodec" when available in the platform; otherwise, defaults to "pyav".
return_uint8 (bool): If True, return raw uint8 frames without float32 normalization.
This reduces memory for DataLoader IPC; normalization can be done on GPU afterward.
Returns:
torch.Tensor: Decoded frames.
torch.Tensor: Decoded frames (float32 in [0,1] by default, or uint8 if return_uint8=True).
Currently supports torchcodec on cpu and pyav.
"""
if backend is None:
backend = get_safe_default_codec()
if backend == "torchcodec":
return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)
return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s, return_uint8=return_uint8)
elif backend in ["pyav", "video_reader"]:
return decode_video_frames_torchvision(video_path, timestamps, tolerance_s, backend)
return decode_video_frames_torchvision(
video_path, timestamps, tolerance_s, backend, return_uint8=return_uint8
)
else:
raise ValueError(f"Unsupported video backend: {backend}")
@@ -154,6 +159,7 @@ def decode_video_frames_torchvision(
tolerance_s: float,
backend: str = "pyav",
log_loaded_timestamps: bool = False,
return_uint8: bool = False,
) -> torch.Tensor:
"""Loads frames associated to the requested timestamps of a video
@@ -240,14 +246,17 @@ def decode_video_frames_torchvision(
if log_loaded_timestamps:
logger.info(f"{closest_ts=}")
# convert to the pytorch format which is float32 in [0,1] range (and channel first)
closest_frames = closest_frames.type(torch.float32) / 255
if len(timestamps) != len(closest_frames):
raise FrameTimestampError(
f"Number of retrieved frames ({len(closest_frames)}) does not match "
f"number of queried timestamps ({len(timestamps)})"
)
if return_uint8:
return closest_frames
# convert to the pytorch format which is float32 in [0,1] range (and channel first)
closest_frames = closest_frames.type(torch.float32) / 255
return closest_frames
@@ -306,6 +315,7 @@ def decode_video_frames_torchcodec(
tolerance_s: float,
log_loaded_timestamps: bool = False,
decoder_cache: VideoDecoderCache | None = None,
return_uint8: bool = False,
) -> torch.Tensor:
"""Loads frames associated with the requested timestamps of a video using torchcodec.
@@ -373,14 +383,16 @@ def decode_video_frames_torchcodec(
if log_loaded_timestamps:
logger.info(f"{closest_ts=}")
# convert to float32 in [0,1] range
closest_frames = (closest_frames / 255.0).type(torch.float32)
if not len(timestamps) == len(closest_frames):
raise FrameTimestampError(
f"Retrieved timestamps differ from queried {set(closest_frames) - set(timestamps)}"
)
if return_uint8:
return closest_frames
# convert to float32 in [0,1] range
closest_frames = (closest_frames / 255.0).type(torch.float32)
return closest_frames
+310
View File
@@ -331,6 +331,7 @@ class LiberoEnv(EnvConfig):
camera_name_mapping: dict[str, str] | None = None
observation_height: int = 360
observation_width: int = 360
is_libero_plus: bool = False
features: dict[str, PolicyFeature] = field(
default_factory=lambda: {
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(7,)),
@@ -432,6 +433,7 @@ class LiberoEnv(EnvConfig):
control_mode=self.control_mode,
episode_length=self.episode_length,
camera_name_mapping=self.camera_name_mapping,
is_libero_plus=self.is_libero_plus,
)
def get_env_processors(self):
@@ -496,6 +498,146 @@ class MetaworldEnv(EnvConfig):
)
@EnvConfig.register_subclass("robocasa")
@dataclass
class RoboCasaEnv(EnvConfig):
task: str = "CloseFridge"
fps: int = 20
episode_length: int = 1000
obs_type: str = "pixels_agent_pos"
render_mode: str = "rgb_array"
camera_name: str = "robot0_agentview_left,robot0_eye_in_hand,robot0_agentview_right"
observation_height: int = 256
observation_width: int = 256
visualization_height: int = 512
visualization_width: int = 512
split: str | None = None
# Object-mesh registries to sample from. Upstream default is
# ("objaverse", "lightwheel"), but objaverse is ~30GB and the CI image
# only ships the lightwheel pack. Override to include objaverse once
# you've run `python -m robocasa.scripts.download_kitchen_assets
# --type objaverse` locally.
obj_registries: list[str] = field(default_factory=lambda: ["lightwheel"])
features: dict[str, PolicyFeature] = field(
default_factory=lambda: {ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(12,))}
)
features_map: dict[str, str] = field(default_factory=lambda: {ACTION: ACTION, "agent_pos": OBS_STATE})
def __post_init__(self):
if self.obs_type not in ("pixels", "pixels_agent_pos"):
raise ValueError(f"Unsupported obs_type: {self.obs_type}")
# Preserve raw RoboCasa camera names end-to-end (e.g.
# `observation.images.robot0_agentview_left`). This matches the
# naming convention used by the RoboCasa datasets on the Hub, so
# trained policies don't need a `--rename_map` at eval time.
cams = [c.strip() for c in self.camera_name.split(",") if c.strip()]
for cam in cams:
self.features[f"pixels/{cam}"] = PolicyFeature(
type=FeatureType.VISUAL,
shape=(self.observation_height, self.observation_width, 3),
)
self.features_map[f"pixels/{cam}"] = f"{OBS_IMAGES}.{cam}"
if self.obs_type == "pixels_agent_pos":
self.features["agent_pos"] = PolicyFeature(type=FeatureType.STATE, shape=(16,))
@property
def gym_kwargs(self) -> dict:
kwargs: dict[str, Any] = {
"obs_type": self.obs_type,
"render_mode": self.render_mode,
"observation_height": self.observation_height,
"observation_width": self.observation_width,
"visualization_height": self.visualization_height,
"visualization_width": self.visualization_width,
}
if self.split is not None:
kwargs["split"] = self.split
return kwargs
def create_envs(self, n_envs: int, use_async_envs: bool = False):
from .robocasa import create_robocasa_envs
if self.task is None:
raise ValueError("RoboCasaEnv requires a task to be specified")
env_cls = _make_vec_env_cls(use_async_envs, n_envs)
return create_robocasa_envs(
task=self.task,
n_envs=n_envs,
camera_name=self.camera_name,
gym_kwargs=self.gym_kwargs,
env_cls=env_cls,
episode_length=self.episode_length,
obj_registries=tuple(self.obj_registries),
)
@EnvConfig.register_subclass("vlabench")
@dataclass
class VLABenchEnv(EnvConfig):
task: str = "select_fruit"
fps: int = 10
episode_length: int = 500
obs_type: str = "pixels_agent_pos"
render_mode: str = "rgb_array"
render_resolution: tuple[int, int] = (480, 480)
robot: str = "franka"
action_mode: str = "eef"
features: dict[str, PolicyFeature] = field(
default_factory=lambda: {
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(7,)),
}
)
features_map: dict[str, str] = field(
default_factory=lambda: {
ACTION: ACTION,
"agent_pos": OBS_STATE,
"pixels/image": f"{OBS_IMAGES}.image",
"pixels/second_image": f"{OBS_IMAGES}.second_image",
"pixels/wrist_image": f"{OBS_IMAGES}.wrist_image",
}
)
def __post_init__(self):
h, w = self.render_resolution
if self.obs_type == "pixels":
self.features["pixels/image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
self.features["pixels/second_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
self.features["pixels/wrist_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
elif self.obs_type == "pixels_agent_pos":
self.features["pixels/image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
self.features["pixels/second_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
self.features["pixels/wrist_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
self.features["agent_pos"] = PolicyFeature(type=FeatureType.STATE, shape=(7,))
else:
raise ValueError(f"Unsupported obs_type: {self.obs_type}")
@property
def gym_kwargs(self) -> dict:
return {
"obs_type": self.obs_type,
"render_mode": self.render_mode,
"render_resolution": self.render_resolution,
"robot": self.robot,
"max_episode_steps": self.episode_length,
"action_mode": self.action_mode,
}
def create_envs(self, n_envs: int, use_async_envs: bool = False):
from .vlabench import create_vlabench_envs
if self.task is None:
raise ValueError("VLABenchEnv requires a task to be specified")
env_cls = _make_vec_env_cls(use_async_envs, n_envs)
return create_vlabench_envs(
task=self.task,
n_envs=n_envs,
gym_kwargs=self.gym_kwargs,
env_cls=env_cls,
)
@EnvConfig.register_subclass("isaaclab_arena")
@dataclass
class IsaaclabArenaEnv(HubEnvConfig):
@@ -574,3 +716,171 @@ class IsaaclabArenaEnv(HubEnvConfig):
),
PolicyProcessorPipeline(steps=[]),
)
@EnvConfig.register_subclass("libero_plus")
@dataclass
class LiberoPlusEnv(LiberoEnv):
"""Config for LIBERO-plus robustness benchmark evaluation.
LIBERO-plus extends LIBERO with 7 perturbation dimensions (camera viewpoints,
object layouts, robot initial states, language instructions, lighting, background
textures, sensor noise) producing ~10k task variants.
The gym interface is identical to LIBERO so this class reuses ``LiberoEnv``
entirely only the registered name and default task suite differ.
Install: see docker/Dockerfile.benchmark.libero_plus LIBERO-plus ships
as a namespace package from a git fork and must be cloned + PYTHONPATH'd
rather than installed as a pyproject extra.
See Also:
https://github.com/sylvestf/LIBERO-plus
"""
task: str = "libero_spatial"
is_libero_plus: bool = True
@EnvConfig.register_subclass("robotwin")
@dataclass
class RoboTwinEnvConfig(EnvConfig):
"""Configuration for RoboTwin 2.0 benchmark environments.
RoboTwin 2.0 is a dual-arm manipulation benchmark with 50 tasks built on the
SAPIEN simulator. The robot is an Aloha-AgileX bimanual platform with 14 DOF
(7 per arm). All three cameras are enabled by default.
See: https://robotwin-platform.github.io
Dataset: https://huggingface.co/datasets/lerobot/robotwin_unified
"""
task: str = "beat_block_hammer" # single task or comma-separated list
fps: int = 25
episode_length: int = 300
obs_type: str = "pixels_agent_pos"
render_mode: str = "rgb_array"
# Available cameras from RoboTwin's aloha-agilex embodiment: head_camera
# (torso-mounted) + left_camera / right_camera (wrists).
camera_names: str = "head_camera,left_camera,right_camera"
# Match the D435 dims in task_config/demo_clean.yml (_camera_config.yml).
# Gym's vector-env concatenate pre-allocates buffers of this shape, so it
# must equal what SAPIEN actually renders.
observation_height: int = 240
observation_width: int = 320
features: dict[str, PolicyFeature] = field(
default_factory=lambda: {
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(14,)),
}
)
features_map: dict[str, str] = field(
default_factory=lambda: {
ACTION: ACTION,
"pixels/head_camera": f"{OBS_IMAGES}.head_camera",
"pixels/left_camera": f"{OBS_IMAGES}.left_camera",
"pixels/right_camera": f"{OBS_IMAGES}.right_camera",
"agent_pos": OBS_STATE,
}
)
def __post_init__(self):
cam_list = [c.strip() for c in self.camera_names.split(",") if c.strip()]
for cam in cam_list:
self.features[f"pixels/{cam}"] = PolicyFeature(
type=FeatureType.VISUAL,
shape=(self.observation_height, self.observation_width, 3),
)
# Keep features_map entry if already set (default_factory); add if missing.
key = f"pixels/{cam}"
if key not in self.features_map:
self.features_map[key] = f"{OBS_IMAGES}.{cam}"
if self.obs_type == "pixels_agent_pos":
self.features["agent_pos"] = PolicyFeature(
type=FeatureType.STATE,
shape=(14,), # 14 DOF: 7 per arm
)
elif self.obs_type != "pixels":
raise ValueError(
f"Unsupported obs_type '{self.obs_type}'. "
"RoboTwinEnvConfig supports 'pixels' and 'pixels_agent_pos'."
)
@property
def gym_kwargs(self) -> dict:
return {}
def create_envs(self, n_envs: int, use_async_envs: bool = True):
from lerobot.envs.robotwin import create_robotwin_envs
if not self.task:
raise ValueError("RoboTwinEnvConfig requires `task` to be specified.")
env_cls = _make_vec_env_cls(use_async_envs, n_envs)
cam_list = [c.strip() for c in self.camera_names.split(",") if c.strip()]
return create_robotwin_envs(
task=self.task,
n_envs=n_envs,
env_cls=env_cls,
camera_names=cam_list,
observation_height=self.observation_height,
observation_width=self.observation_width,
episode_length=self.episode_length,
)
@EnvConfig.register_subclass("robomme")
@dataclass
class RoboMMEEnv(EnvConfig):
"""RoboMME memory-augmented manipulation benchmark (ManiSkill/SAPIEN).
16 tasks across 4 suites: Counting, Permanence, Reference, Imitation.
Dataset: lerobot/robomme (LeRobot v3.0, 1,600 episodes).
Benchmark: https://github.com/RoboMME/robomme_benchmark
Requires the `robomme` git package installed separately (Linux only);
see docker/Dockerfile.benchmark.robomme for the canonical install.
"""
task: str = "PickXtimes"
fps: int = 10
episode_length: int = 300
action_space: str = "joint_angle" # or "ee_pose" (7-D)
dataset_split: str = "test" # "train" | "val" | "test"
task_ids: list[int] | None = None
features: dict[str, PolicyFeature] = field(default_factory=dict)
features_map: dict[str, str] = field(
default_factory=lambda: {
ACTION: ACTION,
"pixels/image": f"{OBS_IMAGES}.image",
"pixels/wrist_image": f"{OBS_IMAGES}.wrist_image",
"agent_pos": OBS_STATE,
}
)
def __post_init__(self):
action_dim = 8 if self.action_space == "joint_angle" else 7
self.features = {
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(action_dim,)),
"pixels/image": PolicyFeature(type=FeatureType.VISUAL, shape=(256, 256, 3)),
"pixels/wrist_image": PolicyFeature(type=FeatureType.VISUAL, shape=(256, 256, 3)),
"agent_pos": PolicyFeature(type=FeatureType.STATE, shape=(8,)),
}
@property
def gym_kwargs(self) -> dict:
return {}
def create_envs(self, n_envs: int, use_async_envs: bool = True):
from lerobot.envs.robomme import create_robomme_envs
env_cls = _make_vec_env_cls(use_async_envs, n_envs)
return create_robomme_envs(
task=self.task,
n_envs=n_envs,
action_space_type=self.action_space,
dataset=self.dataset_split,
episode_length=self.episode_length,
task_ids=self.task_ids,
env_cls=env_cls,
)
+46 -26
View File
@@ -16,6 +16,7 @@
from __future__ import annotations
import os
import re
from collections import defaultdict
from collections.abc import Callable, Iterable, Mapping, Sequence
from functools import partial
@@ -31,20 +32,7 @@ from libero.libero.envs import OffScreenRenderEnv
from lerobot.types import RobotObservation
from .utils import _LazyAsyncVectorEnv
def _parse_camera_names(camera_name: str | Sequence[str]) -> list[str]:
"""Normalize camera_name into a non-empty list of strings."""
if isinstance(camera_name, str):
cams = [c.strip() for c in camera_name.split(",") if c.strip()]
elif isinstance(camera_name, (list | tuple)):
cams = [str(c).strip() for c in camera_name if str(c).strip()]
else:
raise TypeError(f"camera_name must be str or sequence[str], got {type(camera_name).__name__}")
if not cams:
raise ValueError("camera_name resolved to an empty list.")
return cams
from .utils import _LazyAsyncVectorEnv, parse_camera_names
def _get_suite(name: str) -> benchmark.Benchmark:
@@ -69,14 +57,34 @@ def _select_task_ids(total_tasks: int, task_ids: Iterable[int] | None) -> list[i
return ids
def get_task_init_states(task_suite: Any, i: int) -> np.ndarray:
init_states_path = (
Path(get_libero_path("init_states"))
/ task_suite.tasks[i].problem_folder
/ task_suite.tasks[i].init_states_file
)
init_states = torch.load(init_states_path, weights_only=False) # nosec B614
return init_states
# LIBERO-plus perturbation variants encode the perturbation in the filename
# but on disk only the base `.pruned_init` exists — strip the suffix to match
# LIBERO-plus's own suite.get_task_init_states() (we reimplement it here so we
# can pass weights_only=False for PyTorch 2.6+ numpy pickles).
_LIBERO_PERTURBATION_SUFFIX_RE = re.compile(r"_(?:language|view|light)_[^.]*|_(?:table|tb)_\d+")
def get_task_init_states(task_suite: Any, i: int, is_libero_plus: bool = False) -> np.ndarray:
task = task_suite.tasks[i]
filename = Path(task.init_states_file)
root = Path(get_libero_path("init_states"))
if not is_libero_plus:
init_states_path = root / task.problem_folder / filename.name
return torch.load(init_states_path, weights_only=False) # nosec B614
# LIBERO-plus: `_add_` / `_level` variants store extra-object layouts under
# libero_newobj/ as a flat array that must be reshaped to (1, -1).
if "_add_" in filename.name or "_level" in filename.name:
init_states_path = root / "libero_newobj" / task.problem_folder / filename.name
init_states = torch.load(init_states_path, weights_only=False) # nosec B614
return init_states.reshape(1, -1)
# LIBERO-plus perturbation variants encode the perturbation in the filename
# but on disk only the base `.pruned_init` exists — strip the suffix to match.
stripped = _LIBERO_PERTURBATION_SUFFIX_RE.sub("", filename.stem) + filename.suffix
init_states_path = root / task.problem_folder / stripped
return torch.load(init_states_path, weights_only=False) # nosec B614
def get_libero_dummy_action():
@@ -118,9 +126,11 @@ class LiberoEnv(gym.Env):
camera_name_mapping: dict[str, str] | None = None,
num_steps_wait: int = 10,
control_mode: str = "relative",
is_libero_plus: bool = False,
):
super().__init__()
self.task_id = task_id
self.is_libero_plus = is_libero_plus
self.obs_type = obs_type
self.render_mode = render_mode
self.observation_width = observation_width
@@ -128,7 +138,7 @@ class LiberoEnv(gym.Env):
self.visualization_width = visualization_width
self.visualization_height = visualization_height
self.init_states = init_states
self.camera_name = _parse_camera_names(
self.camera_name = parse_camera_names(
camera_name
) # agentview_image (main) or robot0_eye_in_hand_image (wrist)
@@ -147,7 +157,11 @@ class LiberoEnv(gym.Env):
self.episode_index = episode_index
self.episode_length = episode_length
# Load once and keep
self._init_states = get_task_init_states(task_suite, self.task_id) if self.init_states else None
self._init_states = (
get_task_init_states(task_suite, self.task_id, is_libero_plus=self.is_libero_plus)
if self.init_states
else None
)
self._reset_stride = n_envs # when performing a reset, append `_reset_stride` to `init_state_id`.
self.init_state_id = self.episode_index # tie each sub-env to a fixed init state
@@ -380,6 +394,7 @@ def _make_env_fns(
gym_kwargs: Mapping[str, Any],
control_mode: str,
camera_name_mapping: dict[str, str] | None = None,
is_libero_plus: bool = False,
) -> list[Callable[[], LiberoEnv]]:
"""Build n_envs factory callables for a single (suite, task_id)."""
@@ -396,6 +411,7 @@ def _make_env_fns(
n_envs=n_envs,
control_mode=control_mode,
camera_name_mapping=camera_name_mapping,
is_libero_plus=is_libero_plus,
**local_kwargs,
)
@@ -418,6 +434,7 @@ def create_libero_envs(
control_mode: str = "relative",
episode_length: int | None = None,
camera_name_mapping: dict[str, str] | None = None,
is_libero_plus: bool = False,
) -> dict[str, dict[int, Any]]:
"""
Create vectorized LIBERO environments with a consistent return shape.
@@ -437,7 +454,7 @@ def create_libero_envs(
gym_kwargs = dict(gym_kwargs or {})
task_ids_filter = gym_kwargs.pop("task_ids", None) # optional: limit to specific tasks
camera_names = _parse_camera_names(camera_name)
camera_names = parse_camera_names(camera_name)
suite_names = [s.strip() for s in str(task).split(",") if s.strip()]
if not suite_names:
raise ValueError("`task` must contain at least one LIBERO suite name.")
@@ -462,6 +479,7 @@ def create_libero_envs(
# Probe once and reuse to avoid creating a temp env per task.
cached_obs_space: spaces.Space | None = None
cached_act_space: spaces.Space | None = None
cached_metadata: dict[str, Any] | None = None
for tid in selected:
fns = _make_env_fns(
@@ -475,12 +493,14 @@ def create_libero_envs(
gym_kwargs=gym_kwargs,
control_mode=control_mode,
camera_name_mapping=camera_name_mapping,
is_libero_plus=is_libero_plus,
)
if is_async:
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space)
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
if cached_obs_space is None:
cached_obs_space = lazy.observation_space
cached_act_space = lazy.action_space
cached_metadata = lazy.metadata
out[suite_name][tid] = lazy
else:
out[suite_name][tid] = env_cls(fns)
+3 -1
View File
@@ -311,6 +311,7 @@ def create_metaworld_envs(
is_async = env_cls is gym.vector.AsyncVectorEnv
cached_obs_space = None
cached_act_space = None
cached_metadata = None
out: dict[str, dict[int, Any]] = defaultdict(dict)
for group in task_groups:
@@ -324,10 +325,11 @@ def create_metaworld_envs(
fns = [(lambda tn=task_name: MetaworldEnv(task=tn, **gym_kwargs)) for _ in range(n_envs)]
if is_async:
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space)
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
if cached_obs_space is None:
cached_obs_space = lazy.observation_space
cached_act_space = lazy.action_space
cached_metadata = lazy.metadata
out[group][tid] = lazy
else:
out[group][tid] = env_cls(fns)
+425
View File
@@ -0,0 +1,425 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import logging
from collections import defaultdict
from collections.abc import Callable, Sequence
from functools import partial
from typing import Any
import gymnasium as gym
import numpy as np
from gymnasium import spaces
from lerobot.types import RobotObservation
from .utils import _LazyAsyncVectorEnv, parse_camera_names
logger = logging.getLogger(__name__)
# Dimensions for the flat action/state vectors used by the LeRobot wrapper.
# These correspond to the PandaOmron robot in RoboCasa365.
OBS_STATE_DIM = 16 # base_pos(3) + base_quat(4) + ee_pos_rel(3) + ee_quat_rel(4) + gripper_qpos(2)
ACTION_DIM = 12 # base_motion(4) + control_mode(1) + ee_pos(3) + ee_rot(3) + gripper(1)
ACTION_LOW = -1.0
ACTION_HIGH = 1.0
# Default PandaOmron cameras. We surface these raw names directly as
# `observation.images.<name>` so the LeRobot dataset/policy keys match
# RoboCasa's native convention (no implicit renaming).
DEFAULT_CAMERAS = [
"robot0_agentview_left",
"robot0_eye_in_hand",
"robot0_agentview_right",
]
# Object-mesh registries to sample from. RoboCasa's upstream default is
# ("objaverse", "lightwheel"), but the objaverse pack is huge (~30GB) and
# most users — including our CI image — only download the lightwheel pack
# (`--type objs_lw` in `download_kitchen_assets`). When a sampled object
# category has zero candidates in every registry, robocasa crashes with
# `ValueError: Probabilities contain NaN` (0/0 divide in the probability
# normalization). Restricting to registries that are actually on disk
# avoids the NaN and matches what the asset download provides.
DEFAULT_OBJ_REGISTRIES: tuple[str, ...] = ("lightwheel",)
# Task-group shortcuts accepted as `--env.task`. When the user passes one of
# these names, we expand it to the upstream RoboCasa task list and auto-set
# the dataset split. Individual task names (optionally comma-separated) still
# take precedence; this only triggers on an exact group-name match.
_TASK_GROUP_SPLITS = {
"atomic_seen": "target",
"composite_seen": "target",
"composite_unseen": "target",
"pretrain50": "pretrain",
"pretrain100": "pretrain",
"pretrain200": "pretrain",
"pretrain300": "pretrain",
}
def _resolve_tasks(task: str) -> tuple[list[str], str | None]:
"""Resolve a `--env.task` value to (task_names, split_override).
If `task` is a known task-group name (e.g. `atomic_seen`, `pretrain100`),
expand it via `robocasa.utils.dataset_registry.{TARGET,PRETRAINING}_TASKS`
and return the matching split. Otherwise treat `task` as a single task or
comma-separated list and leave the split untouched (None).
"""
key = task.strip()
if key in _TASK_GROUP_SPLITS:
from robocasa.utils.dataset_registry import PRETRAINING_TASKS, TARGET_TASKS
combined = {**TARGET_TASKS, **PRETRAINING_TASKS}
if key not in combined:
raise ValueError(
f"Task group '{key}' is not available in this version of robocasa. "
f"Known groups: {sorted(combined.keys())}."
)
return list(combined[key]), _TASK_GROUP_SPLITS[key]
names = [t.strip() for t in task.split(",") if t.strip()]
if not names:
raise ValueError("`task` must contain at least one RoboCasa task name.")
return names, None
def convert_action(flat_action: np.ndarray) -> dict[str, Any]:
"""Split a flat (12,) action vector into a RoboCasa action dict.
Layout: base_motion(4) + control_mode(1) + ee_pos(3) + ee_rot(3) + gripper(1)
"""
return {
"action.base_motion": flat_action[0:4],
"action.control_mode": flat_action[4:5],
"action.end_effector_position": flat_action[5:8],
"action.end_effector_rotation": flat_action[8:11],
"action.gripper_close": flat_action[11:12],
}
class RoboCasaEnv(gym.Env):
"""LeRobot gym.Env wrapper for RoboCasa365 kitchen environments.
Wraps RoboCasaGymEnv from the robocasa package and converts its
dict-based observations and actions into the flat arrays LeRobot expects.
Raw RoboCasa camera names are preserved verbatim under `pixels/<cam>`.
"""
metadata = {"render_modes": ["rgb_array"], "render_fps": 20}
def __init__(
self,
task: str,
camera_name: str | Sequence[str] = ",".join(DEFAULT_CAMERAS),
obs_type: str = "pixels_agent_pos",
render_mode: str = "rgb_array",
observation_width: int = 256,
observation_height: int = 256,
visualization_width: int = 512,
visualization_height: int = 512,
split: str | None = None,
episode_length: int | None = None,
obj_registries: Sequence[str] = DEFAULT_OBJ_REGISTRIES,
episode_index: int = 0,
):
super().__init__()
self.task = task
self.obs_type = obs_type
self.render_mode = render_mode
self.observation_width = observation_width
self.observation_height = observation_height
self.visualization_width = visualization_width
self.visualization_height = visualization_height
self.split = split
self.obj_registries = tuple(obj_registries)
# Per-worker index (0..n_envs-1) used to spread the user-provided
# seed across factories so each sub-env explores a distinct layout
# even when the same seed is passed to `reset()`.
self.episode_index = int(episode_index)
self.camera_name = parse_camera_names(camera_name)
self._max_episode_steps = episode_length if episode_length is not None else 1000
# Deferred — created on first reset() inside the worker subprocess
# to avoid inheriting stale GPU/EGL contexts across fork().
self._env: Any = None
self.task_description = ""
images = {
cam: spaces.Box(
low=0,
high=255,
shape=(self.observation_height, self.observation_width, 3),
dtype=np.uint8,
)
for cam in self.camera_name
}
if self.obs_type == "pixels":
self.observation_space = spaces.Dict({"pixels": spaces.Dict(images)})
elif self.obs_type == "pixels_agent_pos":
self.observation_space = spaces.Dict(
{
"pixels": spaces.Dict(images),
"agent_pos": spaces.Box(
low=-np.inf,
high=np.inf,
shape=(OBS_STATE_DIM,),
dtype=np.float32,
),
}
)
else:
raise ValueError(f"Unsupported obs_type '{self.obs_type}'. Use 'pixels' or 'pixels_agent_pos'.")
self.action_space = spaces.Box(
low=ACTION_LOW,
high=ACTION_HIGH,
shape=(ACTION_DIM,),
dtype=np.float32,
)
def _ensure_env(self) -> None:
"""Create the underlying RoboCasaGymEnv on first use.
Called inside the worker subprocess after fork(), so each worker gets
its own clean rendering context rather than inheriting a stale one from
the parent process (which causes crashes with AsyncVectorEnv).
"""
if self._env is not None:
return
from robocasa.wrappers.gym_wrapper import RoboCasaGymEnv
# RoboCasaGymEnv defaults split="test", which create_env rejects
# (only None/"all"/"pretrain"/"target" are valid). Always pass a
# valid value so we don't hit that default. Extra kwargs are
# forwarded to the underlying kitchen env via create_env/robosuite.make.
self._env = RoboCasaGymEnv(
env_name=self.task,
camera_widths=self.observation_width,
camera_heights=self.observation_height,
split=self.split if self.split is not None else "all",
obj_registries=self.obj_registries,
)
ep_meta = self._env.env.get_ep_meta()
self.task_description = ep_meta.get("lang", self.task)
def _format_raw_obs(self, raw_obs: dict) -> RobotObservation:
"""Convert RoboCasaGymEnv observation dict to LeRobot format."""
# RoboCasaGymEnv emits camera frames under "video.<cam>".
images = {cam: raw_obs[f"video.{cam}"] for cam in self.camera_name if f"video.{cam}" in raw_obs}
if self.obs_type == "pixels":
return {"pixels": images}
# `state.*` keys come from PandaOmronKeyConverter inside the wrapper.
agent_pos = np.concatenate(
[
raw_obs.get("state.base_position", np.zeros(3)),
raw_obs.get("state.base_rotation", np.zeros(4)),
raw_obs.get("state.end_effector_position_relative", np.zeros(3)),
raw_obs.get("state.end_effector_rotation_relative", np.zeros(4)),
raw_obs.get("state.gripper_qpos", np.zeros(2)),
],
axis=-1,
).astype(np.float32)
return {"pixels": images, "agent_pos": agent_pos}
def render(self) -> np.ndarray:
self._ensure_env()
assert self._env is not None
return self._env.render()
def reset(self, seed=None, **kwargs):
self._ensure_env()
assert self._env is not None
super().reset(seed=seed)
# Spread the seed across workers so n_envs factories don't all
# roll the same scene. With an explicit user seed we shift it by
# episode_index; with no seed we fall back to episode_index so
# each worker is still distinct rather than inheriting the same
# global RNG state.
worker_seed = seed + self.episode_index if seed is not None else self.episode_index
raw_obs, info = self._env.reset(seed=worker_seed)
ep_meta = self._env.env.get_ep_meta()
self.task_description = ep_meta.get("lang", self.task)
observation = self._format_raw_obs(raw_obs)
info = {"is_success": False}
return observation, info
def step(self, action: np.ndarray) -> tuple[RobotObservation, float, bool, bool, dict[str, Any]]:
self._ensure_env()
assert self._env is not None
if action.ndim != 1:
raise ValueError(
f"Expected action to be 1-D (shape (action_dim,)), "
f"but got shape {action.shape} with ndim={action.ndim}"
)
action_dict = convert_action(action)
raw_obs, reward, done, truncated, info = self._env.step(action_dict)
is_success = bool(info.get("success", False))
terminated = done or is_success
info.update({"task": self.task, "done": done, "is_success": is_success})
observation = self._format_raw_obs(raw_obs)
if terminated:
info["final_info"] = {
"task": self.task,
"done": bool(done),
"is_success": bool(is_success),
}
self.reset()
return observation, reward, terminated, truncated, info
def close(self):
if self._env is not None:
self._env.close()
def _make_env_fns(
*,
task: str,
n_envs: int,
camera_names: list[str],
obs_type: str,
render_mode: str,
observation_width: int,
observation_height: int,
visualization_width: int,
visualization_height: int,
split: str | None,
episode_length: int | None,
obj_registries: Sequence[str],
) -> list[Callable[[], RoboCasaEnv]]:
"""Build n_envs factory callables for a single task.
Each factory carries a distinct ``episode_index`` (``0..n_envs-1``) so
``RoboCasaEnv.reset()`` can derive a per-worker seed series from the
user-provided seed.
"""
def _make_env(episode_index: int) -> RoboCasaEnv:
return RoboCasaEnv(
task=task,
camera_name=camera_names,
obs_type=obs_type,
render_mode=render_mode,
observation_width=observation_width,
observation_height=observation_height,
visualization_width=visualization_width,
visualization_height=visualization_height,
split=split,
episode_length=episode_length,
obj_registries=obj_registries,
episode_index=episode_index,
)
return [partial(_make_env, i) for i in range(n_envs)]
def create_robocasa_envs(
task: str,
n_envs: int,
gym_kwargs: dict[str, Any] | None = None,
camera_name: str | Sequence[str] = ",".join(DEFAULT_CAMERAS),
env_cls: Callable[[Sequence[Callable[[], Any]]], Any] | None = None,
episode_length: int | None = None,
obj_registries: Sequence[str] = DEFAULT_OBJ_REGISTRIES,
) -> dict[str, dict[int, Any]]:
"""Create vectorized RoboCasa365 environments with a consistent return shape.
Returns:
dict[task_name][task_id] -> vec_env (env_cls([...]) with exactly n_envs factories)
`task` can be:
- a single task name (e.g. `CloseFridge`)
- a comma-separated list of task names (e.g. `CloseFridge,PickPlaceCoffee`)
- a benchmark-group shortcut (`atomic_seen`, `composite_seen`,
`composite_unseen`, `pretrain50`, `pretrain100`, `pretrain200`,
`pretrain300`), which auto-expands to the upstream task list and
auto-sets the dataset `split` ("target" or "pretrain").
"""
if env_cls is None or not callable(env_cls):
raise ValueError("env_cls must be a callable that wraps a list of environment factory callables.")
if not isinstance(n_envs, int) or n_envs <= 0:
raise ValueError(f"n_envs must be a positive int; got {n_envs}.")
gym_kwargs = dict(gym_kwargs or {})
obs_type = gym_kwargs.pop("obs_type", "pixels_agent_pos")
render_mode = gym_kwargs.pop("render_mode", "rgb_array")
observation_width = gym_kwargs.pop("observation_width", 256)
observation_height = gym_kwargs.pop("observation_height", 256)
visualization_width = gym_kwargs.pop("visualization_width", 512)
visualization_height = gym_kwargs.pop("visualization_height", 512)
split = gym_kwargs.pop("split", None)
camera_names = parse_camera_names(camera_name)
task_names, group_split = _resolve_tasks(str(task))
if group_split is not None and split is None:
split = group_split
logger.info(
"Creating RoboCasa envs | tasks=%s | split=%s | n_envs(per task)=%d",
task_names,
split,
n_envs,
)
is_async = env_cls is gym.vector.AsyncVectorEnv
cached_obs_space: spaces.Space | None = None
cached_act_space: spaces.Space | None = None
cached_metadata: dict[str, Any] | None = None
out: dict[str, dict[int, Any]] = defaultdict(dict)
for task_name in task_names:
fns = _make_env_fns(
task=task_name,
n_envs=n_envs,
camera_names=camera_names,
obs_type=obs_type,
render_mode=render_mode,
observation_width=observation_width,
observation_height=observation_height,
visualization_width=visualization_width,
visualization_height=visualization_height,
split=split,
episode_length=episode_length,
obj_registries=obj_registries,
)
if is_async:
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
if cached_obs_space is None:
cached_obs_space = lazy.observation_space
cached_act_space = lazy.action_space
cached_metadata = lazy.metadata
out[task_name][0] = lazy
else:
out[task_name][0] = env_cls(fns)
logger.info("Built vec env | task=%s | n_envs=%d", task_name, n_envs)
return {name: dict(task_map) for name, task_map in out.items()}
+245
View File
@@ -0,0 +1,245 @@
"""RoboMME environment wrapper for LeRobot evaluation.
Wraps the RoboMME ``BenchmarkEnvBuilder`` into a Gymnasium-compatible
``VectorEnv`` suitable for ``lerobot_eval``.
RoboMME tasks:
Counting: BinFill, PickXtimes, SwingXtimes, StopCube
Permanence: VideoUnmask, VideoUnmaskSwap, ButtonUnmask, ButtonUnmaskSwap
Reference: PickHighlight, VideoRepick, VideoPlaceButton, VideoPlaceOrder
Imitation: MoveCube, InsertPeg, PatternLock, RouteStick
Dataset: lerobot/robomme (LeRobot v3.0, 1,600 episodes)
Install: see docker/Dockerfile.benchmark.robomme (Linux only mani-skill vs numpy pin conflict)
Benchmark: https://github.com/RoboMME/robomme_benchmark
"""
from __future__ import annotations
from collections.abc import Callable, Sequence
from functools import partial
from typing import Any
import gymnasium as gym
import numpy as np
from gymnasium import spaces
from .utils import _LazyAsyncVectorEnv
ROBOMME_TASKS = [
"BinFill",
"PickXtimes",
"SwingXtimes",
"StopCube",
"VideoUnmask",
"VideoUnmaskSwap",
"ButtonUnmask",
"ButtonUnmaskSwap",
"PickHighlight",
"VideoRepick",
"VideoPlaceButton",
"VideoPlaceOrder",
"MoveCube",
"InsertPeg",
"PatternLock",
"RouteStick",
]
class RoboMMEGymEnv(gym.Env):
"""Thin Gymnasium wrapper around a single RoboMME episode env."""
metadata = {"render_modes": ["rgb_array"], "render_fps": 10}
def __init__(
self,
task: str = "PickXtimes",
action_space_type: str = "joint_angle",
dataset: str = "test",
episode_idx: int = 0,
max_steps: int = 300,
):
super().__init__()
from robomme.env_record_wrapper import BenchmarkEnvBuilder
self._task = task
self._action_space_type = action_space_type
self._dataset = dataset
self._episode_idx = episode_idx
self._max_steps = max_steps
self._max_episode_steps = max_steps
self._builder = BenchmarkEnvBuilder(
env_id=task,
dataset=dataset,
action_space=action_space_type,
gui_render=False,
max_steps=max_steps,
)
self._env = None
self._last_raw_obs: dict | None = None
action_dim = 8 if action_space_type == "joint_angle" else 7
self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(action_dim,), dtype=np.float32)
# `pixels` must be a nested Dict so `preprocess_observation()` in
# envs/utils.py picks it up and maps each camera to
# `observation.images.<cam>`. A flat layout (`pixels/image`,
# `pixels/wrist_image`) silently drops every image from the batch.
self.observation_space = spaces.Dict(
{
"pixels": spaces.Dict(
{
"image": spaces.Box(0, 255, shape=(256, 256, 3), dtype=np.uint8),
"wrist_image": spaces.Box(0, 255, shape=(256, 256, 3), dtype=np.uint8),
}
),
"agent_pos": spaces.Box(-np.inf, np.inf, shape=(8,), dtype=np.float32),
}
)
def reset(self, *, seed=None, options=None):
super().reset(seed=seed)
self._env = self._builder.make_env_for_episode(
episode_idx=self._episode_idx,
max_steps=self._max_steps,
)
obs, info = self._env.reset()
self._last_raw_obs = obs
return self._convert_obs(obs), self._convert_info(info)
def step(self, action):
obs, reward, terminated, truncated, info = self._env.step(action)
self._last_raw_obs = obs
terminated_bool = bool(terminated.item()) if hasattr(terminated, "item") else bool(terminated)
truncated_bool = bool(truncated.item()) if hasattr(truncated, "item") else bool(truncated)
status = info.get("status", "ongoing")
is_success = status == "success"
conv_info = self._convert_info(info)
conv_info["is_success"] = is_success
return self._convert_obs(obs), float(reward), terminated_bool, truncated_bool, conv_info
def render(self) -> np.ndarray | None:
"""Return the front camera image from the last observation for video recording."""
if self._last_raw_obs is None:
return np.zeros((256, 256, 3), dtype=np.uint8)
front = self._last_raw_obs.get("front_rgb_list")
if front is None:
return np.zeros((256, 256, 3), dtype=np.uint8)
frame = front[-1] if isinstance(front, list) else front
return np.asarray(frame, dtype=np.uint8)
def _convert_obs(self, obs: dict) -> dict:
front_rgb = (
obs["front_rgb_list"][-1] if isinstance(obs["front_rgb_list"], list) else obs["front_rgb_list"]
)
wrist_rgb = (
obs["wrist_rgb_list"][-1] if isinstance(obs["wrist_rgb_list"], list) else obs["wrist_rgb_list"]
)
joint_state = (
obs["joint_state_list"][-1]
if isinstance(obs["joint_state_list"], list)
else obs["joint_state_list"]
)
gripper_state = (
obs["gripper_state_list"][-1]
if isinstance(obs["gripper_state_list"], list)
else obs["gripper_state_list"]
)
front_rgb = np.asarray(front_rgb, dtype=np.uint8)
wrist_rgb = np.asarray(wrist_rgb, dtype=np.uint8)
joint = np.asarray(joint_state, dtype=np.float32).flatten()[:7]
gripper = np.asarray(gripper_state, dtype=np.float32).flatten()[:1]
state = np.concatenate([joint, gripper])
return {
"pixels": {"image": front_rgb, "wrist_image": wrist_rgb},
"agent_pos": state,
}
def _convert_info(self, info: dict) -> dict:
return {
"status": info.get("status", "ongoing"),
"task_goal": info.get("task_goal", ""),
}
def _make_env_fns(
*,
task: str,
n_envs: int,
action_space_type: str,
dataset: str,
episode_length: int,
task_id: int,
) -> list[Callable[[], RoboMMEGymEnv]]:
"""Build n_envs factory callables for one RoboMME task id."""
def _make_one(episode_index: int) -> RoboMMEGymEnv:
return RoboMMEGymEnv(
task=task,
action_space_type=action_space_type,
dataset=dataset,
episode_idx=episode_index,
max_steps=episode_length,
)
return [partial(_make_one, task_id + i) for i in range(n_envs)]
def create_robomme_envs(
task: str,
n_envs: int = 1,
action_space_type: str = "joint_angle",
dataset: str = "test",
episode_length: int = 300,
task_ids: list[int] | None = None,
env_cls: Callable[[Sequence[Callable[[], Any]]], Any] | None = None,
) -> dict[str, dict[int, gym.vector.VectorEnv]]:
"""Create vectorized RoboMME environments for evaluation.
`task` may be a single RoboMME task name (e.g. "PickXtimes") or a
comma-separated list (e.g. "PickXtimes,BinFill,StopCube"). Each task
becomes its own suite in the returned mapping.
Returns {suite_name: {task_id: VectorEnv}} matching lerobot's expected format.
"""
if env_cls is None or not callable(env_cls):
raise ValueError("env_cls must be a callable that wraps a list of env factory callables.")
if not isinstance(n_envs, int) or n_envs <= 0:
raise ValueError(f"n_envs must be a positive int; got {n_envs}.")
if task_ids is None:
task_ids = [0]
task_names = [t.strip() for t in task.split(",") if t.strip()]
is_async = env_cls is gym.vector.AsyncVectorEnv
cached_obs_space: spaces.Space | None = None
cached_act_space: spaces.Space | None = None
cached_metadata: dict[str, Any] | None = None
out: dict[str, dict[int, gym.vector.VectorEnv]] = {}
for task_name in task_names:
envs_by_task: dict[int, gym.vector.VectorEnv] = {}
for task_id in task_ids:
fns = _make_env_fns(
task=task_name,
n_envs=n_envs,
action_space_type=action_space_type,
dataset=dataset,
episode_length=episode_length,
task_id=task_id,
)
if is_async:
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
if cached_obs_space is None:
cached_obs_space = lazy.observation_space
cached_act_space = lazy.action_space
cached_metadata = lazy.metadata
envs_by_task[task_id] = lazy
else:
envs_by_task[task_id] = env_cls(fns)
out[task_name] = envs_by_task
return out
+488
View File
@@ -0,0 +1,488 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import importlib
import logging
from collections import defaultdict
from collections.abc import Callable, Sequence
from functools import partial
from typing import Any
import gymnasium as gym
import numpy as np
import torch
from gymnasium import spaces
from lerobot.types import RobotObservation
from .utils import _LazyAsyncVectorEnv
logger = logging.getLogger(__name__)
# Camera names as used by RoboTwin 2.0. The wrapper appends "_rgb" when looking
# up keys in get_obs() output (e.g. "head_camera" → "head_camera_rgb").
ROBOTWIN_CAMERA_NAMES: tuple[str, ...] = (
"head_camera",
"left_camera",
"right_camera",
)
ACTION_DIM = 14 # 7 DOF × 2 arms
ACTION_LOW = -1.0
ACTION_HIGH = 1.0
DEFAULT_EPISODE_LENGTH = 300
# D435 dims from task_config/_camera_config.yml (what demo_clean.yml selects).
DEFAULT_CAMERA_H = 240
DEFAULT_CAMERA_W = 320
# Task list from RoboTwin 2.0's `envs/` directory — mirrors upstream exactly
# (50 tasks as of main; earlier revisions had 60 with a different split).
# Keep this in sync with:
# gh api /repos/RoboTwin-Platform/RoboTwin/contents/envs --paginate \
# | jq -r '.[].name' | grep -E '\.py$' | grep -v '^_' | sed 's/\.py$//'
ROBOTWIN_TASKS: tuple[str, ...] = (
"adjust_bottle",
"beat_block_hammer",
"blocks_ranking_rgb",
"blocks_ranking_size",
"click_alarmclock",
"click_bell",
"dump_bin_bigbin",
"grab_roller",
"handover_block",
"handover_mic",
"hanging_mug",
"lift_pot",
"move_can_pot",
"move_pillbottle_pad",
"move_playingcard_away",
"move_stapler_pad",
"open_laptop",
"open_microwave",
"pick_diverse_bottles",
"pick_dual_bottles",
"place_a2b_left",
"place_a2b_right",
"place_bread_basket",
"place_bread_skillet",
"place_burger_fries",
"place_can_basket",
"place_cans_plasticbox",
"place_container_plate",
"place_dual_shoes",
"place_empty_cup",
"place_fan",
"place_mouse_pad",
"place_object_basket",
"place_object_scale",
"place_object_stand",
"place_phone_stand",
"place_shoe",
"press_stapler",
"put_bottles_dustbin",
"put_object_cabinet",
"rotate_qrcode",
"scan_object",
"shake_bottle",
"shake_bottle_horizontally",
"stack_blocks_three",
"stack_blocks_two",
"stack_bowls_three",
"stack_bowls_two",
"stamp_seal",
"turn_switch",
)
_ROBOTWIN_SETUP_CACHE: dict[str, dict[str, Any]] = {}
def _load_robotwin_setup_kwargs(task_name: str) -> dict[str, Any]:
"""Build the kwargs dict RoboTwin's setup_demo expects.
Mirrors the config loading done by RoboTwin's ``script/eval_policy.py``:
reads ``task_config/demo_clean.yml``, resolves the embodiment file from
``_embodiment_config.yml``, loads the robot's own ``config.yml``, and
reads camera dimensions from ``_camera_config.yml``.
Uses ``aloha-agilex`` single-robot dual-arm by default (the only embodiment
used by beat_block_hammer and most smoke-test tasks).
"""
if task_name in _ROBOTWIN_SETUP_CACHE:
return dict(_ROBOTWIN_SETUP_CACHE[task_name])
import os
import yaml # type: ignore[import-untyped]
from envs import CONFIGS_PATH # type: ignore[import-not-found]
task_config = "demo_clean"
with open(os.path.join(CONFIGS_PATH, f"{task_config}.yml"), encoding="utf-8") as f:
args = yaml.safe_load(f)
# Resolve embodiment — demo_clean.yml uses [aloha-agilex] (dual-arm single robot)
with open(os.path.join(CONFIGS_PATH, "_embodiment_config.yml"), encoding="utf-8") as f:
embodiment_types = yaml.safe_load(f)
embodiment = args.get("embodiment", ["aloha-agilex"])
if len(embodiment) == 1:
robot_file = embodiment_types[embodiment[0]]["file_path"]
args["left_robot_file"] = robot_file
args["right_robot_file"] = robot_file
args["dual_arm_embodied"] = True
elif len(embodiment) == 3:
args["left_robot_file"] = embodiment_types[embodiment[0]]["file_path"]
args["right_robot_file"] = embodiment_types[embodiment[1]]["file_path"]
args["embodiment_dis"] = embodiment[2]
args["dual_arm_embodied"] = False
else:
raise ValueError(f"embodiment must have 1 or 3 items, got {len(embodiment)}")
with open(os.path.join(args["left_robot_file"], "config.yml"), encoding="utf-8") as f:
args["left_embodiment_config"] = yaml.safe_load(f)
with open(os.path.join(args["right_robot_file"], "config.yml"), encoding="utf-8") as f:
args["right_embodiment_config"] = yaml.safe_load(f)
# Camera dimensions
with open(os.path.join(CONFIGS_PATH, "_camera_config.yml"), encoding="utf-8") as f:
camera_config = yaml.safe_load(f)
head_cam = args["camera"]["head_camera_type"]
args["head_camera_h"] = camera_config[head_cam]["h"]
args["head_camera_w"] = camera_config[head_cam]["w"]
# Headless overrides
args["render_freq"] = 0
args["task_name"] = task_name
args["task_config"] = task_config
_ROBOTWIN_SETUP_CACHE[task_name] = args
return dict(args)
def _load_robotwin_task(task_name: str) -> type:
"""Dynamically import and return a RoboTwin 2.0 task class.
RoboTwin tasks live in ``envs/<task_name>.py`` relative to the repository
root and are expected to be on ``sys.path`` after installation.
"""
try:
module = importlib.import_module(f"envs.{task_name}")
except ModuleNotFoundError as e:
raise ModuleNotFoundError(
f"Could not import RoboTwin task '{task_name}'. "
"Ensure RoboTwin 2.0 is installed and its 'envs/' directory is on PYTHONPATH. "
"See the RoboTwin installation guide: https://robotwin-platform.github.io/doc/usage/robotwin-install.html"
) from e
task_cls = getattr(module, task_name, None)
if task_cls is None:
raise AttributeError(f"Task class '{task_name}' not found in envs/{task_name}.py")
return task_cls
class RoboTwinEnv(gym.Env):
"""Gymnasium wrapper around a single RoboTwin 2.0 task.
RoboTwin uses a custom SAPIEN-based API (``setup_demo`` / ``get_obs`` /
``take_action`` / ``check_success``) rather than the standard gym interface.
This class bridges that API to Gymnasium so that ``lerobot-eval`` can drive
RoboTwin exactly like LIBERO or Meta-World.
The underlying SAPIEN environment is created lazily on the first ``reset()``
call *inside the worker process*. This is required for
``gym.vector.AsyncVectorEnv`` compatibility: SAPIEN allocates EGL/GPU
contexts that must not be forked from the parent process.
Observations
------------
The ``pixels`` dict uses the raw RoboTwin camera names as keys (e.g.
``"head_camera"``, ``"left_camera"``). ``preprocess_observation`` in
``envs/utils.py`` then converts these to ``observation.images.<cam>``.
Actions
-------
14-dim float32 array in ``[-1, 1]`` (joint-space, 7 DOF per arm).
Autograd
--------
``setup_demo`` and ``take_action`` drive CuRobo's Newton trajectory
optimizer, which calls ``cost.backward()`` internally. lerobot_eval wraps
the rollout in ``torch.no_grad()``, so both call sites re-enable grad.
"""
metadata = {"render_modes": ["rgb_array"], "render_fps": 25}
def __init__(
self,
task_name: str,
episode_index: int = 0,
n_envs: int = 1,
camera_names: Sequence[str] = ROBOTWIN_CAMERA_NAMES,
observation_height: int | None = None,
observation_width: int | None = None,
episode_length: int = DEFAULT_EPISODE_LENGTH,
render_mode: str = "rgb_array",
):
super().__init__()
self.task_name = task_name
self.task = task_name # used by add_envs_task() in utils.py
self.task_description = task_name.replace("_", " ")
self.episode_index = episode_index
self._reset_stride = n_envs
self.camera_names = list(camera_names)
# Default to D435 dims (the camera type baked into task_config/demo_clean.yml).
# The YAML-driven lookup is deferred to reset() so construction doesn't
# import RoboTwin's `envs` module — fast-tests run without RoboTwin installed.
self.observation_height = observation_height or DEFAULT_CAMERA_H
self.observation_width = observation_width or DEFAULT_CAMERA_W
self.episode_length = episode_length
self._max_episode_steps = episode_length # lerobot_eval.rollout reads this
self.render_mode = render_mode
self._env: Any | None = None # deferred — created on first reset() inside worker
self._step_count: int = 0
self._black_frame = np.zeros((self.observation_height, self.observation_width, 3), dtype=np.uint8)
image_spaces = {
cam: spaces.Box(
low=0,
high=255,
shape=(self.observation_height, self.observation_width, 3),
dtype=np.uint8,
)
for cam in self.camera_names
}
self.observation_space = spaces.Dict(
{
"pixels": spaces.Dict(image_spaces),
"agent_pos": spaces.Box(low=-np.inf, high=np.inf, shape=(ACTION_DIM,), dtype=np.float32),
}
)
self.action_space = spaces.Box(
low=ACTION_LOW, high=ACTION_HIGH, shape=(ACTION_DIM,), dtype=np.float32
)
def _ensure_env(self) -> None:
"""Create the SAPIEN environment on first use.
Called inside the worker subprocess after fork(), so each worker gets
its own EGL/GPU context rather than inheriting a stale one from the
parent process (which causes crashes with AsyncVectorEnv).
"""
if self._env is not None:
return
task_cls = _load_robotwin_task(self.task_name)
self._env = task_cls()
def _get_obs(self) -> RobotObservation:
assert self._env is not None, "_get_obs called before _ensure_env()"
raw = self._env.get_obs()
cameras_raw = raw.get("observation", {})
images: dict[str, np.ndarray] = {}
for cam in self.camera_names:
cam_data = cameras_raw.get(cam)
img = cam_data.get("rgb") if cam_data else None
if img is None:
images[cam] = self._black_frame
continue
img = np.asarray(img, dtype=np.uint8)
if img.ndim == 2:
img = np.stack([img, img, img], axis=-1)
elif img.shape[-1] != 3:
img = img[..., :3]
images[cam] = img
ja = raw.get("joint_action") or {}
vec = ja.get("vector")
if vec is not None:
arr = np.asarray(vec, dtype=np.float32).ravel()
joint_state = (
arr[:ACTION_DIM] if arr.size >= ACTION_DIM else np.zeros(ACTION_DIM, dtype=np.float32)
)
else:
joint_state = np.zeros(ACTION_DIM, dtype=np.float32)
return {"pixels": images, "agent_pos": joint_state}
def reset(self, seed: int | None = None, **kwargs) -> tuple[RobotObservation, dict]:
self._ensure_env()
super().reset(seed=seed)
assert self._env is not None # set by _ensure_env() above
actual_seed = self.episode_index if seed is None else seed
setup_kwargs = _load_robotwin_setup_kwargs(self.task_name)
setup_kwargs.update(seed=actual_seed, is_test=True)
with torch.enable_grad():
self._env.setup_demo(**setup_kwargs)
self.episode_index += self._reset_stride
self._step_count = 0
obs = self._get_obs()
return obs, {"is_success": False, "task": self.task_name}
def step(self, action: np.ndarray) -> tuple[RobotObservation, float, bool, bool, dict[str, Any]]:
assert self._env is not None, "step() called before reset()"
if action.ndim != 1 or action.shape[0] != ACTION_DIM:
raise ValueError(f"Expected 1-D action of shape ({ACTION_DIM},), got {action.shape}")
with torch.enable_grad():
if hasattr(self._env, "take_action"):
self._env.take_action(action)
else:
self._env.step(action)
self._step_count += 1
is_success = bool(getattr(self._env, "eval_success", False))
if not is_success and hasattr(self._env, "check_success"):
is_success = bool(self._env.check_success())
obs = self._get_obs()
reward = float(is_success)
terminated = is_success
truncated = self._step_count >= self.episode_length
info: dict[str, Any] = {
"task": self.task_name,
"is_success": is_success,
"step": self._step_count,
}
if terminated or truncated:
info["final_info"] = {
"task": self.task_name,
"is_success": is_success,
}
self.reset()
return obs, reward, terminated, truncated, info
def render(self) -> np.ndarray:
self._ensure_env()
obs = self._get_obs()
# Prefer head camera for rendering; fall back to first available.
if "head_camera" in obs["pixels"]:
return obs["pixels"]["head_camera"]
return next(iter(obs["pixels"].values()))
def close(self) -> None:
if self._env is not None:
if hasattr(self._env, "close_env"):
import contextlib
with contextlib.suppress(TypeError):
self._env.close_env()
self._env = None
# ---- Multi-task factory --------------------------------------------------------
def _make_env_fns(
*,
task_name: str,
n_envs: int,
camera_names: list[str],
observation_height: int,
observation_width: int,
episode_length: int,
) -> list[Callable[[], RoboTwinEnv]]:
"""Return n_envs factory callables for a single task."""
def _make_one(episode_index: int) -> RoboTwinEnv:
return RoboTwinEnv(
task_name=task_name,
episode_index=episode_index,
n_envs=n_envs,
camera_names=camera_names,
observation_height=observation_height,
observation_width=observation_width,
episode_length=episode_length,
)
return [partial(_make_one, i) for i in range(n_envs)]
def create_robotwin_envs(
task: str,
n_envs: int,
env_cls: Callable[[Sequence[Callable[[], Any]]], Any] | None = None,
camera_names: Sequence[str] = ROBOTWIN_CAMERA_NAMES,
observation_height: int = DEFAULT_CAMERA_H,
observation_width: int = DEFAULT_CAMERA_W,
episode_length: int = DEFAULT_EPISODE_LENGTH,
) -> dict[str, dict[int, Any]]:
"""Create vectorized RoboTwin 2.0 environments.
Returns:
``dict[task_name][0] -> VectorEnv`` one entry per task, each wrapping
``n_envs`` parallel rollouts.
Args:
task: Comma-separated list of task names (e.g. ``"beat_block_hammer"``
or ``"beat_block_hammer,click_bell"``).
n_envs: Number of parallel rollouts per task.
env_cls: Vector env constructor (e.g. ``gym.vector.AsyncVectorEnv``).
camera_names: Cameras to include in observations.
observation_height: Pixel height for all cameras.
observation_width: Pixel width for all cameras.
episode_length: Max steps before truncation.
"""
if env_cls is None or not callable(env_cls):
raise ValueError("env_cls must be callable (e.g. gym.vector.AsyncVectorEnv).")
if not isinstance(n_envs, int) or n_envs <= 0:
raise ValueError(f"n_envs must be a positive int; got {n_envs}.")
task_names = [t.strip() for t in str(task).split(",") if t.strip()]
if not task_names:
raise ValueError("`task` must contain at least one RoboTwin task name.")
unknown = [t for t in task_names if t not in ROBOTWIN_TASKS]
if unknown:
raise ValueError(f"Unknown RoboTwin tasks: {unknown}. Available tasks: {sorted(ROBOTWIN_TASKS)}")
logger.info(
"Creating RoboTwin envs | tasks=%s | n_envs(per task)=%d",
task_names,
n_envs,
)
is_async = env_cls is gym.vector.AsyncVectorEnv
cached_obs_space: spaces.Space | None = None
cached_act_space: spaces.Space | None = None
cached_metadata: dict[str, Any] | None = None
out: dict[str, dict[int, Any]] = defaultdict(dict)
for task_name in task_names:
fns = _make_env_fns(
task_name=task_name,
n_envs=n_envs,
camera_names=list(camera_names),
observation_height=observation_height,
observation_width=observation_width,
episode_length=episode_length,
)
if is_async:
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
if cached_obs_space is None:
cached_obs_space = lazy.observation_space
cached_act_space = lazy.action_space
cached_metadata = lazy.metadata
out[task_name][0] = lazy
else:
out[task_name][0] = env_cls(fns)
logger.info("Built vec env | task=%s | n_envs=%d", task_name, n_envs)
return {k: dict(v) for k, v in out.items()}
+27 -1
View File
@@ -34,6 +34,25 @@ from lerobot.utils.utils import get_channel_first_image_shape
from .configs import EnvConfig
def parse_camera_names(camera_name: str | Sequence[str]) -> list[str]:
"""Normalize ``camera_name`` into a non-empty list of strings.
Accepts a comma-separated string (``"cam_a,cam_b"``) or a sequence of
strings (tuples/lists). Whitespace is stripped; empty entries are
dropped. Raises ``TypeError`` for unsupported input types and
``ValueError`` when the normalized list is empty.
"""
if isinstance(camera_name, str):
cams = [c.strip() for c in camera_name.split(",") if c.strip()]
elif isinstance(camera_name, (list | tuple)):
cams = [str(c).strip() for c in camera_name if str(c).strip()]
else:
raise TypeError(f"camera_name must be str or sequence[str], got {type(camera_name).__name__}")
if not cams:
raise ValueError("camera_name resolved to an empty list.")
return cams
def _convert_nested_dict(d):
result = {}
for k, v in d.items():
@@ -153,17 +172,20 @@ class _LazyAsyncVectorEnv:
env_fns: list[Callable],
observation_space=None,
action_space=None,
metadata=None,
):
self._env_fns = env_fns
self._env: gym.vector.AsyncVectorEnv | None = None
self.num_envs = len(env_fns)
if observation_space is not None and action_space is not None:
if observation_space is not None and action_space is not None and metadata is not None:
self.observation_space = observation_space
self.action_space = action_space
self.metadata = metadata
else:
tmp = env_fns[0]()
self.observation_space = tmp.observation_space
self.action_space = tmp.action_space
self.metadata = tmp.metadata
tmp.close()
self.single_observation_space = self.observation_space
self.single_action_space = self.action_space
@@ -172,6 +194,10 @@ class _LazyAsyncVectorEnv:
if self._env is None:
self._env = gym.vector.AsyncVectorEnv(self._env_fns, context="forkserver", shared_memory=True)
@property
def unwrapped(self):
return self
def reset(self, **kwargs):
self._ensure()
return self._env.reset(**kwargs)
+589
View File
@@ -0,0 +1,589 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""VLABench environment wrapper for LeRobot.
VLABench is a large-scale benchmark for language-conditioned robotic manipulation
with long-horizon reasoning, built on MuJoCo/dm_control.
- Paper: https://arxiv.org/abs/2412.18194
- GitHub: https://github.com/OpenMOSS/VLABench
- Website: https://vlabench.github.io
"""
from __future__ import annotations
import contextlib
import logging
from collections import defaultdict
from collections.abc import Callable, Sequence
from typing import Any
import cv2
import gymnasium as gym
import numpy as np
from gymnasium import spaces
from scipy.spatial.transform import Rotation
from lerobot.types import RobotObservation
from .utils import _LazyAsyncVectorEnv
logger = logging.getLogger(__name__)
ACTION_DIM = 7 # pos(3) + euler(3) + gripper(1)
ACTION_LOW = np.array([-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 0.0], dtype=np.float32)
ACTION_HIGH = np.array([1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], dtype=np.float32)
# Default max episode steps per task type
DEFAULT_MAX_EPISODE_STEPS = 500
# VLABench task suites
PRIMITIVE_TASKS = [
"select_fruit",
"select_toy",
"select_chemistry_tube",
"add_condiment",
"select_book",
"select_painting",
"select_drink",
"insert_flower",
"select_billiards",
"select_ingredient",
"select_mahjong",
"select_poker",
# Physical series
"density_qa",
"friction_qa",
"magnetism_qa",
"reflection_qa",
"simple_cuestick_usage",
"simple_seesaw_usage",
"sound_speed_qa",
"thermal_expansion_qa",
"weight_qa",
]
COMPOSITE_TASKS = [
"cluster_billiards",
"cluster_book",
"cluster_drink",
"cluster_toy",
"cook_dishes",
"cool_drink",
"find_unseen_object",
"get_coffee",
"hammer_nail",
"heat_food",
"make_juice",
"play_mahjong",
"play_math_game",
"play_poker",
"play_snooker",
"rearrange_book",
"rearrange_chemistry_tube",
"set_dining_table",
"set_study_table",
"store_food",
"take_chemistry_experiment",
"use_seesaw_complex",
]
SUITE_TASKS: dict[str, list[str]] = {
"primitive": PRIMITIVE_TASKS,
"composite": COMPOSITE_TASKS,
}
class VLABenchEnv(gym.Env):
"""Gymnasium wrapper for VLABench environments.
Wraps the dm_control-based VLABench simulator behind a standard gym.Env interface.
Supports multiple cameras (front, second, wrist) and end-effector control.
"""
metadata = {"render_modes": ["rgb_array"], "render_fps": 10}
def __init__(
self,
task: str = "select_fruit",
obs_type: str = "pixels_agent_pos",
render_mode: str = "rgb_array",
render_resolution: tuple[int, int] = (480, 480),
robot: str = "franka",
max_episode_steps: int = DEFAULT_MAX_EPISODE_STEPS,
action_mode: str = "eef",
):
super().__init__()
self.task = task
self.obs_type = obs_type
self.render_mode = render_mode
self.render_resolution = render_resolution
self.robot = robot
self._max_episode_steps = max_episode_steps
self.action_mode = action_mode
# Deferred — created on first reset() inside worker subprocess to avoid
# inheriting stale GPU/EGL contexts when AsyncVectorEnv spawns workers.
# We never cache `env.physics`: dm_control exposes it as a weakref
# proxy that goes stale across resets (rebuilds the sim), so we always
# refetch it via `self._env.physics` at the call site.
self._env = None
self.task_description = "" # populated on first reset
# Cached world-frame XYZ of the robot base link. The VLABench datasets
# log both `observation.state` positions and `actions` positions in
# robot-base frame (see VLABench/scripts/convert_to_lerobot.py which
# subtracts `robot_frame_pos` from ee_pos). The robot is attached at a
# fixed offset per task so this is safe to cache once per env build.
self._robot_base_xyz: np.ndarray | None = None
h, w = self.render_resolution
if self.obs_type == "state":
raise NotImplementedError(
"The 'state' observation type is not supported in VLABenchEnv. "
"Please use 'pixels' or 'pixels_agent_pos'."
)
elif self.obs_type == "pixels":
self.observation_space = spaces.Dict(
{
"pixels": spaces.Dict(
{
"image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
"second_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
"wrist_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
}
),
}
)
elif self.obs_type == "pixels_agent_pos":
self.observation_space = spaces.Dict(
{
"pixels": spaces.Dict(
{
"image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
"second_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
"wrist_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
}
),
"agent_pos": spaces.Box(low=-np.inf, high=np.inf, shape=(7,), dtype=np.float64),
}
)
else:
raise ValueError(f"Unsupported obs_type: {self.obs_type}")
self.action_space = spaces.Box(low=ACTION_LOW, high=ACTION_HIGH, dtype=np.float32)
# Max attempts to rebuild the underlying env when MuJoCo throws
# `PhysicsError` (e.g. mjWARN_BADQACC) during VLABench's 20-step
# reset warm-up. Some random task/layout samples land in unstable
# initial configurations; re-sampling the layout almost always
# gives a stable one. A handful of upstream tasks (notably
# `select_mahjong`) have layout samplers that diverge often enough
# to need >>5 retries, so we pick a generous ceiling.
_ENSURE_ENV_MAX_ATTEMPTS = 20
def _ensure_env(self) -> None:
"""Create the underlying VLABench env on first use.
Called inside the worker subprocess after fork(), so each worker gets
its own clean rendering context rather than inheriting a stale one from
the parent process (which causes crashes with AsyncVectorEnv).
Retries on `PhysicsError`: VLABench's `LM4ManipDMEnv.reset()` runs 20
warm-up `step()` calls while toggling gravity/fluids to let the scene
settle; for some random layouts MuJoCo's integrator diverges and
raises `mjWARN_BADQACC`. Re-sampling the layout almost always yields
a stable one, so we retry a number of times before giving up. Between
attempts we reseed NumPy's global RNG from OS entropy so the upstream
task sampler explores fresh initial states without this, retries
can replay the same diverging configuration when the sampler is
deterministic given the current RNG state.
"""
if self._env is not None:
return
import VLABench.robots # noqa: F401 # type: ignore[import-untyped]
import VLABench.tasks # noqa: F401 # type: ignore[import-untyped]
from dm_control.rl.control import PhysicsError # type: ignore[import-untyped]
from VLABench.envs import load_env # type: ignore[import-untyped]
h, w = self.render_resolution
last_exc: PhysicsError | None = None
for attempt in range(1, self._ENSURE_ENV_MAX_ATTEMPTS + 1):
try:
env = load_env(task=self.task, robot=self.robot, render_resolution=(h, w))
self._env = env
break
except PhysicsError as exc:
last_exc = exc
logger.warning(
"PhysicsError on attempt %d/%d while building task '%s': %s. Retrying with fresh layout…",
attempt,
self._ENSURE_ENV_MAX_ATTEMPTS,
self.task,
exc,
)
np.random.seed(None)
if self._env is None:
assert last_exc is not None
raise RuntimeError(
f"VLABench task '{self.task}' failed to produce a stable "
f"initial layout after {self._ENSURE_ENV_MAX_ATTEMPTS} "
f"attempts. This task's upstream sampler diverges too "
f"often for the configured robot; consider removing it "
f"from the eval set. Last physics error: {last_exc}"
) from last_exc
# Extract task description from the dm_control task
task_obj = self._env.task
if hasattr(task_obj, "task_description"):
self.task_description = task_obj.task_description
elif hasattr(task_obj, "language_instruction"):
self.task_description = task_obj.language_instruction
else:
self.task_description = self.task
# Cache robot base world position so `_build_ctrl_from_action` and
# `_get_obs` can translate between robot-frame (dataset) and
# world-frame (dm_control) without hitting physics every call.
try:
self._robot_base_xyz = np.asarray(self._env.get_robot_frame_position(), dtype=np.float64).reshape(
3
)
except Exception:
# Fallback to VLABench's default Franka base position.
self._robot_base_xyz = np.array([0.0, -0.4, 0.78], dtype=np.float64)
def _get_obs(self) -> dict:
"""Get current observation from the environment."""
assert self._env is not None
obs = self._env.get_observation()
h, w = self.render_resolution
def _to_hwc3(arr: np.ndarray) -> np.ndarray:
"""Coerce any camera array to the declared (h, w, 3) uint8 shape."""
a = np.asarray(arr)
# Drop a leading singleton batch dim if present.
while a.ndim > 3 and a.shape[0] == 1:
a = a[0]
if a.ndim == 3 and a.shape[0] in (1, 3, 4) and a.shape[-1] not in (1, 3, 4):
# CHW → HWC
a = np.transpose(a, (1, 2, 0))
if a.ndim == 2:
a = np.stack([a] * 3, axis=-1)
if a.ndim != 3:
return np.zeros((h, w, 3), dtype=np.uint8)
# Force 3 channels.
if a.shape[-1] == 1:
a = np.repeat(a, 3, axis=-1)
elif a.shape[-1] == 4:
a = a[..., :3]
elif a.shape[-1] != 3:
return np.zeros((h, w, 3), dtype=np.uint8)
if a.shape[:2] != (h, w):
a = cv2.resize(a, (w, h), interpolation=cv2.INTER_AREA)
return a.astype(np.uint8)
# Extract camera images — VLABench returns (n_cameras, C, H, W) or individual arrays
raw_frames: list[np.ndarray] = []
if "rgb" in obs:
rgb = obs["rgb"]
if isinstance(rgb, np.ndarray):
if rgb.ndim == 4:
raw_frames = [rgb[i] for i in range(rgb.shape[0])]
elif rgb.ndim == 3:
raw_frames = [rgb]
image_keys = ["image", "second_image", "wrist_image"]
images: dict[str, np.ndarray] = {}
for i, key in enumerate(image_keys):
if i < len(raw_frames):
images[key] = _to_hwc3(raw_frames[i])
else:
images[key] = np.zeros((h, w, 3), dtype=np.uint8)
# Convert VLABench's raw ee_state `[pos_world(3), quat_wxyz(4), open(1)]`
# to the dataset's observation.state layout `[pos_robot(3), euler_xyz(3),
# gripper(1)]`. See VLABench/scripts/convert_to_lerobot.py — positions
# are stored in robot-base frame and orientations as scipy extrinsic
# 'xyz' euler angles.
raw = np.asarray(obs.get("ee_state", np.zeros(8)), dtype=np.float64).ravel()
pos_world = raw[:3] if raw.size >= 3 else np.zeros(3, dtype=np.float64)
quat_wxyz = raw[3:7] if raw.size >= 7 else np.array([1.0, 0.0, 0.0, 0.0], dtype=np.float64)
gripper = float(raw[7]) if raw.size >= 8 else 0.0
base = self._robot_base_xyz if self._robot_base_xyz is not None else np.zeros(3, dtype=np.float64)
pos_robot = pos_world - base
euler_xyz = Rotation.from_quat([quat_wxyz[1], quat_wxyz[2], quat_wxyz[3], quat_wxyz[0]]).as_euler(
"xyz", degrees=False
)
ee_state = np.concatenate([pos_robot, euler_xyz, [gripper]]).astype(np.float64)
if self.obs_type == "pixels":
return {"pixels": images}
elif self.obs_type == "pixels_agent_pos":
return {
"pixels": images,
"agent_pos": ee_state.astype(np.float64),
}
else:
raise ValueError(f"Unknown obs_type: {self.obs_type}")
# ---- Action adaptation (EEF → joint ctrl) --------------------------------
#
# The HF vlabench datasets log 7D actions
# `[x, y, z (robot frame), rx, ry, rz (scipy extrinsic xyz), gripper]`,
# exactly matching VLABench's own eval pipeline (evaluator.base):
# pos, euler, g = policy(...)
# quat = euler_to_quaternion(*euler) # extrinsic xyz -> wxyz
# _, qpos = robot.get_qpos_from_ee_pos(physics, pos=pos + base, quat=quat)
# env.step(np.concatenate([qpos, [g, g]]))
#
# VLABench's dm_control task writes `data.ctrl[:] = action` directly — for
# Franka that's 9 entries (7 arm joints + 2 gripper fingers). We mirror the
# above conversion so the policy's EEF commands actually drive the robot.
_FRANKA_FINGER_OPEN = 0.04 # qpos when gripper fully open
def _build_ctrl_from_action(self, action: np.ndarray, ctrl_dim: int) -> np.ndarray:
"""Convert a 7D EEF action into the `ctrl_dim`-sized joint command vector.
For the Franka default (ctrl_dim=9): 7 arm joint qposes (via IK) +
2 gripper finger qposes (open/closed based on the gripper scalar).
If the action is already joint-space (shape matches ctrl_dim), pass
through.
"""
if action.shape[0] == ctrl_dim:
return action.astype(np.float64, copy=False)
if action.shape[0] != 7:
# Unknown layout — fall back to zero-pad so the sim doesn't crash.
padded = np.zeros(ctrl_dim, dtype=np.float64)
padded[: min(action.shape[0], ctrl_dim)] = action[:ctrl_dim]
return padded
from dm_control.utils.inverse_kinematics import qpos_from_site_pose
# Action position is in robot-base frame (see convert_to_lerobot.py);
# dm_control's IK expects a world-frame target.
base = self._robot_base_xyz if self._robot_base_xyz is not None else np.zeros(3, dtype=np.float64)
pos_world = np.asarray(action[:3], dtype=np.float64) + base
rx, ry, rz = float(action[3]), float(action[4]), float(action[5])
gripper = float(np.clip(action[6], 0.0, 1.0))
# Dataset euler is scipy extrinsic 'xyz' (same as VLABench's
# `euler_to_quaternion`). scipy emits `[x, y, z, w]`; dm_control's IK
# and MuJoCo use `[w, x, y, z]`, so reorder.
qxyzw = Rotation.from_euler("xyz", [rx, ry, rz], degrees=False).as_quat()
quat = np.array([qxyzw[3], qxyzw[0], qxyzw[1], qxyzw[2]], dtype=np.float64)
assert self._env is not None
robot = self._env.task.robot
site_name = robot.end_effector_site.full_identifier
# inplace=False so IK doesn't mutate physics state mid-step — we only
# want the solved qpos. Fetch a fresh physics handle — caching it can
# yield a stale weakref after a reset.
ik_result = qpos_from_site_pose(
self._env.physics,
site_name=site_name,
target_pos=pos_world,
target_quat=quat,
inplace=False,
max_steps=100,
)
n_dof = robot.n_dof # 7 for Franka
arm_qpos = ik_result.qpos[:n_dof]
# Dataset gripper convention: 1 = open (finger qpos = 0.04),
# 0 = closed (finger qpos = 0.0). See VLABench/scripts/convert_to_lerobot.py
# where `trajectory[i][-1] > 0.03` is encoded as `1`.
finger_qpos = gripper * self._FRANKA_FINGER_OPEN
ctrl = np.zeros(ctrl_dim, dtype=np.float64)
ctrl[:n_dof] = arm_qpos
# Remaining entries are gripper fingers (usually 2 for Franka).
ctrl[n_dof:] = finger_qpos
return ctrl
def reset(self, seed=None, **kwargs) -> tuple[RobotObservation, dict[str, Any]]:
self._ensure_env()
assert self._env is not None
super().reset(seed=seed)
if seed is not None:
self._seed_inner_env(int(self.np_random.integers(0, 2**31 - 1)))
self._env.reset()
observation = self._get_obs()
info = {"is_success": False}
return observation, info
def _seed_inner_env(self, seed: int) -> None:
"""Propagate `seed` to the inner dm_control env. `Environment.reset()`
doesn't accept a seed, so we re-seed the task and environment
`RandomState`s directly. Best-effort: silently skipped when the
expected attributes are absent on a given VLABench version.
"""
for owner_attr, rng_attr in (("task", "random"), (None, "_random_state")):
owner = getattr(self._env, owner_attr) if owner_attr else self._env
rng = getattr(owner, rng_attr, None)
rng_seed = getattr(rng, "seed", None)
if callable(rng_seed):
rng_seed(seed)
def step(self, action: np.ndarray) -> tuple[RobotObservation, float, bool, bool, dict[str, Any]]:
from dm_control.rl.control import PhysicsError # type: ignore[import-untyped]
self._ensure_env()
assert self._env is not None
if action.ndim != 1:
raise ValueError(
f"Expected action to be 1-D (shape (action_dim,)), "
f"but got shape {action.shape} with ndim={action.ndim}"
)
if self.action_mode not in ("eef", "joint", "delta_eef"):
raise ValueError(f"Unknown action_mode: {self.action_mode}")
# Always refetch physics — dm_control returns a weakref proxy that can
# go stale across resets.
physics = self._env.physics
ctrl_dim = int(physics.data.ctrl.shape[0])
ctrl = self._build_ctrl_from_action(action, ctrl_dim)
try:
timestep = self._env.step(ctrl)
except PhysicsError as exc:
# Physics integrator diverged (e.g. mjWARN_BADQACC). Treat it as
# a graceful failed termination rather than a hard crash — the
# rest of the multi-task eval should still run.
logger.warning(
"PhysicsError during step on task '%s': %s. Terminating episode.",
self.task,
exc,
)
observation = self._get_obs()
info = {"task": self.task, "is_success": False, "physics_error": True}
# Drop the stale env so the next reset() rebuilds it cleanly.
with contextlib.suppress(Exception):
self._env.close()
self._env = None
return observation, 0.0, True, False, info
# Extract reward from dm_control timestep
reward = float(timestep.reward) if timestep.reward is not None else 0.0
# Check success via the task's termination condition
is_success = False
if hasattr(self._env, "task") and hasattr(self._env.task, "should_terminate_episode"):
is_success = bool(self._env.task.should_terminate_episode(self._env.physics))
terminated = is_success
truncated = False
info = {
"task": self.task,
"is_success": is_success,
}
observation = self._get_obs()
if terminated:
self.reset()
return observation, reward, terminated, truncated, info
def render(self) -> np.ndarray:
self._ensure_env()
obs = self._get_obs()
return obs["pixels"]["image"]
def close(self):
if self._env is not None:
self._env.close()
self._env = None
# ---- Main API ----------------------------------------------------------------
def create_vlabench_envs(
task: str,
n_envs: int,
gym_kwargs: dict[str, Any] | None = None,
env_cls: Callable[[Sequence[Callable[[], Any]]], Any] | None = None,
) -> dict[str, dict[int, Any]]:
"""
Create vectorized VLABench environments with a consistent return shape.
Returns:
dict[suite_name][task_id] -> vec_env (env_cls([...]) with exactly n_envs factories)
Notes:
- n_envs is the number of rollouts *per task*.
- `task` can be a suite name ("primitive", "composite"), a comma-separated list of
suite names, or individual task names (e.g. "select_fruit,heat_food").
"""
if env_cls is None or not callable(env_cls):
raise ValueError("env_cls must be a callable that wraps a list of environment factory callables.")
if not isinstance(n_envs, int) or n_envs <= 0:
raise ValueError(f"n_envs must be a positive int; got {n_envs}.")
gym_kwargs = dict(gym_kwargs or {})
task_groups = [t.strip() for t in task.split(",") if t.strip()]
if not task_groups:
raise ValueError("`task` must contain at least one VLABench task or suite name.")
logger.info(
"Creating VLABench envs | task_groups=%s | n_envs(per task)=%d",
task_groups,
n_envs,
)
is_async = env_cls is gym.vector.AsyncVectorEnv
cached_obs_space = None
cached_act_space = None
cached_metadata = None
out: dict[str, dict[int, Any]] = defaultdict(dict)
for group in task_groups:
# Check if it's a suite name, otherwise treat as individual task
tasks = SUITE_TASKS.get(group, [group])
for tid, task_name in enumerate(tasks):
logger.info(
"Building vec env | group=%s | task_id=%d | task=%s",
group,
tid,
task_name,
)
fns = [(lambda tn=task_name: VLABenchEnv(task=tn, **gym_kwargs)) for _ in range(n_envs)]
if is_async:
lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
if cached_obs_space is None:
cached_obs_space = lazy.observation_space
cached_act_space = lazy.action_space
cached_metadata = lazy.metadata
out[group][tid] = lazy
else:
out[group][tid] = env_cls(fns)
return {group: dict(task_map) for group, task_map in out.items()}
+12 -7
View File
@@ -12,8 +12,19 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
from typing import TYPE_CHECKING
import numpy as np
from lerobot.utils.import_utils import _placo_available, require_package
if TYPE_CHECKING or _placo_available:
import placo # type: ignore[import-not-found]
else:
placo = None
class RobotKinematics:
"""Robot kinematics using placo library for forward and inverse kinematics."""
@@ -32,13 +43,7 @@ class RobotKinematics:
target_frame_name (str): Name of the end-effector frame in the URDF
joint_names (list[str] | None): List of joint names to use for the kinematics solver
"""
try:
import placo # type: ignore[import-not-found] # C++ library with Python bindings, no type stubs available. TODO: Create stub file or request upstream typing support.
except ImportError as e:
raise ImportError(
"placo is required for RobotKinematics. "
"Please install the optional dependencies of `kinematics` in the package."
) from e
require_package("placo", extra="placo-dep")
self.robot = placo.RobotWrapper(urdf_path)
self.solver = placo.KinematicsSolver(self.robot)
+2 -1
View File
@@ -24,7 +24,7 @@ from functools import cached_property
from typing import TYPE_CHECKING, Any, TypedDict
from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
from lerobot.utils.import_utils import _can_available
from lerobot.utils.import_utils import _can_available, require_package
if TYPE_CHECKING or _can_available:
import can
@@ -111,6 +111,7 @@ class DamiaoMotorsBus(MotorsBusBase):
bitrate: Nominal bitrate in bps (default: 1000000 = 1 Mbps)
data_bitrate: Data bitrate for CAN FD in bps (default: 5000000 = 5 Mbps), ignored if use_can_fd is False
"""
require_package("python-can", extra="damiao", import_name="can")
super().__init__(port, motors, calibration)
self.port = port
self.can_interface = can_interface
+2 -2
View File
@@ -356,8 +356,8 @@ class SerialMotorsBus(MotorsBusBase):
motors: dict[str, Motor],
calibration: dict[str, MotorCalibration] | None = None,
):
require_package("pyserial", extra="hardware", import_name="serial")
require_package("deepdiff", extra="hardware")
require_package("pyserial", extra="pyserial-dep", import_name="serial")
require_package("deepdiff", extra="deepdiff-dep")
super().__init__(port, motors, calibration)
self.port_handler: PortHandler
+3 -2
View File
@@ -23,12 +23,12 @@ from types import SimpleNamespace
from typing import TYPE_CHECKING, Any, TypedDict
from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
from lerobot.utils.import_utils import _can_available
from lerobot.utils.import_utils import _can_available, require_package
if TYPE_CHECKING or _can_available:
import can
else:
can = SimpleNamespace(Message=object, interface=None)
can = SimpleNamespace(Message=object, interface=None, BusABC=object)
import numpy as np
from lerobot.utils.errors import DeviceNotConnectedError
@@ -106,6 +106,7 @@ class RobstrideMotorsBus(MotorsBusBase):
bitrate: Nominal bitrate in bps (default: 1000000 = 1 Mbps)
data_bitrate: Data bitrate for CAN FD in bps (default: 5000000 = 5 Mbps), ignored if use_can_fd is False
"""
require_package("python-can", extra="robstride", import_name="can")
super().__init__(port, motors, calibration)
self.port = port
self.can_interface = can_interface
+7 -3
View File
@@ -18,14 +18,21 @@ import logging
import math
from dataclasses import asdict, dataclass
from pathlib import Path
from typing import TYPE_CHECKING
import draccus
from torch.optim import Optimizer
from torch.optim.lr_scheduler import LambdaLR, LRScheduler
from lerobot.utils.constants import SCHEDULER_STATE
from lerobot.utils.import_utils import _diffusers_available, require_package
from lerobot.utils.io_utils import deserialize_json_into_object, write_json
if TYPE_CHECKING or _diffusers_available:
from diffusers.optimization import get_scheduler
else:
get_scheduler = None
@dataclass
class LRSchedulerConfig(draccus.ChoiceRegistry, abc.ABC):
@@ -47,10 +54,7 @@ class DiffuserSchedulerConfig(LRSchedulerConfig):
num_warmup_steps: int | None = None
def build(self, optimizer: Optimizer, num_training_steps: int) -> LambdaLR:
from lerobot.utils.import_utils import require_package
require_package("diffusers", extra="diffusion")
from diffusers.optimization import get_scheduler
kwargs = {**asdict(self), "num_training_steps": num_training_steps, "optimizer": optimizer}
return get_scheduler(**kwargs)
+4
View File
@@ -20,12 +20,14 @@ from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as M
from .pi0.configuration_pi0 import PI0Config as PI0Config
from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
from .pi05.configuration_pi05 import PI05Config as PI05Config
from .pi052.configuration_pi052 import PI052Config as PI052Config
from .pretrained import PreTrainedPolicy as PreTrainedPolicy
from .rtc import ActionInterpolator as ActionInterpolator
from .sac.configuration_sac import SACConfig as SACConfig
from .sac.reward_model.configuration_classifier import RewardClassifierConfig as RewardClassifierConfig
from .sarm.configuration_sarm import SARMConfig as SARMConfig
from .smolvla.configuration_smolvla import SmolVLAConfig as SmolVLAConfig
from .smolvla2.configuration_smolvla2 import SmolVLA2Config as SmolVLA2Config
from .tdmpc.configuration_tdmpc import TDMPCConfig as TDMPCConfig
from .utils import make_robot_action, prepare_observation_for_inference
from .vqbet.configuration_vqbet import VQBeTConfig as VQBeTConfig
@@ -45,10 +47,12 @@ __all__ = [
"PI0Config",
"PI0FastConfig",
"PI05Config",
"PI052Config",
"RewardClassifierConfig",
"SACConfig",
"SARMConfig",
"SmolVLAConfig",
"SmolVLA2Config",
"TDMPCConfig",
"VQBeTConfig",
"WallXConfig",
@@ -23,6 +23,7 @@ TODO(alexander-soare):
import math
from collections import deque
from collections.abc import Callable
from typing import TYPE_CHECKING
import einops
import numpy as np
@@ -32,6 +33,14 @@ import torchvision
from torch import Tensor, nn
from lerobot.utils.constants import ACTION, OBS_ENV_STATE, OBS_IMAGES, OBS_STATE
from lerobot.utils.import_utils import _diffusers_available, require_package
if TYPE_CHECKING or _diffusers_available:
from diffusers.schedulers.scheduling_ddim import DDIMScheduler
from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
else:
DDIMScheduler = None
DDPMScheduler = None
from ..pretrained import PreTrainedPolicy
from ..utils import (
@@ -64,6 +73,7 @@ class DiffusionPolicy(PreTrainedPolicy):
dataset_stats: Dataset statistics to be used for normalization. If not passed here, it is expected
that they will be passed with a call to `load_state_dict` before the policy is used.
"""
require_package("diffusers", extra="diffusion")
super().__init__(config)
config.validate_features()
self.config = config
@@ -155,11 +165,7 @@ def _make_noise_scheduler(name: str, **kwargs: dict):
Factory for noise scheduler instances of the requested type. All kwargs are passed
to the scheduler.
"""
from lerobot.utils.import_utils import require_package
require_package("diffusers", extra="diffusion")
from diffusers.schedulers.scheduling_ddim import DDIMScheduler
from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
if name == "DDPM":
return DDPMScheduler(**kwargs)
+49
View File
@@ -128,6 +128,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
from .pi05.modeling_pi05 import PI05Policy
return PI05Policy
elif name == "pi052":
from .pi052.modeling_pi052 import PI052Policy
return PI052Policy
elif name == "sac":
from .sac.modeling_sac import SACPolicy
@@ -140,6 +144,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
from .smolvla.modeling_smolvla import SmolVLAPolicy
return SmolVLAPolicy
elif name == "smolvla2":
from .smolvla2.modeling_smolvla2 import SmolVLA2Policy
return SmolVLA2Policy
elif name == "sarm":
from .sarm.modeling_sarm import SARMRewardModel
@@ -196,10 +204,18 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
return PI0Config(**kwargs)
elif policy_type == "pi05":
return PI05Config(**kwargs)
elif policy_type == "pi052":
from .pi052.configuration_pi052 import PI052Config
return PI052Config(**kwargs)
elif policy_type == "sac":
return SACConfig(**kwargs)
elif policy_type == "smolvla":
return SmolVLAConfig(**kwargs)
elif policy_type == "smolvla2":
from .smolvla2.configuration_smolvla2 import SmolVLA2Config
return SmolVLA2Config(**kwargs)
elif policy_type == "reward_classifier":
return RewardClassifierConfig(**kwargs)
elif policy_type == "groot":
@@ -236,6 +252,12 @@ class ProcessorConfigKwargs(TypedDict, total=False):
preprocessor_overrides: dict[str, Any] | None
postprocessor_overrides: dict[str, Any] | None
dataset_stats: dict[str, dict[str, torch.Tensor]] | None
# Optional: HF Hub repo id of the dataset the policy is being
# trained on. Used by policies that auto-fit pieces of their
# preprocessing (e.g. pi052's FAST action tokenizer per
# Pertsch et al. 2025 [64], π0.5 §III.C). When omitted, those
# policies fall back to their universal pre-fitted tokenizers.
dataset_repo_id: str | None
def make_pre_post_processors(
@@ -362,6 +384,22 @@ def make_pre_post_processors(
dataset_stats=kwargs.get("dataset_stats"),
)
elif policy_cfg.type == "pi052":
# NOTE: PI052Config subclasses PI05Config, so this branch MUST
# come before the PI05Config isinstance check below (otherwise
# pi052 would silently pick up π0.5's processor).
from .pi052.processor_pi052 import make_pi052_pre_post_processors
processors = make_pi052_pre_post_processors(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
# ``dataset_repo_id`` flows in via kwargs when FAST CE is
# enabled — the train loop sets it from ``--dataset.repo_id``.
# When ``None``, ``make_pi052_pre_post_processors`` skips
# the auto-fit and uses the universal tokenizer.
dataset_repo_id=kwargs.get("dataset_repo_id"),
)
elif isinstance(policy_cfg, PI05Config):
from .pi05.processor_pi05 import make_pi05_pre_post_processors
@@ -386,6 +424,17 @@ def make_pre_post_processors(
dataset_stats=kwargs.get("dataset_stats"),
)
elif policy_cfg.type == "smolvla2":
# NOTE: SmolVLA2Config subclasses SmolVLAConfig, so this branch
# MUST come before the SmolVLAConfig isinstance check below
# (otherwise SmolVLA2 would silently pick up SmolVLA's processor).
from .smolvla2.processor_smolvla2 import make_smolvla2_pre_post_processors
processors = make_smolvla2_pre_post_processors(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
)
elif isinstance(policy_cfg, SmolVLAConfig):
from .smolvla.processor_smolvla import make_smolvla_pre_post_processors
@@ -43,6 +43,7 @@ from torch import Tensor
from lerobot.configs import FeatureType, PolicyFeature
from lerobot.utils.constants import ACTION, OBS_IMAGES
from lerobot.utils.import_utils import require_package
from ..pretrained import PreTrainedPolicy
from .configuration_groot import GrootConfig
@@ -59,6 +60,7 @@ class GrootPolicy(PreTrainedPolicy):
def __init__(self, config: GrootConfig, **kwargs):
"""Initialize Groot policy wrapper."""
require_package("transformers", extra="groot")
super().__init__(config)
config.validate_features()
self.config = config
@@ -36,7 +36,7 @@ import torch.nn.functional as F # noqa: N812
import torchvision
from torch import Tensor
from lerobot.utils.import_utils import _transformers_available
from lerobot.utils.import_utils import _diffusers_available, _transformers_available, require_package
from .configuration_multi_task_dit import MultiTaskDiTConfig
@@ -46,6 +46,13 @@ if TYPE_CHECKING or _transformers_available:
else:
CLIPTextModel = None
CLIPVisionModel = None
if TYPE_CHECKING or _diffusers_available:
from diffusers.schedulers.scheduling_ddim import DDIMScheduler
from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
else:
DDIMScheduler = None
DDPMScheduler = None
from lerobot.utils.constants import (
ACTION,
OBS_IMAGES,
@@ -65,6 +72,8 @@ class MultiTaskDiTPolicy(PreTrainedPolicy):
name = "multi_task_dit"
def __init__(self, config: MultiTaskDiTConfig, **kwargs):
require_package("transformers", extra="multi_task_dit")
require_package("diffusers", extra="multi_task_dit")
super().__init__(config)
config.validate_features()
self.config = config
@@ -643,12 +652,6 @@ class DiffusionObjective(nn.Module):
"prediction_type": config.prediction_type,
}
from lerobot.utils.import_utils import require_package
require_package("diffusers", extra="multi_task_dit")
from diffusers.schedulers.scheduling_ddim import DDIMScheduler
from diffusers.schedulers.scheduling_ddpm import DDPMScheduler
if config.noise_scheduler_type == "DDPM":
self.noise_scheduler: DDPMScheduler | DDIMScheduler = DDPMScheduler(**scheduler_kwargs)
elif config.noise_scheduler_type == "DDIM":
+2 -1
View File
@@ -26,7 +26,7 @@ import torch
import torch.nn.functional as F # noqa: N812
from torch import Tensor, nn
from lerobot.utils.import_utils import _transformers_available
from lerobot.utils.import_utils import _transformers_available, require_package
# Conditional import for type checking and lazy loading
if TYPE_CHECKING or _transformers_available:
@@ -947,6 +947,7 @@ class PI0Policy(PreTrainedPolicy):
Args:
config: Policy configuration class instance.
"""
require_package("transformers", extra="pi")
super().__init__(config)
config.validate_features()
self.config = config
+2 -1
View File
@@ -26,7 +26,7 @@ import torch
import torch.nn.functional as F # noqa: N812
from torch import Tensor, nn
from lerobot.utils.import_utils import _transformers_available
from lerobot.utils.import_utils import _transformers_available, require_package
# Conditional import for type checking and lazy loading
if TYPE_CHECKING or _transformers_available:
@@ -918,6 +918,7 @@ class PI05Policy(PreTrainedPolicy):
Args:
config: Policy configuration class instance.
"""
require_package("transformers", extra="pi")
super().__init__(config)
config.validate_features()
self.config = config
+42
View File
@@ -0,0 +1,42 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""π0.5 v2 — full reproduction of the π0.5 paper's hierarchical
inference recipe on lerobot.
Extends :class:`lerobot.policies.pi05.PI05Policy` with:
* recipe-driven training (PR 1's :class:`RenderMessagesStep`),
* PaliGemma ``lm_head`` cross-entropy on supervised subtask spans
(the "high-level subtask prediction" of the paper, §IV.D),
* AR text generation at inference (:meth:`PI052Policy.select_message`),
* per-component prompt dropout (Pi 0.7 §V.E) for regularising the
text head against missing context at inference.
See ``src/lerobot/configs/recipes/hirobot.yaml`` for the
canonical training recipe and
``examples/training/pi052_hirobot.slurm`` for the launcher.
"""
from .configuration_pi052 import PI052Config
from .modeling_pi052 import PI052Policy
from .processor_pi052 import make_pi052_pre_post_processors
from .text_processor_pi052 import PI052TextTokenizerStep
__all__ = [
"PI052Config",
"PI052Policy",
"PI052TextTokenizerStep",
"make_pi052_pre_post_processors",
]
@@ -0,0 +1,172 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""π0.5 v2 (with text head) — reproduction of the π0.5 paper's
hierarchical inference recipe.
Same architecture as the existing ``PI05Policy`` (PaliGemma 2B VLM +
~300M Gemma action expert, joint training with FAST tokens during
pre-train and flow matching during post-train), but with the
PaliGemma ``lm_head`` re-enabled so the same model can be supervised
to predict both:
* **subtask strings** at the high level (cross-entropy on the LM
head), and
* **action chunks** at the low level (flow matching on the
action-expert tokens).
This is the dual-head co-training pattern from the paper:
L = H(x, f_θ_text) + α * ω - a - f_θ_action(a_τ, o, )²
with α = 10.0 per § IV.D of arxiv:2504.16054. The π0.5 model splits
inference into a text-prediction step followed by an action-prediction
step, which mirrors what ``SmolVLA2Runtime`` already does on a
SmolVLM2 backbone.
"""
from dataclasses import dataclass
from lerobot.configs import PreTrainedConfig
from ..pi05.configuration_pi05 import PI05Config
@PreTrainedConfig.register_subclass("pi052")
@dataclass
class PI052Config(PI05Config):
"""π0.5 with the PaliGemma LM head re-enabled for subtask prediction.
See ``SmolVLA2Config`` for the analogous SmolVLM2-backed dual-head
config. Same recipe-driven training surface; the only differences
are which backbone the policy uses (PaliGemma here vs SmolVLM2
there) and the default loss-weight scale (paper §IV.D uses
``α=10`` between the two heads, which we encode as
``flow_loss_weight=10, text_loss_weight=1``).
"""
# Recipe / language stack ---------------------------------------------
recipe_path: str | None = "recipes/hirobot.yaml"
"""Path (absolute or relative to ``src/lerobot/configs/``) to a
``TrainingRecipe`` YAML. Defaults to the canonical Hi-Robot blend
shipped alongside this policy. Set to ``None`` to disable recipe
rendering and fall back to π0.5's single-task ``Task: ... Action:``
prompt path (unannotated datasets keep working that way)."""
apply_chat_template: bool = False
"""PaliGemma is *not* chat-pretrained — its tokenizer doesn't ship a
chat template. So unlike SmolVLA2 we don't apply one. The recipe
renderer's output is concatenated as a plain prefix + assistant
suffix instead, mirroring how the π0.5 paper's high-level inference
samples text auto-regressively after the prefix."""
# Loss weights --------------------------------------------------------
# Paper §IV.D: total = H(text) + α * MSE(flow), α = 10. We split
# the same total into two configurable knobs so individual scaling
# is recoverable.
text_loss_weight: float = 1.0
"""Weight on the LM-head cross-entropy term. Set to ``0`` to disable
text training entirely (reverts to flow-only / π0.5 behaviour)."""
flow_loss_weight: float = 10.0
"""Weight on the action-expert flow-matching term. Default ``10.0``
matches the paper's α."""
# Backbone training ---------------------------------------------------
unfreeze_lm_head: bool = True
"""Whether to keep the PaliGemma ``lm_head`` unfrozen for fine-tuning.
The existing ``PI05Policy`` zeroes / freezes the head on load
because it never reads from it. Must be ``True`` for π0.5-style
hierarchical inference."""
# Per-component prompt dropout (Pi0.7 §V.E) ---------------------------
# Same regulariser surface as SmolVLA2: randomly drop non-target
# context messages so the LM head learns to handle missing /
# stale plan / memory at inference. Defaults to 0.0 so behaviour
# is identical until explicitly enabled.
plan_dropout_prob: float = 0.0
memory_dropout_prob: float = 0.0
subtask_dropout_prob: float = 0.0
# FAST discrete-action supervision — paper §III.B-C ------------------
# When enabled, actions are *also* tokenised via the FAST tokenizer
# ("physical-intelligence/fast") and supervised with cross-entropy
# on the PaliGemma LM head — exactly as in the paper's pre-training
# objective (Eq. 1 mixes FAST CE + flow MSE + subtask CE). The
# ActionTokenizerProcessorStep is wired into the preprocessor
# pipeline when this flag is set; the loss is computed in
# PI052Policy.forward.
enable_fast_action_loss: bool = True
"""If True, tokenise actions with the FAST tokenizer and add a
cross-entropy loss on the LM head. On by default to match the
π0.5 paper's three-loss objective (text CE + FAST CE + flow MSE,
§III.B-C Eq. 1). Set to False if you only want the
post-training-style flow + text recipe."""
action_tokenizer_name: str = "physical-intelligence/fast"
"""HF identifier for the FAST action tokenizer."""
max_action_tokens: int = 256
"""Maximum number of FAST tokens per action chunk."""
fast_skip_tokens: int = 128
"""Number of low-vocab tokens the FAST tokenizer skips to avoid
collisions with PaliGemma's text vocabulary."""
fast_action_loss_weight: float = 1.0
"""Weight on the FAST-action-token CE loss. Paper §III.C uses 1.0."""
auto_fit_fast_tokenizer: bool = False
"""If True, the processor factory checks ``fast_tokenizer_cache_dir``
for a previously-fitted tokenizer keyed on ``(dataset_repo_id,
base_tokenizer_name, fit_samples)``. On cache miss, it loads
``action_tokenizer_name`` as a base, samples
``fast_tokenizer_fit_samples`` action chunks from the dataset, runs
``.fit()``, saves the result, and uses *that* fitted path as the
actual tokenizer. Pertsch et al. 2025 (FAST paper [64], π0.5 §III.C)
explicitly recommend per-dataset fitting for best compression.
Off by default because the fit requires a separate pre-training
pass over the dataset (~1-2 min on a medium dataset) and depends
on the FAST tokenizer snapshot having a ``.fit()`` method. Opt in
when you want paper-faithful compression; leave off to fall back
on the universal ``physical-intelligence/fast`` codebook."""
fast_tokenizer_cache_dir: str = "~/.cache/lerobot/fast_tokenizers"
"""Where fitted FAST tokenizers are stored. ``~`` expands."""
fast_tokenizer_fit_samples: int = 1024
"""Number of action chunks to sample for the fit. The FAST paper uses
a few thousand; 1024 is a reasonable default for medium datasets."""
# Knowledge insulation — paper §III.B --------------------------------
# When enabled, gradients from the action expert's flow loss are
# blocked from flowing back into the VLM's K/V projections. This
# prevents the action loss from over-fitting the language backbone
# to robot-specific features. Implemented in ``modeling_pi052`` as
# a per-instance monkey-patch on ``paligemma_with_expert.forward``
# that splits queries into VLM and action halves and ``.detach()``-s
# the VLM K/V tensors used in the action-half's attention.
knowledge_insulation: bool = False
"""If True, route every transformer layer through the KI
attention path that blocks actionVLM gradient flow on K/V."""
def __post_init__(self) -> None:
super().__post_init__()
# Backbone needs gradients flowing through the text head when
# we're training it. Override the π0.5 default
# (``train_expert_only=True``) unless the user explicitly opts
# out of text training via ``text_loss_weight=0``.
if self.text_loss_weight > 0 and self.unfreeze_lm_head:
self.train_expert_only = False
@@ -0,0 +1,204 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Dataset-specific FAST action tokenizer fitting.
The published ``physical-intelligence/fast`` tokenizer is a *universal*
codebook fitted on a heterogeneous mix of robot datasets. Per Pertsch
et al. 2025 (the FAST paper, [64] in the π0.5 paper) and §III.C of
π0.5 itself, the recommended practice is to **finetune the tokenizer on
your specific dataset's action distribution** before training the
policy same way one would adapt a language tokenizer to a domain
corpus. Without this finetune step, action sequences from your robot
may require more tokens per chunk than necessary, lowering effective
compression and slowing convergence of the action-CE loss.
This module provides a single utility, :func:`fit_fast_tokenizer`,
that does the finetune. The training entry point invokes it
automatically when the policy's ``enable_fast_action_loss`` and
``auto_fit_fast_tokenizer`` flags are both ``True`` and no cached
fitted tokenizer is found at ``fast_tokenizer_cache_dir``.
The fitted tokenizer is saved to
``{cache_dir}/{dataset_hash}_{base_hash}/`` so successive training
runs over the same dataset re-use it.
"""
from __future__ import annotations
import hashlib
import logging
from pathlib import Path
from typing import Any
import numpy as np
logger = logging.getLogger(__name__)
def _dataset_signature(
dataset_repo_id: str,
base_tokenizer_name: str,
n_samples: int,
chunk_size: int,
) -> str:
"""Deterministic short hash for naming the cache directory.
Keys on (dataset, base tokenizer, sample count, chunk size) so any
of those changing re-runs the fit. ``chunk_size`` matters because
the tokenizer is fit on chunks of that length.
"""
h = hashlib.sha256()
h.update(dataset_repo_id.encode("utf-8"))
h.update(b"\0")
h.update(base_tokenizer_name.encode("utf-8"))
h.update(b"\0")
h.update(str(n_samples).encode("utf-8"))
h.update(b"\0")
h.update(str(chunk_size).encode("utf-8"))
return h.hexdigest()[:16]
def fit_fast_tokenizer(
*,
dataset_repo_id: str,
cache_dir: str | Path,
base_tokenizer_name: str = "physical-intelligence/fast",
n_samples: int = 1024,
chunk_size: int = 50,
seed: int = 42,
) -> str:
"""Fit a FAST tokenizer on a LeRobot dataset's action distribution.
Args:
dataset_repo_id: HF Hub repo id of the LeRobotDataset to fit on.
cache_dir: Directory under which to save (and look up) fitted
tokenizers. The actual save path is
``{cache_dir}/{signature}``.
base_tokenizer_name: HF identifier for the base FAST tokenizer
to finetune from. ``physical-intelligence/fast`` is the
universal one.
n_samples: Number of action chunks to sample for the fit. The
FAST paper uses a few thousand; ``1024`` is a good default
for medium datasets.
chunk_size: Length of each action chunk (matches
``policy.chunk_size``). The FAST tokenizer is fit on
sequences of this length.
seed: RNG seed for sample selection.
Returns:
The local path to the fitted tokenizer. Passed directly to
``--policy.action_tokenizer_name`` for the training run.
Raises:
ImportError: If the ``transformers`` library doesn't expose
``AutoProcessor`` or the FAST tokenizer doesn't have a
``.fit()`` method (then you're on an older FAST snapshot —
update to the current published model).
FileNotFoundError: If the dataset can't be loaded.
"""
cache_dir = Path(cache_dir)
sig = _dataset_signature(dataset_repo_id, base_tokenizer_name, n_samples, chunk_size)
out_dir = cache_dir / sig
if out_dir.exists() and (out_dir / "preprocessor_config.json").exists():
logger.info(
"FAST tokenizer cache hit: %s — re-using fitted tokenizer for "
"dataset=%s base=%s n_samples=%d",
out_dir, dataset_repo_id, base_tokenizer_name, n_samples,
)
return str(out_dir)
logger.info(
"FAST tokenizer cache miss — fitting on dataset=%s "
"base=%s n_samples=%d chunk_size=%d%s",
dataset_repo_id, base_tokenizer_name, n_samples, chunk_size, out_dir,
)
from transformers import AutoProcessor # noqa: PLC0415
from lerobot.datasets.lerobot_dataset import LeRobotDataset # noqa: PLC0415
# Stream a single episode's worth of action chunks at a time so
# we don't blow memory on huge datasets. Random episode +
# random start offset gives a reasonable spread.
rng = np.random.default_rng(seed)
actions_buf: list[np.ndarray] = []
# Load just the metadata first to know episode boundaries.
ds_meta_only = LeRobotDataset(dataset_repo_id, episodes=[0])
num_episodes = ds_meta_only.meta.total_episodes
del ds_meta_only
samples_per_episode = max(1, n_samples // max(num_episodes, 1))
collected = 0
eps_visited = 0
for ep_idx in rng.permutation(num_episodes):
if collected >= n_samples:
break
ep_idx = int(ep_idx)
try:
ds = LeRobotDataset(dataset_repo_id, episodes=[ep_idx])
except Exception as exc: # noqa: BLE001
logger.warning("FAST fit: skipping episode %d: %s", ep_idx, exc)
continue
if len(ds) < chunk_size:
continue
# Sample ``samples_per_episode`` start indices uniformly within
# the episode.
starts = rng.integers(0, len(ds) - chunk_size + 1, size=samples_per_episode)
for s in starts:
try:
chunk_actions = np.stack(
[
np.asarray(ds[int(s) + j]["action"].cpu().numpy())
for j in range(chunk_size)
],
axis=0,
)
except Exception as exc: # noqa: BLE001
logger.debug("FAST fit: chunk at ep=%d s=%d failed: %s", ep_idx, s, exc)
continue
actions_buf.append(chunk_actions)
collected += 1
if collected >= n_samples:
break
eps_visited += 1
if not actions_buf:
raise RuntimeError(
f"FAST fit collected zero action chunks from {dataset_repo_id!r}. "
"Check that the dataset has an ``action`` column and chunks of "
f"length ≥ {chunk_size}."
)
actions = np.stack(actions_buf, axis=0) # (N, H, D)
logger.info(
"FAST fit: collected %d chunks of shape %s from %d episodes",
actions.shape[0], actions.shape[1:], eps_visited,
)
base = AutoProcessor.from_pretrained(base_tokenizer_name, trust_remote_code=True)
if not hasattr(base, "fit"):
raise ImportError(
f"Base FAST tokenizer {base_tokenizer_name!r} has no ``.fit()`` "
"method — your transformers / model snapshot is too old. Update "
"to the current ``physical-intelligence/fast`` revision."
)
fitted = base.fit(actions)
out_dir.mkdir(parents=True, exist_ok=True)
fitted.save_pretrained(str(out_dir))
logger.info("FAST fit: saved fitted tokenizer to %s", out_dir)
return str(out_dir)
@@ -0,0 +1,821 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""π0.5 v2 policy — dual-head training & hierarchical inference.
A thin subclass of :class:`PI05Policy` that:
* keeps the PaliGemma ``lm_head`` unfrozen during fine-tuning
(``PI05Policy`` zeroes / freezes it because it never reads from
the head; ``PI052Config.unfreeze_lm_head`` flips that),
* adds a ``text_loss`` term computed via cross-entropy on
``text_labels`` (built by ``PI052TextTokenizerStep``),
* adds :meth:`select_message` for AR text generation at inference
(the high-level step in the π0.5 paper's two-stage inference loop),
* combines both losses in :meth:`forward` per Eq. (1) of the paper:
L = H(x, f_θ_text) + α * ω - a - f_θ_action(...)²
with α controllable via ``config.flow_loss_weight``.
The same multi-rate runtime that drives ``SmolVLA2Runtime`` (see
``lerobot.policies.smolvla2.inference``) can drive this policy too
both expose ``predict_action_chunk`` for the action expert and
``select_message`` for the LM head.
"""
from __future__ import annotations
import logging
import math
import types
from typing import Any
import torch
from torch import Tensor
from torch.nn import functional as F
from lerobot.utils.constants import OBS_LANGUAGE_ATTENTION_MASK, OBS_LANGUAGE_TOKENS
from ..pi05.configuration_pi05 import PI05Config
from ..pi05.modeling_pi05 import PI05Policy
from .configuration_pi052 import PI052Config
logger = logging.getLogger(__name__)
# ----------------------------------------------------------------------
# Loss helpers (shared between fused and prefix-only paths)
# ----------------------------------------------------------------------
def _mask_per_sample(per_sample: Tensor, predict_actions_t: Tensor | None) -> Tensor:
"""Mean over samples where ``predict_actions_t`` is True, else over all."""
if predict_actions_t is None:
return per_sample.mean()
mask = predict_actions_t.to(per_sample.dtype)
return (per_sample * mask).sum() / mask.sum().clamp(min=1.0)
def _shifted_ce(logits: Tensor, labels: Tensor) -> Tensor:
"""Next-token CE: hidden at t predicts label at t+1, ignore_index=-100.
Mean over non-ignored positions across the batch. Returns 0 cleanly
when no positions are supervised (clamp(min=1) on the denominator).
"""
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = labels[:, 1:].contiguous().long()
valid = shift_labels != -100
loss = F.cross_entropy(
shift_logits.reshape(-1, shift_logits.shape[-1]),
shift_labels.reshape(-1),
ignore_index=-100,
reduction="sum",
)
return loss / valid.sum().clamp(min=1)
def _fast_ce(
fast_logits: Tensor,
action_tokens: Tensor,
action_mask: Tensor,
predict_actions_t: Tensor | None,
) -> Tensor:
"""FAST-CE with both token-pad masking and per-sample action gating.
``action_mask`` is the FAST tokenizer's padding mask; samples whose
recipe sets ``predict_actions=False`` (e.g. plan_generation,
ask_vqa_*) get *all* their FAST positions masked out via the
per-sample gate.
"""
shift_logits = fast_logits[:, :-1, :].contiguous()
shift_targets = action_tokens[:, 1:].contiguous().long()
shift_valid = action_mask[:, 1:].contiguous().bool()
if predict_actions_t is not None:
sample_mask = predict_actions_t[:, None].expand_as(shift_valid)
shift_valid = shift_valid & sample_mask
shift_targets = shift_targets.masked_fill(~shift_valid, -100)
return F.cross_entropy(
shift_logits.reshape(-1, shift_logits.shape[-1]),
shift_targets.reshape(-1),
ignore_index=-100,
reduction="sum",
) / shift_valid.sum().clamp(min=1)
# ----------------------------------------------------------------------
# Knowledge insulation — ported from pi05_full (branch ``feat/add-pi05``)
# ----------------------------------------------------------------------
#
# Per-layer attention that splits the queries into VLM and action
# parts, computing attention for action queries with .detach()'d VLM
# K/V so the action loss's gradient cannot flow back into the VLM's K
# and V projections. Forward output is bit-equivalent to the standard
# layer; backward differs only on the path action_loss → VLM K/V.
def _compute_layer_ki(
layer_idx,
inputs_embeds,
attention_mask,
position_ids,
adarms_cond,
paligemma,
gemma_expert,
):
from transformers.models.gemma import modeling_gemma # noqa: PLC0415
models = [paligemma.model.language_model, gemma_expert.model]
query_states, key_states, value_states, gates = [], [], [], []
vlm_len = inputs_embeds[0].shape[1]
for i, hidden_states in enumerate(inputs_embeds):
layer = models[i].layers[layer_idx]
hidden_states, gate = layer.input_layernorm(hidden_states, cond=adarms_cond[i])
gates.append(gate)
input_shape = hidden_states.shape[:-1]
hidden_shape = (*input_shape, -1, layer.self_attn.head_dim)
q = layer.self_attn.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
k = layer.self_attn.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
v = layer.self_attn.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
query_states.append(q)
key_states.append(k)
value_states.append(v)
query_states = torch.cat(query_states, dim=2)
key_states = torch.cat(key_states, dim=2)
value_states = torch.cat(value_states, dim=2)
dummy = torch.zeros(
query_states.shape[0],
query_states.shape[2],
query_states.shape[-1],
device=query_states.device,
dtype=query_states.dtype,
)
cos, sin = paligemma.model.language_model.rotary_emb(dummy, position_ids)
query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
query_states, key_states, cos, sin, unsqueeze_dim=1
)
batch_size = query_states.shape[0]
scaling = paligemma.model.language_model.layers[layer_idx].self_attn.scaling
# Split queries / K / V at the VLM-vs-action boundary.
Q_vlm = query_states[:, :, :vlm_len, :]
Q_action = query_states[:, :, vlm_len:, :]
K_vlm = key_states[:, :, :vlm_len, :]
K_action = key_states[:, :, vlm_len:, :]
V_vlm = value_states[:, :, :vlm_len, :]
V_action = value_states[:, :, vlm_len:, :]
# Detach VLM K/V *only* on the path the action queries use.
K_vlm_det = K_vlm.detach()
V_vlm_det = V_vlm.detach()
K_for_vlm = key_states # full (gradients flow)
V_for_vlm = value_states
K_for_action = torch.cat([K_vlm_det, K_action], dim=2)
V_for_action = torch.cat([V_vlm_det, V_action], dim=2)
mask_for_vlm = attention_mask[:, :, :vlm_len, :]
mask_for_action = attention_mask[:, :, vlm_len:, :]
att_vlm, _ = modeling_gemma.eager_attention_forward(
paligemma.model.language_model.layers[layer_idx].self_attn,
Q_vlm, K_for_vlm, V_for_vlm, mask_for_vlm, scaling,
)
att_action, _ = modeling_gemma.eager_attention_forward(
paligemma.model.language_model.layers[layer_idx].self_attn,
Q_action, K_for_action, V_for_action, mask_for_action, scaling,
)
att = torch.cat([att_vlm, att_action], dim=1)
head_dim = paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
att = att.reshape(batch_size, -1, 1 * 8 * head_dim)
outputs_embeds = []
start = 0
for i, hidden_states in enumerate(inputs_embeds):
layer = models[i].layers[layer_idx]
end = start + hidden_states.shape[1]
if att.dtype != layer.self_attn.o_proj.weight.dtype:
att = att.to(layer.self_attn.o_proj.weight.dtype)
out_emb = layer.self_attn.o_proj(att[:, start:end])
out_emb = modeling_gemma._gated_residual(hidden_states, out_emb, gates[i]) # noqa: SLF001
after_first = out_emb.clone()
out_emb, gate = layer.post_attention_layernorm(out_emb.clone(), cond=adarms_cond[i])
if layer.mlp.up_proj.weight.dtype == torch.bfloat16:
out_emb = out_emb.to(dtype=torch.bfloat16)
out_emb = layer.mlp(out_emb)
out_emb = modeling_gemma._gated_residual(after_first, out_emb, gate) # noqa: SLF001
outputs_embeds.append(out_emb)
start = end
return outputs_embeds
def _paligemma_forward_ki(
self,
attention_mask=None,
position_ids=None,
past_key_values=None,
inputs_embeds=None,
use_cache=None,
adarms_cond=None,
):
"""Replacement ``PaliGemmaWithExpertModel.forward`` that routes the
dual-expert layer pass through :func:`_compute_layer_ki`.
Bound onto the model instance when ``config.knowledge_insulation``
is True (see ``PI052Policy.__init__``). Single-expert branches
(VLM-only or action-only) defer back to the original forward
KI only matters when actions and VLM tokens are forwarded together.
"""
from ..pi05.modeling_pi05 import layernorm_forward # noqa: PLC0415
if adarms_cond is None:
adarms_cond = [None, None]
# Single-expert paths: defer to the original forward saved in
# PI052Policy.__init__.
if inputs_embeds[0] is None or inputs_embeds[1] is None:
return self._pi052_orig_forward(
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
adarms_cond=adarms_cond,
)
models = [self.paligemma.model.language_model, self.gemma_expert.model]
num_layers = self.paligemma.config.text_config.num_hidden_layers
use_gc = (
hasattr(self.gemma_expert.model, "gradient_checkpointing")
and self.gemma_expert.model.gradient_checkpointing
and self.training
) or (
hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training
)
for layer_idx in range(num_layers):
if use_gc:
inputs_embeds = torch.utils.checkpoint.checkpoint(
_compute_layer_ki,
layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond,
use_reentrant=False, preserve_rng_state=False,
paligemma=self.paligemma, gemma_expert=self.gemma_expert,
)
else:
inputs_embeds = _compute_layer_ki(
layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond,
paligemma=self.paligemma, gemma_expert=self.gemma_expert,
)
outputs_embeds = []
for i, hidden_states in enumerate(inputs_embeds):
out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
outputs_embeds.append(out_emb)
return [outputs_embeds[0], outputs_embeds[1]], None
class PI052Policy(PI05Policy):
"""π0.5 with the PaliGemma LM head re-enabled."""
config_class = PI052Config
name = "pi052"
def __init__(self, config: PI052Config, **kwargs: Any) -> None:
super().__init__(config, **kwargs)
# ``PI05Policy.__init__`` zeroes the PaliGemma ``lm_head`` and
# freezes a few terminal layers when ``train_expert_only`` is
# the (default) True. We re-enable the head if the user
# wants text supervision.
if config.text_loss_weight > 0 and config.unfreeze_lm_head:
self._unfreeze_lm_head()
# Knowledge insulation: bind a custom ``forward`` on the
# PaliGemmaWithExpertModel instance that uses
# :func:`_compute_layer_ki` for the dual-expert layer pass.
# The bind is per-instance, so this doesn't leak into stock
# ``pi05`` policies that share the same class.
if getattr(config, "knowledge_insulation", False):
backbone = self.model.paligemma_with_expert
backbone._pi052_orig_forward = backbone.forward
backbone.forward = types.MethodType(_paligemma_forward_ki, backbone)
logger.info(
"PI052: knowledge insulation enabled — action→VLM K/V "
"gradients are blocked in attention."
)
# ------------------------------------------------------------------
# Head unfreeze helper
# ------------------------------------------------------------------
def _unfreeze_lm_head(self) -> None:
"""Walk the PaliGemma submodules and re-enable gradients on
``lm_head`` + the immediately preceding norm / last text-model
layer that ``PI05Policy`` typically freezes."""
backbone = self.model.paligemma_with_expert.paligemma
if hasattr(backbone, "lm_head"):
for p in backbone.lm_head.parameters():
p.requires_grad_(True)
# The text model's final norm and last transformer block —
# mirror SmolVLA2's logic, which finds these dynamically by
# the trainable=False parameters that point at the head's
# neighbourhood.
text_model = getattr(backbone, "model", None)
text_model = getattr(text_model, "language_model", text_model)
if text_model is None:
return
norm = getattr(text_model, "norm", None)
if norm is not None:
for p in norm.parameters():
p.requires_grad_(True)
layers = getattr(text_model, "layers", None)
if isinstance(layers, (list, torch.nn.ModuleList)) and len(layers) > 0:
for p in layers[-1].parameters():
p.requires_grad_(True)
# ------------------------------------------------------------------
# Forward (dual loss: flow + text)
# ------------------------------------------------------------------
def forward(
self,
batch: dict[str, Tensor],
reduction: str = "mean",
) -> tuple[Tensor, dict]:
"""Dual-head forward: flow-matching loss + text-CE loss.
When ``text_labels`` isn't present in the batch (e.g. the
recipe wasn't applied), we delegate to ``PI05Policy.forward``
unchanged. Otherwise we compute both losses and sum them with
``flow_loss_weight`` / ``text_loss_weight``.
"""
text_labels = batch.get("text_labels")
predict_actions_t = batch.get("predict_actions")
# Fall through to PI05Policy only on fully unannotated batches
# (no recipe applied → no routing fields). For recipe-applied
# batches we keep control of the loss dispatch even if all
# samples are text-only — delegating would silently train flow
# on text-only frames (PI05Policy.forward ignores
# ``predict_actions``).
if (
text_labels is None
and predict_actions_t is None
and not getattr(self.config, "enable_fast_action_loss", False)
):
return super().forward(batch, reduction=reduction)
run_flow = (
self.config.flow_loss_weight > 0
and (predict_actions_t is None or bool(predict_actions_t.any().item()))
)
run_text = self.config.text_loss_weight > 0 and text_labels is not None
loss_dict: dict[str, Any] = {}
total: Tensor | None = None
# Decide which losses fire this step.
run_fast = (
getattr(self.config, "enable_fast_action_loss", False)
and self.config.fast_action_loss_weight > 0
and (predict_actions_t is None or bool(predict_actions_t.any().item()))
)
action_tokens = action_mask = None
if run_fast:
from lerobot.utils.constants import ACTION_TOKEN_MASK, ACTION_TOKENS # noqa: PLC0415
action_tokens = batch.get(ACTION_TOKENS)
action_mask = batch.get(ACTION_TOKEN_MASK)
if action_tokens is None or action_mask is None:
run_fast = False
# ------------------------------------------------------------
# Dispatch: full fusion when flow is active, otherwise the
# prefix-only text+FAST helper (no suffix forward needed).
#
# Full fusion (flow ON):
# ONE backbone forward with prefix=[images, lang, FAST] +
# suffix=[noisy_actions], suffix→FAST attention masked out.
# All three losses computed from slices of the single output.
#
# Prefix-only fusion (flow OFF, e.g. text-only recipes):
# ONE prefix-only forward, both text + FAST losses computed
# from slices. No suffix forward → cheaper.
# ------------------------------------------------------------
if run_flow:
flow_loss, text_loss, fast_loss = self._compute_all_losses_fused(
batch,
text_labels=text_labels if run_text else None,
action_tokens=action_tokens if run_fast else None,
action_mask=action_mask if run_fast else None,
predict_actions_t=predict_actions_t,
)
loss_dict["flow_loss"] = float(flow_loss.detach().item())
total = self.config.flow_loss_weight * flow_loss
if text_loss is not None:
loss_dict["text_loss"] = float(text_loss.detach().item())
total = total + self.config.text_loss_weight * text_loss
if fast_loss is not None:
loss_dict["fast_action_loss"] = float(fast_loss.detach().item())
total = total + self.config.fast_action_loss_weight * fast_loss
elif run_text or run_fast:
text_loss, fast_loss = self._compute_text_and_fast_loss(
batch,
text_labels=text_labels if run_text else None,
action_tokens=action_tokens if run_fast else None,
action_mask=action_mask if run_fast else None,
predict_actions_t=predict_actions_t,
)
if text_loss is not None:
loss_dict["text_loss"] = float(text_loss.detach().item())
weighted = self.config.text_loss_weight * text_loss
total = weighted if total is None else total + weighted
if fast_loss is not None:
loss_dict["fast_action_loss"] = float(fast_loss.detach().item())
weighted = self.config.fast_action_loss_weight * fast_loss
total = weighted if total is None else total + weighted
if total is None:
# Both flow and text disabled — make this an obvious bug
# rather than a silent zero loss.
raise RuntimeError(
"PI052Policy.forward: both flow_loss_weight and "
"text_loss_weight are 0 (or text_labels missing) — "
"nothing to train."
)
loss_dict["loss"] = float(total.detach().item()) if total.dim() == 0 else float("nan")
if reduction == "none":
return total.expand(batch[OBS_LANGUAGE_TOKENS].shape[0]), loss_dict
return total, loss_dict
# ------------------------------------------------------------------
# Text loss
# ------------------------------------------------------------------
def _compute_all_losses_fused(
self,
batch: dict[str, Tensor],
text_labels: Tensor | None,
action_tokens: Tensor | None,
action_mask: Tensor | None,
predict_actions_t: Tensor | None = None,
) -> tuple[Tensor, Tensor | None, Tensor | None]:
"""Full fusion: flow + text + FAST in ONE backbone forward.
Builds:
prefix = [images, language, FAST (when provided)]
suffix = [noisy_actions] (action expert via gemma_expert)
Then overrides the unified 2D attention mask to *explicitly*
zero out ``suffix FAST`` attention. Without this override
the action expert would attend to the discrete FAST tokens
and trivially decode them back to the same continuous
actions it's supposed to predict via flow matching — the
whole training signal collapses.
Both prefix_out and suffix_out are captured from the same
forward. From prefix_out we slice the language and FAST
token positions and compute their CE losses. From suffix_out
we run the existing flow path (action_out_proj MSE).
Returns ``(flow_loss, text_loss, fast_loss)`` where text/fast
can be ``None`` when the caller didn't supply the
corresponding inputs.
"""
from lerobot.utils.constants import ACTION # noqa: PLC0415
from ..pi05.modeling_pi05 import make_att_2d_masks # noqa: PLC0415
# ---- preamble (mirrors PI05Pytorch.forward) ------------------
actions = self.model.prepare_action(batch)
noise = self.model.sample_noise(actions.shape, actions.device)
time = self.model.sample_time(actions.shape[0], actions.device)
time_expanded = time[:, None, None]
x_t = time_expanded * noise + (1 - time_expanded) * actions
u_t = noise - actions
# ---- prefix: images + language + (optional FAST) -------------
images, img_masks = self.model._preprocess_images(batch)
lang_tokens = batch[OBS_LANGUAGE_TOKENS]
lang_masks = batch[OBS_LANGUAGE_ATTENTION_MASK]
prefix_embs, prefix_pad, prefix_att = self.model.embed_prefix(
images, img_masks, lang_tokens, lang_masks
)
non_fast_prefix_len = prefix_embs.shape[1] # images + language only
fast_len = 0
if action_tokens is not None and action_mask is not None:
emb_dim = prefix_embs.shape[-1]
fast_emb = self.model.paligemma_with_expert.embed_language_tokens(action_tokens)
fast_emb = fast_emb * math.sqrt(emb_dim)
fast_len = action_tokens.shape[1]
ones_att = torch.ones(
(action_tokens.shape[0], fast_len),
dtype=torch.bool,
device=prefix_embs.device,
)
prefix_embs = torch.cat([prefix_embs, fast_emb], dim=1)
prefix_pad = torch.cat([prefix_pad, action_mask.to(prefix_pad.dtype)], dim=1)
prefix_att = torch.cat([prefix_att, ones_att], dim=1)
# ---- suffix: noisy actions ----------------------------------
suffix_embs, suffix_pad, suffix_att, adarms_cond = self.model.embed_suffix(x_t, time)
# ---- bf16 alignment (mirrors PI05Pytorch.forward) -----------
first_layer = (
self.model.paligemma_with_expert.paligemma.model.language_model.layers[0]
)
if first_layer.self_attn.q_proj.weight.dtype == torch.bfloat16:
suffix_embs = suffix_embs.to(dtype=torch.bfloat16)
prefix_embs = prefix_embs.to(dtype=torch.bfloat16)
# ---- combined attention -------------------------------------
pad_masks = torch.cat([prefix_pad, suffix_pad], dim=1)
att_masks = torch.cat([prefix_att, suffix_att], dim=1)
att_2d_masks = make_att_2d_masks(pad_masks, att_masks)
# Critical: zero out suffix → FAST attention. Without this the
# action expert reads the FAST tokens and trivially decodes
# them back to the same continuous actions it's supposed to
# predict from noise. Cumulative-block attention from
# ``make_att_2d_masks`` doesn't enforce this on its own
# because suffix tokens have a strictly higher cumsum than
# FAST tokens and therefore attend to them by default.
if fast_len > 0:
fast_start = non_fast_prefix_len
fast_end = non_fast_prefix_len + fast_len # = prefix_pad.shape[1]
att_2d_masks[:, fast_end:, fast_start:fast_end] = False
position_ids = torch.cumsum(pad_masks, dim=1) - 1
att_2d_masks_4d = self.model._prepare_attention_masks_4d(att_2d_masks)
# ---- forward (capture BOTH expert outputs) ------------------
(prefix_out, suffix_out), _ = self.model.paligemma_with_expert.forward(
attention_mask=att_2d_masks_4d,
position_ids=position_ids,
past_key_values=None,
inputs_embeds=[prefix_embs, suffix_embs],
use_cache=False,
adarms_cond=[None, adarms_cond],
)
# ---- flow loss (mirrors PI05Pytorch.forward) ----------------
suffix_out_slice = suffix_out[:, -self.model.config.chunk_size :].to(
dtype=torch.float32
)
v_t = self.model.action_out_proj(suffix_out_slice)
flow_per_dim = F.mse_loss(u_t, v_t, reduction="none")
# Truncate to the actual action dimensionality (PI05 pads
# internally to max_action_dim).
original_action_dim = self.config.output_features[ACTION].shape[0]
flow_per_dim = flow_per_dim[:, :, :original_action_dim]
per_sample_flow = flow_per_dim.mean(dim=(1, 2))
flow_loss = _mask_per_sample(per_sample_flow, predict_actions_t)
# ---- text + FAST CE from prefix_out ------------------------
lm_head = self.model.paligemma_with_expert.paligemma.lm_head
text_loss: Tensor | None = None
if text_labels is not None and prefix_out is not None:
lang_len = text_labels.shape[1]
if fast_len > 0:
text_hidden = prefix_out[:, -(fast_len + lang_len) : -fast_len, :]
else:
text_hidden = prefix_out[:, -lang_len:, :]
text_logits = lm_head(text_hidden.to(lm_head.weight.dtype))
text_loss = _shifted_ce(text_logits, text_labels)
fast_loss: Tensor | None = None
if fast_len > 0 and prefix_out is not None:
fast_hidden = prefix_out[:, -fast_len:, :]
fast_logits = lm_head(fast_hidden.to(lm_head.weight.dtype))
fast_loss = _fast_ce(fast_logits, action_tokens, action_mask, predict_actions_t)
return flow_loss, text_loss, fast_loss
def _compute_text_and_fast_loss(
self,
batch: dict[str, Tensor],
text_labels: Tensor | None,
action_tokens: Tensor | None,
action_mask: Tensor | None,
predict_actions_t: Tensor | None = None,
) -> tuple[Tensor | None, Tensor | None]:
"""Single prefix forward → text CE + FAST CE.
Embed [images, language] (and FAST when requested) once, run
one backbone forward, then slice the resulting hidden states
at the language and FAST positions to compute both CE losses.
Bit-equivalent to running the two losses in separate forwards
because the segment-aware ``make_att_2d_masks`` keeps FAST
tokens invisible to language tokens, so adding FAST to the
prefix doesn't perturb the hidden states at language positions.
Returns ``(text_loss, fast_loss)``. Either can be ``None`` if
the caller doesn't want that head.
"""
from ..pi05.modeling_pi05 import make_att_2d_masks # noqa: PLC0415
images, img_masks = self.model._preprocess_images(batch)
lang_tokens = batch[OBS_LANGUAGE_TOKENS]
lang_masks = batch[OBS_LANGUAGE_ATTENTION_MASK]
prefix_embs, prefix_pad, prefix_att = self.model.embed_prefix(
images, img_masks, lang_tokens, lang_masks
)
fast_len = 0
if action_tokens is not None and action_mask is not None:
emb_dim = prefix_embs.shape[-1]
fast_emb = self.model.paligemma_with_expert.embed_language_tokens(action_tokens)
fast_emb = fast_emb * math.sqrt(emb_dim)
fast_len = action_tokens.shape[1]
ones_att = torch.ones(
(action_tokens.shape[0], fast_len),
dtype=torch.bool,
device=prefix_embs.device,
)
full_embs = torch.cat([prefix_embs, fast_emb], dim=1)
full_pad = torch.cat([prefix_pad, action_mask.to(prefix_pad.dtype)], dim=1)
full_att = torch.cat([prefix_att, ones_att], dim=1)
else:
full_embs = prefix_embs
full_pad = prefix_pad
full_att = prefix_att
att_2d = make_att_2d_masks(full_pad, full_att)
position_ids = torch.cumsum(full_pad, dim=1) - 1
(vlm_out, _), _ = self.model.paligemma_with_expert.forward(
attention_mask=att_2d,
position_ids=position_ids,
past_key_values=None,
inputs_embeds=[full_embs, None],
use_cache=False,
)
if vlm_out is None:
raise RuntimeError(
"PI052 text+fast loss: VLM forward returned no hidden states."
)
lm_head = self.model.paligemma_with_expert.paligemma.lm_head
text_loss: Tensor | None = None
if text_labels is not None:
lang_len = text_labels.shape[1]
# embed_prefix lays out as [images, language]; with FAST
# appended the full sequence is [images, language, FAST].
if fast_len > 0:
text_hidden = vlm_out[:, -(fast_len + lang_len):-fast_len, :]
else:
text_hidden = vlm_out[:, -lang_len:, :]
text_logits = lm_head(text_hidden.to(lm_head.weight.dtype))
text_loss = _shifted_ce(text_logits, text_labels)
fast_loss: Tensor | None = None
if action_tokens is not None and action_mask is not None and fast_len > 0:
fast_hidden = vlm_out[:, -fast_len:, :]
fast_logits = lm_head(fast_hidden.to(lm_head.weight.dtype))
fast_loss = _fast_ce(fast_logits, action_tokens, action_mask, predict_actions_t)
return text_loss, fast_loss
# ------------------------------------------------------------------
# select_message — AR text generation at inference
# ------------------------------------------------------------------
def select_message(
self,
batch: dict[str, Tensor],
*,
max_new_tokens: int = 128,
min_new_tokens: int = 0,
eos_token_id: int | None = None,
temperature: float = 0.0,
top_p: float = 1.0,
tokenizer: Any = None,
) -> str:
"""Generate text continuation from a multimodal prefix.
Mirrors ``SmolVLA2Policy.select_message`` so the same
:class:`lerobot.policies.smolvla2.inference.SmolVLA2Runtime`
can drive π0.5 v2 unchanged.
"""
self.eval()
if tokenizer is None:
from transformers import AutoTokenizer # noqa: PLC0415
tokenizer = AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224")
if eos_token_id is None:
eos_token_id = tokenizer.eos_token_id
special_ids: set[int] = set()
try:
for sid in (tokenizer.all_special_ids or []):
if sid is not None:
special_ids.add(int(sid))
except Exception: # noqa: BLE001
pass
if eos_token_id is not None:
special_ids.add(int(eos_token_id))
images, img_masks = self.model._preprocess_images(batch)
tokens = batch[OBS_LANGUAGE_TOKENS]
masks = batch[OBS_LANGUAGE_ATTENTION_MASK]
prefix_embs, prefix_pad_masks, prefix_att_masks = self.model.embed_prefix(
images, img_masks, tokens, masks
)
device = prefix_embs.device
bsize = prefix_embs.shape[0]
emb_dim = prefix_embs.shape[-1]
text_emb_scale = math.sqrt(emb_dim)
ones_step = torch.ones((bsize, 1), dtype=torch.bool, device=device)
current_embs = prefix_embs
current_pad = prefix_pad_masks
current_att = prefix_att_masks
generated: list[int] = []
from ..pi05.modeling_pi05 import make_att_2d_masks # noqa: PLC0415
backbone = self.model.paligemma_with_expert
lm_head = backbone.paligemma.lm_head
for _ in range(max_new_tokens):
att_2d = make_att_2d_masks(current_pad, current_att)
position_ids = torch.cumsum(current_pad, dim=1) - 1
(vlm_out, _), _ = backbone.forward(
attention_mask=att_2d,
position_ids=position_ids,
past_key_values=None,
inputs_embeds=[current_embs, None],
use_cache=False,
)
if vlm_out is None:
break
last = vlm_out[:, -1:].to(lm_head.weight.dtype)
logits_step = lm_head(last)[:, -1] # (B, V)
if special_ids and len(generated) < min_new_tokens:
for sid in special_ids:
logits_step[..., sid] = float("-inf")
next_ids = self._sample_next_token(logits_step, temperature, top_p)
tok_id = int(next_ids[0].item())
generated.append(tok_id)
if eos_token_id is not None and tok_id == eos_token_id:
break
new_emb = backbone.embed_language_tokens(next_ids.unsqueeze(0))
new_emb = new_emb * text_emb_scale
current_embs = torch.cat([current_embs, new_emb], dim=1)
current_pad = torch.cat([current_pad, ones_step], dim=1)
current_att = torch.cat([current_att, ones_step], dim=1)
decoded = tokenizer.decode(generated, skip_special_tokens=True).strip()
if not decoded and generated:
try:
self._last_select_message_debug = (
f"raw_ids={generated[:16]} "
f"decoded_w_special={tokenizer.decode(generated, skip_special_tokens=False)!r}"
)
except Exception: # noqa: BLE001
self._last_select_message_debug = f"raw_ids={generated[:16]}"
else:
self._last_select_message_debug = ""
return decoded
@staticmethod
def _sample_next_token(logits: Tensor, temperature: float, top_p: float) -> Tensor:
if temperature <= 0.0:
return logits.argmax(dim=-1)
scaled = logits / max(temperature, 1e-6)
probs = torch.softmax(scaled, dim=-1)
if top_p < 1.0:
sorted_p, sorted_ix = torch.sort(probs, descending=True, dim=-1)
cum = torch.cumsum(sorted_p, dim=-1)
mask = cum > top_p
mask[..., 0] = False
sorted_p = sorted_p.masked_fill(mask, 0.0)
sorted_p = sorted_p / sorted_p.sum(dim=-1, keepdim=True).clamp_min(1e-8)
choice = torch.multinomial(sorted_p, num_samples=1)
return sorted_ix.gather(-1, choice).squeeze(-1)
return torch.multinomial(probs, num_samples=1).squeeze(-1)
@@ -0,0 +1,196 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""π0.5 v2 pre/post-processor factory.
When ``config.recipe_path`` is set, the pre-processor pipeline becomes:
rename observations
add batch dim
relative-action prep (inherited from π0.5)
NormalizerProcessorStep
RenderMessagesStep recipe messages, target_message_indices,
message_streams (PR 1 of the steerable
stack)
PI052TextTokenizerStep messages input_ids + label mask +
predict_actions
DeviceProcessorStep
When ``recipe_path`` is ``None`` we delegate to the plain π0.5 pipeline
so unannotated datasets keep working.
Post-processor is unchanged from π0.5.
"""
from __future__ import annotations
from pathlib import Path
from typing import Any
import torch
from lerobot.configs.recipe import TrainingRecipe
from lerobot.processor import (
AbsoluteActionsProcessorStep,
ActionTokenizerProcessorStep,
AddBatchDimensionProcessorStep,
DeviceProcessorStep,
NormalizerProcessorStep,
PolicyAction,
PolicyProcessorPipeline,
RelativeActionsProcessorStep,
RenameObservationsProcessorStep,
RenderMessagesStep,
UnnormalizerProcessorStep,
policy_action_to_transition,
transition_to_policy_action,
)
from lerobot.utils.constants import POLICY_POSTPROCESSOR_DEFAULT_NAME, POLICY_PREPROCESSOR_DEFAULT_NAME
from ..pi05.processor_pi05 import make_pi05_pre_post_processors
from .configuration_pi052 import PI052Config
from .text_processor_pi052 import PI052TextTokenizerStep
def make_pi052_pre_post_processors(
config: PI052Config,
dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
dataset_repo_id: str | None = None,
) -> tuple[
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
PolicyProcessorPipeline[PolicyAction, PolicyAction],
]:
"""Build PI0.5-v2's pre/post-processor pipelines.
Falls through to π0.5's stock pipeline when ``recipe_path`` is unset.
"""
if not config.recipe_path:
return make_pi05_pre_post_processors(config, dataset_stats=dataset_stats)
recipe = _load_recipe(config.recipe_path)
relative_step = RelativeActionsProcessorStep(
enabled=config.use_relative_actions,
exclude_joints=getattr(config, "relative_exclude_joints", []),
action_names=getattr(config, "action_feature_names", None),
)
input_steps = [
RenameObservationsProcessorStep(rename_map={}),
AddBatchDimensionProcessorStep(),
relative_step,
NormalizerProcessorStep(
features={**config.input_features, **config.output_features},
norm_map=config.normalization_mapping,
stats=dataset_stats,
),
RenderMessagesStep(recipe=recipe),
PI052TextTokenizerStep(
tokenizer_name="google/paligemma-3b-pt-224",
max_length=config.tokenizer_max_length,
plan_dropout_prob=getattr(config, "plan_dropout_prob", 0.0),
memory_dropout_prob=getattr(config, "memory_dropout_prob", 0.0),
subtask_dropout_prob=getattr(config, "subtask_dropout_prob", 0.0),
),
]
# FAST tokenizer for discrete-action CE supervision (paper §III.C).
# Only inserted when explicitly enabled — keeps the post-training-
# style recipe (flow + text) as the default. When on, the step
# writes ACTION_TOKENS / ACTION_TOKEN_MASK into
# ``COMPLEMENTARY_DATA`` and the modeling forward picks them up.
if getattr(config, "enable_fast_action_loss", False):
# Per Pertsch et al. 2025 (FAST [64], π0.5 §III.C): fit the
# tokenizer on this dataset's action distribution rather than
# using the universal codebook off the shelf. We do this once
# and cache to disk, keyed on (dataset, base, n_samples).
action_tokenizer_path = config.action_tokenizer_name
if (
getattr(config, "auto_fit_fast_tokenizer", False)
and dataset_repo_id is not None
):
from .fit_fast_tokenizer import fit_fast_tokenizer # noqa: PLC0415
cache_dir = Path(config.fast_tokenizer_cache_dir).expanduser()
try:
action_tokenizer_path = fit_fast_tokenizer(
dataset_repo_id=dataset_repo_id,
cache_dir=cache_dir,
base_tokenizer_name=config.action_tokenizer_name,
n_samples=config.fast_tokenizer_fit_samples,
chunk_size=config.chunk_size,
)
except Exception as exc: # noqa: BLE001
import logging # noqa: PLC0415
logging.getLogger(__name__).warning(
"FAST tokenizer fit failed (%s) — falling back to "
"the universal base tokenizer %r. Train will still "
"work but compression will be suboptimal.",
exc, config.action_tokenizer_name,
)
input_steps.append(
ActionTokenizerProcessorStep(
action_tokenizer_name=action_tokenizer_path,
max_action_tokens=config.max_action_tokens,
fast_skip_tokens=config.fast_skip_tokens,
paligemma_tokenizer_name="google/paligemma-3b-pt-224",
)
)
input_steps.append(DeviceProcessorStep(device=config.device))
output_steps = [
UnnormalizerProcessorStep(
features=config.output_features,
norm_map=config.normalization_mapping,
stats=dataset_stats,
),
AbsoluteActionsProcessorStep(
enabled=config.use_relative_actions,
relative_step=relative_step,
),
DeviceProcessorStep(device="cpu"),
]
return (
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
steps=input_steps,
name=POLICY_PREPROCESSOR_DEFAULT_NAME,
),
PolicyProcessorPipeline[PolicyAction, PolicyAction](
steps=output_steps,
name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
to_transition=policy_action_to_transition,
to_output=transition_to_policy_action,
),
)
def _load_recipe(path_str: str) -> TrainingRecipe:
"""Resolve ``path_str`` to a ``TrainingRecipe``.
Accepts an absolute path or a path relative to
``src/lerobot/configs/`` (same lookup rules as
``make_smolvla2_pre_post_processors``).
"""
p = Path(path_str)
if not p.is_absolute() and not p.exists():
from lerobot.configs import recipe as _recipe_module # noqa: PLC0415
configs_dir = Path(_recipe_module.__file__).resolve().parent
candidate = configs_dir / path_str
if candidate.exists():
p = candidate
return TrainingRecipe.from_yaml(p)

Some files were not shown because too many files have changed in this diff Show More