Compare commits

...

358 Commits

Author SHA1 Message Date
pepijn223 3427499212 feat(pi052): condition low-level prompt on state + fix eval slowdown
- Inject discretized proprioceptive state (256 bins, pi05 format) into
  low-level action-conditioning prompts in both training
  (PI052TextTokenizerStep) and eval (_with_low_level_subtask_prompt),
  matching the recipe's documented "[images, subtask, state]" intent.
  Higher-level subtask/memory text streams stay state-free.
- Cache the loc-token tokenizer (_get_loc_tokenizer) instead of reloading
  it from disk on every _build_text_batch/select_message call (it ran
  twice per env per replan and dominated eval runtime).
- Add a KV cache to select_message decode (bit-identical output to the
  recompute path) to avoid O(n^2) generation.

Net: pi052 eval ~2.9 s/it -> ~0.1 s/it (~25x).
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-14 13:57:55 +02:00
Pepijn c5965d4971 Merge branch 'main' into feat/smolvla-on-steerable 2026-06-08 11:02:54 +02:00
Maxime Ellerbach 09808183ca feat(rollout): adding episodic strategy (#3717)
* feat(rollout): adding legacy strategy

* adding legacy to existing tests

* updating docs and docstring

* changing misleading docstring

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* adding extra guard like dagged with try except finally

* Potential fix for pull request finding

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* adding reset to initial position

* moving smooth teleop handover to control_utils and adding this behavior to legacy strategy

* reducing duration of the handover

* * renaming to episodic
* changing semantics of the docstring
* fixing leader - follower handover disable torque
* adding optionnal config to disable handover

* wiring the smooth_leader_follower_handover config

* renaming config smooth_leader_to_follower_handover

---------

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
2026-06-06 00:32:38 +02:00
pepijn223 470fdd195d fix(ema): default EMA decay to 0.99
Matches openpi's top-level default (ema_decay=0.99, ~last 100 steps).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-05 16:10:00 +02:00
pepijn223 384feca91a fix(ema): default EMAConfig.enable to False (opt-in)
EMA was on by default, so every training run on the branch (incl. VLA-JEPA
and other non-flow-matching policies) created a full fp32 shadow copy. EMA
only benefits flow-matching/diffusion policies (pi0/pi05/pi052). Make it
opt-in via --ema.enable=true; the pi05/pi052 recipes already pass that flag.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-05 16:09:08 +02:00
pepijn223 7b35af6eca Merge remote-tracking branch 'origin/main' into feat/smolvla-on-steerable
Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	uv.lock
2026-06-05 14:38:47 +02:00
pepijn223 aca02ff24c fix(robocasa): align env state/action order to openpi/robocasa convention
LeRobot's RoboCasaEnv used a divergent flat state/action layout vs the
robocasa package (robocasa.utils.env_utils.convert_action) and the openpi
robocasa pipeline. This scrambles I/O when using openpi-convention checkpoints
(e.g. the JAX->PyTorch->LeRobot converted pi05 robocasa model: CloseFridge
20% -> 60% once both orders match openpi).

- convert_action: ee_pos(3)+ee_rot(3)+gripper(1)+base_motion(4)+control_mode(1)
- observation.state: ee_pos_rel(3)+ee_rot_rel(4)+base_pos(3)+base_rot(4)+gripper(2)

Matches openpi examples/robocasa/main.py + RobocasaInputs ordering.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-05 13:47:43 +02:00
pepijn223 de7ba67556 style: drop decorative === comment banners from pi052 split
Replace the === separator banners (against repo style) with plain comments.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 20:21:10 +02:00
pepijn223 c020c0d053 refactor(pi052): split pi05_backbone into pi_gemma + modeling_pi052
Eliminate the standalone pi052/pi05_backbone.py by distributing its contents:
- Generic dual-expert transformer machinery -> lerobot/policies/pi_gemma.py
  (sdpa_attention_forward, compute_layer_complete, PaliGemmaWithExpertModel,
  get_gemma_config; the openpi width/depth config is renamed GemmaConfig ->
  GemmaVariantConfig to avoid clashing with transformers' GemmaConfig). These
  sit next to the existing PiGemma layer code they already depend on.
- pi052-specific model + helpers -> pi052/modeling_pi052.py (PI05Pytorch,
  ActionSelectKwargs, make_att_2d_masks, pad_vector, resize_with_pad_torch,
  create_sinusoidal_pos_embedding, sample_beta, get_safe_dtype).

DEFAULT_IMAGE_SIZE is duplicated as a plain constant in pi_gemma to avoid a
pi_gemma -> pi05 import cycle. Additive to pi_gemma; pi0/pi05 unaffected.
Verified bit-exact on pepijn223/pi052_robocasa_full (embed/predict/forward
identical) and all 34 pi052 tests pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 20:18:18 +02:00
pepijn223 4cbd91a04e chore: drop one-off bench/build/train scripts from the PR
Remove development-only tooling that doesn't belong in the PR:
- examples/benchmark/* (pi052 step/kernel benchmark slurm + harness)
- examples/port_datasets/slurm_build_robocasa_composite_seen.py and
  src/lerobot/scripts/build_robocasa_composite_seen.py (composite_seen
  dataset build scripts)
- scripts/build_episode_filter.py, scripts/build_robocasa_smoke.sh,
  scripts/train_pi052_human300_exclude_unannotated.sh

None are imported by the library, tests, or entry points.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 20:05:25 +02:00
pepijn223 afe30630cc test(pi052): repair stale-name CE tests for fused linear CE
_fast_ce/_shifted_ce were renamed to _fast_lin_ce/_shifted_lin_ce and changed
from logits-based to Liger fused-linear-CE (hidden @ lm_head_weightᵀ). Update
the tests via thin adapters that pass an identity lm_head_weight (so the
computed logits equal the provided ones), run on CUDA (Liger is GPU-only) and
skip otherwise, and loosen the allclose tolerance to absorb GPU-vs-CPU float
noise on the tiny losses.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 20:03:18 +02:00
pepijn223 a594ad7969 refactor(pi052): self-contained policy; revert pi0/pi05 to upstream main
The smolvla branch had modified the shared pi0/pi05 modeling + pi05 config to
support pi052 (SDPA attention, layernorm/lm_head handling, optimizer
foreach/fused/lm_head_lr_scale, embedding scaling). Decouple pi052 instead:

- Vendor the PI0.5 backbone (PaliGemmaWithExpertModel, PI05Pytorch, helpers)
  into pi052/pi05_backbone.py (verbatim copy, no PI05Policy).
- Flatten PI052Policy to subclass PreTrainedPolicy directly (no longer
  PI05Policy); inline the needed PI05Policy methods.
- Restore optimizer_foreach/fused + get_optimizer_preset on PI052Config.
- Revert pi0, pi0_fast, pi05 modeling and configuration_pi05 to origin/main
  (byte-identical), so the shared policies carry no smolvla modifications.

Behavior verified bit-exact on pepijn223/pi052_robocasa_full: embed_language_
tokens, predict_action_chunk, and the fused flow+text+FAST training loss are
identical before/after (max_abs_diff=0). pi052 tests pass (pre-existing
stale-name collection errors unchanged).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 19:59:27 +02:00
Maxime Ellerbach 2e9cd87bbd feat(policies): add VLA-JEPA (#3568)
* first commit

* feat(policies): add VLA-JEPA

* feat(policies): add VLA-JEPA

* support vla_jepa

* (feat)policies: add VLA-JEPA

* linting

* adding deps to pyproject.toml

* updating uv lock

* adding guards to avoid needing transformers and diffusers for type checking and basic tests

* fixing action and state dim

* fix warnings with qwen processor kwargs

* fixing wm_loss not propagating

* adjusting obs steps, tublets size to match original implementation

* some more fixes to be closer to the original implem

* adding more tests to ensure good coverage

* align VLA-JEPA architecture with original checkpoint

- Remove stale `action_num_heads` / `action_attention_head_dim` config fields;
  DiT head dimensions are now always derived from the preset (DiT-B/L/test).
- Add `num_target_vision_tokens` and `action_max_seq_len` config fields required
  by the action head's future-token embedding and positional embedding tables.
- Fix default `qwen_model_name` to 2B (matches all released checkpoints).
- Rename `ActionEncoder` attrs w1/w2/w3 → layer1/layer2/layer3 to match
  checkpoint key names; replace `nn.Sequential` decoder/state-encoder with
  `_MLP2` (layer1/layer2 naming).
- Fix `VLAJEPAActionHead` to size ActionEncoder and StateEncoder at `inner_dim`
  (DiT input width) rather than `action_hidden_size` (DiT output width).
- Rename `DiT.blocks` → `transformer_blocks` and `attn` → `attn1` to match
  checkpoint; add alternating cross/self attention (even blocks cross-attend to
  Qwen context, odd blocks self-attend).
- Add `DiT-test` preset for unit tests.
- Rewrite `ActionConditionedVideoPredictor` with explicit ViT-style blocks
  (`_PredictorBlock` with fused qkv) to match checkpoint structure; rename
  `encoder`/`norm`/`proj` → `predictor_blocks`/`predictor_norm`/`predictor_proj`.

* propagate action_is_pad masking through VLA-JEPA policy pipeline

Pass the `action_is_pad` tensor from the batch through to the action head
so padded timesteps are excluded from the flow-matching loss.

* update VLA-JEPA tests for arch changes and action_is_pad

- Switch conftest to use `action_model_type="DiT-test"` now that
  `action_num_heads` / `action_attention_head_dim` have been removed.
- Add action_head tests covering fully-padded loss (zero) and equivalence
  of action_is_pad=None vs all-zeros mask.
- Remove obsolete `test_native_to_lerobot_wm_only` test.

* add VLA-JEPA documentation

Covers architecture overview, pretrained checkpoints, config reference,
training/eval commands for LIBERO-10, and guidance on fine-tuning for
single-camera datasets.

* add one-shot script to convert ginwind/VLA-JEPA checkpoints to safetensors (will remove once migrated)

* make default params more aligned with paper and pretrained models
- adding possibility of freezing qwen backbone and world model
- added tests for weight loading

* trying out to re-init the action head to avoid pretraining dimension mismatch

* allow different state dim and action dim

* removing missleading future_action_window_size to just use chunk_size

* lots of changes to make existing weights work, need to massively refactor the pre and post processing

* refactoring into using pre and post processor

* pre-commit cleanup

* fixing doc defaults args

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>

* adressing dtype zeros issue

* adding guard for diffusers

* fixing training and exal examples

* trying to close success rate gap

* fix qwen norm layer output libero eval is now as expected

* adding instructions for different embodiement + fixing some tests

* smol fix to avoid having default CPU device when training

* fixing misconception about multiview / singleview handling

* removing conversion script

* adding licences

* adding .mdx docs and shortening polivy_vla_jepa_README.md

* removing useless pre-processor

* cleanup

* removing swish in favor of silu

* adding configuration gripper index and threshold

* fixing simlink

---------

Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
Co-authored-by: ginwind <ginwind@mail.ustc.edu.cn>
2026-06-04 19:22:51 +02:00
pepijn223 8292548f0d fix(pi052): stop double-scaling FAST/text token embeddings
embed_language_tokens already applies Gemma's sqrt(hidden) normalizer
(GemmaTextScaledWordEmbedding, transformers >=5.4.0). pi052 multiplied FAST
action-token and autoregressive subtask-text embeddings by sqrt(emb_dim) on
top of that, double-scaling them (~2048x). Remove the manual scaling so FAST
and text tokens are single-scaled, consistent with the pi05 fix and OpenPI.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 18:31:41 +02:00
pepijn223 77cc35b932 fix(pi0,pi05,pi0_fast): stop double-scaling text embeddings
transformers >=5.4.0 (PR #44432) makes Gemma's embed_tokens a
GemmaTextScaledWordEmbedding that already multiplies token embeddings by
sqrt(hidden_size). The manual `* sqrt(embed_dim)` applied on top therefore
double-scaled text (~2048x instead of ~45x), breaking VLM alignment for
models trained/run on stock transformers. Remove the manual scaling and rely
on embed_tokens' internal normalizer (matches main #3603). Image features
stay raw (un-normalized), as before.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 18:22:34 +02:00
pepijn223 f0757fc707 fix(pi0,pi0_fast): scale text embeddings by sqrt(embed_dim) to match OpenPI
OpenPI (pi0 and pi0-FAST) multiplies language token embeddings by
sqrt(embed_dim) — the Gemma embedder normalizer — before the transformer.
LeRobot pi0/pi0_fast omitted it, leaving text tokens ~45x under-scaled
relative to the residual stream (same class of bug as the pi05 image
scaling). pi0: applied in embed_prefix's lang_embed_func. pi0_fast:
applied inside embed_language_tokens so prompt, FAST action tokens, and
autoregressive next-token embeds are all scaled consistently.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 18:14:27 +02:00
pepijn223 a48d4e32a1 fix(pi05): don't scale image features by sqrt(hidden_size)
lerobot/pi05_base was trained in the OpenPI/big_vision regime where image
(soft) tokens are NOT multiplied by the Gemma embedder normalizer
(sqrt(hidden_size)) — only text tokens are. Scaling image features here
over-scaled them ~45x, breaking the pretrained vision-language alignment
and yielding ~0% closed-loop success on RoboCasa across all pi05 runs.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 17:20:34 +02:00
Pepijn 9596e3d53f Merge remote-tracking branch 'origin/feat/smolvla-on-steerable' into feat/smolvla-on-steerable 2026-06-04 17:14:33 +02:00
Pepijn 0a6a799317 Merge feat/language-annotation-pipeline into feat/smolvla-on-steerable
Bring the authoritative annotation pipeline from the annotation branch.
The annotation surface is forced to EXACTLY match feat/language-annotation-
pipeline (the annotation branch is the source of truth for annotation
code), which also removes smolvla's stale copies:
  - deleted: steerable_pipeline/vocabulary.py, tests/annotations/test_
    vocabulary.py, prompts/module_0_vocabulary.txt, module_1_action_record
    .txt, module_3_vqa.txt, module_1_plan.txt, and the old module_* prompt
    names (now plan_*/interjections_*/vqa.txt).
  - synced: all of src/lerobot/annotations/, lerobot_annotate.py,
    examples/annotations/, tests/annotations/, datasets/language.py,
    tests/datasets/test_language.py, docs/annotation_pipeline.mdx.

Non-annotation conflicts resolved by union (keeping both branches' intent):
  - pyproject.toml: keep smolvla's pi extra (+sentencepiece) and add the
    molmoact2 extra from main.
  - policies/factory.py: keep both dataset_repo_id (pi052 FAST tokenizer)
    and dataset_meta (both are referenced); union the policy-type docstring.
  - scripts/lerobot_train.py: keep smolvla's pi052 / use_relative_actions
    processor-rebuild block.
  - uv.lock: regenerated from the merged pyproject.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 17:13:36 +02:00
pepijn e660a51e78 pi052(debug): drop misleading inference/parity dump from text preds
The first-token parity check re-tokenized the decoded (stripped) inference
string, so the leading-space SentencePiece variant always mismatched the
training argmax — a false "DIVERGED" alarm. Remove the autoregressive
inference print and parity comparison (and the now-dead per-sample
select_message generation), keeping only the prompt, ground-truth target,
and teacher-forced argmax accuracy.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-04 13:32:44 +00:00
Pepijn cdd94a703f annotate(config): tighten field comments to one line each
Collapse the remaining multi-line field comments / docstrings in config.py
to single lines (or two where a knob genuinely needs it), keeping the
essential rationale. Comments only — no field or behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 15:12:31 +02:00
Pepijn cd59c8b312 annotate: remove the action_record style/feature entirely
Drop the optional structured per-subtask action records — not a feature
we want to ship.

  * language.py: remove 'action_record' from CORE_STYLES + PERSISTENT_STYLES
    (and the matching assertion in tests/datasets/test_language.py).
  * config.py: delete ActionRecordsConfig (verb/grasp vocabularies,
    frames_per_subtask, emit_record_row) and the PlanConfig.action_records
    field.
  * plan_subtasks_memory.py: delete _extract_action_record and the
    run_episode block that emitted style='action_record' rows; drop the
    now-unused json / to_image_blocks imports.
  * remove the plan_action_record.txt prompt.
  * run_hf_job.py: drop the action_records comment.

Verified: 40 tests pass; pre-commit (ruff, mypy, bandit) clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 14:40:34 +02:00
Pepijn 99baae012f annotate(config): further compact field comments
Tighten the remaining multi-line comment blocks in config.py (derive_task,
frames/window, describe_first, action-record/vqa/vlm fields, video_backend,
repo ids, executor) to 1-3 lines each. Also fix a stale path typo
('examples/annotation' -> the docstring now just says HF Jobs). Comments
only — no field or behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 14:36:02 +02:00
Pepijn 973318ef65 annotate: dedup task_aug + row-normalization; docs module on/off table
Two behavior-preserving simplifications:
  * plan_subtasks_memory.run_episode: the task_aug 'axes' and free-form
    branches built identical deduped rows via copy-pasted seen/append
    loops. Collapse to one branch that picks the variant source, then a
    shared _task_aug_rows() helper does the dedup + row build (-~25 LOC).
  * writer: _normalize_persistent_row / _normalize_event_row shared the
    same camera-validate + struct construction. Extract _normalize_row(),
    keeping the exact key order (the parquet struct schema is inferred
    from insertion order, so timestamp must stay between style and camera).

docs: 'Which modules run' is now a table giving each module's on/off flag
(--plan.enabled / --interjections.enabled / --vqa.enabled) and what it
turns off.

Verified: 40 tests pass (incl. test_writer struct round-trip); pre-commit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 14:18:36 +02:00
Pepijn 7471a6b1ed annotate: compress conftest + pyproject comments (fix stale backend note)
The pyproject annotations-extra comment still described the removed
vllm/transformers in-process backends ('vllm preferred ... transformers
fallback', '_make_vllm_client'); rewrite it for the openai-only reality
and trim it. Also condense the conftest lazy-import NOTE. Comments only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 14:12:04 +02:00
Pepijn 20c7a12dd5 annotate: remove dead code, document CLI options, compact config
Dead code (defined but never referenced anywhere in src/tests/examples):
  * reader.py: keyframe_indices, episode_frame_timestamps, lookup_data_path,
    and the now-orphaned gather_data_paths + episode_offsets_per_path
    (lookup_data_path was their only caller).
  * staging.py: iter_staged_episodes.
  * writer.py: normalize_rows_for_writer.
  * config.py VlmConfig: json_mode, batch_size, tensor_parallel_size,
    gpu_memory_utilization, trust_remote_code — consumed only by the
    in-process vllm/transformers backends that were removed; the openai
    auto-serve path carries those vLLM flags via serve_command instead.
    Kept max_model_len (still used as the serve-command default).
  * config.py TaskAugAxesConfig.total property.

Docs: new 'Key options' section in annotation_pipeline.mdx — grouped
tables (dataset in/out, module toggles, --vlm.*, --plan.*, interjections
+ vqa) describing the flags users actually reach for, with defaults.

config.py: compact the verbose field comments + ActionRecordsConfig /
TaskAugAxesConfig docstrings; fix two stale 'verify' references (the
verify pass was removed — it's describe -> segment now) and the stale
'renders record back to subtask text' note (that path was removed).
vlm_client docstring no longer mentions the removed json_mode field.

Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (40 passed); pre-commit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 14:05:46 +02:00
Pepijn dbe02f0c4f annotate(plan): condense verbose comments + docstrings
Trim the long inline comment blocks (effective_task / task_aug, action
records, plan-boundary rows, plan-update span closing, windowed +
coverage-stitch sections) and the _generate_plan / run_plan_updates
docstrings to a few lines each. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 13:52:24 +02:00
Pepijn 56cbb5f9ec annotate(example): trim run_hf_job comments to one line each
Same flags and rationale, condensed — each plan-module flag now has a
short one/two-line comment instead of a paragraph.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 13:48:55 +02:00
Pepijn 2af2402a0c docs(annotate): cleaner architecture diagram layout
Top-down flow (read episodes → 3 modules fan out → validator → writer →
parquet) with aligned boxes, instead of the cramped bordered version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 11:59:31 +02:00
Pepijn 7bec991cdf docs(annotate): friendlier rewrite + architecture diagram; drop reproducibility section
Rewrite annotation_pipeline.mdx in plainer, easier-to-read language
(shorter sentences, active voice, a plain-text intro), add an ASCII
'How it fits together' architecture diagram, and remove the
'Reproducibility via seed and prompt hashes' section. Content/links are
preserved; only wording and structure change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 11:48:59 +02:00
Pepijn c6f682b3f4 annotate docs: install lerobot from main (post-merge wording)
The example already pins '@main'; update the doc step and the script
docstring from 'the branch under test' to 'lerobot (from main)' now that
the pipeline is merging to main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-04 11:45:38 +02:00
Pepijn eba3ab3741 annotate: address review feedback — bug fixes, docs/code drift, naming, cleanup
Bugs
  * validator: don't re-raise on unknown style. The second column_for_style
    lookup (used to route persistent vs event) now sits in try/except so an
    unknown style is recorded by _check_column_routing and skipped instead
    of crashing the whole validation pass.
  * general_vqa._target_cameras: when restrict_to_default_camera is set but
    the configured camera_key isn't one the provider exposes, warn and fall
    back to all cameras instead of returning a phantom key that KeyErrors
    deep in frame decode.
  * interjections: clamp interjection timestamps to frame_timestamps[0]
    rather than a hardcoded 0.0 (datasets can start at non-zero t).

Docs / code drift
  * annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase
    0 / --vocabulary.* / canonical_vocabulary.json' section (none of it
    exists); describe the real describe->segment + coverage-stitch flow.
    Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of
    this PR' (matches tools.mdx, which already marks the runtime layer as
    not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note
    the default is now a single h200. Add a 'Contributing new modules'
    section inviting module / prompt / quality contributions.
  * executor docstring: six phases, no phantom phase 0.

run_hf_job.py
  * add the Apache 2.0 license header (was flagged repeatedly).
  * default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1
    (scale to h200x4 noted in the docstring).
  * pin the install to @main instead of the feature branch (won't break
    after merge).

Naming / cleanup
  * rename dest_repo_id -> new_repo_id across config / script / example /
    test to match the LeRobot dataset edit tools.
  * rename prompt templates module_N_*.txt -> descriptive (plan_*,
    interjections_*, vqa.txt) and update every load_prompt() call.
  * remove dead _messages_to_prompt (used only by the removed in-process
    backends).
  * declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as
    real init=False dataclass fields instead of getattr monkey-patches.
  * scope bandit B607 to the two ffmpeg subprocess.run sites via
    '# nosec B607' and drop it from the global skip list.

Tests
  * fix stale canned-VLM markers ('ONE realistic interruption' ->
    'compact interjection', 'Update the memory' -> 'compressed semantic
    memory') and drop the dead 'concise hierarchical PLAN' plan responders
    (plan generation is deterministic now) in run_e2e_smoke,
    test_pipeline_recipe_render, test_modules.
  * run_e2e_smoke now asserts interjection + speech rows are produced so a
    stale marker can't silently pass again.
  * drop remaining 'PR 1' / 'PR 2' references from test comments / names.

Verified: tests/annotations + tests/datasets/test_language +
tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke
(interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit,
prettier) clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 18:30:46 +02:00
Pepijn 3a24e426df language: register action_record in CORE_STYLES so STYLE_REGISTRY contains it
action_record is in PERSISTENT_STYLES but was missing from CORE_STYLES,
so STYLE_REGISTRY (= CORE_STYLES | EXTENDED_STYLES) didn't contain it and
the PERSISTENT_STYLES | EVENT_ONLY_STYLES <= STYLE_REGISTRY invariant in
test_style_registry_routes_columns failed. Add it to CORE_STYLES so the
registry, the persistent-set, and column_for_style() stay consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 16:38:06 +02:00
Pepijn b9a0187335 annotate: drop local in-process VLM backends — HF Jobs (openai) only for now
The shipped workflow is Hugging Face Jobs (examples/annotations/run_hf_
job.py): it serves the model with vLLM in the vllm/vllm-openai image and
the pipeline talks to it over the OpenAI-compatible API. The in-process
vllm / transformers local backends added surface (and the vllm
one pinned an old torch) without being part of that path, so they're
removed for now.

  * vlm_client.make_vlm_client: keep only backend='openai' (+ 'stub'
    rejected with the usual guidance). Requesting 'vllm'/'transformers'
    now raises a clear 'not supported for now — use the HF Jobs flow'
    error. Removed _make_vllm_client and _make_transformers_client.
  * config: backend docstring updated (openai-only); default model_id
    bumped to Qwen/Qwen3.6-27B to match run_hf_job.
  * docs/annotation_pipeline.mdx: remove the '## Running locally'
    section; the launcher description now says one vLLM server per GPU
    over the OpenAI API, and the 'One Qwen-VL pass' note drops the
    'vLLM/transformers fallback' wording.

Tests are unaffected (they construct StubVlmClient directly; nothing
referenced the removed backends).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 16:28:40 +02:00
Pepijn a18d969753 tests(annotations): fix stale canned-VLM markers + action_record style assertion
The annotation tests had never actually run in CI (collection failed on
the missing 'datasets' extra); now that they do, three stale assertions
surfaced against the evolved pipeline:

  * test_module1_plan_memory_subtask_smoke: the memory canned-responder
    marker 'Update the memory' no longer appears in module_1_memory.txt
    (now 'compressed semantic memory'), so the stub returned no memory
    row and the {subtask,plan,memory} subset check failed. Marker
    updated to match the current prompt.
  * test_module2_mid_episode_emits_paired_interjection_and_speech: the
    interjection marker 'Write ONE interjection' is now 'Write ONE
    compact interjection' in module_2_interjection.txt, so 0 interjections
    were emitted. Marker updated.
  * tests/datasets/test_language.py::test_style_registry_routes_columns:
    PERSISTENT_STYLES gained 'action_record' in this PR; add it to the
    expected set.

These are test/prompt-marker syncs — no production behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 16:21:17 +02:00
Pepijn 273a8fc335 deps(annotations): drop hard vllm dependency to unblock CI torch/torchcodec resolution
Fast Pytest 'dataset' tier failed collecting tests/datasets/test_video_
decoder_cache.py with 'Could not load libtorchcodec ... undefined symbol:
torch_dtype_float4_e2m1fn_x2' — a torch/torchcodec ABI mismatch.

Root cause: the annotations extra's vllm hard-pins an older torch
(via xformers/xgrammar -> torch 2.8). uv resolves a SINGLE unified lock
across all extras, so vllm capped torch to 2.8 for every tier —
including dataset, whose torchcodec 0.11.1 needs torch 2.11. The
result was torch 2.8 + torchcodec 0.11.1 installed together -> ABI break.
(main has no vllm, so it resolves torch 2.11 + torchcodec 0.11.1 cleanly.)

Fix: remove vllm from the annotations extra. It is not needed by
the shipped workflow — examples/annotations/run_hf_job.py gets vllm from
the vllm/vllm-openai image and talks to it over the OpenAI-compatible
API (--vlm.backend=openai), and vlm_client._make_vllm_client imports vllm
lazily. For the in-process --vlm.backend=vllm path, install vllm
separately (the ImportError now says so).

After the fix uv resolves torch 2.11.0 + torchcodec 0.11.1 (matching
main); uv lock --check is clean. The annotations extra still provides
datasets / transformers / openai.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 16:09:22 +02:00
Pepijn b9246ef61b tests(annotations): guard on the 'dataset' extra so base fast-test tier skips cleanly
Fast Pytest Tests failed at COLLECTION in the base '--extra test' tier
with 'ModuleNotFoundError: No module named datasets': tests/annotations/
conftest.py imported the fixture dataset builder (-> lerobot.datasets ->
the HF 'datasets' lib + pandas/pyarrow), which only ship under the
'dataset' extra, so the whole annotations package crashed.

Fix uses the repo's proven module-level guard pattern (see
tests/datasets/test_language.py), NOT a conftest-level importorskip —
verified empirically that pytest.importorskip raised during conftest
*import* is treated as a collection ERROR (exit 1), while module-level
importorskip is a clean SKIP.

  * conftest.py: import build_annotation_dataset LAZILY inside the
    fixtures so the conftest itself imports cleanly in every tier.
  * test_modules / test_validator / test_writer / test_pipeline_recipe_
    render: add module-level pytest.importorskip('datasets') +
    ('pandas') before the pyarrow / lerobot.* imports (# noqa: E402 to
    match the existing convention). pyarrow-importing modules place the
    guard before the pyarrow import.
  * tests/scripts/test_lerobot_annotate.py: same guard (its _push_to_hub
    path imports lerobot.datasets).

Result:
  - base / hardware / viz tiers (no dataset extra): annotation tests
    skip cleanly; the rest of the suite runs -> exit 0.
  - dataset tier: datasets present -> guards pass through -> annotation
    tests run with the stub VLM. The pipeline modules import only
    stdlib + relative + lerobot.datasets (no module-level datatrove /
    vllm / openai), so they import fine there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 15:57:04 +02:00
Pepijn 870980efd6 Merge branch 'main' into feat/language-annotation-pipeline 2026-06-03 15:46:13 +02:00
Jaimin d1b1c5c8cf docs: fix broken dataset script paths (datasets/v30 -> scripts) (#3695)
The docs pointed at src/lerobot/datasets/v30/, which does not exist.
Both scripts actually live in src/lerobot/scripts/:

- convert_dataset_v21_to_v30.py
- augment_dataset_quantile_stats.py

Updated the four references (one python -m module path and three
file-path invocations) to the correct location, matching each
script's own usage docstring.
2026-06-03 14:48:19 +02:00
Nikodem Bartnik 741c2d0a39 Docs/add lelab (#3707)
* first text draft (no images)

* simplified docs

* fix formatting

* add youtube video

* add a tip about compatibility

* fix broken link
2026-06-03 14:22:05 +02:00
Haoming Song 19fe315971 fix(train): enable relative action overrides for pretrained processors (#3711)
* fix(train): enable relative action overrides for pretrained processors
Keep pretrained processor pipelines when use_relative_actions is enabled and
apply relative/absolute action processor settings through overrides. Rename the
relative action processor registry key to relative_actions_processor.

* fix(config): reject rename_map without pretrained checkpoint

Fail fast when rename_map is set during fresh initialization, since fresh
configs derive feature names from the current dataset and no rename is applied.

---------

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-06-03 11:46:35 +02:00
Khalil Meftah 906b585826 fix(datasets): default private to None in push_to_hub to respect Hub org visibility settings (#3713) 2026-06-02 19:25:13 +02:00
Pepijn 4c86332fe3 feat(annotate): add plan toggle, drop subtask verify pass, 4xH200 job
- PlanConfig.emit_plan (default True): keep subtasks + memory but skip
  the per-boundary "plan" rows and their VLM call when False.
- Remove the subtask_verify pass entirely: pruning dropped legitimate
  subtasks and the stitch step already guarantees full-episode coverage.
  Deletes _verify_subtasks, both call sites, and the now-unused
  module_1_subtask_verify prompt.
- run_hf_job example: 4xH200 (4 vllm servers), emit_plan=false, vqa off.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-02 18:02:13 +02:00
pepijn 23419026d5 pi052: parquet-direct FAST tokenizer fit (fix v3 dataset hang)
``fit_fast_tokenizer`` previously called ``LeRobotDataset(repo_id,
episodes=[N])`` per sampled episode, which on v3-format datasets
routes through HF datasets' split lookup and raises ``ValueError:
Instruction "train" corresponds to no data!`` on every episode. On
``pepijn223/robocasa_pretrain_human300_v4`` (32 k episodes) this looped
through 13,293 skipped episodes for ~2.5 h before the NCCL watchdog
killed the run via the 2 h ALLREDUCE timeout (job 22182985).

Switch to reading the ``action`` column directly from the dataset's
``data/chunk-*/file-*.parquet`` shards (same pattern as the audit
scripts). Verified end-to-end on the 32 k-episode dataset: 1000 chunks
collected from 1000 episodes in 70.7 s.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-02 15:54:31 +00:00
Pepijn 1417fd69b2 docs(annotate): prettier format annotation_pipeline.mdx
Quality-gate fix: ruff-format/markdown prettier hook reflow of the
annotation pipeline doc. No content change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:41:46 +02:00
Pepijn 53c7b4c69a annotate: ruff lint + format pass
Quality-gate fixes after the main merge:
  * UP037: drop redundant quotes from PlanConfig forward-ref annotations
    (action_records / task_aug_axes) — safe under 'from __future__ import
    annotations'.
  * ruff format applied to config.py, executor.py, general_vqa.py,
    plan_subtasks_memory.py, validator.py, lerobot_annotate.py.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:38:18 +02:00
Pepijn 3662c41b85 Merge remote-tracking branch 'origin/main' into feat/language-annotation-pipeline
# Conflicts:
#	uv.lock
2026-06-02 17:36:07 +02:00
Pepijn 518e191337 annotate: windowed subtask generation for constant temporal density
Long episodes no longer get sparse subtasks. Previously a long episode
was subsampled to max_video_frames=32 across its whole duration (~1
frame/4s for a 2-min clip). New opt-in windowing keeps a CONSTANT
frames_per_second density by splitting the episode into fixed-length
windows and running the subtask chain per window.

New PlanConfig.subtask_window_seconds (default 0.0 = off). When > 0 and
the episode is longer than one window:
  * episode is split into consecutive [w0, w1] windows of this length
  * each window's frames are sampled at frames_per_second (so a 32s
    window at 1 fps = 32 frames, filling but not exceeding the per-call
    context budget)
  * the full describe -> segment -> verify chain runs PER window, in
    window-relative time [0, L]; spans are offset back to absolute
  * all windows' spans are merged, frame-snap-deduped, and stitched into
    one contiguous whole-episode cover

Implementation:
  * _episode_video_block / _video_message / _describe_episode /
    _verify_subtasks gain an optional window=(w0,w1); when set they
    embed frames sampled in that absolute range at frames_per_second
    (video_url path skipped — it's whole-episode).
  * _clean_spans gains bounds= (override clamp range, for window-relative
    spans) and dedupe= (skip frame-snap until the merged absolute set).
  * new _generate_subtasks_windowed + _subtasks_for_window orchestrate
    the loop; _generate_subtasks branches to them when window_s > 0.

run_hf_job.py: --plan.subtask_window_seconds=32 (32s windows at 1 fps).
Cost scales with episode length (chain calls × ceil(duration/window)).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 16:26:14 +02:00
Pepijn 3236c6ee4a examples(annotate): switch run_hf_job to Qwen3.6-27B (dense VLM)
Swap the annotation VLM from Qwen3.6-35B-A3B (sparse MoE, ~3B active)
to Qwen3.6-27B (dense, 27B all-active). Per Scale's dense-captioning
study, model capacity is the #1 lever and the dominant failure is
visual grounding — both helped by ~9x more active params. Qwen3.6-27B
is a vision-language model (vision encoder, image + video), same family
so the chat template / video handling / enable_thinking=false flag are
unchanged, and at 27B dense it still fits one H200 per server, so the
two-parallel-server layout (TP=1, one per GPU) is preserved — no
throughput-layout change, just a much stronger model.

Kept: parallel_servers=2, num_gpus=2, max-model-len 32768 (the 32-frame
embedded budget is ~10k tokens, well under), gpu-mem 0.8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 16:16:26 +02:00
Pepijn cd128cbbd5 annotate: add verb-scoped disambiguation rules to subtask prompt
Adopt the one prompt technique Scale's dense-captioning study found
reliably positive: targeted, verb-scoped, visually-grounded
disambiguation rules. Their lesson was that such a rule must fire ONLY
on the spatial situation it names (their narrow 'Stack vs Put' rule
helped; an over-broad directional 'Scoop' rule bled into other verbs
and hurt), so each rule here is phrased visually and scoped to one
confusable pair:
  * stack-vs-put (on top of an object vs on a surface)
  * insert-vs-put (fitted slot vs surface)
  * pick-up/retrieve-vs-put (decide by which way the OBJECT moves:
    gripper closes + object moves with hand = pick up; gripper opens +
    object stays = put — directly targets Scale's dominant
    direction-flip failure)
  * pour-vs-put (tilt + flow vs untilted move)

This is the highest-confidence, lowest-risk change from the Scale
findings; our pipeline already aligns with their 'avoid' list (no
temporal tokens, no overlays, no fancy sampling, no sequential context
injection, uniform sampling, describe-don't-predict framing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 16:10:49 +02:00
Pepijn 1fb46ab300 annotate: cap embedded-frame budget to fit VLM context (fix 32k overflow)
Switching the plan module to embedded frames (use_video_url=false)
exposed a context overflow: at frames_per_second=2.0 with the old
max_video_frames=128 default, a 480x640 episode embeds ~128 frames ≈
33-39k vision tokens, over the model's 32768 context — every plan call
died with 'Input length exceeds maximum context length' (HTTP 400),
crashing the whole annotation job.

The video_url path never hit this because the server downsampled; the
embedded path sends every sampled frame, so the frame count is a hard
token budget.

Fix:
  * config default max_video_frames 128 -> 32 (~8-10k vision tokens,
    comfortable headroom for the prompt + describe/verify passes).
    Frames are still sampled UNIFORMLY across the whole episode, so
    longer episodes are subsampled, not truncated — full temporal
    coverage preserved, just coarser density.
  * run_hf_job.py: frames_per_second 2.0 -> 1.0, explicit
    --plan.max_video_frames=32, with a comment explaining the token
    budget and the 'do not raise toward 128 with embedded frames' rule.

Only the plan module embeds the full episode; VQA (1 frame/tick) and
interjections (4-frame window) were never at risk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 16:02:25 +02:00
Pepijn 79f9a84407 annotate: make full-episode subtask coverage unconditional
Remove the subtask_full_coverage config flag. Stitching subtask spans
into a contiguous full-episode cover is now always applied in
_generate_subtasks — a sparse / gap-ridden subtask timeline is never
desirable for conditioning, so there's no reason to make it optional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 15:36:23 +02:00
Pepijn 799d0e3bcc annotate: stitch subtasks to full-episode coverage
The verify pass prunes subtasks, which could leave the first subtask
starting after t0 or leave gaps between spans — so the subtask timeline
no longer tiled the episode and frames fell through with no active
subtask label.

New deterministic post-step (no VLM call), default on via
PlanConfig.subtask_full_coverage:
  * first subtask start pulled back to the episode's first frame t0
    (idle / approach before the first labelled action folds into it)
  * each subtask end snapped to the next subtask start (gaps closed)
  * last subtask end extended to the last frame t_last

Runs after segment + verify in _generate_subtasks. Starts other than
the first are left as the VLM/verify produced them (already frame-
snapped + distinct), so the cover is contiguous and non-overlapping.
Disable with --plan.subtask_full_coverage=false if a consumer wants
sparse subtasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 15:34:34 +02:00
Pepijn 1fe1463ae0 annotate: enable subtask describe->segment->verify chain by default
Flip PlanConfig.subtask_describe_first and subtask_verify defaults
False -> True. Every subtask annotation now runs the 3-call grounding
+ pruning chain by default, since the single-call path reliably
hallucinates steps from the task text. Costs 2 extra VLM calls/episode;
disable with --plan.subtask_describe_first=false / --plan.subtask_
verify=false on easy datasets where fewer calls matter more than
label fidelity.

run_hf_job.py: drop the now-redundant explicit flags, leave a note that
the chain is default-on and how to opt out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 15:13:50 +02:00
Pepijn dcd368e1f8 annotate: multi-call subtask quality chain (describe -> segment -> verify)
The single-call 'watch video -> emit subtask JSON' pattern makes the
VLM commit to structured output before reasoning about what it saw, so
it pattern-matches the task text and hallucinates steps. Split it into
an opt-in multi-call chain that grounds first and prunes last.

New PlanConfig flags (both default False -> single-call unchanged):
  * subtask_describe_first: a grounding pass narrates ONLY what is
    visible in the video (no subtask JSON yet). That description is
    injected into the segmentation prompt via a new {observation_block}
    placeholder, so the model segments its own grounded observations
    instead of the instruction text. +1 VLM call/episode.
  * subtask_verify: after segmentation, an adversarial pass re-watches
    the video and drops any candidate subtask it cannot see. Can only
    PRUNE (never add/rewrite/move) and fails open (keeps un-verified
    spans if the call returns nothing). +1 VLM call/episode.

Implementation:
  * _generate_subtasks now orchestrates describe -> segment -> verify.
  * Factored span cleaning into _clean_spans (shared by segment + verify
    outputs); added _describe_episode and _verify_subtasks helpers.
  * New prompts module_1_subtask_describe.txt (returns {description})
    and module_1_subtask_verify.txt (returns pruned {subtasks}).
  * module_1_subtasks.txt gains a {observation_block} slot at the top.

run_hf_job.py enables both for the RoboCasa run (3 VLM calls/episode
for subtasks). Combined with single-camera grounding + the embedded-
frame path, this is the high-quality configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 15:12:46 +02:00
Pepijn ba5d4c5cd8 annotate: kill subtask hallucination + single-camera grounding
Two fixes for 'subtasks describe actions not in the video' plus a way
to focus the whole pipeline on one camera.

ANTI-HALLUCINATION
  1. _episode_video_block: when use_video_url is set but clip extraction
     fails, FALL BACK to embedded frames instead of returning an empty
     block. An empty block left the VLM with zero visual grounding, so
     it invented subtasks from the task text alone — the likely root
     cause of hallucinated steps. Now logs a warning and embeds frames.
  2. module_1_subtasks.txt gains a GROUNDING preamble (overrides all
     other rules): label only motion visible in specific frames; never
     invent/anticipate/pad; max_steps is a CEILING not a target; atomic
     demos may be exactly ONE subtask; the VIDEO is ground truth, not
     the instruction text.

SINGLE-CAMERA GROUNDING
  * New VqaConfig.restrict_to_default_camera (default False). When True,
    the VQA module grounds on only the --vlm.camera_key stream instead
    of iterating every camera — matching the plan / interjection
    modules, which already use that single camera. Now the whole
    pipeline can focus on one view (e.g. observation.images.base).

run_hf_job.py updated:
  * use_video_url=false + frames_per_second=2.0 — embed frames directly
    (most reliable; no silent text-only failure mode) with dense
    grounding.
  * vqa.restrict_to_default_camera=true — VQA on the single camera too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 15:08:25 +02:00
Pepijn 7454b4c993 annotate: remove action-record subtask-text replacement entirely
Drops the replace_subtask_text option and the
_render_action_record_to_subtask_text renderer. Action records are now
strictly additive: when action_records.enabled=True the module emits
style='action_record' rows (the typed {verb,object,arm,grasp,dest,
mistake} schema) and NEVER rewrites the subtask text the policy
conditions on.

The render-back-to-text path was the source of corrupted subtasks
(navigation tasks produced 'move stove to stove', manipulation tasks
got spurious 'with left arm using pinch grip' suffixes). Reconstructing
natural-language subtasks from hallucinated structured fields is
inherently fragile, so the capability is removed rather than guarded.

Removed:
  * ActionRecordsConfig.replace_subtask_text field
  * PlanSubtasksMemoryModule._render_action_record_to_subtask_text
  * the span['text'] = canonical_text overwrite in run_episode

Updated docstrings + run_hf_job.py comment accordingly. emit_record_row
(default True) is now the feature's only output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 14:42:36 +02:00
Pepijn c5042a6850 fix(annotate): stop action records + augmentation from corrupting RoboCasa labels
Three compounding bugs made RoboCasa annotation produce off-task
subtasks ('move stove to stove with left arm') and drifting
augmentations ('wander around the kitchen' for 'Navigate to the stove').

1. action_records.replace_subtask_text now defaults False.
   Overwriting the VLM's subtask text with a reconstruction of
   hallucinated {verb,object,arm,grasp,dest} fields is high-risk:
   navigation / non-manipulation tasks don't fit the schema and render
   to nonsense. Records are now additive by default (emit_record_row),
   never silently replacing subtask text. Flip replace_subtask_text on
   only for manipulation datasets verified to render cleanly.

2. _render_action_record_to_subtask_text drops a degenerate
   destination that just echoes the object (verb=move object=stove
   destination=stove -> 'move stove' instead of 'move stove to stove').
   Also routes 'navigate' through the 'to <dest>' preposition family.

3. module_1_task_aug_axes.txt hardened: variants MUST preserve the
   goal/destination. Explicitly forbids 'Navigate to the stove' ->
   'wander around the kitchen'. Only wording / arm / orientation /
   grasp may vary; verb meaning, object, and destination are fixed.

examples/annotations/run_hf_job.py — corrected for RoboCasa:
  * derive_task_from_video=off (was =always). The dataset task string
    is authoritative and is what eval conditions on; =always threw it
    away, re-derived a hallucinated task from the video, and poisoned
    every downstream subtask/plan row. THIS was the dominant cause.
  * n_task_rephrasings=0 + task_aug_axes left off — RoboCasa eval uses
    exact task strings, so augmentation is unused/harmful.
  * action_records left off — manipulation schema doesn't fit atomic /
    navigation tasks.
  * plan_max_steps=6 to keep atomic-task decomposition tight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 14:34:48 +02:00
pepijn223 ff1d58a46f pi052: suppress FAST action tokens in select_message text generation
The FAST action tokenizer maps action codes to the top of the PaliGemma
vocab (id = vocab_size-1-fast_skip_tokens-t). The lower part of that band
sits just below the reserved <loc> block, so it escaped the existing
suppress_loc_tokens mask and leaked into generated subtask/VQA/memory text
as high-codepoint gibberish. Mask the FAST band on every select_message
call so the high-level head emits clean language.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-02 13:07:02 +02:00
Pepijn 98a519e7f2 fix(annotate): default frame provider to video keys, not image keys
VideoFrameProvider derived its default camera and camera list from
meta.camera_keys, which mixes image- and video-stored cameras. The
clip/decode paths read videos/<key>/from_timestamp, which only exists
for video keys, so an image-stored camera sorted first (e.g.
observation.images.wrist) crashed the plan phase with a KeyError.

Restrict the list and default to meta.video_keys. Add a regression test
and point the example job at the dataset's actual video camera. Skip
bandit B607 (ffmpeg/git are intentionally resolved via PATH).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-02 12:09:55 +02:00
Pepijn 5dbf0fac5f annotations(steerable): remove Phase 0 canonical vocabulary discovery
Drops the optional Phase 0 vocabulary-discovery feature entirely.
With the new structured action records (Phase 1a + 1b) providing
cross-episode consistency via the deterministic template renderer,
the older vocabulary-constraint path is redundant and adds a second
constraint mechanism that wasn't well-validated in practice.

Removed:
  * src/lerobot/annotations/steerable_pipeline/vocabulary.py
    (Vocabulary dataclass + VocabularyDiscoveryModule + load_/
    save_vocabulary helpers; canonical_vocabulary.json on-disk format)
  * src/lerobot/annotations/steerable_pipeline/prompts/module_0_vocabulary.txt
    (Phase 0 VLM prompt)
  * tests/annotations/test_vocabulary.py

Pruned wiring across:
  * config.py: VocabularyConfig dataclass + AnnotationPipelineConfig.
    vocabulary field
  * executor.py: vocabulary attribute on Executor + _run_vocabulary_
    phase method + Phase 0 phases.append call in run()
  * modules/plan_subtasks_memory.py: Vocabulary import + vocabulary
    attribute + _subtask_vocabulary_block / _memory_vocabulary_block
    helpers + _canonicalize_subtask / _normalize / _invalid_subtasks
    / _build_subtask_retry_message methods + vocabulary-gated retry
    path in _generate_subtasks + empty-episode warning + _NORMALIZE_
    STRIP_TOKENS constant
  * prompts/module_1_subtasks.txt: {vocabulary_block} placeholder
  * prompts/module_1_memory.txt: {vocabulary_block} placeholder
  * __init__.py: Vocabulary / VocabularyDiscoveryModule / load_
    vocabulary / save_vocabulary / vocabulary_path / VOCABULARY_
    FILENAME re-exports
  * scripts/lerobot_annotate.py: VocabularyDiscoveryModule import +
    instantiation + executor argument
  * examples/annotations/run_hf_job.py: --vocabulary.enabled=false
    flag + docstring references + inline phase-0 comment

The original free-form rephrasings path stays (PlanConfig.
n_task_rephrasings still works when task_aug_axes.enabled=False).
Action records remain the preferred mechanism for cross-episode
subtask consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 11:48:27 +02:00
Pepijn 2bfaf44db2 annotations(steerable): structured action records + 5-axis task augmentation
EgoMimic-inspired additions to the plan module, both opt-in for back-compat.

1. PHASE 1a + 1b: per-subtask structured action records
   * cfg.action_records.enabled=True triggers, after Phase 1 subtask-span
     generation, one extra VLM call per subtask to extract a typed record:
       {verb, object, arm, grasp_type, destination, mistake}
   * A deterministic Python template (_render_action_record_to_subtask_text)
     renders the record back to canonical subtask text. When replace_subtask_
     text=True (default), this REPLACES the VLM's free-form text — eliminates
     cross-episode phrasing drift.
   * When emit_record_row=True (default), the structured record is also
     emitted as a row with style='action_record' (added to PERSISTENT_STYLES)
     so downstream training can consume the typed schema directly.
   * Verb + grasp vocabularies are configurable. Out-of-vocab values are
     rejected at extraction time.

2. STRUCTURED 5-AXIS TASK AUGMENTATION
   * cfg.task_aug_axes.enabled=True replaces the free-form n_task_rephrasings
     path with a structured prompt producing variants along 5 named axes:
       synonym_paraphrase (3)
       omit_arm           (3)
       omit_orientation   (2)
       omit_grasp_method  (2)
       combined_omissions (2)
     Total ~12 variants. Axes with nothing to omit emit fewer entries.
   * Each variant is emitted as a task_aug row at t=0 (existing style).

Inspired by https://github.com/GaTech-RL2/EgoVerse/tree/main/egomimic/scripts/language_process
— they pay Scale AI annotators to fill a structured form and then generate
language via a deterministic prompt. We get the same hallucination-reducing
structure via one extra VLM call per subtask.

Files:
  src/lerobot/datasets/language.py
  src/lerobot/annotations/steerable_pipeline/config.py
  src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
  src/lerobot/annotations/steerable_pipeline/prompts/module_1_action_record.txt
  src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_aug_axes.txt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 11:35:35 +02:00
Pepijn d04ea0ea8a annotations(steerable): structured action records + 5-axis task augmentation
EgoMimic-inspired additions to the plan module, both opt-in for back-compat.

1. PHASE 1a + 1b: per-subtask structured action records
   * cfg.action_records.enabled=True triggers, after Phase 1 subtask-span
     generation, one extra VLM call per subtask to extract a typed record:
       {verb, object, arm, grasp_type, destination, mistake}
   * A deterministic Python template (_render_action_record_to_subtask_text)
     renders the record back to canonical subtask text. When replace_subtask_
     text=True (default), this REPLACES the VLM's free-form text — eliminates
     cross-episode phrasing drift.
   * When emit_record_row=True (default), the structured record is also
     emitted as a row with style='action_record' (added to PERSISTENT_STYLES)
     so downstream training can consume the typed schema directly.
   * Verb + grasp vocabularies are configurable. Out-of-vocab values are
     rejected at extraction time.

2. STRUCTURED 5-AXIS TASK AUGMENTATION
   * cfg.task_aug_axes.enabled=True replaces the free-form n_task_rephrasings
     path with a structured prompt producing variants along 5 named axes:
       synonym_paraphrase (3)
       omit_arm           (3)
       omit_orientation   (2)
       omit_grasp_method  (2)
       combined_omissions (2)
     Total ~12 variants. Axes with nothing to omit emit fewer entries.
   * Each variant is emitted as a task_aug row at t=0 (existing style).

Inspired by https://github.com/GaTech-RL2/EgoVerse/tree/main/egomimic/scripts/language_process
— they pay Scale AI annotators to fill a structured form and then generate
language via a deterministic prompt. We get the same hallucination-reducing
structure via one extra VLM call per subtask.

Files:
  src/lerobot/datasets/language.py
  src/lerobot/annotations/steerable_pipeline/config.py
  src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
  src/lerobot/annotations/steerable_pipeline/prompts/module_1_action_record.txt
  src/lerobot/annotations/steerable_pipeline/prompts/module_1_task_aug_axes.txt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 11:31:42 +02:00
pepijn223 bb2c09965b pi052: hierarchical select_action + RoboCasa eval video overlay
- modeling_pi052: per-env low-level subtask generation in select_action so
  hierarchical inference is correct for eval.batch_size > 1
- render_messages_processor: always emit a fallback low-level prompt so
  observation.language.tokens are produced when recipe annotations are absent
- lerobot_eval: overlay high-level task + predicted subtask onto recorded
  rollout videos (render path only; does not affect policy observations)

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-01 14:35:13 +02:00
Khalil Meftah b8ad81bf39 feat(rewards): add ROBOMETER reward model (#3627)
* feat/add ROBOMETER reward model

* feat(rewards): add Robometer offline progress labeling script

* fix(rewards/robometer): add missing input keys mm_token_type_ids

* chore(rewards/robometer): default to lerobot/Robometer-4b model

* doc(rewards/robometer): update citation and original github link

* feat(rewards/robometer): add image key argument to compute Robometer progress
2026-05-29 21:45:39 +02:00
Haoquan Fang 24017e960c Add MolmoAct2 policy (#3604)
* add molmoact2 policy

* add apache headers to molmoact2 files

* simplify molmoact2 package imports

* align molmoact2 feature validation with eo pattern

* remove molmoact2 processor override from factory

* guard molmoact2 transformers imports

* guard molmoact2 processor transformers import

* add scipy dependency to molmoact2 extra

* use a single molmoact2 action queue

* move molmoact2 config logic into config

* fix molmoact2 hf image key resolution

* load molmoact2 without remote code

* lazy import molmoact2 scipy

* format molmoact2 files

* skip molmoact2 tests without optional deps

* fix molmoact2 pre-commit checks

* validate molmoact2 gripper range
2026-05-27 18:58:37 +02:00
Khalil Meftah e86f5af5bf feat(rewards): add TOPReward reward model (#3629)
* feat(rewards): add TOPReward reward model

* refactor(rewards): clean up TOPReward processor/model

* fix(rewards/topreward): add missing input keys mm_token_type_ids

* fix(rewards/topreward): fix pyproject extra typo and simplify processor (#3653)

Add lerobot[topreward] extra to all in
pyproject.toml, drop the redundant labels arg in scoring, and
collapse the dead-branch shape check in the encoder processor.

* optmize topreward input processing (#3660)

---------

Co-authored-by: Cole <91766445+jcoleharrison@users.noreply.github.com>
Co-authored-by: Haoming Song <haomingsong24@gmail.com>
2026-05-27 14:24:31 +02:00
pepijn 1f1541243a pi052: make `lerobot-eval` work on saved checkpoints
pi052's preprocessor pipelines don't roundtrip through the saved
``policy_preprocessor.json``: ``RenderMessagesStep`` holds a
``TrainingRecipe`` Python object (not JSON-serializable, saved as
``{}``) and ``ActionTokenizerProcessorStep`` saves the fitted FAST
tokenizer's host-only ``~/.cache/lerobot/fast_tokenizers/...`` path.
``PolicyProcessorPipeline.from_pretrained`` then dies with
``RenderMessagesStep.__init__() missing 1 required positional
argument: 'recipe'`` (job 22164494).

The pi052 training path was workable because the recipe-aware steps
were built directly; the runtime path
(``lerobot.scripts.lerobot_pi052_runtime``) sidesteps the loader by
passing ``pretrained_path=None`` to ``make_pre_post_processors`` and
building fresh from ``config.recipe_path``. The standard
``lerobot-eval`` entry point had no such escape hatch.

Two surgical fixes:

* ``factory.make_pre_post_processors``: when ``policy_cfg.type ==
  "pi052"`` AND ``pretrained_path`` is set, bypass the generic
  ``PolicyProcessorPipeline.from_pretrained`` call. Build the
  pipelines fresh via ``make_pi052_pre_post_processors`` (same
  bootstrap the runtime uses) and transplant the saved stateful
  blobs from each step's ``state_file`` reference in the saved JSON
  (today: NormalizerProcessorStep + UnnormalizerProcessorStep
  quantile stats). Pairing is by ``registry_name`` AND position so
  a benign reorder logs a warning instead of silently mis-loading.

* ``PI052Config.use_hf_kernels``: re-add as a deprecated no-op
  field. The flag was removed in d70c8104 (Liger kernels became
  unconditional), but checkpoints saved before that commit
  serialize ``use_hf_kernels: true`` into ``config.json``. Without
  this field draccus rejects the load with ``DecodingError: The
  fields use_hf_kernels are not valid for PI052Config`` (job
  22164492). Mark for removal in a future major bump.

Together these let an external ``lerobot-eval --policy.path=<pi052
checkpoint>`` invocation evaluate the model against any env.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-27 09:14:34 +00:00
Haoming Song 5c98e80430 fix(gr00t): fix Eagle25VL model and processor crash in transformers>=5.4.0, <5.6.0 (#3652)
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-05-26 14:04:22 +02:00
Reece O'Mahoney f65f3f7a4a Fix policy.path in YAML configs (PR #3145 followup) (#3597)
PR #3145 added YAML support for policy.path but left two bugs:

1. extract_path_fields_from_config only deleted config_data[field] when
   no sibling overrides existed. With siblings, the dict stayed in place
   and draccus crashed decoding it as PreTrainedConfig (no 'type' key).
   Sibling overrides go into _config_yaml_overrides and are applied later
   by from_pretrained(), so the field can always be removed.

2. wrap() updated config_path_cli to the cleaned temp file path but
   never propagated it to the draccus.parse fallback branch. cli_args
   still contained --config_path=<original>, so draccus read the
   original YAML with path: still present.

Tests passed because they (a) called extract_path_fields_from_config
directly and (b) included type: alongside path: in the YAML, sidestepping
both bugs.

Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-05-26 14:01:19 +02:00
pepijn d70c810416 pi052: drop `use_hf_kernels` flag — always patch Liger kernels
The flag gated a process-global, idempotent Liger patch that swaps
in fused Triton rope / geglu / layer_norm kernels (~4.5 % step time
on H100, bench job 22161421). Since liger-kernel is now a hard
dependency of the loss path (``_shifted_lin_ce`` / ``_fast_lin_ce``
in ``modeling_pi052``), gating the same dep behind an opt-in flag
was redundant — every pi052 run pulls the wheel in either way.

* ``PI052Policy.__init__`` calls ``_enable_hf_kernels()``
  unconditionally; the function still degrades gracefully if the
  wheel happens to be missing (logs a warning, returns).
* Drop ``PI052Config.use_hf_kernels``; the bench numbers and the
  ``fused_linear_cross_entropy`` pointer to ``_shifted_lin_ce`` /
  ``_fast_lin_ce`` are kept as comments next to the docstring.
* Update the warning + ``_shifted_lin_ce`` lazy-import comment to
  drop stale ``use_hf_kernels`` / ``reduce-overhead`` references.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 11:47:49 +00:00
pepijn 4c3ddb1ff5 pi052: wire Liger fused linear CE + DDP-safe FAST tokenizer fit
* Replace ``_shifted_ce`` / ``_fast_ce`` with Liger's
  ``fused_linear_cross_entropy``: the ``(B, T, 257k)`` logits tensor
  is no longer materialised — the kernel chunks over the ``(B*T)``
  axis and computes matmul + softmax + CE in fused Triton blocks.
  ~30 % step speedup and ~12 GB of activation memory freed on the
  dual-CE pi052 recipe. All four call sites in
  ``_compute_all_losses_fused`` and ``_compute_text_and_fast_loss``
  updated; the ``.any().item()`` CPU sync is dropped so the loss
  path stays CUDA-graph-capturable.

* DDP-safe FAST tokenizer fit. The cache-hit sentinel previously
  looked for ``preprocessor_config.json`` but
  ``ProcessorMixin.save_pretrained`` writes ``processor_config.json``
  — every rank always cache-missed and re-fit, racing on writes and
  occasionally producing a stale ``.pyc`` that crashed
  ``AutoProcessor.from_pretrained`` with ``AttributeError:
  UniversalActionProcessor``. Fix the sentinel; gate the fit on the
  (local) main process; non-leader ranks poll the cache until the
  leader is done. Caught by job 22162549.

* New recipe ``subtask_mem_vqa_robocasa.yaml`` — subtask + memory +
  per-camera VQA over the three robocasa camera keys produced by the
  port pipeline (``robot0_agentview_left/right``, ``robot0_eye_in_hand``).
  The previously-shipped ``subtask_mem_vqa_speech.yaml`` references
  ``observation.images.front`` / ``wrist`` which don't exist in
  robocasa, so VQA never rendered.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 11:18:16 +00:00
pepijn 8615f3f613 annotate(vqa): tighten bbox + keypoint quality bar
Low-confidence VLM detections were producing many overlapping, loose
boxes per frame (oven + toaster oven + counter + drawer + ...) and
coarse keypoints, hurting downstream policy grounding. Two surgical
fixes:

- module_3_vqa prompt: cap bbox at most 3 high-confidence detections
  (prefer 1 tight box), require specific labels and ≤10% padding,
  allow empty detections list when nothing meets the bar; keypoint
  must be a single pixel-precise feature (handle / button / gripper
  tip) rather than a coarse "somewhere on object" point.
- run_hf_job: lower vlm.temperature 0.7 → 0.2. Bbox + keypoint are
  coordinate-regression tasks where sampling noise directly degrades
  localization; question phrasing still varies enough at 0.2.

No new config knobs — the count cap lives in the prompt since "top-N
by confidence" is best picked by the VLM itself. Validator already
accepts empty detections.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 08:31:37 +00:00
pepijn 1e7c0d6aa1 annotate(plan): force composite-action subtasks; ban ultra-fine splits
Tighten ``module_1_subtasks.txt`` so the VLM emits one composite
atomic action per subtask instead of decomposing every pick into
``move to X`` / ``grasp X`` / ``lift X``:

- Lock the verb vocabulary to the composite set the low-level
  policy actually learns end-to-end: ``pick up`` (approach + grasp +
  lift), ``put``/``place`` (transport + release), ``push``, ``pull``,
  ``turn``, ``press``, ``open``, ``close``, ``pour``, ``insert``.
  ``go to`` is allowed only as a pure relocation between phases.
- Add an explicit ``Forbidden ultra-fine splits`` block enumerating
  the patterns the VLM was tempted to emit (``move to X``,
  ``reach for X``, ``grasp X``, ``lift X``, ``release X``) and
  instructing it to fold each into its parent composite.
- Rewrite the Good/Bad examples to match the composite contract;
  the previous ``"move to blue cube" / "grasp blue cube" / "lift
  blue cube"`` Good list was actively encouraging the over-
  segmentation pattern this prompt is supposed to prevent.
- Tighten the duration rule: candidates shorter than
  ``min_subtask_seconds`` must be merged into a neighbour rather
  than emitted. Pairs with bumping the runtime floor to 3 s so
  composites have room to land.

Pure prompt change — no code or schema change. Existing canonical-
vocabulary retry path is unaffected (the new verb whitelist lives
in prose, not in the validator).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 05:14:30 +00:00
pepijn 2686450d68 annotate(plan): force composite-action subtasks; tune run_hf_job for robocasa_smoke
Subtask prompt (``module_1_subtasks.txt``):
- Lock the verb vocabulary to composite atomic actions (``pick up``,
  ``put``/``place``, ``push``/``pull``, ``turn``, ``press``, ``open``/
  ``close``, ``pour``, ``insert``, ``go to``).
- Add an explicit ``Forbidden ultra-fine splits`` block instructing
  the VLM to fold ``move to X`` / ``reach for X`` / ``grasp X`` /
  ``lift X`` / ``release X`` into the parent composite. Previous
  examples actively encouraged the over-segmentation pattern.
- Rewrite the Good/Bad examples around the composite contract.

Job config (``examples/annotations/run_hf_job.py``):
- Point at ``pepijn223/robocasa_smoke_2atomic_v3`` on ``h200x4``.
- ``--vlm.camera_key=robot0_agentview_left`` (real key for the
  dataset; the prior ``observation.images.wrist`` did not exist
  and would have silenced the VQA module).
- ``--vlm.serve_command`` ``--max-model-len 131072`` (4x): keeps
  90 s @ 1 Hz episode video blocks under context even at full
  Qwen vision resolution. On 1x H200 (144 GB) the 35B-FP8 model
  has plenty of room for the bigger KV cache.
- ``--vocabulary.enabled=false`` — heterogeneous dataset, no
  benefit from a single canonical vocabulary.
- ``--plan.derive_task_from_video=off``, ``--plan.n_task_rephrasings=0``
  — reuse the dataset's own ``episode_task`` strings as-is.
- ``--plan.min_subtask_seconds=3.0``, ``--plan.plan_max_steps=6`` —
  give the new composite-action rules room to land (1.5 s floor
  was too small to host a full grasp-or-place composite).
- ``--vqa.vqa_emission_hz=3.0`` — denser VQA grounding.
- Timeout 24h, episode_parallelism=64, client_concurrency=256 to
  scale to the 25k-trajectory regime when the same recipe is
  pointed at a larger dataset.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 05:14:23 +00:00
pepijn 920c6ef5a2 docs(annotate): disable phase-0 vocabulary discovery by default in run_hf_job
Heterogeneous datasets (different tasks/scenes across episodes) don't
share a single small subtask + memory vocabulary, so the canonical
vocabulary phase narrowed every episode to the wrong target distribution.
Flip the example to free-form generation by default and document the
``--vocabulary.enabled=true`` switch for homogeneous datasets where the
canonical vocabulary still helps the downstream policy.

No pipeline-code changes: ``VocabularyConfig.enabled`` already gates
phase 0 (see ``executor.py:_run_vocabulary_phase`` and
``VocabularyConfig`` docstring) and falls back to free-form generation.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-26 04:42:10 +00:00
pepijn 4913356564 pi052: SDPA attention port + selective AC + bench harness
Replaces the per-layer ``modeling_gemma.eager_attention_forward`` call
with ``torch.nn.functional.scaled_dot_product_attention`` in
``compute_layer_complete`` (pi05) and ``_compute_layer_ki`` (pi052).
PyTorch SDPA picks the memory-efficient kernel for the
block-bidirectional 4D additive mask the dual-expert model uses (FA2 /
FA3 reject it because they only accept causal / sliding-window / varlen
patterns). The shared ``sdpa_attention_forward`` helper mirrors the
eager signature so the call sites are unchanged.

Selective AC: removes the redundant outer ``_apply_checkpoint(forward_func, ...)``
wrap in ``PI05Pytorch.forward``. Per-layer checkpointing inside
``PaliGemmaWithExpertModel.forward`` already handles activation
recompute; the outer wrap was double-recomputing the whole backbone.
+14% steps/sec on its own (job 22161405 vs 22161398, 1xH100).

groot: drop ``@strict`` on ``GR00TN15Config`` — newer ``huggingface_hub``
rejects ``@strict`` on non-dataclass ``PretrainedConfig`` subclasses,
which was blocking imports of any sibling policy through
``lerobot.policies.factory``.

New ``examples/benchmark/bench_pi052_step.py`` (+ slurm sweeps v1..v8)
times PI052Policy.forward+backward (optionally with AdamW) on
synthetic inputs. Headline numbers on 1xH100 with KI=True, GC=True,
L=512, 4.14 B trainable params, AdamW state in bf16:

  pre-SDPA eager BS=8                 610ms   19.5 GiB  ->  13.1 samples/s
  sdpa  BS=8  + compile=default       413ms   19.5 GiB  ->  19.3 samples/s
  sdpa  BS=16 + compile=default       715ms   37.3 GiB  ->  22.4 samples/s
  sdpa  BS=32 + compile=default      1325ms   44.8 GiB  ->  24.2 samples/s
  sdpa  BS=40 + compile=default      1665ms   48.6 GiB  ->  24.0 samples/s

Parity tests in ``tests/policies/pi052/test_pi052_sdpa_attention.py``
cover fp32 / bf16 / GQA / MHA forward + backward — output and grads
match the eager path within bf16 tolerance.

Also ships ``examples/benchmark/fsdp_pi052.yaml`` (FSDP2 accelerate
config wrapping GemmaDecoderLayer + SiglipEncoderLayer) for the
follow-up multi-GPU memory sharding work.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-25 21:59:20 +00:00
pepijn 673cc6b0fe pi052: opt-in Liger fused kernels (rope + geglu + layer_norm)
Adds ``PI052Config.use_hf_kernels`` (default off). When enabled,
``PI052Policy.__init__`` calls ``apply_liger_kernel_to_paligemma``
before the backbone is built so PaliGemma / Gemma / Siglip layers
pick up Liger's fused Triton forwards.

Measured at BS=16 / L=512 / H100 80GB with KI+GC on (bench job
22161421, see ``examples/benchmark/bench_pi052_kernels.slurm``):

  rope only        →  -2.5% step time
  geglu only       →  -2.2% step time
  layer_norm only  →  -1.1% step time
  all three        →  -4.5% step time, peak_mem unchanged

``cross_entropy`` / ``fused_linear_cross_entropy`` are deliberately
skipped — pi052 calls ``F.cross_entropy`` directly and bypasses
``PaliGemmaForConditionalGeneration.forward``, so neither patch
fires without invasive model-code changes (left for a follow-up).
``rms_norm`` measured as noise on this workload (GC dominates),
so it stays off to keep the patch surface minimal.

Requires ``pip install liger-kernel``; falls back to a warning if
missing so the default path is unaffected.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-25 20:50:07 +00:00
Pepijn 2ed6519a93 ema: enable by default (matches openpi JAX behavior)
Flip EMAConfig.enable default from False -> True. Every training run
now maintains an EMA shadow of the policy and uses it for eval + W&B
example dumps. Disable per-run with --ema.enable=false for short or
memory-constrained training.

Rationale:
  * openpi (JAX, official) ships EMA on for every shipped config,
    decay=0.99 by default and 0.999 for pi05_libero. The openpi
    PyTorch port explicitly lists EMA as unsupported, a gap LeRobot
    main inherited. Flipping the default closes that gap for every
    LeRobot policy that ships through lerobot-train.
  * EMA is established best practice for diffusion / flow-matching
    policies (Diffusion Policy §V.D; standard in DDPM/EDM/Stable
    Diffusion training recipes). For autoregressive policies the
    extra cost is real but the safety net (smoother eval, better
    final checkpoint) doesn't hurt.

Trade-offs to be aware of:
  * Memory: 1x model params in fp32 shadow (~13 GB for pi052's
    3.3B params; <500 MB for ACT/Diffusion-Policy class). Memory-
    constrained users on consumer GPUs may need --ema.enable=false.
  * Checkpoint disk: extra .pt file in training_state/, size ~=
    pretrained_model/model.safetensors. Over a 100k-step run with
    save_freq=20000 that's 5x the model size in extra disk.
  * Eval scores will now reflect EMA model instead of live model -
    expected to be 1-3% higher on closed-loop tasks per the
    diffusion-policy literature; might surprise users who memorize
    their last run's numbers.

Opt out:
  --ema.enable=false           # disable entirely
  --ema.use_for_eval=false     # keep EMA but eval reflects live
  --ema.use_for_wandb_examples=false   # keep EMA but W&B reflects live

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:58:46 +02:00
Pepijn 72ea531017 train: switch EMA from custom ModelEMA to ema-pytorch
Replace the 250-line src/lerobot/utils/ema.py with a direct dependency
on ema-pytorch (lucidrains' canonical PyTorch EMA library). Same
semantics, decay=0.999 default unchanged, but offloads the maintenance
burden to a maintained library used by every diffusion repo.

Why ema-pytorch:
  * Standard PyTorch EMA library; battle-tested across diffusion +
    speech + image-gen codebases.
  * Tiny pure-python dep (no compiled code).
  * Cleaner consumer-side API: ema.ema_model is a full nn.Module
    clone of the policy, so eval / wandb just pass it through instead
    of context-managed swap/restore on the live model.

What changed mechanically:
  * pyproject.toml: add 'ema-pytorch>=0.7.7,<1.0.0' to core deps.
  * deleted src/lerobot/utils/ema.py (the custom ModelEMA).
  * scripts/lerobot_train.py:
      - import EMA from ema_pytorch
      - instantiate with beta=cfg.ema.decay,
        update_after_step=cfg.ema.warmup_steps, update_every=1,
        include_online_model=False (accelerator owns live model
        lifecycle; double-registration would double-count params).
      - ema.update() (no args) — library tracks the online model
        internally.
      - Eval block: pass eval_target_policy = ema.ema_model (when
        cfg.ema.use_for_eval) instead of swap context manager.
      - W&B examples: same pattern.
      - Save: torch.save(ema.state_dict(), .../ema_state.pt) instead
        of custom safetensors writer. .pt format is consistent with
        the rest of training_state which already mixes safetensors +
        json + (now) pt.
      - Resume: ema.load_state_dict(torch.load(.../ema_state.pt)).
      - WandB observability: ema/step (count of ema.update calls),
        ema/initted (bool from library), ema/beta (constant from
        cfg).
  * configs/default.py: EMAConfig.decay stays 0.999 (matches
    openpi's pi05_libero); docstring updated to reflect ema-pytrch
    semantics for warmup_steps (now maps to update_after_step — a hard
    skip, not a smooth decay ramp).

Behavior preserved:
  * Defaults: enable=False, decay=0.999, warmup_steps=0,
    use_for_eval=True, use_for_wandb_examples=True.
  * Same CLI: --ema.enable=true, --ema.decay=X, etc.
  * Same checkpoint layout (training_state/ema_state.pt next to
    optimizer_state.safetensors etc.); resumes silently if present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:51:23 +02:00
Pepijn 56a934ec55 train: EMA of policy parameters (opt-in via --ema.enable=true)
Adds Exponential Moving Average of trainable policy parameters with
warmup, eval-time swap, checkpoint save/resume, and wandb observability.

For diffusion / flow-matching policies (pi052's flow expert exactly
qualifies), averaging late-training parameter oscillations yields a
smoother model that generalises substantially better at inference —
~1–3% absolute success-rate improvement on closed-loop tasks per the
diffusion-policy lit (Chi et al. 2023 §V.D; standard in DDPM/EDM).

New module: src/lerobot/utils/ema.py
  ModelEMA class with:
    * fp32 shadow of every requires_grad parameter
    * decay warmup: min(decay, (1+n)/(10+n)) for first warmup_steps updates
    * update(model) -> effective_decay (for logging)
    * apply_to(model) context manager: temp-swap weights, restore on exit
    * copy_to(model): permanent overwrite
    * save() / load_from_file(): safetensors + JSON sidecar for metadata
    * state_dict() / load_state_dict() for in-process round-tripping

New config: src/lerobot/configs/default.py EMAConfig + wired into
TrainPipelineConfig as 'ema: EMAConfig'.
  Fields:
    enable: bool = False         (off by default, back-compat)
    decay: float = 0.999         (standard; 0.75 for fast Diffusion-Policy)
    warmup_steps: int = 0        (no warmup by default)
    use_for_eval: bool = True    (eval swaps in EMA weights)
    use_for_wandb_examples: bool = True
                                 (W&B training-examples table uses EMA
                                  for predicted-action columns -> matches
                                  what eval / deployment would see)

Training loop integration (src/lerobot/scripts/lerobot_train.py):
  1. After accelerator.prepare + policy.train(), instantiate ModelEMA
     on the main process if cfg.ema.enable. Resume from
     checkpoint_path/training_state/ema_state.safetensors if present.
  2. After each update_policy() call, ema.update(unwrap_model(policy))
     returns the effective decay (logged to wandb during warmup).
  3. The save_checkpoint() block also ema.save(...) the shadow next to
     the existing optimizer/scheduler/rng training state. Resume picks
     it up automatically in (1).
  4. The eval block (cfg.env && is_eval_step) wraps eval_policy_all in
     ema.apply_to() when use_for_eval=True. Live weights restored
     byte-for-byte on context exit.
  5. The W&B training-example dump wraps log_training_examples in
     ema.apply_to() when use_for_wandb_examples=True so the predicted-
     action columns match the eval/deployment behavior.
  6. Two new wandb scalars: ema/effective_decay, ema/num_updates.

Cost:
  Memory: 1x model params in fp32 (~13 GB for pi052's 3.3B params).
          Lives only on main-process GPU. CPU offload available via
          ModelEMA(device='cpu') if needed.
  Compute: one elementwise update per step (~1% of step time).
  Eval: 2x checkpoint files in training_state/ (live optimizer state
        + ema shadow). Negligible relative to model.safetensors.

Usage:
  lerobot-train ... --ema.enable=true
  lerobot-train ... --ema.enable=true --ema.decay=0.9999  # very slow EMA
  lerobot-train ... --ema.enable=true --ema.warmup_steps=1000

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:27:14 +02:00
Pepijn 738e317caa pi052: PaLM-style z-loss on text CE (default weight 1e-4)
Penalise the log-partition function z = log Σ exp(logits) drifting away
from zero on text-CE supervised positions. Without it, large-vocab
models (PaliGemma's 257k vocab) can let logsumexp grow unboundedly
while CE stays low — a uniform additive logit bias cancels in softmax
but pushes the partition function out of bounds, causing numerical
instability and generation drift.

PaLM appendix B / Chinchilla report z-loss is essential for stable
large-vocab CE. It is especially valuable for pi052 because the recent
default lm_head_lr_scale=5.0 amplifies head-drift risk: the 5x boost
keeps the head pinned to fine-tuning targets, and z-loss caps the
partition function so the head can't just bias all logits high uniformly.

Implementation:
  * _shifted_ce(logits, labels, z_loss_weight=0.0) gains the new arg
    with default 0.0 (back-compat for any other caller).
  * Both call sites in PI052Policy.forward read self.config.text_ce_
    z_loss_weight and pass it through.
  * PI052Config.text_ce_z_loss_weight defaults to 1e-4 (commonly cited
    PaLM value); set to 0 to disable.

Cheap to compute: one extra logsumexp shares the softmax kernel that
F.cross_entropy already runs. No memory overhead beyond a (B*T,) tensor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:08:56 +02:00
Pepijn 8ba3b187a1 pi052: bump lm_head_lr_scale default to 5.0 (keep base LR at 2.5e-5)
The base optimizer LR (2.5e-5, cosine to 2.5e-6, 1k warmup, AdamW
(0.9, 0.95), wd 0.01, grad_clip 1.0) is the openpi/π0.5 setting used
for the RoboCasa leaderboard baselines and is well-validated for 3B-
class VLAs with a paligemma backbone. Leave it alone.

The one place pi052 needs to diverge from pi05 is the LM-head LR
multiplier:

  * pi05 has no text supervision -> head doesn't get gradients ->
    lm_head_lr_scale is moot, stays at 1.0.
  * pi052 always has text supervision via the recipe (subtask /
    memory / VQA). Under KI, the LM head only sees gradients on
    ~30-45% of the batch (the text-CE mask share). Under aggressive
    cosine decay the head drifts back toward PaliGemma's pretrained
    <loc> first-token bias, despite teacher-forced CE staying near 0.

5x is the documented fix (see PI05Config.lm_head_lr_scale docstring
and PI05Policy.get_optim_params, which is already wired to split the
LM head + tied embed_tokens into their own param group while sharing
the same cosine lambda). Flipping the default here lifts the fix from
opt-in to on-by-default for every pi052 run, with zero downside on
text-free recipes (head still gets no gradients to scale).

Other LR knobs reviewed and intentionally NOT changed:
  - optimizer_lr=2.5e-5: openpi-validated, matches leaderboard.
  - scheduler_warmup_steps=1000: standard for VLA finetuning.
  - scheduler_decay_steps=30000: auto-scales for short runs.
  - optimizer_betas=(0.9, 0.95): GPT/LLM convention, works for
    flow-matching + LM-CE.
  - optimizer_weight_decay=0.01, grad_clip=1.0: standard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:57:43 +02:00
Pepijn 057c794ffe wandb: flip training-example logging defaults to on (every 5000 steps)
The training-example wandb.Table dump (camera images + text fields +
GT/predicted action chunk endpoints) was opt-in. Flip defaults so any
run with --wandb.enable=true gets visual training observability for free.

  log_examples_freq:           0     -> 5000   (push table every 5k steps)
  log_examples_n:              4     -> 4      (unchanged)
  log_examples_predict_actions: False -> True   (extra forward in eval mode)

Runs without --wandb.enable=true are unaffected (the training loop gate
checks wandb_logger is not None first). Set log_examples_freq=0 to opt
out of the dump even with wandb enabled; set log_examples_predict_actions
=false to skip the extra inference forward pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 18:00:04 +02:00
Pepijn b1e83f556c train: periodic wandb log of training examples (images + text + actions)
Adds an opt-in cadence for pushing rich training examples to W&B,
independent of the scalar log_freq. Off by default; turn on with
--wandb.log_examples_freq=5000 (one wandb.Table dump every 5k steps).

WandBConfig (configs/default.py):
  + log_examples_freq: int = 0       # 0 disables
  + log_examples_n: int = 4          # batch elements per dump
  + log_examples_predict_actions: bool = False
                                     # opt-in extra forward pass to
                                     # show predicted vs GT action chunk

WandBLogger.log_training_examples (common/wandb_utils.py):
  Builds one wandb.Table row per sampled batch element with:
    * one wandb.Image column per camera (auto handles CHW/HWC,
      uint8/float32 [0,1])
    * any text fields present in the batch (task / subtask /
      memory / instruction)
    * gt_action_first / gt_action_last (chunk endpoints)
    * pred_action_first / pred_action_last when --wandb.log_examples_
      predict_actions=true (policy.eval() + no_grad; restores train
      mode after)
  Defensive: per-camera failures don't poison the row; predict_action_
  chunk exceptions are logged and the predicted columns are dropped.

Training loop (scripts/lerobot_train.py):
  One new gated block right after the existing scalar log_step clause.
  Reads batch + dataset.meta.camera_keys, hands them to
  log_training_examples. Wrapped in try/except so a bad sample never
  kills the run.

Usage:
  lerobot-train ... \
    --wandb.enable=true --wandb.project=robocasa_composite_seen \
    --wandb.log_examples_freq=5000 \
    --wandb.log_examples_n=4 \
    --wandb.log_examples_predict_actions=true

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 16:57:15 +02:00
Pepijn da3e87ee86 Merge branch 'feat/smolvla-on-steerable' of https://github.com/huggingface/lerobot into feat/smolvla-on-steerable 2026-05-25 16:56:50 +02:00
Pepijn 1e9a6d044d Merge remote-tracking branch 'origin/feat/language-annotation-pipeline' into feat/smolvla-on-steerable
# Conflicts:
#	src/lerobot/datasets/__init__.py
#	src/lerobot/policies/__init__.py
#	src/lerobot/policies/factory.py
#	src/lerobot/processor/render_messages_processor.py
#	uv.lock
2026-05-25 16:56:22 +02:00
pepijn 3fdfcb912a examples(port_datasets): generalize RoboCasa builder + add smoke script
- Add ATOMIC_TASKS, COMPOSITE_UNSEEN_TASKS and four new --task-set keys
  (atomic, composite_unseen, composite_all, composite_atomic) so the same
  builder produces the 50-task target benchmark or the 300-task Human300
  pretraining slice (via --split=pretrain --task-set=all) without
  duplicating logic.
- Stop hardcoding the composite_seen tag on the HF push; tags are now
  derived from --split / --source / --task-set so atomic, composite_all,
  and pretrain runs land with accurate metadata.
- Refresh module docstring to match the broader scope.
- Add scripts/build_robocasa_smoke.sh: 2-atomic-task smoke dataset
  (~1k episodes, ~131k frames) for fast end-to-end training validation
  before kicking off Human300-scale runs.
2026-05-25 14:54:00 +00:00
Pepijn c37b1fc7d0 Merge origin/feat/language-annotation-pipeline (8 fix(annotate) commits + vocabulary phase) 2026-05-25 15:47:25 +02:00
Pepijn 9020635b14 Merge branch 'main' into feat/language-annotation-pipeline
Resolves conflicts from 32 commits on main:

* docs/source/_toctree.yml — keep both new toc entries
  (annotation_pipeline + video_encoding_parameters).
* docs/source/language_and_recipes.mdx — adopt main's section
  ordering (Layer 2 before "Temporal semantics") and float32
  timestamp dtype to match the codebase.
* src/lerobot/configs/__init__.py — keep both export sets
  (recipe + video encoder).
* src/lerobot/datasets/dataset_metadata.py — drop redundant lazy
  imports (top-level imports cover both LANGUAGE_COLUMNS and
  DEFAULT_TOOLS); adopt main's @tools.setter for info.json
  write-back.
* src/lerobot/datasets/feature_utils.py — call the real
  validate_feature_language() instead of returning "".
* src/lerobot/datasets/language.py — float32 timestamps to match
  pa.float32() used in video_utils.py and the rest of the codebase.
* src/lerobot/datasets/language_render.py — adopt main's
  unwrap_scalar() helper (drops two hand-rolled .item()/list
  unwrappers); float32 in docstring.
* src/lerobot/processor/render_messages_processor.py — drop
  PR-local _scalar() helper, use shared unwrap_scalar().
* tests/datasets/test_language.py — adopt main's new float32 dtype
  + validate_feature_language warning tests.
* tests/datasets/test_dataset_metadata.py — adopt main's new
  tools.setter persist/clear tests.
* uv.lock — regenerated cleanly from main's resolver.

90 of 92 touched tests pass. Two pre-existing test failures
(test_module1_plan_memory_subtask_smoke,
test_module2_mid_episode_emits_paired_interjection_and_speech in
tests/annotations/test_modules.py) are unrelated to this merge —
that test file doesn't exist on main, so the failures originate on
the branch and are addressed by the 8 newer fix(annotate) commits
already on origin that will land in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:46:32 +02:00
Pepijn 83d0c390da pi052: drop debug scaffolding left over from training/inference bug hunts
Three diagnostic surfaces shipped in PR3 that don't belong in a clean
release:

* ``LEROBOT_DUMP_RECIPE_SAMPLES`` env-var dump (~70 LOC in
  text_processor_pi052.py): pretty-prints the next N rendered samples
  with ``[TGT]...[/TGT]`` markers over supervised spans. One-off
  training-inspection tool — no production user, never wired into a
  CLI flag, only useful while iterating on the recipe. Drop the module
  constants, the ``_is_dump_rank`` / ``_dump_recipe_sample`` helpers,
  the call site, and the now-unused ``import os``.

* ``_log_obs_tensors_once()`` in lerobot_pi052_runtime.py: the
  docstring literally says "Used to bisect train/inference mismatches"
  — a debugging artifact from when the LM head was collapsing on the
  live robot. Logged unconditionally at WARNING level from both the
  dataset-driven and robot-driven providers, with no ``--verbose``
  gate. Drop the function, both call sites, and the ``_logged`` /
  ``_obs_logged`` flag dicts that fed them. (``_resize_logged`` is
  kept — it gates the operationally useful camera-size sanity log.)

* Defensive ``unsqueeze(0)`` block in the dataset observation
  provider: papered over an upstream bug where some preprocessor step
  could produce an unbatched tensor. ``AddBatchDimensionProcessorStep``
  is reliable in the current pipeline — pi052 tests still pass with
  the block removed. If the bug ever resurfaces it should be fixed
  at the source, not silently re-batched here.

Net: -169 LOC. All 30 ``tests/policies/pi052/`` tests pass.

The ``<loc>`` token plumbing (``register_paligemma_loc_tokens``,
``_loc_token``, ``suppress_loc_tokens`` runtime gate) is left as-is —
it's the actual mechanism for VQA spatial answers, not scaffolding,
and the ``suppress_loc_tokens=True`` callers on subtask/memory/
interjection paths and ``=False`` on the VQA path are intentional
asymmetric behaviour, not a bug-routing knob.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:07:43 +02:00
Pepijn 1ff10b935c Merge branch 'feat/language-annotation-pipeline' into feat/smolvla-on-steerable
Resolves conflicts from 66 commits on the base branch:

* pyproject.toml — keep base's transformers>=5.4.0,<5.6.0; add the
  sentencepiece-dep entry pi052 (FAST action tokenizer) needs.
* policies/__init__.py — keep pi052 export; drop the
  RewardClassifierConfig export that base removed.
* policies/factory.py — docstring list resolution (keep pi052; drop
  reward_classifier, removed by base).
* annotations/steerable_pipeline/executor.py — adopt base's renamed
  _ensure_annotation_metadata_in_info (it already advertises the say
  tool); drop pi052's older _ensure_tools_in_info call.
* configs/train.py — keep pi052's vqa_target_fraction; adopt base's
  SampleWeightingConfig (legacy RA-BC inline params already covered
  by the migration shim base added).
* scripts/lerobot_train.py — merge pi052's per-policy processor
  rebuild + dataset_repo_id pass-through with base's active_cfg /
  is_reward_model_training tightening, and re-route vqa-weighted
  sampler to active_cfg.drop_n_last_frames.
* datasets/language_render.py — adopt base's _select_one + timestamp
  tolerance (drops pi052's stale _select_latest / per-style sort_key).
* tests — adopt base's parametrized per-camera blend + tolerance
  test; drop pi052 tests that overlap with base's tighter rewrites;
  keep pi052's flow-only / VQA-blend coverage; add a
  test_canonical_recipe_loads check on subtask_mem_vqa_speech.yaml.
* policies/pi052/processor_pi052.py — import RenderMessagesStep
  directly from render_messages_processor (base intentionally
  dropped it from lerobot.processor's re-exports).
* uv.lock — regenerated cleanly from base + pi052's pocket-tts /
  beartype.

All 67 touched tests pass (30 pi052 + 37 recipe / language-render /
pipeline / render-messages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:47:09 +02:00
Pepijn 67bdf4690e examples(port_datasets): rewrite RoboCasa composite_seen builder
Replace the earlier wrapper (which depended on robocasa.scripts.download
+ dataset_registry) with a self-contained pipeline that:

* downloads each task tarball directly from Box via box_links_ds.json
* converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30
* standardizes camera keys under observation.images.robot0_* and
  flattens observation.state by concatenating base/EE/gripper subkeys
  when the source dataset stores them separately
* builds per-rank unified shards then aggregates into one dataset

Filter: composite_seen task-set restricts discovery to the 16 multi-step
target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use
--task-set=all to keep every discovered task in the split/source slice;
--tasks=... overrides for arbitrary subsets.

Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task.

Adapted from a battle-tested port_robocasa.py reference shared by the
user; the only semantic addition is the task-set filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:27:42 +02:00
Pepijn 8085feab6e pi052(runtime): factor out shared observation-prep boilerplate
Both observation providers in lerobot_pi052_runtime.py ended a sample
dict the same way — strip the runtime-owned language columns and hand
the policy a device-resident ``observation.*``-only subset. Extract
two tiny helpers (``_strip_runtime_owned_language_cols`` and
``_select_observation_to_device``) so the dataset and robot paths
read as a clear linear pipeline. Path-specific concerns (defensive
unsqueeze on the dataset path; camera resize + state-vector sanity
logging on the robot path) stay inline at the call sites.

Behaviour unchanged; all 30 ``tests/policies/pi052/`` tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:25:08 +02:00
Pepijn a088c10c80 examples(port_datasets): SLURM+datatrove RoboCasa composite_seen build
Parallel variant of build_robocasa_composite_seen.py modeled after the
existing slurm_port_shards.py / slurm_aggregate_shards.py pattern.

Two-phase datatrove pipeline:
  * Phase 1 DOWNLOAD: tasks=16 (one per RoboCasa composite_seen task),
    each worker downloads its assigned tar via RoboCasa's own
    download_datasets helper. Network-bound, idempotent.
  * Phase 2 AGGREGATE: tasks=1, single worker calls aggregate_datasets
    over the 16 extracted directories. Submitted with depends=phase1 so
    SLURM only releases it once all 16 downloads succeed.

Reuses the COMPOSITE_SEEN_TASKS list and per-task download/resolve
helpers from the single-machine script via aliased imports — single
source of truth for 'what does it mean to download a composite_seen
task'.

Local (--slurm 0) mode runs the two phases sequentially in-process for
debugging on a workstation.

Usage on SLURM:
    uv run python examples/port_datasets/slurm_build_robocasa_composite_seen.py \
        --output-dir=/scratch/${USER}/robocasa_composite_seen \
        --hub-repo-id=${HF_USER}/robocasa_composite_seen \
        --logs-dir=/scratch/${USER}/logs/robocasa \
        --partition=cpu --push-to-hub

Prereq: uv sync --extra annotations  (pulls datatrove)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:10:05 +02:00
Pepijn 9c3d5ab7ce scripts: build_robocasa_composite_seen — aggregate 16 target tasks
RoboCasa 1.0 ships its target/human demos in LeRobot format (parquet +
mp4) as lerobot.tar archives distributed via Box. This script wraps
RoboCasa's own download_datasets helper to pull each of the 16
composite_seen tasks, opens each extracted directory as a
LeRobotDataset, and merges them into a single combined dataset via
merge_datasets (a thin wrapper over aggregate_datasets that revalidates
fps/robot_type/features, unifies task indices, concatenates videos and
parquet, and recomputes stats).

The 16-task slice corresponds exactly to the 'Composite-Seen' column of
the published RoboCasa365 leaderboard, so the resulting dataset is the
right substrate for an apples-to-apples pi05 vs pi052 comparison on
multi-step kitchen manipulation.

Usage:
    uv run python -m lerobot.scripts.build_robocasa_composite_seen \
        --output-dir=/data/lerobot/robocasa_composite_seen \
        --hub-repo-id=${HF_USER}/robocasa_composite_seen \
        --push-to-hub

Idempotent: re-running skips already-downloaded tasks. Defensive
fallbacks handle RoboCasa API drift in get_ds_path / download_datasets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:01:28 +02:00
Pepijn e84f97a8c1 smolvla2(runtime): interactive task picker + drop action diagnostic
Task picker:
The dataset bootstrap used to silently overwrite args.task with the
canonical training task. Replace that with an interactive picker
(_select_task_interactively) that shows every unique task in
ds_meta.tasks as a numbered menu (canonical task first as default) plus
a 'type a custom task' option. --task on the CLI still skips the
picker, and non-TTY runs fall back to the bootstrap task so scripted
invocations are unchanged.

Action diagnostic removal:
Drop the [act] log block in LowLevelForward.run (|a|_mean / spread /
normalized + unnormalized first/last + state) that was added while
debugging the 'barely moving' issue. Robot motion is now healthy, the
output is noise in steady-state, and it depended on stashing the
postprocessor on runtime.state — also removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:59:08 +02:00
Pepijn 6d2b8c80ab smolvla2(runtime): wire MemoryUpdateFwd into the inference pipeline
MemoryUpdateFwd was importable but never installed, so subtask_change
events fired by HighLevelSubtaskFwd had no listener and current_memory
stayed at its initial None value — the runtime panel always showed
'memory (not set)' even when the policy was trained with the
memory_update recipe (e.g. subtask_mem_vqa_speech.yaml, weight 0.15).

Insert MemoryUpdateFwd between HighLevelSubtaskFwd and AskVQAFwd so
the event is visible the same tick it is emitted, and refresh the
stale comment that claimed memory was not in scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:52:44 +02:00
Pepijn 793c7c4ddd feat(runtime): --subtask_chunks_per_gen throttles HL gen vs action chunks
Adds a per-chunk-boundary counter to HighLevelSubtaskFwd: subtask gen
fires only once every N chunk boundaries (default 1 = current
behavior). Lets the operator run e.g. 5 flow-matching action chunks
per LM-head subtask gen so the subtask doesn't churn every 1.7s while
the previous one is still being executed — saves compute and avoids
re-planning the action trajectory mid-grasp.

  --subtask_chunks_per_gen=5    # 5 chunks per subtask refresh

The counter starts at 0 so the very first chunk boundary fires
immediately (no startup delay). Trigger is rearmed when skipping so
a low high_level_hz doesn't lose slots.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:34:59 +02:00
Pepijn db927ab40b feat(runtime): action chunk diagnostic — log normalized + unnormalized values
Adds a per-chunk log line in LowLevelForward that surfaces what the
action expert actually emits and what the robot receives after the
postprocessor unnormalizes it, so "barely moving" can be diagnosed
at a glance:

  [act] T=50 |a|_mean=0.234 spread=0.512
  [act] norm  first=[0.12, -0.31, ...]  last=[0.45, -0.22, ...]
  [act] joint first=[3.2, -47.8, ...]  last=[12.4, -41.0, ...]  state=[0.5, -55.3, ...]

|a|_mean ~ 0.3–0.6 with spread ~ 0.3+ and visible delta from first to
last → healthy trajectory. |a|_mean near 0 across the chunk → model
defaulting to median pose. joint values that don't differ much from
state → safety cap or model output near current state.

Postprocessor is stashed on runtime.state["_postprocessor"] at startup
so the diagnostic can replay the same unnormalize the dispatcher uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:10:52 +02:00
pepijn 471b2b1b1d fix(annotate): bump same-frame subtasks onto distinct frames
If two consecutive VLM-emitted subtask spans have ``start`` timestamps
that round to the same source frame after ``snap_to_frame`` (e.g. on
short episodes the VLM sometimes nominates two ~adjacent action
boundaries within one 30 Hz step), the writer emits two
``style=subtask`` rows at the identical persistent timestamp. The
training-time renderer's default binding
``subtask: active_at(t, style=subtask)`` then raises:

    ValueError: Ambiguous resolver for style='subtask';
                add role=..., tool_name=..., or camera=... to disambiguate.

… and the whole training run dies on the first batch.

Observed concretely on ``pepijn223/super_poulain_vocab2`` (job
22159979): episodes 3 and 30 each had two subtask rows at the same
timestamp (``release yellow cube`` + ``retract arm`` snapping to the
same frame).

Add ``_dedupe_starts_to_distinct_frames`` to walk the cleaned span list
and, whenever a snapped start collides with one already used, push the
later span onto the next free frame timestamp. Both subtasks survive
on distinct timestamps; the renderer can now disambiguate. If the
episode genuinely has no later free frame (extremely unlikely — would
require a same-timestamp collision on the very last frame of the
episode), the later span is dropped with a warning rather than left
to poison the render.

New test ``test_plan_module_bumps_collocated_subtasks_to_distinct_frames``
locks in the contract; full vocabulary suite is 14/14 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-23 19:31:44 +00:00
pepijn a15e16c072 fix(annotate): replace fuzzy subtask snapping with strict match + one-shot retry
The Jaccard-overlap snap was warping VLM output into wrong canonical
labels — e.g. an off-vocab "consult the wizard" span would silently
become "grasp blue cube" if that scored highest. Even with a higher
floor the operator can't tell which subtasks were paraphrases vs
genuine mislabels in the resulting dataset.

Replace with strict exact-match validation + a single targeted retry:

  1. Generate subtasks as before.
  2. If any returned subtask's normalised form (lowercased, articles
     stripped, whitespace collapsed) isn't in the canonical vocab,
     fire one retry call naming the offending strings and re-sending
     the full canonical list. The retry prompt requires byte-identical
     output from the vocab.
  3. After the retry, validate again. Spans still off-vocab are
     dropped — no fuzzy snapping ever produces a different canonical
     label than the VLM actually emitted.
  4. If every span ends up off-vocab even after the retry, warn loudly
     so the operator extends ``meta/canonical_vocabulary.json`` to
     cover the missing phase. The episode is left with empty subtasks
     rather than silently fabricated ones — visibility > sweep-under-
     the-rug.

Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the
normalisation helper out so the retry-validation and the final
canonicalisation share one source of truth.

Tests:
  - test_plan_module_accepts_article_only_difference: "grasp the blue
    cube" still maps to canonical "grasp blue cube" (article-tolerant).
  - test_plan_module_retries_when_subtask_off_vocab: paraphrase
    triggers the retry which the VLM corrects in pass 2.
  - test_plan_module_drops_off_vocab_subtask_after_retry: VLM that
    refuses to correct → bad span dropped, in-vocab span kept.
  - test_plan_module_empty_when_all_off_vocab_after_retry: every
    span off-vocab → episode left empty (no warping).
All 13 vocabulary tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-23 09:57:27 +00:00
pepijn 336af85c09 fix(annotate): never leave an episode with zero canonical subtasks
When the canonical vocabulary is enabled and the VLM produces spans
that don't overlap any canonical label, the previous Jaccard-floor
(0.5) dropped them and the episode came out with no subtasks at all
— invisible to the downstream policy. Observed on
``pepijn223/super_poulain_vocab``: some episodes had empty subtask
columns because every VLM-emitted phrase scored below 0.5 against
the discovered vocabulary.

Two-pass canonicalisation:

  - First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to
    let mild paraphrases through) and drops everything below.
  - If that first pass leaves the episode with **zero** subtasks,
    fall back to a second pass that always snaps each VLM span to
    its nearest canonical label by Jaccard (no floor). The episode
    ends up with subtasks even when the vocabulary missed a phase
    — a slightly-wrong canonical label is still closer to the right
    motion than nothing at all.
  - Log loudly when the fallback fires so the operator can spot
    coverage gaps in ``meta/canonical_vocabulary.json``.
  - Log a per-episode count at INFO when some (but not all) spans
    were dropped so it's visible without spamming the run output.

Promote the Jaccard floor + ignore-tokens to class constants so
they're a single edit point. Add ``force=True`` parameter to
``_canonicalize_subtask`` for the no-floor fallback path.

New test ``test_plan_module_snaps_when_all_off_vocab`` covers the
fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is
adjusted to keep at least one in-vocab span so the floor path can
still fire and is exercised. All 12 vocabulary tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 12:44:03 +00:00
pepijn 54221ceea2 feat(annotate): let the VLM decide vocabulary size
Hardcoding ``n_subtask_target=10`` and ``n_memory_target=6`` baked task
complexity into the config — a simple pick-and-place needs ~6, a
multi-step recipe needs ~20. The VLM already sees the clips, so let it
pick the count itself from what's recurring across episodes.

Drop both knobs from ``VocabularyConfig`` and the ``module_0_vocabulary``
prompt template. The prompt now says "decide the count yourself based
on what you see — the smallest set that still covers every recurring
phase" and adds an "each label must recur across the demos" rule so
the VLM filters out one-off motions.

Update the launcher script + docs to remove the old knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 11:46:31 +00:00
pepijn 369ab17110 fix(annotate): update run_hf_job CLI args for renamed namespaces + phase 0
Three stale things in the launcher script:

  - ``--module_1/2/3.*`` no longer exist; review commit fd18beb renamed
    the CLI namespaces to ``--plan/interjections/vqa``. Forwarded all
    eight existing args to their new names.
  - ``--push_to_hub`` is now a bool; the destination repo lives at
    ``--dest_repo_id``. Split the single positional into both args.
  - ``openai`` was missing from the pip install list, which the prior
    review review (claude bot, 2026-05-08) flagged — the default vlm
    backend is ``openai`` so the job would have ImportError'd. Added.

Also expose the new phase 0 (canonical vocabulary discovery) knobs
explicitly: ``--vocabulary.sample_episodes``, ``--n_subtask_target``,
``--n_memory_target``. Defaults are sane (3 / 10 / 6) but worth
flagging in the example so the operator knows what they're running.

Update the docstring + section comments to match the current phase
layout (vocabulary → plan → interjections → vqa → writer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 11:43:06 +00:00
pepijn 86a7edc590 feat(annotate): phase 0 — derive canonical vocabulary from sample episodes
The pipeline previously emitted near-unique subtask + memory phrasings
per episode (free-form LLM rephrasing). On the downstream low-level
policy that collapses the action expert's conditioning to noise: every
episode pairs a different paraphrase with similar motions, so the
expert learns a flat scene-prior that ignores the subtask string —
then at inference the high-level head invents *yet another* paraphrase
and the expert produces tiny "uncertain hover" chunks.

Add a vocabulary-discovery phase (phase 0) that runs once per dataset:

  - watches the first ``vocabulary.sample_episodes`` (default 3)
    episode videos as one Qwen-VL prompt,
  - asks the VLM to derive ~``n_subtask_target`` canonical imperative
    subtask labels and ~``n_memory_target`` first-person past-tense
    memory milestones that recur across the demos,
  - persists them to ``meta/canonical_vocabulary.json`` (human-
    inspectable, hand-editable), and
  - wires the resulting ``Vocabulary`` into the ``plan`` module so
    every per-episode subtask + memory call is constrained to those
    exact strings (both as prompt-side instructions *and* post-VLM
    validation: paraphrases snap to the closest canonical entry via
    token-set overlap; below a 0.5 Jaccard floor the subtask is
    dropped rather than warped into something semantically wrong).

Operator workflow:

  - first run discovers the vocabulary, writes the JSON, and runs
    the ``plan`` module against it,
  - subsequent runs reuse the on-disk file (``reuse_existing=True``
    default) so hand-edits stick,
  - set ``--vocabulary.enabled=False`` to fall back to free-form
    generation (the original behaviour).

The discovery prompt forbids gerunds / third-person / adverbs and
caps the lists to the requested counts, matching the Hi-Robot /
π0.6-MEM convention of small per-environment vocabularies. The
``plan`` module's subtask + memory prompts grow a conditional
``{vocabulary_block}`` slot rendered only when a vocabulary is
present; without one the templates collapse to their previous
free-form form.

Tests: 11 new unit tests under tests/annotations/test_vocabulary.py
cover the on-disk round-trip, discovery against the fixture dataset,
``reuse_existing`` short-circuit, paraphrase canonicalisation, off-
vocab subtask dropping, and the no-vocabulary pass-through path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 11:40:05 +00:00
pepijn 77a16db529 fix(smolvla2): make HighLevelSubtaskFwd actually fire at low hz + quiet startup log
Two runtime fixes that surfaced from on-robot testing.

(1) HighLevelSubtaskFwd was double-gated: HzTrigger fires every period
(e.g. every 5s at --high_level_hz=0.2) AND the step requires the
action queue to be empty. The queue-empty window is brief (~tens of
ms between drain and refill) and almost never coincides with the
low-hz timer, so HL effectively never fired and the subtask shown
in the runtime panel stayed on the dataset's frame-0 annotation.

Add HzTrigger.rearm() and have HighLevelSubtaskFwd call it when
skipping due to queue-non-empty — the trigger stays armed and tries
again on the next tick instead of waiting another full period.
LowLevelForward keeps the original "skip" semantics because chunk_hz
is meant as a true upper bound on chunk-generation rate.

(2) The "robot state at startup" warning in _build_robot_observation_provider
was meant to fire once but wasn't gated by _resize_logged like the
sibling "camera ... live=AxB" warning. Result: it spammed every
observation tick (~1-2s). Gate it on first_call (snapshot of
_resize_logged["done"]) so both logs fire once at session start.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 11:04:12 +00:00
Pepijn 8194897994 fix(deps): cap placo below 0.9.16 and harden kinematics import (#3647)
* fix(deps): cap placo below 0.9.16 and harden kinematics import

placo 0.9.16 links against liburdfdom_sensor.so.4, which is unavailable
on Ubuntu 24.04 (noble ships urdfdom 3.x). Importing placo on that base
crashes with:

  ImportError: liburdfdom_sensor.so.4.0: cannot open shared object file

This broke nightly Latest Deps tests (CPU and GPU) when the lockfile
upgrade picked placo 0.9.16, since lerobot.model.kinematics
unconditionally imports placo when _placo_available is true, and that
check (importlib.util.find_spec) cannot detect dlopen failures of
transitive shared libraries — so unrelated subsystems (RL actor,
gym_manipulator) became unimportable.

Two changes:

1. Pin placo to <0.9.16 in pyproject.toml + regenerate uv.lock
   (0.9.16 → 0.9.15). Short-term unblock for nightly CI until system
   urdfdom 4.x is broadly available.

2. Harden the import guard in src/lerobot/model/kinematics.py:
   wrap 'import placo' in try/except ImportError so a missing
   transitive .so no longer crashes module import. RobotKinematics
   instantiation now raises an informative ImportError citing the
   underlying dlopen failure via _raise_if_placo_unusable().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(kinematics): hoist _placo_runtime_error to module scope for mypy

Mypy walks the TYPE_CHECKING branch in which the runtime else-block is
not executed, so _placo_runtime_error was only defined at runtime and
mypy reported 'Name "_placo_runtime_error" is not defined' on the
three references inside _raise_if_placo_unusable. Declare the symbol
unconditionally at module scope with a default of None; the runtime
import-failure branch still assigns to it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style(kinematics): drop verbose comments around placo import guard

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 12:03:07 +02:00
pepijn ca1b951e7b feat(pi05): expose lm_head_lr_scale for stronger text-CE gradient
With knowledge_insulation=True the LM head only receives gradients on
text-CE samples (e.g. ~45% of the mix for subtask_mem.yaml). Under
aggressive cosine LR decay this is enough for the head's first-token
distribution to drift back toward PaliGemma's pretrained <loc>
detection prior — teacher-forced argmax stays high while autoregressive
generation collapses to <locDDDD> tokens.

Add `lm_head_lr_scale` (default 1.0, no behavior change) on PI05Config.
When != 1.0, PI05Policy.get_optim_params splits the policy into two
param groups: the PaliGemma lm_head projection plus its tied
embed_tokens at lr * lm_head_lr_scale, and the rest at lr. The cosine
scheduler multiplies both groups by the same lambda each step, so the
ratio is preserved across decay.

Recommended starting point for pi052 + subtask_mem.yaml runs: 5.0,
combined with a higher scheduler_decay_lr floor (e.g. 5e-6 instead of
1e-6) so the head doesn't get starved in the second half of training.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 09:56:46 +00:00
pepijn 9d30d91021 fix(pi052,smolvla2): unblock text generation when LM head drifted to <loc>
PaliGemma's pretraining puts heavy first-token mass on its <loc0000>..
<loc1023> ids at any "Assistant:" continuation. Our pi052 fine-tunes
with knowledge_insulation=True and a small text-CE budget (~45% of
samples) drift back toward that prior on long runs at low LR — teacher-
forced argmax stays at 100% (CE only measures next-token given correct
prefix) while autoregressive first-token selection collapses onto <loc>.
On the running poulain11 checkpoint at step 8000 this manifests as a
stream of <locDDDD> tokens for every subtask call — confirmed locally
against the saved checkpoint on a dataset frame.

Add a `suppress_loc_tokens` knob to `PI052Policy.select_message` that
masks ids [256000, 257024) to -inf before sampling, and pass it from
the three text-only inference steps (HighLevelSubtaskFwd,
MemoryUpdateFwd, UserInterjectionFwd). VQA steps keep the default
False so spatial answers can still emit locs. Verified end-to-end:
suppressed → "the robot arm moves the blue block to the green basket".

Also fix `_msgs_for_memory`: it was emitting the older
`User: ${task}\nPlan:..\nMemory:..` / `Assistant: ${subtask}` template,
which no longer matches the `memory_update` recipe layout
(`User: ${task}` / `Assistant: Previous memory: ..` /
`User: Completed subtask: ..`). The new prompt mirrors the training
recipe; `HighLevelSubtaskFwd` stashes the just-completed subtask in
`state['prior_subtask']` so the memory prompt can render
`Completed subtask: ..` for `MemoryUpdateFwd`.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 09:50:14 +00:00
Haoming Song 9f437d86b6 fix(groot): align GR00TN15Config with transformers config dataclasses (#3606)
* fix(gr00t): fix gr00t config dataclass init TypeError

* fix(groot): guard strict config decorator without transformers for passing CI

---------

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-05-22 10:31:04 +02:00
Haoming Song b74a551d38 fix(pi0, pi05): stabilize torch.compile and expand test coverage (#3610)
* chore(gr00t): sync with #3606 for fixing gr00t config crash

* fix(pi0&pi05): fix graph break caused by deepcopy of past_key_values in sample_actions

* fix(pi0&pi05): fix frequent recompile caused by compute_layer_complete

* feat(test): add compile test and benchamrk for pi0 and pi05

* feat(test): add comprehensive testing for pi0 and pi05. Including processor, forward, sample action, etc.
2026-05-22 10:29:34 +02:00
Nikodem Bartnik c0a2e9814d fix examples (#3623)
- Fixed broken API examples in Lerobot Imitation Learning Documentation
- Teleoperation with cameras improved by adding a fixed frequency in the loop (without it the cameras feed gets very slow)
- Wrapped record example script in main() to avoid problems on Mac
- Previously teleoperation example was using SO-ARM and teleoperation with cameras was using Koch. I changed it to use SO-ARM in all of the examples.
- Added section on how to train with HF Jobs - CLI and Python examples
- Replaced lerobot-record with lerobot-rollout in policies examples
2026-05-21 22:14:07 +02:00
pepijn e050d0fe0a fix(recipes): use active_at for memory_update, rebalance subtask_mem
memory_update was bound to `emitted_at(t, style=memory)`, which requires
the frame's exact timestamp to match a memory annotation. Memory rows are
placed at subtask-boundary timestamps and at 30 fps that's ~1% of frames,
so 99% of memory_update draws couldn't render and silently fell through
to _fallback_low_level_render — injecting task-conditioned low-level
training on ~30% of samples (subtask_mem.yaml).

Switch to `active_at`. At inference `MemoryUpdateFwd` is triggered on
`subtask_change` events, but the model only needs to learn the stateless
mapping (prior_memory, completed_subtask) -> current_memory. active_at
supervises this mapping on every frame inside a subtask interval, against
varied observations; the trigger lives outside the model. Net effect:
memory_update renders on ~87% of frames, the fallback leak drops from
~30% to ~4%, and memory CE gets a meaningful (not 0.3%) training share.

subtask_mem.yaml: rebalance to 0.30 / 0.55 / 0.15 so memory CE is
~13% effective and the freed weight goes to low_level_execution.
subtask_mem_vqa_speech.yaml: keep weights (memory_update=0.10 was
already balanced against the other text-CE branches).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 14:53:13 +00:00
pepijn 2ca030fa28 fix(pi052): build processors from current config
When fine-tuning from pi05_base, reuse only the pretrained weights so pi052 still generates recipe text labels and FAST action labels.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 13:54:29 +00:00
pepijn 36f828221c fix(pi05): preserve pretrained paligemma lm head
Keep the PaliGemma LM head in float32 and initialize it from pretrained weights or token embeddings when loading pi05 checkpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 13:25:24 +00:00
Pepijn d41d874581 fix(pi052): debug parity harness truncates prompt instead of masking
The parity check in debug_text_predictions was producing false ✗
DIVERGED reports. Root cause: I built the "inference" batch by
zero-masking the attention past the supervised span, but kept the
full 512-token padded sequence. select_message reads the prompt-end
hidden state via ``vlm_out[:, -1:]`` — the LAST position of the
prefix — which in a padded batch is a padding-token hidden state,
not the last prompt token. PaliGemma's prior on those padded
positions reliably argmaxes to <loc0879>, falsely flagging a
training/inference mismatch.

Fix: truncate both tokens AND mask to length == first_sup before
calling select_message, mirroring what the real runtime does
(``tokenizer(prompt)`` returns un-padded ids). Now the parity check
compares like-with-like.

The actual training argmax in the dump was sensible English
("' move the blue cube into the green bin'" at acc=6/9) — the head
is learning correctly. The "<loc>" salad was purely the harness
reading from the wrong position.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 15:09:36 +02:00
Khalil Meftah bac4f61eae refactor: support custom progress parquet overlays (#3640) 2026-05-21 14:32:10 +02:00
Pepijn efa05f0ada fix(train): unwrap DDP policy in debug_text_predictions hook
At training time the policy is wrapped by Accelerator/DDP into a
.module attribute and custom methods are NOT proxied through the
wrapper, so ``hasattr(policy, "debug_text_predictions")`` was False
and the periodic dump was silently no-op'ing. Walk through .module
indirection to reach the raw PI052Policy that defines the method.

Also surface why the dump didn't fire (no method / empty supervised
positions / generation error) so users can see what's blocking it
instead of staring at silence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:41:20 +02:00
Pepijn e98b6f726b feat(train): debug dump runs inference too, with parity check
Extends the periodic LM-head dump (LEROBOT_DEBUG_PREDS_EVERY) to ALSO
run select_message autoregressively on the same prompt prefix and show:

  prompt                          : '<bos>User: ... Assistant: '
  target  (ground truth)          : ' close the gripper ...'
  training argmax (teacher-fed)   : ' close the gri lift ...'  acc=12/15=80%
  inference (autoregressive)      : ' close the gripper around ...'
  first-token parity              : train=3387 (' close') vs infer=3387 (' close')  ✓ MATCH

The first-token parity check is decisive: training-side argmax at the
prompt-end position and inference's first generated token both compute
``argmax(lm_head(h_last_prompt))`` on identical context, so they MUST
match. Any divergence signals a training↔inference bug (mask, dtype,
KI routing, embedding scale, etc.). Subsequent tokens can diverge
because training uses teacher forcing while inference free-runs.

debug_text_predictions now also returns an ``inference`` list keyed
by sample, each entry carrying ``first_sup_pos`` and ``decoded``.
Limited to 24 new tokens per sample to keep the dump fast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:27:32 +02:00
Pepijn f7747d02a9 feat(train): periodic LM-head prediction dump for live debugging
Adds an opt-in diagnostic that, every N training steps, dumps 5 batch
samples plus the LM head's argmax prediction at every supervised
position alongside the label and a ✓/✗ marker — the cheapest signal
for "is text training actually learning what we expect, or collapsing
to a fixed token". Refills the recipe-sample dump budget on the same
cadence so the raw input shapes are also re-dumped.

Opt in via env var:
  LEROBOT_DEBUG_PREDS_EVERY=1000 lerobot-train ...

PI052 implements ``debug_text_predictions`` (mirrors the text-loss
forward but returns argmax instead of CE); other policies are silently
skipped. The dump runs in eval() mode under no_grad, slicing the
current batch to N samples — no extra data fetch, no train-state
mutation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:23:05 +02:00
pepijn 86ecd4bc2e add subtask memory training recipe
Add a recipe that blends subtask prediction, low-level execution, and memory update supervision.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 09:56:10 +00:00
pepijn 28b86449a2 fix(pi05): cast attention masks to model dtype
Ensure attention masks follow the backbone dtype during bf16 inference to avoid mixed dtype failures.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 09:52:46 +00:00
Virgileboat f4b834844e Feat/clean can bus (#3526)
* change timeout  for handshake

* enforce last state read when querry

* change import order

* fix(motors): flush stale robstride RX and harden feedback drain

* robstride: remove redundant timeout and max_messages casts

* bugfix + %-style

* update exception catch
2026-05-21 11:44:04 +02:00
Pepijn 5bb2da4da6 fix(pi052): VQA target format = "label <loc><loc>" not "<loc><loc> label"
The trained model collapsed to spewing 40+ <loc> tokens for *every*
prompt — subtask, memory, anything — because VQA targets were supervised
to *start* with <loc>. With ~25% of all text samples beginning with a
<loc> token, the LM head learned "Assistant: → <loc>" as a strong
attractor; once one loc is emitted, autoregression chains the rest.

Flip the format so every text target — subtask, memory, speech, AND VQA
— starts with a regular word. The model still learns the <loc>
vocabulary for the spatial portion of the answer, but loc can no
longer be the first generation step out of a clean prompt.

Examples:
  point  : "green box <loc0162><loc0759>"
  bbox   : "cube <loc0082>…<loc0409>"
  multi  : "blue <locs> ; yellow <locs>"

The runtime parser (parse_loc_answer) strips loc tokens and uses the
remainder as label, so it's order-tolerant and works under either
format. Old loc-first checkpoints still parse cleanly at inference;
new training will use label-first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 18:56:48 +02:00
Pepijn f7b989ad97 fix(pi052): read backbone dtype from q_proj, not first parameter
select_message's bf16 cast used next(paligemma.parameters()).dtype,
which lands on a fp32-kept param (norm / embedding) under
to_bfloat16_for_selected_params. Mask stayed fp32 while q/k/v were
bf16 → SDPA still raised "invalid dtype for bias". Read the dtype
from layers[0].self_attn.q_proj.weight instead — q_proj is always
cast with the rest, so its dtype matches what SDPA sees.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 18:46:08 +02:00
Pepijn 3b4376aa33 fix(pi052): cast attention bias to model dtype for bf16 inference
`_prepare_attention_masks_4d` always returns fp32 (the 0.0 / -inf
literals); with bf16 weights, HF PaliGemma's SDPA path raises
"invalid dtype for bias - should match query's dtype" and
select_message returns empty every step. Cast in both attention
sites: `_compute_layer_ki` (training, when both experts run) and
`select_message` (inference, VLM-only branch). Bf16 training +
bf16 inference now run end to end with no dtype mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 18:42:26 +02:00
Pepijn a0233f53f4 feat(annotate): default VLM to Qwen3.6-35B-A3B-FP8
Match the production target used in examples/annotations/run_hf_job.py.
Per Scale Labs' dense-captioning ablations, model capacity dominates
prompt-engineering gains; defaulting to the larger model avoids
shipping a worst-tier configuration out of the box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 11:46:59 +02:00
Pepijn 34269a5d78 fix(pi052): register PaliGemma <loc> tokens so they tokenize as single ids
THE bug behind the <loc>-salad. PaliGemma's vocab reserves ids
[256000, 257023] for <locDDDD> detection / pointing tokens, but the
stock AutoTokenizer does NOT match them on raw text — it BPE-splits
<loc0162> into SEVEN pieces (<, loc, 0, 1, 6, 2, >). So a VQA target
like "<loc0162><loc0759> green box<eos>" tokenized to 16 pieces, not
5, and training the LM head supervised those generic BPE pieces
instead of one detection-vocab id. The piece logits got pumped up
across ~25% of supervised positions; at inference they dominated
every turn — even subtask prompts produced <loc>-salad followed by
the actual answer.

Register the 1024 <locDDDD> tokens via tokenizer.add_tokens once on
load, in every path the policy uses: PI052TextTokenizerStep (training
encode), _build_text_batch_pi052 (runtime encode), and
select_message's default tokenizer (runtime decode). Verified
empirically with the real PaliGemma tokenizer: VQA target now
tokenizes to 5 ids matching the loc-vocab range (256162, 256759, ...)
with correct offset_mapping.

This unlocks PaliGemma's actual detection prior; <loc>-salad cannot
recur because each <locDDDD> is a single class on the LM head, not a
character sequence the head accidentally learns to extend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 11:41:41 +02:00
Pepijn 75507491bf fix(pi052): VQA <loc> conversion treats coords as 0-1000 normalized
Confirmed empirically on the published dataset: VQA bbox/keypoint
coordinates are Qwen2.5-VL's 0–1000 normalized grounding output, NOT
pixels. Scanning 8207 samples showed x and y both spanning 0..1000
with ~30% of values exceeding the camera's pixel dimensions (which is
impossible if they were pixels).

_vqa_answer_to_loc was dividing by the observation image's H/W, so
e.g. point [742, 158] on a 640x480 wrist cam clamped x to <loc1023>
(the far-right edge) instead of mapping to <loc0760> (~74% across).
Fix: divide by 1000 — the actual Qwen scale. The conversion is now
camera-resolution-independent, so _camera_image_shapes and the
image_shapes plumbing through __call__ / _encode_messages /
_messages_vqa_to_loc are dropped. Tests updated to the new signature
and the 0–1000 round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:21:28 +02:00
Pepijn 88519cb14c fix(pi052): quantile-normalize actions before FAST tokenizer fit
base.fit() rejected the data with "Vocab size 1024 is too small for
the range of tokens 9339": the FAST tokenizer was fit on raw
motor-unit actions, whose DCT-token range vastly exceeds the 1024
codebook.

Two problems, one fix. (1) Raw actions blow up the token range. (2) At
training time ActionTokenizerProcessorStep runs after the QUANTILES
NormalizerProcessorStep, so it encodes normalized actions — fitting on
raw actions mismatches that space. Replicate QUANTILES normalization
(per-dim [q01,q99] -> [-1,1], clipped) before base.fit() so the fit and
the training-time encode see the same distribution and the token range
fits the codebook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:02:20 +02:00
Pepijn bc0c993b25 fix(pi052): FAST tokenizer fit read actions from column, not ds[i]
fit_fast_tokenizer collected action chunks via ds[i]["action"], which
builds a full training item — delta-timestamp expansion, video decode,
image transforms. A single video-decode failure threw, was swallowed
at debug level, and silently starved the fit of every chunk → "FAST
fit collected zero action chunks", falling back to the universal
tokenizer.

Read the ``action`` column straight from the HF dataset instead: it
carries no video, so it is immune to decode errors and far faster.
Also fail fast with a clear message when the dataset has no ``action``
feature or all episodes are shorter than chunk_size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:51:53 +02:00
Pepijn ddf4bc2063 fix(pi052): knowledge insulation crashed on wrong _gated_residual import
_compute_layer_ki called modeling_gemma._gated_residual, but that
adaRMSNorm gated-residual helper is a lerobot helper in pi_gemma, not
part of HF transformers — so enabling knowledge_insulation crashed with
AttributeError on the first training step. Import _gated_residual from
pi_gemma, matching pi05's own layer code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:48:02 +02:00
Pepijn b7317b6c29 test(pi052): round-trip coverage for VQA <loc> conversion
Pins JSON pixel coords -> PaliGemma <loc> -> runtime parse back: the
conversion preserves coordinate order (JSON x-first, <loc> y-first) and
per-axis normalization, losing only <loc>-grid quantization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:24:24 +02:00
Pepijn c026aed8f8 feat(pi052): train VQA spatial answers in PaliGemma <loc> format
Spatial VQA answers (bbox / keypoint) were trained as pixel-coordinate
JSON, which fights PaliGemma's detection prior and leaks <loc>-token
salad at inference. Convert them to PaliGemma's native <locNNNN>
vocabulary instead so the LM head reuses that prior.

Training side (text_processor_pi052.py): a target turn whose content
parses as a bbox/keypoint answer is rewritten to <loc> text, using the
camera frame's native (H, W) from the observation and the preceding
image block. Non-spatial answers, subtask/memory targets and SmolVLA2
keep their JSON form — the dataset stays backbone-agnostic.

Runtime side (smolvla2/inference/vqa.py): parse_vqa_answer detects
<loc> answers (2 locs -> keypoint, 4 -> bbox), returning normalized
[0,1] coords with a normalized flag; draw_vqa_overlay denormalizes
against the chosen camera frame's pixel size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 20:23:46 +02:00
pepijn e425dfd624 fix(processor): fallback to task message when recipe misses
Keep action-only samples trainable by rendering the task as a low-level user message when no recipe branch matches.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 15:32:09 +00:00
Pepijn 15f79b5e5e fix(pi052): supervise an EOS token at the end of each text target
PI052TextTokenizerStep masked text_labels over the assistant turn's
*content only* — the trailing newline was excluded and no EOS token was
ever a supervised label. So the LM head was never given a stop signal:
at inference select_message decoded to max_new_tokens, producing the
runaway subtask paragraphs and the "}"}"}-style VQA tails.

_format_messages now appends the tokenizer's EOS to each supervised
target turn and extends that turn's span to cover it, so the EOS lands
in text_labels. _shifted_ce then trains "<last content token> -> EOS"
and the model learns to terminate; select_message stops on it.

Inference callers (the runtime's _build_text_batch_pi052) pass no
target_indices / eos_token, so no EOS is baked into the prompt — the
model generates it. Verified end-to-end with the PaliGemma tokenizer:
the supervised span is `<content><eos>` and the trailing newline stays
unsupervised.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:22:22 +02:00
pepijn 2ea0da2d9f fix(annotate): tag uploaded dataset revision
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 12:44:35 +00:00
Pepijn 725ac95b0d feat(runtime): make the interactive runtime drive PI052 too
The runtime's text path was hard-wired to SmolVLA2: _build_text_batch
read policy.config.vlm_model_name (which PI052Config doesn't have) and
built a SmolVLM2 chat-template prompt. PI052/PaliGemma is not
chat-pretrained and trains on a flat `User: ... \nAssistant: ...`
prompt, so the runtime crashed or fed an out-of-distribution prefix.

- _build_text_batch now dispatches on policy.config.type: smolvla2 ->
  chat template (renamed _build_text_batch_chat); pi052 -> flat
  role-prefixed text via PI052TextTokenizerStep's own _format_messages /
  _strip_blocks / _flatten_say_tool_calls, so the inference prefix
  matches PI052 training exactly.
- Add a lerobot-pi052-runtime entry point (alias of the same main; the
  policy type is read from the checkpoint) so the command name isn't
  misleading. argparse prog now defaults to the invoked command name.

PI052's select_message / predict_action_chunk already work with the
runtime; this was the one SmolVLA2-only coupling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:28:55 +02:00
Pepijn 7b64e5498d revert(annotate): move memory + speech prompts to base PR (#3471)
The first-person memory narrative, task-rephrasing and initial-speech
prompt tweaks belong in the annotation pipeline itself. Applied to
feat/language-annotation-pipeline (#3471); reverting them here to the
merge-base so they drop out of this PR's diff. general_vqa.py keeps its
docstring fix since it references a recipe this PR introduces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:17:52 +02:00
Pepijn 134a707c7a feat(annotate): first-person memory narrative + shorter speech prompts
- module_1_memory: rewrite as an explicit first-person, past-tense
  narrative ("I picked up...", "I opened...") matching the MEM
  (Torne 2026) running-memory style, instead of "one or two short
  sentences" with no person/tense guidance.
- module_1_task_rephrasings: bias rephrasings toward short imperative.
- module_2_initial_speech: prefer very short robot acknowledgements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:17:30 +02:00
Pepijn 182f10184f revert(annotate): move pipeline changes to base PR (#3471)
The deterministic-plan rewrite, single-frame VQA (K 3->1), dataset
version tagging, telegraphic-subtask prompt and shorter interjection
prompt belong in the annotation pipeline itself, not in the SmolVLA
training PR. They have been applied to feat/language-annotation-
pipeline (#3471). Reverting these six files here to the merge-base so
they drop out of this PR's diff; #3491 will inherit the canonical
versions when it next rebases on its base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:07:23 +02:00
Pepijn ce47075d6b feat(annotate): deterministic plan, single-frame VQA, dataset tagging
Port the steerable-pipeline refinements developed on feat/smolvla-on-
steerable back into the annotation pipeline itself:

- module_1_subtasks: imperative verb-first telegraphic labels with a
  consistent-object-noun rule and good/bad examples (no hard word cap).
- _generate_plan: drop the VLM round-trip; the plan is now a
  deterministic numbered list of still-todo subtasks, re-emitted at
  every subtask boundary so it shrinks as work progresses. Removes
  module_1_plan.txt.
- VqaConfig.K 3 -> 1: a VQA pair anchors exactly its emission frame, no
  stale-label temporal smear.
- lerobot-annotate: tag the pushed dataset with its codebase_version so
  LeRobotDataset can resolve a revision and load it.
- module_2_interjection: shorter, more natural mid-task cues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:06:15 +02:00
Pepijn 26013da699 feat(annotations): enforce imperative verb-first subtask phrasing
Rewrite module_1_subtasks prompt to produce short imperative commands
("pick up the orange") instead of third-person narration ("the robot
arm moves to the orange"). Drops the verbose "how, not what" rule and
adds a good/bad few-shot table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 13:53:20 +02:00
pepijn bb31988915 fix(pi052): pass 4d masks to prefix-only forwards
Convert PI052 prefix-only attention masks before calling PaliGemma so text-only batches and generation use the same mask shape as fused training.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 21:07:13 +00:00
pepijn 2629175d2d fix(pi05): use fused AdamW by default
Route full PI05/PI052 fine-tuning through PyTorch's fused AdamW path to avoid the single-tensor Adam denominator allocation near GPU memory limits.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 19:23:17 +00:00
pepijn 2b4c5f49e3 fix(pi05): disable foreach AdamW by default
Avoid the multi-tensor AdamW temporary that can OOM full PI05/PI052 fine-tuning near GPU memory limits.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 18:58:17 +00:00
pepijn 22c9c4905e fix(pi052): avoid dense CE over padded tokens
Select only supervised text and FAST action-code positions before cross-entropy to avoid full-vocabulary loss tensors over padded sequences.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 18:40:34 +00:00
pepijn 7960cc14ec fix(pi052): call policy preprocessing helpers
Use PI05Policy helpers for action padding and image preprocessing in PI052 fused losses instead of looking them up on the inner PI05Pytorch module.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 17:52:47 +00:00
pepijn 1750a87104 fix(pi052): handle batched rendered messages
Tokenize batched recipe outputs in PI052 so training batches with nested message lists do not crash before model forward.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 17:41:58 +00:00
pepijn 0e2dc1b76f fix(pi052): supervise only FAST action-code tokens
Mask the FAST auxiliary loss to discrete action-code tokens so wrapper formatting tokens do not affect action co-training.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-18 17:38:34 +00:00
Pepijn 474c5478d9 tune(annotations): VQA emission anchors a single frame (K 3 -> 1)
Module 3 anchored each VQA emission tick to K=3 consecutive frames
(~0.1s at 30fps). The VLM grounds the answer — bbox/keypoint
coordinates especially — against the first frame's image, so copying it
onto frames 2-3 smears a stale label over a moving scene.

Default K=1: a VQA pair lands on exactly its emission frame, no
temporal smear. VQA frames get sparser; the WeightedEpisodeAwareSampler
(vqa_target_fraction) is the knob to compensate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:24:36 +02:00
Pepijn f72b28738a fix(annotate): default keyframe decode to ffmpeg CLI (thread-safe)
The decoder chain tried torchcodec first, then ffmpeg. torchcodec is
not thread-safe: under the executor's 16-wide concurrent decode in the
interjections phase it SIGSEGVs (exit 139) before the ffmpeg fallback
is ever reached — uncatchable, so it kills the whole job.

Default the auto chain to ffmpeg only. Per-frame ffmpeg decode runs in
an isolated child process: crash-safe and concurrency-safe (the plan
phase already proved 16 parallel ffmpeg subprocesses are fine).
torchcodec / pyav remain available via an explicit video_backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:40:29 +02:00
Pepijn 1bd53cc7da fix(annotate): decode keyframes via ffmpeg CLI fallback
PyAV segfaulted (exit 139) decoding the AV1 streams modern LeRobot
datasets use — a SIGSEGV that the per-episode try/except cannot catch,
killing the whole job when the interjections phase started.

Replace the PyAV fallback with _decode_frames_ffmpeg, which shells out
to the ffmpeg CLI: a full ffmpeg build decodes AV1, and a child-process
crash is a catchable non-zero exit rather than a segfault. Decoder chain
is now torchcodec -> ffmpeg. _decode_frames_av stays available behind
video_backend="pyav" for callers that want it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:08:31 +02:00
Pepijn 0f5f0e4091 refactor(recipes): rename recipes, drop pi05_hirobot
- hirobot.yaml            -> subtasks_vqa.yaml
- hirobot_memory.yaml     -> subtask_mem_vqa_speech.yaml
- pi05_hirobot.yaml       -> deleted (stale: uses plan, top-camera names;
  superseded by the two recipes above)
- smolvla2_hirobot.yaml   -> deleted (was untracked stale junk)

Updated the smolvla2 / pi052 `recipe_path` config defaults, all
docstring / comment references, the annotation-pipeline + recipe docs,
and the three tests that loaded pi05_hirobot.yaml (repointed to the
renamed recipes; the low-level-branch and pipeline-render assertions
now accept a flow-only `low_level` stream as valid supervision, since
the new recipes' low_level_execution has no text-CE target).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:02:15 +02:00
Pepijn 7128bb1769 fix(annotate): decode keyframes via PyAV directly
The pyav fallback routed through lerobot's decode_video_frames(backend=
"pyav"), which uses torchvision.io.VideoReader — removed in torchvision
0.23+. On modern torch stacks (e.g. vllm-openai with torchvision 0.26)
both torchcodec and that path fail, leaving interjection/vqa prompts
without visual context.

Add _decode_frames_av: a self-contained PyAV decoder that picks the
nearest frame per timestamp. It is the always-available tail of the
decoder chain (torchcodec -> pyav) and the target of --video_backend=pyav.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:45:04 +02:00
Pepijn 426d48dbbf fix(pi052): port the smolvla2 text-head fixes to pi052
pi052 had the same text-CE collapse bug smolvla2 had — PaliGemma's
embed_prefix flags the language block att=0, so make_att_2d_masks makes
it fully bidirectional and the text cross-entropy degenerates into a
copy task. Ported the three model-specific fixes:

- _mark_target_span_causal: set att=1 on supervised target language
  positions so the text-CE is genuine causal next-token prediction.
  Applied in both _compute_all_losses_fused and _compute_text_and_fast_loss.
- flow_loss_weight 10.0 -> 5.0: the paper's a=10 swamps the LM head once
  the flow-only low_level recipe fires often (matches SmolVLA2Config).
- _flatten_say_tool_calls in the text tokenizer: serialize `say` tool
  calls into a <say>...</say> marker so the spoken reply is tokenized
  and supervised (PaliGemma's flat prompt has no structured calls, so
  they were dropped entirely).

select_message needed no change: pi052's prefix is [images, language]
with no trailing state token, so it already decodes from the last
language token.

Regression tests mirror the smolvla2 attention-masking + tool-call suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:42:19 +02:00
Pepijn fbcb9225f5 feat: oversample sparse VQA annotations (recipe consumption + weighted sampler)
VQA annotations are sparse, so VQA was badly underrepresented in training:
its effective share was weight x density, and blend draws that picked an
ask_vqa* sub-recipe for a non-VQA frame were wasted entirely.

Two pieces:

1. Recipe-side consumption (language_render.py): render_sample now routes
   any frame that carries a VQA annotation to a matching ask_vqa* sub-recipe,
   regardless of the weighted blend draw. No VQA annotation is wasted and no
   draw lands on a non-renderable VQA recipe — VQA's recipe-side share now
   equals the VQA-annotation density.

2. Dataset-side oversampling (WeightedEpisodeAwareSampler + vqa_target_fraction):
   a new weighted, episode-aware sampler draws frames with replacement by
   per-frame weight. When TrainPipelineConfig.vqa_target_fraction is set, the
   train script scans language_events, weights VQA frames so they make up
   ~that fraction of the training stream, and uses the weighted sampler. This
   is what actually lets VQA exceed its natural density. Default None keeps
   uniform episode-aware sampling unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:30:00 +02:00
Pepijn 31e0c15e55 fix(annotate): pyav fallback when torchcodec keyframe decode fails
VideoFrameProvider decoded keyframes via torchcodec only. Some containers
(e.g. vllm-openai) ship a torchcodec that cannot push packets to the
decoder ("Operation not permitted"), silently degrading interjection/vqa
prompts to no visual context.

_decode now retries with pyav when the default backend raises, and a new
`video_backend` config field lets callers pin the backend explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:23:53 +02:00
Pepijn c5676ef1b3 feat(annotate): add dest_repo_id for separate push target
Adds an optional `dest_repo_id` to AnnotationPipelineConfig. When set,
`push_to_hub` uploads the annotated dataset there instead of overwriting
the source `repo_id`, restoring separate source/destination repos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:05:23 +02:00
Pepijn b319ccf688 fix(smolvla2): only prompt for a camera when a VQA overlay is drawn
The VLM already sees every camera, so the operator never needs to name
one to ask a question. Move the camera prompt to after generation and
only fire it when the answer actually carries a bounding box / point
(whose pixel coordinates are camera-specific and need a target frame).
Non-spatial answers (count / attribute / spatial / plain text) now skip
the prompt entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:50:19 +02:00
Pepijn 3174e14bc0 fix(smolvla2): feed all cameras to VQA generation, not just the chosen one
handle_vqa_query filtered the observation down to the single chosen
camera before calling the VLM. But training feeds every camera: the
ask_vqa_* recipes' image blocks are stripped before tokenization and
the frames reach the model via OBS_IMAGES_*, where embed_prefix
consumes all config.image_features regardless of the per-camera recipe
tag. Filtering to one camera changed the image-token count in the
prefix (the dropped camera zero-padded with mask=0) — a prefix shape
the model never saw at training.

Now the full observation is passed to select_message; the chosen
camera is used only to pick which frame the bbox/point overlay is
drawn on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:46:38 +02:00
Pepijn dc530e10fe feat(smolvla2): VQA example prompts in the panel; drop quotes from hints
Command arguments never needed quotes (`_strip_quotes` only strips a
matching pair if present) — `/question point to the yellow cube` works.
The hints wrongly implied `""` were required; all hints/help now show
`/action <task>` / `/question <text>`.

Also adds a reference line to the state panel showing the two
overlay-producing VQA prompt shapes:
  /question point to the yellow cube   -> point overlay
  /question detect the blue cube       -> bounding-box overlay
plus the same examples in /help.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:42:32 +02:00
Pepijn e7c5613a39 refactor(smolvla2): command-driven runtime — no startup prompts
Replace the startup mode prompt + task picker with a single
command-driven prompt. The runtime now comes up immediately at the
command line in `paused` mode (robot idle) and the operator drives it:

  /action "task"     run the robot on a task (bare = resume, number = timed burst)
  /pause             stop the action loop — robot holds position
  /question "..."    pause and answer one VQA question (camera prompt + overlay)
  /help / stop

- Removed _select_mode_interactively / _select_task_interactively /
  _dataset_task_strings (the interactive pickers).
- mode value renamed "question" -> "paused"; --mode choices are now
  action|paused (default paused).
- /question takes the question inline and runs it via _handle_slash_command
  (pauses first, so the policy isn't used concurrently).
- The ENTER-to-start gate only fires when starting in action mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:37:51 +02:00
Pepijn 516ffc7687 feat(smolvla2): --mode flag, skip task picker with --task, timed /action
Lets the operator skip the interactive startup entirely and go straight
to the command line:

- New --mode {action,question} arg; when given, the startup mode prompt
  is skipped.
- When --task is passed explicitly on the CLI, the startup task picker
  is skipped (the dataset-bootstrap task still shows the picker so you
  can override it).

Also adds a timed action burst: /action <seconds> runs the robot for N
seconds, then the autonomous loop auto-reverts to question mode and
clears the action queue. Plain /action stays unlimited.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:26:12 +02:00
Pepijn 7a68bf13d9 feat(recipes): add hirobot_memory — hirobot + memory + spoken tool-call replies
New recipe alongside hirobot.yaml (kept as the lean baseline). Superset
that adds two text-supervised sub-recipes:

- memory_update: compress progress into a memory note.
- user_interjection_response: reply to a user interjection with a `say`
  tool call only (no plan/subtask text). The SmolVLA2 chat tokenizer
  flattens the call to a `<say>...</say>` marker the runtime parses back.

Plan is intentionally omitted; memory is the only persistent high-level
state. Weights: low_level 0.40, subtask 0.25, memory 0.10, interjection
0.10, vqa 0.075 x2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:21:41 +02:00
Pepijn 15229468d0 feat(smolvla2): startup mode prompt; rename /vlm mode to /question
Add a mode prompt at startup, shown before the task picker, so the
operator chooses action (run the robot) vs question (VQA only) up front
instead of having to discover /vlm mid-run.

Also rename the VQA mode from "vlm" to the clearer "question":
- state["mode"] value is now "action" | "question"
- the command is /question (/vlm and /vqa kept as aliases)
- panels, hints and help text updated to match

handle_vqa_query now reports via both push_log and direct stdout, so
VQA answers / overlay paths are visible in autonomous question mode
where the panel redraw is suspended.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:17:03 +02:00
Pepijn a9cea3e8dd fix(smolvla2): make the autonomous REPL usable for slash commands / VQA
The autonomous panel redraw cleared the screen every 0.5s, so the "> "
prompt and the one-shot command hint vanished — the operator could not
see what to type or what they were typing, making /vlm unreachable.

- Suspend the timer redraw entirely while in /vlm mode (the action loop
  is paused, nothing changes in the background) so the VQA question and
  camera prompt stay on a stable screen.
- Re-print the "> " prompt after each redraw so it is always visible.
- Show an always-on command hint in the panel (/vlm, /help, /action)
  instead of relying on the startup line that scrolls away.
- Redraw immediately after a slash command so the mode flip is visible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:10:13 +02:00
Pepijn 89d4846590 fix(smolvla2): always show the startup task picker on a TTY
The picker was skipped whenever a task was already resolved — which is
always the case with --dataset.repo_id, since the dataset's canonical
task is auto-filled. The operator never got to choose. Now the picker
always runs on an interactive terminal: the resolved task is shown as
"(current)" and selected by an empty Enter, so the dataset-canonical
default still works while letting the operator pick another task or
type a custom one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:04:53 +02:00
Pepijn Kooijmans 9dfc9084e1 review: decode keyframes via video_utils.decode_video_frames
Addresses three of CarolinePascal's frames.py comments (the fourth, the
subprocess re-encode, waits on #3611):

- replace the bespoke _decode_pyav_direct PyAV decoder with
  lerobot.datasets.video_utils.decode_video_frames (torchcodec backend,
  PyAV fallback) — torchvision's VideoReader removal no longer applies
- frames flow through the provider as torch.Tensor (C, H, W uint8); PIL
  is materialised only at the VLM-message boundary in to_image_blocks /
  to_video_block, where the chat backends need it
- _decode now returns exactly one frame per timestamp (or [] on failure),
  so frames_at pairs them with strict=True

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:00:38 +02:00
Pepijn Kooijmans fd18beb3a1 review: address CarolinePascal feedback
- name the three modules everywhere (plan / interjections / vqa) instead
  of module_1/2/3 — config classes, config fields, executor params,
  staging keys and phase names now carry the module name
- rename examples/annotation -> examples/annotations; add the Apache
  header to run_hf_job.py
- drop the unused GeneralVqaModule._generate_one
- remove "PR 1" references from comments/docstrings
- frames.py: rely on the always-defined LeRobotDatasetMetadata.camera_keys
- executor.py: read/write meta/info.json via load_info / write_info
- reader.py: load meta/tasks.parquet via io_utils.load_tasks
- make --push_to_hub a bool; push the annotated dataset back to --repo_id
- move the on-disk test dataset builder into tests/fixtures
  (build_annotation_dataset); run_e2e_smoke reuses it
- clarify in the docs that the vqa module grounds each pair on a single
  frame (K = per-tick anchor count)
- hoist stdlib dynamic imports to module scope

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:03:25 +02:00
Pepijn 26cb38a7d0 feat(smolvla2): startup task picker, /vlm mode toggle, interactive VQA overlay
Three additions to the SmolVLA2 interactive runtime:

1. Startup task picker — when no --task is given, the runtime lists the
   dataset's task strings as a numbered menu (plus a custom-task option)
   instead of silently waiting for the first stdin line.

2. Mode toggle — /action and /vlm slash commands flip a persistent run
   mode. /vlm pauses the whole action loop (HighLevelSubtaskFwd,
   LowLevelForward and DispatchAction gate on state["mode"]) and clears
   the action queue so the robot holds position; /action resumes it.
   The mode is shown in the state panel.

3. Interactive VQA — in /vlm mode a typed line is a VQA question. The
   new inference/vqa.py module asks which camera to ground on, runs the
   VLM on that single camera, and when the answer is a bbox/keypoint it
   draws the overlay, saves a PNG to ./vqa_overlays/ and auto-opens it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:20:57 +02:00
Pepijn bfb8cfb432 fix(smolvla2): flatten say tool_calls into <say> marker before tokenizing
The chat tokenizer passed assistant `tool_calls` straight to
`apply_chat_template`, which renders them as a structured JSON
`<tool_call>` block — so the LM head was trained to emit JSON. But the
inference parser `_split_plan_and_say` looks for a `<say>...</say>`
marker, which the model never saw in training, so the `say` tool never
fired at inference.

`_flatten_say_tool_calls` is the missing training-time serializer (the
one `_split_plan_and_say`'s docstring already assumed existed): it
rewrites a `say` tool call into a `<say>...</say>` marker inside the
content text before the chat template runs, so the template only
tokenizes plain text and the supervised target span trains the model to
emit exactly the marker the runtime parses back (Pi 0.5-style flat
tool-call serialization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:47:31 +02:00
Pepijn 5e3b9ba82c tune(smolvla2): override optimizer_lr to 2.5e-5 for pretrained-LM fine-tuning
SmolVLA's 1e-4 is safe only because it freezes the language head. SmolVLA2
unfreezes lm_head + the last text layer and fine-tunes the pretrained
SmolVLM2 language weights; 1e-4 is too aggressive there and destabilises
generation into degenerate repetition. Match pi05's 2.5e-5 peak LR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:41:13 +02:00
Pepijn 083d3cd419 tune(smolvla2): soften flow:text loss split from 10:1 to 5:1
The Pi 0.5 α=10 split assumed text is a rare auxiliary task. With the
flow-only `low_level` recipe (~40% of the blend) now rendering, the flow
term fires often and at 10x weight dominates the shared VLM backbone,
starving the text head into degenerate repetition decoding. A 5:1 split
keeps actions primary while leaving the language head enough gradient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 16:00:08 +02:00
Pepijn bf996c7938 fix(datasets): render flow-only low_level recipes instead of dropping them
A recipe whose only supervision is the action-expert flow loss (e.g.
`low_level_execution`: `user(${subtask})` with `stream: low_level` and no
`target` turn) was rejected at render time by `_render_message_recipe` and
`_validate_rendered`, both of which required at least one target turn.

The result: every blend draw of the flow-only recipe rendered to `None`,
`predict_actions` was never set, `run_flow` never fired, and the action
expert received no flow loss — leaving it at random init. Both gates now
also accept a `low_level`-stream turn as valid supervision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 13:20:39 +02:00
Pepijn 0d88eaf8eb test(smolvla2): attention masking of the language target span
Regression coverage for the text-CE collapse bug fixed in 3cd348ff.
Pure-function tests over ``_mark_target_span_causal`` /
``_locate_lang_range`` / ``make_att_2d_masks`` — no model load, fast.

Pins:
* the target span flips to att=1, prompt/images stay att=0;
* target tokens attend causally among themselves (no peeking at
  future targets) — genuine next-token prediction;
* targets still attend bidirectionally to images + the user prompt;
* the action-expert (state) token still attends to every target;
* a no-target subtask (low_level_execution user turn, labels all
  -100) leaves the mask bidirectional;
* an explicit test documenting the bug: the raw embed_prefix mask
  lets the first target token see the last — the copy-task collapse.

Skips cleanly when transformers isn't installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:28:44 +02:00
Pepijn 3cd348ffe2 fix(smolvla2): causal mask on the text-CE target span (THE collapse bug)
Root cause of every collapsed inference run. ``embed_prefix`` flags
all language tokens ``att=0``; ``make_att_2d_masks`` turns that into
a single fully BIDIRECTIONAL block. So during the text-loss forward,
a supervised subtask token's hidden state attends to the very tokens
it is trained to predict. The cross-entropy degenerates into a copy
task — ``text_loss → ~3e-5`` not because the model learned to
predict subtasks but because it can see the answer.

At inference ``select_message`` decodes autoregressively (causally):
each token must be predicted WITHOUT seeing it — a task the model
was never actually trained on. Hence the universal collapse: a
coherent first token or two ("grasp the yellow cube"), then a loop
("cover cover cover", "icatorsicators", "the the the").

Fix: ``_mark_target_span_causal`` sets ``att=1`` on the language
positions that are supervised targets (``text_labels != -100``).
With make_att_2d_masks's cumulative-block rule each target token
then attends to images + the user prompt bidirectionally and to
EARLIER target tokens only — genuine causal next-token prediction,
matching select_message. Applied in both ``_compute_text_loss`` and
``_compute_fused_loss``. Per-sample correct: high_level_subtask
targets become causal; low_level_execution subtasks (a user turn,
labels all -100) stay bidirectional so the action expert reads them
as bidirectional context. The action expert is otherwise unaffected
— the suffix has a strictly higher cumsum and still attends to the
whole prefix.

Requires retraining: this changes the training objective. Existing
checkpoints were all trained on the degenerate copy task and cannot
generate text. Expect ``text_loss`` to settle MUCH higher than 3e-5
after this — that is correct; it is now a real prediction task.

NOTE: pi052's text path (PaliGemma prefix-LM) has the same
bidirectional-block structure and needs the analogous fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:24:44 +02:00
Pepijn db03fc6dc4 fix(smolvla2): select_message must decode from the language position
``embed_prefix`` lays the prefix out as ``[images, lang, state]`` with
the state token LAST. Training supervises the text head on the
*language* positions (``_compute_text_loss`` / ``_compute_fused_loss``
slice ``prefix_out[lang_start:lang_end]`` and run lm_head there).

But ``select_message`` started AR generation from the full prefix and
read ``prefix_out[:, -1:]`` — the **state token** — to decode the
first subtask token. The state token's hidden state exists for the
action expert to read; the lm_head was never trained to produce
subtask text from it. So inference decoded the high-level head from a
position entirely outside the training distribution: the text head
collapses (``the arm the arm``, ``grasp the surface population``,
``_333 absburg…``) no matter how cleanly ``text_loss`` converged.

Fix: truncate the state token off the prefix before the AR loop, so
``prefix_out[:, -1:]`` is the last language token (right after the
``Assistant:`` generation prompt) — exactly where training supervised.

Inference-only change — no retraining needed; existing checkpoints
benefit immediately. The action path (``predict_action_chunk``) is
untouched: state belongs in the action expert's prefix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 15:05:16 +02:00
Pepijn 56068d37ea fix(smolvla2): default load_vlm_weights=True — don't train from scratch
SmolVLAConfig defaults ``load_vlm_weights=False``. With that and no
``--policy.path``, ``SmolVLMWithExpert.__init__`` builds the VLM via
``SmolVLMForConditionalGeneration(config=...)`` — i.e. a fully
**random-initialised** 500M backbone, including a random ``lm_head``.

For plain SmolVLA that's a deliberate "pre-train the expert" mode.
For SmolVLA2 it's a footgun: the high-level text head *is* the
SmolVLM2 ``lm_head``. Training subtask prediction from a random
language model can only memorise — which is exactly the repetition
collapse seen on the real robot ("the arm the arm the arm …").

SmolVLA2 now defaults ``load_vlm_weights=True`` so every run
fine-tunes the pretrained ``HuggingFaceTB/SmolVLM2-500M-Video-Instruct``
backbone (vision tower + language model + lm_head). The action
expert still trains from scratch on the robot data (standard SmolVLA
fine-tuning); start it from pretrained too by fine-tuning a full
``lerobot/smolvla_base`` checkpoint via ``--policy.path``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 16:44:00 +02:00
Pepijn e727688052 annotate: telegraphic subtasks — ≤4 words, verb+object, consistent nouns
Tighten the subtask prompt further per real-data feedback. The old
≤5-word cap still produced things like "release the yellow block
into the green bin" (8 words, articles, destination, and "block"
where the task said "cube").

New rules:
* Hard cap ≤ 4 words, ideally 2-3. Form: VERB + (color) + OBJECT.
* No articles, no destinations, no adverbs, no "robot/arm/gripper".
* Must reuse the exact object nouns from the task — no block/cube,
  bin/box/container drift across the episode.
* Concrete good/bad examples anchored on the cube task.

Shorter, templated, consistent targets are far more robust for the
autoregressive LM head — fewer tokens to drift on, fewer dominant
n-grams to repetition-collapse into.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 14:14:42 +02:00
Pepijn f1a0a663cc fix(inference): gibberish detector catches long repetition collapse
The ``_looks_like_gibberish`` low-unique-token check was gated on
``len(stripped) < 80``, so an LM head that loops an n-gram for the
whole 256-token budget — "the arm the arm … the the the the" —
sailed straight through (``gibberish:0`` in the panel) and the
garbage subtask got accepted and fed to the action expert.

Added a length-independent check: ``>= 8 tokens`` but unique-token
count ``<= max(3, tokens // 10)`` ⇒ repetition collapse. Now the
runtime rejects the looped output and keeps the previous (real)
subtask instead of propagating nonsense.

This is a guard, not a cure — the underlying issue is the LM head
on the current checkpoint being undertrained / collapsed; re-
annotate with the short prompts and train longer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 13:52:26 +02:00
Pepijn 6e64c20cf1 runtime: stop seeding plan/memory from the dataset (unused)
The current recipe trains neither plan nor memory, and no inference
step consumes them — ``_msgs_for_subtask`` renders the bare task and
``LowLevelForward`` conditions on the subtask. Bootstrapping
``current_plan`` / ``current_memory`` from the dataset's
``language_persistent`` annotations therefore only placed a stale,
do-nothing plan in the status panel.

Keep seeding ``current_subtask`` — it's a useful first-frame
fallback for ``LowLevelForward`` before ``HighLevelSubtaskFwd``
produces its first subtask.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 13:47:33 +02:00
Pepijn b29cccb37e runtime: restore the subtask hierarchy — generated subtask drives actions
Reverts the previous "condition actions on the task" shortcut.
The action expert is conditioned on the SUBTASK again:

* ``low_level_execution`` recipe back to ``user(${subtask})``.
* ``LowLevelForward`` conditions on ``current_subtask`` (falls back
  to the task only on the first frame, before the high-level loop
  has produced a subtask).
* ``HighLevelSubtaskFwd`` re-added to the runtime pipeline so the
  subtask is actually generated each high-level tick and written to
  ``current_subtask`` before ``LowLevelForward`` consumes it.
* ``_msgs_for_subtask`` now renders just ``${task}`` (no
  ``Plan: ``/``Memory: `` lines) to match the current
  ``high_level_subtask`` recipe, whose user turn is the bare task.

So the loop is: task → HighLevelSubtaskFwd (LM head) → subtask →
LowLevelForward → action chunk conditioned on that subtask.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 13:43:04 +02:00
Pepijn f161e27e96 recipe+runtime: condition the action expert on the task, not the subtask
Real-robot runs shook and failed the task despite a low flow loss.
Root cause: train/inference conditioning mismatch — not a flow-loss
bug (``_compute_fused_loss``'s flow path is byte-identical to
``SmolVLAModel.forward``).

At training, ``low_level_execution`` conditioned the action expert
on ``${subtask}``, and every frame's subtask was the correct one
for that frame. At inference the runtime has no high-level subtask
generator (VQA-only pipeline), so ``current_subtask`` was frozen —
the action expert got "move towards the blue cube" for the entire
episode. Once the arm reached the cube, that (image, subtask) pair
never occurred in training → OOD conditioning → incoherent flow
output → shaking.

Fix: ``low_level_execution`` now renders ``user(${task})``. The
task is stable for the whole episode and always available, so the
action expert's conditioning is identical at train and inference
with no high-level loop required. ``LowLevelForward`` updated to
build the same ``[user(task)]`` prompt.

``high_level_subtask`` still trains the text head to predict
subtasks (kept for when a reliable subtask loop is reintroduced) —
it's just no longer on the action expert's critical path.

Requires re-training for the recipe change to take effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 13:40:15 +02:00
Pepijn d5f293a1c9 recipe+runtime: VQA + subtask only — drop plan & memory
Scope reduction while the core subtask + action loop is validated:

Recipe (hirobot.yaml)
* Removed ``plan_generation`` sub-recipe entirely.
* Removed the memory tail from ``high_level_subtask`` (the
  ``new_memory`` binding + the second assistant turn).
* ``high_level_subtask`` user turn is now just ``${task}`` — no
  ``Plan: …\nMemory: …`` context.
* Weights rebalanced over the four remaining sub-recipes:
  high_level_subtask 0.40, low_level_execution 0.40,
  ask_vqa_top/wrist 0.10 each.

Runtime (inference/runtime.py)
* Pipeline trimmed to VQA + the action loop:
  AskVQAFwd → LowLevelForward → DispatchAction → DispatchToolCalls.
* Dropped HighLevelSubtaskFwd / MemoryUpdateFwd / UserInterjectionFwd
  from the default pipeline. They remain importable from
  ``inference.steps`` for when plan/memory/subtask generation is
  brought back. The action expert conditions on the task string
  directly via LowLevelForward's ``current_subtask or task``
  fallback.

This commit lands on top of a rollback of the previous two commits
(repetition_penalty / no_repeat_ngram_size knobs, and the
deterministic plan-walker) — both were bandaids for the LM-head
repetition collapse that the reduced-scope recipe sidesteps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:02:06 +02:00
Pepijn 95033733fc deps: add sentencepiece to the pi extra (FAST action tokenizer)
PI052 and PI0_FAST both load ``physical-intelligence/fast`` as
their action tokenizer. That tokenizer's HF backend requires
``sentencepiece`` to instantiate (or ``tiktoken``); without it
``AutoProcessor.from_pretrained`` raises:

  ValueError: Couldn't instantiate the backend tokenizer from one of:
  (1) a tokenizers library serialization file,
  (2) a slow tokenizer instance to convert or
  (3) an equivalent slow tokenizer class to instantiate and convert.
  You need to have sentencepiece or tiktoken installed [...]

It wasn't listed in pyproject so fresh installs missed it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:52:55 +02:00
Pepijn c3503b774f fix(debug): dumper now shows real stream + target flags
The dumper was printing ``stream=None target=None`` for every
message because it read those fields off the message dicts, but
the recipe renderer keeps them in parallel arrays
(``message_streams`` / ``target_message_indices`` in
COMPLEMENTARY_DATA) so the chat template doesn't see unknown
keys. Zip them back into the dump-time dicts so the printed
metadata is accurate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:43:51 +02:00
Pepijn 99ebee4d16 annotate: tighter subtask + memory prompts (≤5 / ≤10 words)
Both feed into the high-level prompt and the plan rendering, so
keeping them short directly reduces the rendered ``${task}\nPlan:
…\nMemory: …`` prefix the model has to chew through at inference.

Subtasks
* Hard cap: ≤ 5 words. Verb + object only, drop articles/adverbs.
* Concrete good/bad examples to anchor the VLM.

Memory
* Hard cap: ≤ 10 words. Telegraphic noun→location fragments
  ("bowl in box, lid open"), no past-tense verbs, drop attributes
  that don't matter for downstream subtasks.
* Allow empty string when no material change occurred — keeps the
  rendered memory line literally blank instead of forcing a no-op
  sentence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:28:09 +02:00
Pepijn a8ca5128b8 fix(annotate): re-emit plan at every subtask boundary
Previously only emitted a plan at t=0 and on interjections, so the
active plan rendered into training carried "done" subtasks until
the next interjection. With the new "plan = remaining subtasks"
summariser this meant the plan was stale between boundaries.

Emit a fresh plan row at every subtask start. ``active_at(t)`` then
returns a plan that contains exactly the subtasks whose start ≥
the current span's start — completed subtasks fall off the plan
the moment the next subtask begins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:26:49 +02:00
Pepijn dd97c33814 refactor(annotate): plan = summary of still-todo subtasks, drop VLM call
The plan was being generated by a separate VLM call (one per
episode + one per interjection refresh) with a prompt that asked
the model to "compress the subtasks into a compact hierarchical
plan". In practice the plans came out longer than necessary and
sometimes drifted from the actual subtask sequence the runtime
would execute.

Replaced ``_generate_plan`` with a deterministic numbered list
of the upcoming subtasks. At a refresh time the list shrinks to
subtasks whose start ≥ refresh_t — the plan describes what's
*left* to do, so it gets shorter as work progresses.

Saves the per-episode + per-interjection VLM round-trip in the
annotation pipeline and keeps train-time plan text bit-aligned
with the subtask annotations the rest of Module 1 emits.

Removed the now-unused ``prompts/module_1_plan.txt``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:55:02 +02:00
Pepijn fa45ba631b fix(policies,recipe): register PI052Config + allow flow-only sub-recipes
Two regressions surfaced by the first training run:

1. ``--policy.type=pi052`` failed with ``invalid choice``. PI052Config
   wasn't imported in ``policies/__init__.py``, so its
   ``@register_subclass("pi052")`` decorator never ran and draccus
   didn't see it as a valid policy type. Mirror PI05Config /
   SmolVLA2Config in the top-level imports + __all__.

2. ``low_level_execution`` (user-only ``${subtask}`` recipe used for
   π0.5-style flow conditioning) tripped
   ``ValueError: Message recipes must contain at least one target
   turn.`` The validator was too strict — a recipe with only a
   ``stream: low_level`` turn still drives meaningful supervision
   (flow MSE on the action expert via ``predict_actions=True``).
   Allow either ``target: true`` OR ``stream: low_level`` to satisfy
   the "supervises something" requirement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:51:47 +02:00
Pepijn ffd8c92ce5 fix(inference): always emit Plan:/Memory: labels in the high-level prompt
The recipe renders ``"\${task}\nPlan: \${plan}\nMemory: \${memory}"``
unconditionally — when a binding resolves to None,
``language_render._substitute`` substitutes an empty string, so the
training-time user turn always contains the literal ``Plan: `` /
``Memory: `` prefixes even with empty values.

The inference message builders were skipping those lines entirely
when ``state['current_plan']`` / ``state['current_memory']`` was
empty, producing a different prompt shape on early frames (before
the plan-generation step runs) and on datasets without plan/memory
annotations.

Factored a shared ``_hirobot_user_head`` helper used by
``_msgs_for_subtask``, ``_msgs_for_memory``, and the legacy
``_control_context_messages`` so they all match training byte-for-
byte regardless of which bindings are populated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:42:29 +02:00
Pepijn 841d3c47e1 feat(debug): LEROBOT_DUMP_RECIPE_SAMPLES=N dumps the first N rendered samples
Adds a one-shot debug dumper to both chat processors. When the env
var ``LEROBOT_DUMP_RECIPE_SAMPLES`` is set to a positive integer N,
the next N samples processed (rank-0 only) get pretty-printed:

* the recipe-rendered messages (role / stream / target / content),
* the full tokenized prompt (decoded back),
* inline ``[TGT]...[/TGT]`` markers over the spans the LM head is
  supervised on,
* token count + target-token count,
* ``predict_actions`` flag.

Usage:

  LEROBOT_DUMP_RECIPE_SAMPLES=5 sbatch train_smolvla2.slurm

After N dumps the helper becomes a no-op; training continues
unaffected. Works for both smolvla2 (chat-template renderer) and
pi052 (plain ``Role: content`` concat renderer); each processor has
its own copy to avoid cross-package imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:21:46 +02:00
Pepijn 2c920ab178 refactor(recipes): consolidate to shared hirobot.yaml + audit fixes
The smolvla2 and pi052 recipe blends had drifted to identical content
twice in a row; collapse them to a single ``recipes/hirobot.yaml``
both policies point at. Each backbone's text tokenizer (chat-template
for SmolVLA2, plain ``Role: content`` for PI052) handles the
rendering differences downstream — the recipe spec is shared.

Audit fixes folded into the same commit:

* **Train/inference prefix mismatch on the action expert**
  ``_build_text_batch`` always passed ``add_generation_prompt=True``,
  appending ``<|im_start|>assistant\\n`` tokens that the action
  expert never saw at training (the chat tokenizer renders with
  ``add_generation_prompt=False``). Parameterized the helper and
  pass ``False`` from ``LowLevelForward``; ``select_message`` paths
  still default to ``True`` for AR text generation.

* **PI052 fallthrough could silently train flow on text-only frames**
  When ``text_loss_weight=0`` AND every sample was high-level
  (``predict_actions.any()==False``), the previous heuristic
  delegated to ``PI05Policy.forward``, which ignores
  ``predict_actions`` and runs flow on every sample. Reverted to
  delegating only on fully unannotated batches.

* **SmolVLA2 silent zero-loss training**
  ``forward`` returned ``loss=0`` (no error) when neither flow nor
  text path fired. Now raises ``RuntimeError`` with the weights and
  routing flags — fails loud like PI052 already does.

* **PI052 dropout-seed key**
  Was reading ``complementary["dataset_index"]`` (only set by
  ``MultiDataset`` and means "which sub-dataset", not row index)
  with fallback to ``frame_index`` (never set) — every sample got
  seed=0, so per-component dropout was deterministic across the
  epoch. Switched to ``complementary["index"]`` to match SmolVLA2
  and the canonical ``BatchProcessor`` convention.

* **Dead ``DEFAULT_TOOLS`` import**
  Removed from ``chat_processor_smolvla2.py`` — unused since the
  default-tools list was switched to ``[]`` in the prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:16:28 +02:00
Pepijn 9f630e2a41 fix(recipes,training): stop tool prompt leak + drop subtask copy-supervision
CRITICAL (smolvla2) — the SmolVLM2 chat template was rendering the
``say`` tool's JSON schema as a system message on every training
sample because ``DEFAULT_TOOLS`` was the default in
``SmolVLA2ChatTokenizerStep``. That schema was only relevant to
the now-removed ``user_interjection_response`` recipe; with it
gone the schema is dead weight that polluted every action-expert
prefix AND created a train/inference mismatch (the inference
``_build_text_batch`` doesn't pass ``tools=``). Default is now
``[]``; callers needing tools can still set them via
``with_tools(meta.tools)``.

LIKELY-BUG — ``low_level_execution`` had ``target: true`` on its
assistant turn, so text-CE trained the LM head to predict the
same subtask string the user just stated (trivial "copy previous
turn" supervision that diluted LM head capacity). Dropped the
assistant turn entirely; ``high_level_subtask`` (w=0.50) already
owns subtask prediction from real context.

The chat-tokenizer's ``predict_actions`` detection used to scan
target streams only. With the new no-target low_level recipe it
would mis-fire as False. Switched both
``chat_processor_smolvla2.py`` and ``text_processor_pi052.py`` to
scan all message streams — any ``stream: low_level`` on the
sample is enough to trigger flow loss.

Inference: the low-level loop sends only ``[user(subtask)]`` now,
matching the new recipe shape.

PI052 — hardened the forward fallthrough so a degenerate batch
where every sample's recipe is text-only AND text supervision is
disabled (text_loss_weight<=0 or text_labels missing) cleanly
delegates to ``PI05Policy.forward`` instead of raising
"nothing to train".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 14:59:01 +02:00
Pepijn 7a32f8a72a refactor(recipes): π0.5-style split — action expert conditions on subtask only
Previously ``action_execution`` rendered ``task + plan + memory +
subtask`` into one prefix and ran the flow loss on it. That meant
the action expert was conditioned on the full hierarchical context
(closer to π0.7 §V.A), not just the subtask.

The π0.5 paper's hierarchical inference has the action expert see
only the *subtask* (plus images and state). Split the recipe to
match:

  high_level_subtask  (0.50)
    user(task + plan + memory) → assistant(subtask)
    [+ assistant(new_memory) at boundary frames]
    All ``stream: high_level`` → text-CE only, no flow loss.

  low_level_execution (0.30)
    user(subtask) → assistant(subtask)
    Both ``stream: low_level`` → flow loss fires; text CE on the
    subtask is a small redundant extra signal. Prefix the action
    expert sees: [images, subtask, state].

  plan_generation (0.10) — unchanged.
  ask_vqa_{top,wrist} (0.05 each) — unchanged.

Runtime: the low-level loop in ``smolvla2/inference/steps.py``
now sends ``[user(subtask), assistant(subtask)]`` to
``predict_action_chunk`` instead of the full task+plan+memory
context. Falls back to ``state['task']`` when no subtask has been
generated yet so the first frame still has something to condition
on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 14:13:07 +02:00
Pepijn 129aa207e3 fix(smolvla2,pi052): training-correctness audit fixes
CRITICAL (smolvla2) — text-CE was applied to the wrong prefix slice.
``num_state`` was being read from ``state.shape[1]`` (the raw
max_state_dim, ~14-32) instead of the *number of state tokens*
(always 1). Compounded by the trailing-padding issue (state is
not at the end of the padded prefix when ``seq_len < prefix_length``),
the lang slice was landing on image / padding hidden states.

New ``_locate_lang_range`` finds the state position via
``att_masks.nonzero()`` (the only ``1`` in the mask), making the
slice robust to both bugs. Used by ``_compute_text_loss`` and
``_compute_fused_loss``.

LIKELY-BUG (smolvla2) — ``_unfreeze_lm_head`` only re-enabled
``lm_head`` and ``text_model.model.norm.weight``. SmolVLA's parent
ALSO freezes the last 1-2 transformer layers, so text-loss
gradients died in a frozen final block. Now mirrors the parent's
freeze targets and unfreezes the matching ``layers.{N-1}`` (and
``N-2`` when num_vlm % num_expert == 0).

CRITICAL (pi052) — flow and FAST CE were not per-sample masked
under per-sample-routing. Text-only recipe samples
(``plan_generation``, ``ask_vqa_*``) contributed to flow/FAST
loss with prompts that deliberately omit the subtask, corrupting
the signal. Threaded ``predict_actions_t`` through both
``_compute_all_losses_fused`` and ``_compute_text_and_fast_loss``;
flow uses ``(per_sample * mask).sum() / mask.sum()``, FAST uses
``shift_valid & sample_mask`` before ``masked_fill(-100)``.

OTHER
* PI052Policy.forward now falls through to PI05Policy.forward on
  unannotated batches (no text_labels, no predict_actions, no FAST).
* fit_fast_tokenizer cache key now includes ``chunk_size`` — changing
  the chunk size no longer silently loads a wrongly-fit tokenizer.
* Removed dead ``_compute_text_loss`` / ``_compute_fast_action_loss``
  in pi052 (superseded by the fused helpers).
* Fixed stale "no-op stub" docstring on ``knowledge_insulation`` —
  it's been fully wired since the per-layer KI forward port.
* Stripped unused ``copy`` / ``resize_with_pad`` imports.
* Extracted ``_shifted_ce`` / ``_mask_per_sample`` / ``_fast_ce``
  helpers shared between fused and prefix-only paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 14:08:06 +02:00
Pepijn e3ad1c59fc feat(recipes): add plan_generation sub-recipe to smolvla2 + pi052 blends
New text-only sub-recipe at 0.10 weight on both blends:

    user      :  ${task}
    assistant :  ${current_plan}   (high_level target)

Bound to ``active_at(t, style=plan)`` so it supervises the
currently-active plan on every frame, gated by ``if_present`` to
skip frames without a plan annotation.

Weights rebalanced: action_execution 0.85 → 0.75, plan_generation
0.10, VQA top/wrist 0.075 each (sums to 1.0).

Added matching runtime builder ``_msgs_for_plan`` in
``smolvla2/inference/steps.py`` so the high-level loop can call
``select_message`` with the bare-task prompt at episode start /
replanning events.

Closes a gap vs. Pi 0.7 §V — without this recipe the model could
read ``${plan}`` from the prompt but never had to produce one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:51:37 +02:00
Pepijn 9ff62cb08c docs(recipes): trim header comments, drop diversity-knobs note in run_hf_job
Recipes were over-commented (paper citations, history of removed
sub-recipes, inference-time loop walkthroughs). Stripped down to a
short header + a one-line note on the boundary-frame memory tail.

Also removed the ``_tool3`` diversity-knobs comment block in
``examples/annotation/run_hf_job.py`` — it was a personal note about
a since-merged experiment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:55:03 +02:00
Pepijn b2aa372fcf refactor(recipes): fold memory into action_execution, drop interjection, fuse smolvla2 forward
Recipe changes:
* action_execution now bundles the memory update as a second
  assistant target gated on a new ``new_memory`` binding (fires
  only at subtask-boundary frames). No "Completed subtask: X"
  filler — the model emits the new subtask AND the updated
  memory back-to-back in one prefix.
* user_interjection_response sub-recipe removed (current
  datasets don't have interjection / say() annotations).
* Standalone memory_update sub-recipe removed (folded above).
* Weights rebalanced: action_execution 0.85, ask_vqa_top/wrist
  0.075 each (sums to 1.0).

Runtime ``_msgs_for_memory`` updated to match the new
boundary-frame prompt layout.

Modeling:
* SmolVLA2Policy now fuses the flow + text losses into a SINGLE
  backbone forward via ``_compute_fused_loss`` (one
  vlm_with_expert pass with [prefix, suffix] embeds, then both
  lm_head CE on lang slice + action_out_proj MSE on suffix).
  Mirrors pi052's existing ``_compute_all_losses_fused`` —
  saves one backbone pass per training step.

Examples:
* Removed the two training SLURM scaffolds; they were
  out-of-date with the recipe refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:51:09 +02:00
Pepijn 058b8f3958 refactor(recipes): two-flavor design — one fused action_execution + text-only events
Both smolvla2_hirobot.yaml and pi052_hirobot.yaml are rewritten as a
clean two-flavor blend, modelled on Pi 0.7 §V.A (Subtask instructions)
and the hierarchical inference pattern from Pi 0.5 §IV.D.

Flavor 1 — action_execution (60% weight, "main path")
-----------------------------------------------------

One always-on recipe that fuses **all** available context (task,
plan, memory) into a single user prompt and uses the current subtask
as the supervised assistant target. This single recipe supervises
*both* objectives:

  * subtask prediction (text CE on the assistant span via lm_head)
  * action chunks (flow MSE on the action expert via
    stream: low_level, target: true; plus FAST CE on action tokens
    when enable_fast_action_loss=True)

At inference, the *same* prompt structure drives both inference
modes:

  * select_message(user_prompt_only) → LM head generates the next
    subtask. Matches action_execution's training distribution
    exactly (prompt is the user turn, target is the subtask).
  * predict_action_chunk(user_prompt + assistant_subtask) → action
    expert produces the chunk. Matches action_execution's full
    prompt+target.

This replaces what used to be a separate high_level_subtask recipe
plus a low_level_execution recipe; both were supervising the same
subtask text, so collapsing them into one is correct and removes
the redundant text-CE gradient.

Flavor 2 — event-driven text-only recipes
-----------------------------------------

Each of these supervises the LM head to predict a specific kind of
text given a specific event-triggered context. ``stream: high_level``
on all targets so they never trigger predict_actions / flow loss.
``if_present`` guards ensure they only fire on frames where the
event annotation is present.

  * memory_update           (10%)  new memory at subtask boundary
  * user_interjection_response (15%) new plan + say(...) on input
  * ask_vqa_top             (7.5%) front-camera VQA
  * ask_vqa_wrist           (7.5%) wrist-camera VQA

Total weight = 1.0.

Prompt format consistency
-------------------------

User prompt template ``${task}\nPlan: ${plan}\nMemory: ${memory}``
matches what ``inference/steps.py::_msgs_for_subtask`` and
``_control_context_messages`` already emit at inference time. No
"Task: " prefix — the bare task string is used as the leading
content with literal "Plan: " / "Memory: " labels for the
subsequent components.

What changed structurally
-------------------------

  - low_level_execution            DROPPED  (folded into action_execution)
  - high_level_subtask             DROPPED  (subtask supervision moved into action_execution)
  + action_execution               NEW      (the fused main recipe)
    memory_update                  kept, prompt cleaned up
    user_interjection_response     kept, prompt cleaned up
    ask_vqa_top / ask_vqa_wrist    kept

Runtime compatibility
---------------------

No runtime change needed — ``SmolVLA2Runtime`` and the inference
helpers already build their high-level prompt as just the user turn
(task + plan + memory) and append a ``current_subtask`` assistant
turn for the low-level call. Both match the new ``action_execution``
prompt shape exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:35:51 +02:00
Pepijn b873fe454c perf(pi052): full fusion — text + FAST + flow in ONE backbone forward
Previously the forward did 2 backbone passes when all heads were
active: one for flow (via super().forward) and one for the fused
text+FAST helper. This commit reduces it to **one pass** — same
compute as flow-only training.

New ``_compute_all_losses_fused`` builds:

    prefix = [images, language, FAST (when provided)]
    suffix = [noisy_actions]  (action expert via gemma_expert)

and runs a single ``paligemma_with_expert.forward`` with
``inputs_embeds=[prefix_embs, suffix_embs]`` (both experts active
in the same call). Captures *both* prefix_out and suffix_out, slices
each for its respective loss:

    flow MSE     ← suffix_out  (existing action_out_proj + MSE path)
    text  CE     ← prefix_out at language positions (lm_head + CE)
    FAST  CE     ← prefix_out at FAST positions (lm_head + CE)

Critical attention mask override
--------------------------------

``make_att_2d_masks`` produces a cumulative-block attention mask in
which suffix tokens (highest cumsum) attend to *every* lower-cumsum
position by default, including FAST tokens. If we let that stand the
action expert reads the discrete FAST tokens and trivially decodes
them back to the same continuous actions the flow head is supposed
to predict from noise — the entire training signal collapses to a
copy operation.

The fix is a single line right after make_att_2d_masks:

    att_2d_masks[:, fast_end:, fast_start:fast_end] = False

Explicitly zeros out *suffix → FAST* attention. Everything else
remains correct under the cumsum semantics:

  * prefix images/language stay bidirectional among themselves
  * FAST stays causal within itself, attending bidirectionally
    to images+language
  * FAST cannot see suffix (cumsum < suffix cumsum, default)
  * suffix attends bidirectionally among itself, to images+language,
    and now NOT to FAST (this override)

Bit-equivalent to the previous separated forward path for text+FAST
losses (the prefix hidden states at language and FAST positions are
unchanged whether suffix is present or not — the prefix doesn't
attend to suffix). For flow loss, suffix→FAST being masked is the
correct behaviour we *want* — if anything the previous separated
path was less correct for production use because the joint
gradient signal through the action expert was missing the prefix
extension.

Forward routing in ``forward()``
--------------------------------

  * run_flow=True  →  _compute_all_losses_fused (one forward, all
                      three losses)
  * run_flow=False, run_text or run_fast → _compute_text_and_fast_loss
                      (one prefix-only forward, two CE losses, no
                      suffix → cheaper than fusion)
  * neither       →  RuntimeError (explicit; both losses disabled)

Wall-time per step
------------------

  Before this commit:  flow + (text+FAST fused) = 2 forwards
  After this commit:   (flow+text+FAST fused)   = 1 forward

Compute parity with flow-only training when all three heads active.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:28:38 +02:00
Pepijn 83d7250a22 fix(recipes): low_level_execution needs if_present:subtask guard too
Same bug we fixed for high_level_subtask, just on the other
subtask-supervised sub-recipe. ``low_level_execution`` targets
``${subtask}`` (the current active span) but had no
``if_present`` guard. When ``active_at(t, style=subtask)`` returned
None at a frame (gaps in the annotation, or the very first/last
frames of an episode if the annotator's spans don't fully tile),
the assistant message rendered with empty content. The chat
tokenizer still included it in ``target_message_indices`` → text CE
supervised whatever the chat-template's empty assistant turn
decoded to (usually a single ``\n``). That trains the LM head's
prior at the first generation position toward ``\n``, the same
collapse we observed with the original ``${next_subtask}`` target.

Fix: ``if_present: subtask`` on the assistant target in
``low_level_execution`` for both ``smolvla2_hirobot.yaml`` and
``pi052_hirobot.yaml``.

Side effect: frames without an active subtask span no longer
contribute to the flow loss either (the only ``low_level`` target
is skipped, ``predict_actions = bool(targets_by_stream.get("low_level"))``
becomes False). For a well-annotated dataset where subtask spans
tile the whole episode this is a no-op. For datasets with gaps,
those gap frames lose flow supervision — strictly better than the
degenerate text-CE alternative.

Sub-recipe audit summary (no other changes needed):

  * memory_update                 — all if_present guards present, OK
  * user_interjection_response    — all if_present guards present, OK
  * high_level_subtask            — fixed earlier, OK
  * low_level_execution           — fixed by this commit
  * ask_vqa_top / ask_vqa_wrist   — query+answer both guarded, OK

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:22:45 +02:00
Pepijn 35f9063a6c perf(pi052): fuse text + FAST loss into a single prefix forward
Previously the forward did three backbone passes per training step
when all heads were active: one for flow (via super().forward), one
for text CE, and one for FAST CE. That's ~3× the compute of
flow-only training.

The text and FAST losses share their prefix forward exactly — both
are CE on the LM head, evaluated at different slices of the same
hidden states. Adding FAST tokens after language in the prefix is
bit-equivalent for the text loss because the mask_ar convention in
``make_att_2d_masks`` keeps FAST tokens in a strictly-later causal
block: language tokens never see FAST, so their hidden states are
unchanged.

New ``_compute_text_and_fast_loss``:

  * embeds [images, language] once
  * optionally appends [FAST] (when run_fast is True)
  * one backbone forward
  * slices ``vlm_out[:, -(fast_len + lang_len):-fast_len]`` for
    language hidden states (or ``vlm_out[:, -lang_len:]`` when no
    FAST) → text CE
  * slices ``vlm_out[:, -fast_len:]`` for FAST hidden states →
    FAST CE
  * returns both losses, either of which can be None when the
    caller doesn't want that head.

forward() now calls this fused helper instead of running the two
separate ``_compute_text_loss`` / ``_compute_fast_action_loss``
methods. Those remain in the file for callers that only want one
head (e.g. ablations).

Why flow isn't fused
--------------------

Flow MSE comes from the action-expert (suffix) hidden states, which
attend to the prefix. If we just concat FAST onto the prefix and let
the action expert attend to it, the expert can trivially decode FAST
back to continuous actions — overfitting via shortcut. Preventing
that requires a custom segment-aware attention mask (action expert
can attend to images+language but NOT to subtask/FAST), which is
what pi05_full does in ``compute_layer_complete_knowledge_insulation``.
That's the full-fusion path; deferred as a follow-up since the
text+FAST fusion already recovers most of the compute.

End-to-end forward pass count
-----------------------------

Before: 1 (flow) + 1 (text) + 1 (FAST) = 3 backbone forwards
After:  1 (flow) + 1 (text+FAST fused) = 2 backbone forwards

~33% wall-time reduction per training step when all three heads
are active.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:08:34 +02:00
Pepijn 17c0800461 fix(pi052): FAST loss masking + predict_actions gating + smolvla2 review
FAST loss changes
-----------------

1. Gate by ``predict_actions`` (same routing as flow loss). The
   ActionTokenizerProcessorStep tokenises actions for *every*
   sample regardless of which sub-recipe rendered it; for text-only
   recipes (high_level_subtask, memory_update, ...) the action
   tokens are still in the batch but mustn't be supervised. Skip
   the FAST forward+CE entirely when no sample in the batch has
   ``predict_actions=True``.

2. Switch from "multiply-by-mask" masking to ``ignore_index=-100``.
   The old pattern computed per-token CE for all positions, then
   zeroed out invalid ones. Two issues: (a) any out-of-vocab target
   id at a padded position would have crashed cross_entropy before
   the mask got a chance to zero it out, and (b) the pattern is
   needlessly clever. Now ``shift_targets.masked_fill(~mask, -100)``
   followed by ``ignore_index=-100`` cleanly drops invalid positions.
   Matches the smolvla2 text-loss convention.

3. Clean up unused ``bsize`` variable in _compute_fast_action_loss
   and expand the attention-mask docstring with the
   ``make_att_2d_masks`` mask_ar convention spec (causal vs
   bidirectional blocks).

smolvla2 audit (reference review, no code change)
-------------------------------------------------

Compared smolvla2/modeling_smolvla2.py against pi052/modeling_pi052.py
to catch parallel bugs. Findings:

* No ``paligemma.language_model`` vs ``paligemma.model.language_model``
  issue — smolvla2 uses SmolVLM (different class, different attribute
  layout) so the bug doesn't apply.

* ``fill_kv_cache=True`` is correctly passed to smolvla's
  ``vlm_with_expert.forward`` — that class *does* accept the kwarg
  (unlike pi05's PaliGemmaWithExpertModel.forward, which is why
  pi052 must omit it).

* Text-loss alignment is correct: ``_compute_text_loss`` computes
  ``lang_start`` / ``lang_end`` from the known prefix layout
  (``[image_blocks..., lang, state]``) and slices ``prefix_out``
  to just the language positions before applying ``lm_head``. The
  parallel bug I fixed in pi052 (lm_head over the full prefix,
  shape-mismatched against text_labels) was *not* present in
  smolvla2.

* Per-sample flow routing via ``predict_actions``: correctly masks
  per-sample by calling the parent ``forward(..., reduction='none')``
  and applying the predict_actions mask before the mean. pi052 only
  has the batch-level any() gate — a parallel improvement for pi052
  would require modifying PI05Pytorch.forward to support per-sample
  reduction, deferred.

* ``reduction="none"`` returns ``total.expand(bsize)``: identical
  scalar-broadcast limitation in both policies. Acknowledged but
  low priority (only RA-BC weighting uses the per-sample path and
  it's documented as a known approximation in smolvla2).

* Chat tokenizer correctly handles batched/unbatched messages,
  pads with -100 for label positions, builds attention masks. No
  bugs found.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:05:37 +02:00
Pepijn c8763e0ad5 fix(pi052): four real bugs in the modeling code + flip defaults
Defaults
--------
* enable_fast_action_loss: False -> True   (match paper §III.B-C Eq.1)
* auto_fit_fast_tokenizer: True -> False   (opt-in; needs base.fit())

Bug fixes
---------

1. Wrong attribute path on PaliGemma. The KI port copied
   pi05_full's ``paligemma.language_model.layers[...]`` literally,
   but the production pi05 wrapper exposes the text model at
   ``paligemma.model.language_model``. With KI enabled, every layer
   would have raised AttributeError on first forward. Fixed all
   references in _compute_layer_ki + _paligemma_forward_ki.

2. ``fill_kv_cache=True`` passed to PaliGemmaWithExpertModel.forward.
   That kwarg is a SmolVLA-only concept; pi05's signature has no
   such argument, so every forward call from pi052 (text loss, FAST
   loss, select_message) would have crashed with TypeError. Dropped
   from all four call sites — pi05's forward already handles the
   cache via past_key_values, and re-forwarding the cumulative
   sequence each step in select_message is fine for our short
   subtask completions.

3. Text-loss shape mismatch. _compute_text_loss applied lm_head to
   the *full* vlm_out (image tokens + language tokens), then tried
   to cross-entropy that against text_labels which only covers the
   language portion — the .view(-1) calls would produce two
   tensors of different lengths and CE would fail. Now slices
   vlm_out to the last text_labels.shape[1] positions before
   running lm_head, matching the [images, language] order
   embed_prefix produces.

4. Dead-code conditional in _paligemma_forward_ki's single-expert
   fallback. The ``if hasattr(...) else self._pi052_orig_forward``
   ternary always took the wrong branch because the attribute is
   always set (we save it in PI052Policy.__init__). Simplified to
   just call self._pi052_orig_forward directly.

After this commit, pi052 should be runnable end-to-end for the
first time with all three loss heads + KI active. Still worth a
100-step smoke test before kicking off a long run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:58:40 +02:00
Pepijn 0f4faddc01 feat(pi052): auto-fit FAST tokenizer per-dataset before training
Per Pertsch et al. 2025 (FAST paper, [64] in π0.5) and π0.5 §III.C,
the recommended practice is to *fit* the FAST action tokenizer on
the specific dataset's action distribution rather than using the
published universal codebook off the shelf. The universal tokenizer
works on any 6-DoF action sequence but produces suboptimal
compression, which slows CE convergence and wastes vocab capacity.

New utility ``lerobot.policies.pi052.fit_fast_tokenizer``:

  * samples N action chunks from the LeRobotDataset (default 1024)
  * loads ``physical-intelligence/fast`` as the base
  * calls ``.fit(actions)`` (the AutoProcessor API the HF model card
    documents) — produces a per-dataset codebook
  * saves to ``{cache_dir}/{sha256(dataset, base, n_samples)[:16]}/``
  * returns the local path, ready to feed
    ``ActionTokenizerProcessorStep(action_tokenizer_name=...)``.

Cache is keyed on (dataset, base tokenizer, sample count) so changing
any of them re-runs the fit. Re-running training on the same dataset
re-uses the cache (one fit per dataset per machine).

Auto-fit wiring:

  * PI052Config gets ``auto_fit_fast_tokenizer`` (default True),
    ``fast_tokenizer_cache_dir`` (default ~/.cache/lerobot/...),
    ``fast_tokenizer_fit_samples`` (default 1024).
  * make_pi052_pre_post_processors now takes ``dataset_repo_id``;
    when ``enable_fast_action_loss`` and ``auto_fit_fast_tokenizer``
    are both True and a repo_id is provided, the factory calls
    ``fit_fast_tokenizer`` before constructing the processor step
    and points it at the fitted path.
  * ProcessorConfigKwargs gains ``dataset_repo_id``; the global
    factory dispatch threads it through for ``pi052`` policies.
  * lerobot_train.py populates ``processor_kwargs['dataset_repo_id']``
    from ``--dataset.repo_id`` for pi052 runs.

Failure mode: if ``.fit()`` fails (e.g. older transformers without
the method, or no usable action chunks in the dataset), the factory
logs a warning and falls back to the universal base tokenizer. Train
still works; you just lose the compression improvement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:52:31 +02:00
Pepijn 8dc0af3c28 feat(pi052): FAST action CE loss + knowledge insulation + processor wiring
Three additions ported from ``pi05_full`` on branch ``feat/add-pi05``,
giving pi052 full paper-§III.B-C training capabilities alongside the
recipe-driven text supervision it already had:

* **Config flags** in PI052Config:
    - ``enable_fast_action_loss``  default False
    - ``action_tokenizer_name``    default "physical-intelligence/fast"
    - ``max_action_tokens``        default 256
    - ``fast_skip_tokens``         default 128
    - ``fast_action_loss_weight``  default 1.0
    - ``knowledge_insulation``     default False

* **Processor wiring** (processor_pi052.py): when
  ``enable_fast_action_loss=True``, append an
  ``ActionTokenizerProcessorStep`` after the text tokenizer. It
  tokenises the action tensor with the FAST tokenizer and writes
  ACTION_TOKENS / ACTION_TOKEN_MASK into ``COMPLEMENTARY_DATA`` —
  the existing batch-collation pipeline forwards them as
  ``batch['action.tokens']`` / ``batch['action.token_mask']``.

* **FAST CE loss** (modeling_pi052.py::_compute_fast_action_loss):
  Re-embeds the prefix [images, language], appends the FAST token
  embeddings (using PaliGemma's shared embed_language_tokens),
  forwards through the backbone, slices the trailing
  ``fast_len`` positions, applies the LM head, computes shifted
  next-token CE with the action-mask gating the loss. The loss is
  summed into ``forward()``'s total with ``fast_action_loss_weight``.

* **Knowledge insulation** (modeling_pi052.py::_compute_layer_ki +
  _paligemma_forward_ki): port of pi05_full's per-layer attention
  that detaches VLM K/V on the action-query path so action loss
  gradients cannot flow back into the VLM's K/V projections. Bound
  per-instance via ``types.MethodType`` so it doesn't leak into
  stock ``pi05`` policies that share PaliGemmaWithExpertModel.
  Activated automatically when ``config.knowledge_insulation=True``.

Combined with the existing recipe-driven text head, pi052 now
supports the full three-loss objective:

   L = text_w·H(text) + fast_w·H(FAST actions) + flow_w·MSE(flow)

matching Eq. (1) of arxiv:2504.16054 §IV.D (α=10 by default for the
flow term, 1.0 each for text and FAST CE).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:46:21 +02:00
Pepijn 8eba704f15 Revert "chore(training): align pi052_hirobot.slurm with the operator's actual command"
This reverts commit ecbac17196.
2026-05-13 11:03:58 +02:00
Pepijn ecbac17196 chore(training): align pi052_hirobot.slurm with the operator's actual command
Match the working SmolVLA2 launch pattern so the two SLURM scripts
are interchangeable:

  * literal NUM_PROCESSES / BATCH_SIZE / STEPS (no env-var defaults)
  * STEPS=10000 to match the next SmolVLA2 run
  * save_freq=$STEPS so only the final checkpoint is saved
  * dropouts 0.1/0.1/0.1 (mild — matches the operator's iteration)
  * flow_loss_weight / text_loss_weight come from the PI052Config
    defaults (10.0 / 1.0 per Pi 0.5 paper §IV.D), no need to pass
    them explicitly

Job name and policy_repo_id mirror the SmolVLA2 ``_tool-g2`` naming
so the two runs can be compared side-by-side in WandB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:03:09 +02:00
Pepijn 12cce8f2cc fix(smolvla2): align flow_loss_weight default with Pi 0.5 paper's α=10
Pi 0.5 paper §IV.D Eq. (1) sets the loss balance to α=10 between text
CE and flow MSE: actions are the primary output and the flow head
should dominate the gradient signal. SmolVLA2 was defaulting both
weights to 1.0, which inverts that — text CE (~0.5-2.0 nats) ends up
larger than flow MSE (~0.1-1.0), so the action expert gets less
gradient than the LM head despite being the primary task.

Match the paper's split: text_loss_weight=1.0, flow_loss_weight=10.0.
Same as ``pi052`` (the new full reproduction policy).

Also pin the values explicitly in the SLURM launcher so the choice is
visible and overridable per-run rather than buried in the config
default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:02:17 +02:00
Pepijn ef5879a02a feat(pi052): π0.5 v2 — full reproduction of the π0.5 paper recipe
New ``lerobot.policies.pi052`` (parallel to ``smolvla2``) that adds
text-prediction + hierarchical-inference on top of the existing π0.5
implementation. Mirrors the paper's §IV.D dual-head training:

  L = H(text) + α * ‖ω - a - f_θ_action(...)‖²,  α = 10

Components:

  * ``configuration_pi052.py``     thin PI05Config subclass; adds
                                    recipe_path, text/flow loss weights
                                    (default α=10 per paper), prompt
                                    dropout knobs, ``unfreeze_lm_head``.
  * ``text_processor_pi052.py``    PI052TextTokenizerStep — concatenates
                                    rendered messages as ``Role: ...``
                                    plain text (PaliGemma has no chat
                                    template), tokenises with the
                                    PaliGemma tokenizer, builds a label
                                    mask covering supervised target
                                    spans. Includes Pi 0.7 §V.E
                                    per-component prompt dropout.
  * ``processor_pi052.py``         make_pi052_pre_post_processors —
                                    Rename + Batch + Relative +
                                    Normalize + RenderMessagesStep +
                                    PI052TextTokenizerStep + Device.
                                    Falls back to π0.5's plain pipeline
                                    when recipe_path is unset.
  * ``modeling_pi052.py``          PI052Policy(PI05Policy) — re-enables
                                    PaliGemma ``lm_head``, computes
                                    text_loss via CE on the supervised
                                    span, sums with flow_loss in
                                    forward(), and adds select_message
                                    for AR text generation at inference
                                    (same surface as
                                    SmolVLA2Policy.select_message so
                                    SmolVLA2Runtime drives it unchanged).

Plus the supporting plumbing:

  * recipe ``configs/recipes/pi052_hirobot.yaml`` — same Hi-Robot blend
    as smolvla2_hirobot.yaml, with the same ``${subtask}`` /
    ``if_present`` supervision fix (current span at every frame, not
    ``${next_subtask}``).
  * SLURM ``examples/training/pi052_hirobot.slurm`` — full training
    command matching the SmolVLA2 launcher.
  * factory registration: ``--policy.type=pi052`` resolves to
    PI052Policy with the new processor.

Same multi-rate runtime (``lerobot.policies.smolvla2.inference``)
drives this policy too — both expose ``predict_action_chunk`` for the
action expert and ``select_message`` for the LM head.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 10:59:26 +02:00
Pepijn 1d24301b67 chore(training): STEPS=15000 default + dropout walked back to 0.30/0.30/0.20
After _tool-good (2000 steps, 0.50/0.50/0.20 dropout) the LM head's
distribution at position 0 shifted from EOS to subtask-vocabulary
tokens but emitted bag-of-words ("cube arm and") rather than well-
formed sentences. That's the expected mid-fine-tuning phase: token-
level supervision has landed, sequence-level grammar hasn't.

Two changes for the next retrain:

  * STEPS=15000 (from 2000) — chat-pretrained backbones need O(10k+)
    steps to walk their pretraining priors down far enough to commit
    to the fine-tuned distribution structurally, not just at the
    token level. _tool-g2's bag-of-words output proves the model is
    on the right path; it just needs more gradient signal.

  * plan/memory dropout 0.50 -> 0.30 — 0.50 was probably too
    aggressive for a small dataset. Half the training samples had
    crucial context missing, which slows down learning the full
    conditional structure. 0.30 still regularises against prompt
    leakage but lets the model learn proper grammar first; the
    higher dropout can be revisited once the head is solid.

Subtask dropout stays at 0.20 since subtask isn't in the high-level
prompt anyway (recipe fix removed the "Current subtask:" message).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 10:46:19 +02:00
Pepijn 3a20ea337e feat(smolvla2-runtime): --text_min_new_tokens / --text_temperature CLI debug knobs
The recipe fix (target=${subtask} instead of ${next_subtask}) shifted
the LM head's failure mode from "emit newlines" to "emit EOS at
position 0". On the new ``_tool-good`` checkpoint inference produces
exactly one token (``<end_of_utterance>``, id 49279) and decodes to
empty. That's the chat-pretrained backbone's short-turn EOS prior
not yet being overridden by 2000 steps of fine-tuning supervision.

Expose three knobs so the operator can probe whether the head has
real subtask-token probability mass *under* the EOS argmax without
recompiling or retraining:

  --text_min_new_tokens=N    suppress EOS for the first N tokens
  --text_temperature=T       sample at temperature T
  --text_top_p=P             nucleus filtering at top-p

These are explicitly off-policy (training was greedy / no min-tokens),
so they shouldn't ship in production runs — but they let us tell
whether the model has *learned* subtask prediction (just under EOS)
or hasn't yet. If forcing min_new_tokens=3 with temperature=0.5
produces a sensible subtask, the model is fine and just needs more
training steps to walk EOS down. If it produces gibberish, training
hasn't progressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 21:39:33 +02:00
Pepijn b6fb536460 chore(training): bump plan/memory dropout to 0.50 to force vision-grounding
After the recipe fix (target=${subtask} at every frame) the model
can still reach low text_loss by reading the answer off the plan in
the prompt: at training the prompt contains the 6-step plan, and the
current subtask is one of those steps, so the model just learns
"active step N matches subtask N" and never needs to look at the
image. Symptom at inference: subtask string is set but never updates
because the model isn't really conditioning on the visual progress.

Drop plan and memory with p=0.50 each — half of training frames the
prompt is just "${task}" (constant for this dataset) + visual prefix,
which is the only place the answer can come from. Forces the LM head
to actually use vision.

``subtask_dropout`` stays at 0.20 because subtask isn't in the
high-level prompt anymore (recipe fix removed the "Current subtask:
X" message); the knob still affects other sub-recipes that reference
it as context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 21:31:00 +02:00
pepijn bfd3bb1791 fix(smolvla2): handle batched sample indices in chat tokenizer
Normalize tensor and sequence sample indices before prompt dropout so distributed batched preprocessing does not try to cast full index tensors to scalars.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 16:56:13 +00:00
Pepijn 4908433f9a chore(training): align smolvla2_hirobot.slurm with what's actually run
Match the operator's current training command for the _tool6 retrain:

  * default DATASET / POLICY_REPO_ID / JOB_NAME point at the tool6
    iteration (super_poulain_full_tool3 → smolvla2_hirobot_super_poulain_tool6)
  * STEPS default 2000 (short enough to iterate; bump to 10k for full)
  * save_freq=$STEPS so the only checkpoint is the final one
  * OUTPUT_DIR includes step count so successive runs don't clobber
  * Drop the wider augmentation envelope I added earlier — back to
    default ColorJitter ranges (brightness ±20% etc) since the
    high_level_subtask recipe fix (current-subtask supervision) is
    expected to fix the LM-head collapse on its own; the augmentation
    is just the standard regulariser, not a load-bearing widener.
  * prompt-dropout fractions stay at the original 0.15 / 0.15 / 0.20.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:45:38 +02:00
Pepijn 6ce1f36002 fix(smolvla2): supervise high-level head with *current* subtask at every frame
The high_level_subtask recipe targeted ``nth_next(style=subtask, offset=1)``,
which on the last span of any episode resolves to None. The recipe had no
``if_present`` guard on the target, so the renderer emitted an empty
assistant turn and cross-entropy supervised the model on the chat
template's structural newlines (``\n``). Across the dataset this trained
the LM head's argmax at position 0 to collapse to ``\n`` whenever no
transition was imminent (i.e. most frames). Visible failure mode at
inference: the head emits 40+ newlines + ``<end_of_utterance>`` every
chunk boundary while the action expert keeps working — confirmed by
running the dry-run on dataset frame 0 with the dataset's own image
and seeing the same ``\n × 44`` collapse.

Switch to the Pi 0.5 / Pi 0.7 supervision pattern: at every frame, the
assistant target is the *current* active subtask span text (via
``${subtask}`` → ``active_at(t, style=subtask)``). Always non-empty,
always scene-grounded, ``if_present: subtask`` skips frames with no
active span instead of emitting a degenerate empty turn.

Runtime callsite update: ``_msgs_for_subtask`` no longer feeds a
"Current subtask: X" user message into the prompt (that would be
circular — we'd be telling the model the answer). Transition
detection moves into the runtime — when the predicted subtask differs
from ``state['current_subtask']``, the existing ``set_if_changed``
path fires ``subtask_change`` and downstream memory updates. Same
event surface, supervision target is now always meaningful.

Requires re-annotating the dataset and retraining for the fix to land
in the checkpoint, but the recipe + runtime change is what enables it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:42:59 +02:00
Pepijn 731576be80 chore(smolvla2-runtime): auto-fire one tick at dry-run startup
Previously the dry-run REPL only ticked on user input (empty Enter
just redrew), so the bisection test "does the LM head produce text on
start_frame=0?" required typing something arbitrary to drive a tick.
Just run ``step_once`` at startup — the obs diagnostic *and* the
subtask gen both fire automatically, the diag row populates, and the
operator can read the result before pressing any key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:34:42 +02:00
Pepijn 47fb8318b1 chore(training): widen augmentation envelope after live-robot diagnostic
The tensor-level comparison between dry-run (dataset frame) and live-
robot inference proved the runtime is bug-free — same shape, dtype,
device, channel order, batch dim, and normalization on both paths.
The remaining variable: front-camera mean brightness was 0.26 live vs
0.39 on the dataset frame, ~33% darker. Training augmentation only
covered ±20% brightness, so the live scene sits just outside the
supervised envelope and the LM head collapses to its dominant prior.

Widen the augmentation knobs for the next retrain:

  * brightness    0.8–1.2  → 0.5–1.6   (covers ~30% darker / 60% lighter)
  * contrast      0.8–1.2  → 0.6–1.5
  * saturation    0.5–1.5  → 0.3–1.7
  * hue          ±0.05    → ±0.10
  * affine        ±5°/±5%  → ±15°/±15% (covers cube placement / camera drift)
  * max_num_transforms 3 → 4

And bump prompt-component dropout (subtask 0.20 → 0.30) so the LM
can't lean on stale memorised plan/memory at inference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:25:41 +02:00
Pepijn 53172873e3 chore(smolvla2-runtime): probe obs once at dry-run startup
The dry-run REPL only fires a tick when the user types, so the
``_log_obs_tensors_once`` diagnostic never reached stdout (the
provider was never called). Probe the provider once at startup —
the result is discarded; we only care about the obs log it triggers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:21:58 +02:00
Pepijn fcdae0ce8e chore(smolvla2-runtime): tensor-level obs print for both inference paths
Helper that prints (once per provider lifetime) every
``observation.*`` tensor the policy is about to see, with its shape,
dtype, device, and per-channel min/max/mean/std. Wired into both the
dry-run dataset path and the live-robot path.

Now we can bisect train/inference mismatch *at the tensor level* —
if the same checkpoint produces coherent text on one path's tensors
and ``\n`` on the other's, and the printed tensor stats differ
materially, the bug is in the observation prep, not in the model or
the training distribution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:19:18 +02:00
Pepijn 4852b9f952 feat(smolvla2-runtime): --dataset.augment_at_inference for the bisection test
Apply the training-time torchvision-v2 ColorJitter / SharpnessJitter /
RandomAffine pipeline to dataset frames in dry-run, so we can isolate
whether the LM head's collapse to '\n' on live frames is:

  * pure scene-content OOD (unaugmented dataset frames work, mildly
    augmented ones still work — model has learned the augmentation
    distribution, only fails when the scene content itself diverges)
  * hyper-specific memorisation (dry-run with augmentation also
    collapses to '\n' — head is nailed to the exact unperturbed
    training samples and only the retrain helps)

Usage:

  lerobot-smolvla2-runtime --no_robot --policy.path=... \
    --dataset.repo_id=... --dataset.episode=0 \
    --dataset.start_frame=1000 \
    --dataset.augment_at_inference

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:14:57 +02:00
Pepijn 0410705aff chore(smolvla2-runtime): print live state vector once at startup
So the operator can compare live joint values to the dataset's
``observation.state`` mean/std and spot when the robot's home pose is
several σ off the supervised support region. State OOD is the
remaining viable hypothesis for why the live LM head collapses to
``\n`` even though images are pixel-shape-matched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:12:27 +02:00
Pepijn 398a8cf730 chore(smolvla2-runtime): log first-tick resize so train/inference match is verifiable
Print one warning the first time the robot observation provider runs
through, showing live camera resolution and the dataset's training
resolution, plus whether we resized. Lets the operator confirm at a
glance that the visual prefix really is being fed at the same shape
the model saw at training — instead of guessing whether the resize
fired silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 18:06:00 +02:00
Pepijn ab5c1dc392 fix(smolvla2-runtime): match training visual distribution on robot frames
Root cause for the LM head's empty-completion symptom on the live robot
(while the same checkpoint produced sensible subtask/plan/memory in
``--no_robot`` dry-run on dataset frames): the camera observation was
flowing into the model at its native resolution. A Mac/USB webcam
hands us 1280×720 or 1920×1080; the dataset was recorded at the
feature schema's ``observation.images.*['shape']`` resolution
(typically 480×640). SmolVLA's internal ``resize_with_pad(512, 512)``
*does* fit both — but with very different pad geometry, so visual
tokens at each tile carry different content than at training. Action
expert tolerates this; the tightly-supervised LM head goes OOD and
the head's distribution at position 0 collapses to its dominant mode
(``\n`` ×N then ``<end_of_utterance>`` for this checkpoint).

The fix: in ``_build_robot_observation_provider``, pre-compute the
camera-key → (H, W) target from ``ds_features`` and ``cv2.resize``
each live frame to that shape before tensorising. The downstream
``resize_with_pad`` then sees the same input geometry as training and
the LM head returns to producing readable subtask text under plain
greedy decoding — the same as dry-run.

Also drops the inference-time patches (``min_new_tokens``,
``temperature``, ``top_p`` overrides) on the four high-level callers.
They were band-aids around the visual-distribution shift, not a real
LM problem, and they drift inference off the training distribution.
Greedy argmax is what training matched. The ``select_message``
signature still accepts the knobs for callers that want them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:59:24 +02:00
Pepijn 1292304c42 fix(smolvla2): suppress all special tokens during min_new_tokens window
Previous attempt only masked the tokenizer's eos_token_id during the
min_new_tokens prefix. The empty-completion symptom persisted because a
memorised SmolVLM head doesn't just want EOS — its top-1 at position 0
is *some* special token, and when EOS is masked the argmax shifts to a
sibling (``<|im_end|>``, ``<image>``, ``<fake_token_around_image>``,
``<row_X_col_Y>``, …). Those tokens survive generation but then get
stripped by ``decode(skip_special_tokens=True)``, so the runtime still
saw ``last_raw='(empty)'`` every chunk boundary.

Mask the full ``tokenizer.all_special_ids`` set instead. Forces the
head to commit to a normal vocabulary token before it can close or
quietly poison the turn.

Also: when decode returns empty but tokens *were* generated, expose
the raw token ids and the special-tokens-included decoded string via
``policy._last_select_message_debug``. The runtime surfaces this in
the scrollback so the operator can see what the head is actually
emitting — distinguishing "head EOS-ing" from "head emitting image
placeholders" from "head emitting chat-template fragments".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:49:53 +02:00
Pepijn b95eebff77 fix(smolvla2): force min_new_tokens + sampling so memorised LM emits something
Real-robot run confirmed the LM head is producing 0 tokens at every
chunk boundary (empty:N counter climbing, no exception in scrollback):
the model EOS-es at decode step 0. That's the memorisation collapse —
training reached text_loss=6e-6 by overfitting one trajectory whose
supervised subtask turn ended in EOS, and at inference the head's
argmax for token 0 is EOS regardless of the actual frame.

Two changes in select_message:

  * ``min_new_tokens`` parameter masks the EOS logit to -inf until at
    least N real tokens have been decoded. Without this the head's
    "EOS first" prior produces an empty completion every single time.

  * The runtime callers now pass ``min_new_tokens=5..10`` plus
    ``temperature=0.4..0.5`` + ``top_p=0.9``. Sampling at moderate
    temperature with nucleus filtering also helps break the greedy
    argmax collapse — when the model has memorised one continuation,
    greedy keeps replaying it; nucleus sampling forces it to commit
    to *some* coherent continuation that's well-supported by the
    prefix even when greedy's top-1 is degenerate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:48:08 +02:00
Pepijn fbcac95662 feat(smolvla2-runtime): scrollback in autonomous panel + empty-gen counter
Two improvements for diagnosing why ``last_raw`` stays empty:

1. The autonomous panel-redraw thread calls console.clear() every
   0.5 s, wiping any log lines the runtime printed since the last
   redraw. So warnings from generation (``[warn] subtask gen failed:
   ...``, ``[info] subtask gen rejected (gibberish): ...``) flashed
   for milliseconds and disappeared, leaving the operator blind.

   Capture log_lines from each tick into a bounded scrollback
   (last 12 entries) and render them inside the panel itself, below
   the diag row. They now stick across redraws until rotated out.

2. ``empty`` counter for subtask gen. Persistent empty completions
   are their own failure mode — the LM head EOS-es immediately from
   the chat-template generation prompt, distinct from "generated
   something but filter rejected it". The diag row now reads:

     subtask diag    repeat:0  gibberish:0  empty:14  last_raw: '(empty)'
                                            ^^^^^^^
   plus a periodic log line every 10 empties so the cause is also
   surfaced in the scrollback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:42:13 +02:00
Pepijn b9db4d21a2 fix(smolvla2): high-level steps must run before LowLevelForward refills
Both HighLevelSubtaskFwd and LowLevelForward are gated on
'action queue is empty'. With LowLevelForward listed first, it refilled
the queue on the empty-queue tick before HighLevelSubtaskFwd got to
check — so the gate I added in the previous commit made the high-level
step a permanent no-op after the initial bootstrap. Visible symptom:
subtask string never advances past whatever bootstrap seeded, no
subtask_change events, memory stays unset, and the new overfit
diagnostics never appear on the panel because last_subtask_raw is
never written.

Move all high-level steps (subtask, memory, interjection, vqa) ahead
of LowLevelForward. On an empty-queue tick the subtask refreshes
first, the new string flows into the next chunk's prompt, then
LowLevelForward generates the chunk, then DispatchAction drains it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:38:06 +02:00
Pepijn aecb80a9d2 feat(smolvla2-runtime): overfit/memorisation diagnostics on the panel
The autonomous-mode panel now surfaces what the model is *actually*
producing at every chunk boundary, not just what got accepted:

  * last_subtask_raw       most recent generation (accepted or not)
  * subtask_repeat_count   times the same accepted string regenerated
  * subtask_gibberish_count rejections by the gibberish filter
  * memory_gibberish_count / plan_gibberish_count for the other heads

These let the operator see memorisation collapse without scrolling
back through logs:

  subtask diag    repeat:8  gibberish:0  last_raw: '<same string>'
                  ^^^^^^^^^^ → model can't move past current phase

  subtask diag    repeat:0  gibberish:14  last_raw: 'Ass:::'
                  ^^^^^^^^^^^^^^^^^^^^^^ → LM collapsed to template salad

Also silences the per-action ``Relative goal position magnitude had
to be clamped`` warning. The clamp fires every dispatch tick when the
model emits stale joint targets, flooding the panel at ctrl_hz=30.
Replaced the bare ``logging.warning`` call in robots/utils.py with a
module logger so it can be selectively raised to ERROR. Operators
who need the per-tick clamp detail can use ``-v``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:31:04 +02:00
Pepijn c98c695127 feat(smolvla2-runtime): 'rephrase:' prefix to swap task string in place
Adds a third stdin channel alongside 'task:' and bare interjections:

  rephrase: <text>

Swaps state['task'] with the new string while preserving plan/memory/
subtask. Lets the operator probe how robust the model is to wording
variations of the same task — the trained augmentation provided
n_task_rephrasings≈30 task wordings per dataset task, and this is the
direct way to exercise that distribution at inference without
generating a fresh plan via user_interjection_response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:26:59 +02:00
Pepijn d528078aca fix(smolvla2-runtime): allow task switching mid-run via 'task:' prefix
Both stdin handlers (autonomous mode and rich REPL) gated 'task:' to
'only if no task is set yet' — once the initial task existed, typing
'task: <new task>' silently fell through to the interjection branch.
Make 'task:' always override the active task and clear stale
plan/memory/subtask so the next high-level pass regenerates context
from scratch for the new task.

For rephrasings within the same task, the interjection path
(user_interjection_response recipe) is still the right channel — it
refreshes the plan and emits a paired <say> in one trained call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:24:16 +02:00
Pepijn a648da0455 fix(smolvla2): unblock action dispatch when high-level LLM stalls loop
The runtime is single-threaded. `HighLevelSubtaskFwd` at HzTrigger(1.0)
fires every loop iteration on MPS because each `select_message` call
takes ~2 s, longer than its 1/hz period. The whole tick stretches to
~2.5 s, so `DispatchAction` (HzTrigger 30) only pops a single action per
loop iteration — the queue drains at ~0.4 actions/sec instead of 30 and
the robot barely moves between chunk refreshes.

Two changes, both purely about scheduling — no threading:

* Gate `HighLevelSubtaskFwd` to fire only when the action queue is
  empty, matching `LowLevelForward`'s refresh condition. The slow LLM
  call now happens during the "think" phase between chunks, not on
  every dispatch tick. Restores a clean sense → think → act cycle.

* `DispatchAction` catches up via wall-clock: when the trigger fires
  after a stall, pop `round(elapsed * hz)` entries and send only the
  most recent. Open-loop chunks are timestamped at ctrl_hz; sending
  stale joint targets one-by-one would just lag the robot further
  behind. The dynamixel smooths to the latest goal anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:23:09 +02:00
Pepijn d866c2c9fd fix(smolvla2): only regenerate chunk when queue is fully drained
The previous refresh threshold (queue > chunk_size // 2) made each
new chunk *telescope* past the previous one: at queue=25, we kicked
off a new chunk forward from the current observation, but by the
time the new chunk's first action was actually dispatched, the
robot had executed the remaining 25 actions of the previous chunk
— so the new chunk was planned from an observation 25+ steps stale.

Canonical sense → think → act loop: execute the full chunk, then
re-observe and replan. Refresh only when the queue is empty. Every
step of every chunk still gets dispatched to the robot (no
behaviour change there), but each chunk is now planned from an
observation that's at most one chunk's worth of dispatch latency
old, not "previous chunk's worth of stale state on top of that".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 17:15:02 +02:00
Pepijn 01e2228b24 feat(smolvla2): per-component prompt dropout + augmented training script
Two complementary regularisers to attack the
``text_loss=6e-6 = memorised one dataset`` failure mode that's
making the model collapse on real-robot input:

1. **Per-component prompt dropout** (Pi0.7 §V.E / plan's
   ``feat/pi05-prompt-dropout`` follow-up).
   ``SmolVLA2ChatTokenizerStep`` gains
   ``plan_dropout_prob`` / ``memory_dropout_prob`` /
   ``subtask_dropout_prob`` knobs (default 0.0 — opt-in). At training,
   non-target messages whose rendered content starts with
   ``Plan:`` / ``Memory:`` / ``Current subtask:`` etc. are dropped
   with their respective probability before tokenisation, with a
   deterministic per-sample RNG keyed off the dataset ``index``.
   ``target_message_indices`` is re-mapped so the supervision still
   lands on the right turn. Forces the model to handle missing
   plan/memory/subtask context — directly attacks the real-robot
   collapse where a stale or empty plan field puts the prompt OOD.

   Surfaced on ``SmolVLA2Config`` as three floats so they're
   ``--policy.<knob>=<value>``-controllable from the train CLI;
   plumbed through ``make_smolvla2_pre_post_processors``.

2. **Image augmentation** is already wired in lerobot via
   ``--dataset.image_transforms.enable=true`` (torchvision v2
   ColorJitter + SharpnessJitter + RandomAffine, default 3 of 6
   sampled per frame). No code change needed — just a CLI flag.

``examples/training/smolvla2_hirobot.slurm`` shows the full
training command with both enabled. Drop-in replacement for the
ad-hoc SLURM script Pepijn was using locally; same args, plus the
three dropout probs and the image-transforms flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:52:32 +02:00
Pepijn c36de3a3e8 fix(smolvla2): enqueue full chunk via predict_action_chunk
``LowLevelForward`` was calling ``select_action()`` once per
``chunk_hz`` tick. SmolVLA's ``select_action`` is a thin queue-pop:
it returns one action per call and only re-runs the expensive
flow-matching forward when its private internal queue empties.
Result: we got one action back per chunk_hz tick (1Hz default),
``DispatchAction`` at ctrl_hz=30 popped it instantly, then queue
sat empty for ~1s waiting for the next tick. Net throughput was
1 dispatched action/sec instead of the 30 we wanted.

Switch to ``predict_action_chunk`` and enqueue every step of the
returned ``(batch, n_action_steps, action_dim)`` chunk. Refresh
only when the queue is below half a chunk so we don't burn one
flow-matching forward per chunk_hz tick — saves ~5x inference cost
on this hot path. At ctrl_hz=30, chunk_size=50, the queue drains
in ~1.7s before the next refresh, giving smooth dispatch at the
control rate the robot was trained on.

Side effect: ``state['last_chunk_size']`` records how many actions
the most recent chunk produced — useful for the panel later if we
want to surface "chunks generated" alongside "dispatched".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:27:23 +02:00
Pepijn cbfaf2c544 feat(smolvla2): action-dispatch counter + tighter gibberish filter
Real-robot run was unreadable for two reasons:

1. The panel surfaced ``queued actions: 0`` (always zero — dispatch
   pops faster than chunk_hz generates) and gave no signal that
   actions were actually reaching the robot. The only sign of life
   was the safety-clamp warning lines scrolling past.

2. The text head consistently collapses to ``the`` / ``Ass``
   fragments on real-camera input (memorisation wall). The old
   gibberish filter caught ``":":":"`` JSON salad but let
   single-token fragments through, and the ``[info] subtask gen
   produced no text this tick`` line flooded the panel every second.

Changes:

  * ``DispatchAction`` bumps ``state["actions_dispatched"]`` each
    tick; panel renders it next to queue depth. Operator can see
    the policy IS issuing actions even when text is broken.
  * ``_looks_like_gibberish`` now also rejects:
    - too few unique alphabetic tokens (``the``, ``the the``, ...)
    - chat-template marker leakage (``Assistant:``, ``Ass\\n::``)
    catching the actual failure mode on real-robot frames.
  * Gibberish rejections log only the first occurrence + every 30th
    after that, with a count, so the panel stays legible.
  * Empty completions no longer log at all (was every tick).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:22:36 +02:00
Pepijn d0278ea093 feat(smolvla2): render state panel in autonomous mode too
Dry-run REPL had a clean ANSI-clear-+-rich-panel layout via
``_redraw`` showing task / subtask / plan / memory / queued-actions /
pending-tool-calls; autonomous mode just had bare ``> `` plus log
lines scrolling past the user. Same data, two presentations.

Extract ``_make_state_panel_renderer(runtime, mode_label=...)`` and
use it from both ``_run_repl`` (called per user input) and
``_run_autonomous`` (called both on user input *and* on a 0.5s
background timer so subtask / plan / memory refreshes from the
runtime's own loop become visible without the user typing anything).
Title bar shows ``dry-run`` vs ``autonomous`` so it's obvious which
mode you're in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:16:28 +02:00
Pepijn 15f6b08b0e fix(smolvla2): use canonical _strip_lerobot_blocks for inference msgs
Training tokenises messages through ``_strip_lerobot_blocks`` (in
``chat_processor_smolvla2.py``), which normalises every variant of
``message['content']`` into the ``[{type:text, text:...}]`` list shape
SmolVLM's chat template expects:

  * ``list[block]`` → keep text blocks, drop images
  * ``None``        → ``[{type:text, text:""}]``
  * ``str`` / other → ``[{type:text, text:str(content)}]``

Inference was doing a partial inline conversion that only handled the
``str`` case — ``None`` and pre-formatted ``list`` content slipped
through unchanged. ``memory_update``'s ``Previous memory: ...``
assistant turn ends up with ``None`` content when there's no prior
memory, which then renders as no-content / role-marker-only and the
model hallucinates ``Assistant:`` fragments. Subtask gen got further
because its prompt always has at least the task string.

Reuse ``_strip_lerobot_blocks`` directly. Now the inference prompt
shape matches the exact tokenisation training did — no more "trained
on shape X, asked to predict shape Y" mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:07:39 +02:00
Pepijn fc715db4a3 fix(smolvla2): coerce str content to list-of-blocks for chat template
SmolVLM's chat template (and many other multimodal templates) declares
``message['content']`` as a list of typed blocks and iterates it
expecting dicts with a ``'type'`` field:

    {% for line in message['content'] %}
      {% if line['type'] == 'text' %}{{ line['text'] }}
      {% elif line['type'] == 'image' %}{{ '<image>' }}
      {% endif %}
    {% endfor %}

When the caller passes ``content`` as a plain ``str`` (which we did
throughout ``_msgs_for_subtask`` / ``_msgs_for_memory`` etc.), Jinja
silently iterates the string character-by-character. ``'P'['type']``
returns nothing; neither branch fires; *no text tokens get emitted*.
The model receives a prompt containing only role markers
(``User:<end_of_utterance>\nAssistant:``) and predictably continues by
emitting ``Assistant:`` fragments — the gibberish ``subtask: Ass\n::``
on the runtime panel.

Before calling ``apply_chat_template``, walk the messages and rewrite
any string ``content`` into ``[{'type': 'text', 'text': content}]``.
The template's text branch then fires correctly and the model sees
the actual user/assistant text, not just structural tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:01:53 +02:00
Pepijn fe4bd2b6ba fix(smolvla2): pass flat batch dict to preprocessor (no manual wrap)
``PolicyProcessorPipeline.__call__`` already wraps its input via
``to_transition`` (defaulting to ``batch_to_transition``) before
running the steps, and unwraps via ``to_output`` (defaulting to
``transition_to_batch``) afterwards. The input format is therefore a
*flat batch dict* keyed by ``observation.*`` / ``action`` / etc., not
an ``EnvTransition``.

Previous attempt pre-wrapped the observation into a transition with
``TransitionKey.OBSERVATION`` as the key, then handed *that* to the
pipeline — which fed it to ``batch_to_transition``, which looked for
top-level ``observation.*`` entries, found none (they were nested
inside the enum key), and produced an empty observation. Every step
then bailed with ``ObservationProcessorStep requires an observation
in the transition.``

Pass the flat dict from ``build_inference_frame`` straight to the
preprocessor — it does the wrap/unwrap itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:54:48 +02:00
Pepijn 3f7436ff8a fix(smolvla2): use TransitionKey enum (not .value) as transition keys
``EnvTransition`` is declared as a ``TypedDict`` keyed by
``TransitionKey.OBSERVATION.value`` (the string ``'observation'``),
but every concrete ``ProcessorStep`` in the pipeline indexes the
transition with the enum *member* (``transition[TransitionKey.
OBSERVATION]`` / ``transition.get(TransitionKey.OBSERVATION)``).
Those are two different keys in a Python dict — string key vs enum
key — so steps couldn't find the observation we'd placed under the
string variant, and bailed every tick with
``ObservationProcessorStep requires an observation in the
transition``.

Build the transition with the enum members directly. Matches how
``BatchProcessor``, ``RelativeActionProcessor``, ``HilProcessor``,
etc. read the dict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:50:22 +02:00
Pepijn 992d13d4e9 fix(smolvla2): use build_inference_frame for raw robot observations
``robot.get_observation()`` on omx_follower (and most lerobot robots)
returns:

  * per-joint scalar floats with ``.pos`` suffix
    (``shoulder_pan.pos: 0.123``, ``shoulder_lift.pos: 0.456``, ...)
  * per-camera ndarrays keyed by the camera config name (``wrist:
    ndarray(H,W,3)``)

But the trained policy expects:

  * single ``observation.state: tensor[N_joints]`` vector
  * image keys prefixed: ``observation.images.<cam_key>:
    tensor[1, 3, H, W]``

``prepare_observation_for_inference`` only handles the tensor /
batch-dim / device step — it crashes on scalar floats with
``expected np.ndarray (got float)``. The right helper is
``build_inference_frame`` which uses the dataset's feature schema
(``ds_meta.features``) to:

  1. extract the right raw keys per dataset feature,
  2. fold ``shoulder_pan.pos`` / ``shoulder_lift.pos`` / ...
     into a single ``observation.state`` ndarray,
  3. prefix camera keys with ``observation.images.``,
  4. delegate to ``prepare_observation_for_inference`` for the
     tensor / batch / device step.

Pass ``ds_meta.features`` into the observation provider and switch
to ``build_inference_frame`` when available; fall back to the bare
``prepare_observation_for_inference`` only when no dataset is
provided (rare — autonomous mode already requires it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:47:59 +02:00
Pepijn afe40a016b fix(smolvla2): wrap robot obs in EnvTransition before preprocessor
The policy preprocessor pipeline is transition-shaped — its steps
read ``TransitionKey.OBSERVATION`` off an ``EnvTransition`` dict, not
a flat ``RobotObservation`` dict. Passing the raw observation through
made every step bail with
``ObservationProcessorStep requires an observation in the transition``,
which the runtime swallowed at warning level. ``select_message`` then
got called with no ``observation.images.*`` features and crashed
with ``All image features are missing from the batch``.

Mirror ``lerobot-record``'s preamble:
  1. ``prepare_observation_for_inference`` → numpy → torch, ``CHW``
     image layout, ``[0,1]`` scaling, add batch dim, move to device.
  2. Wrap into an ``EnvTransition`` (``{TransitionKey.OBSERVATION.value:
     ...}`` plus ``COMPLEMENTARY_DATA: {}`` and ``None``s for the rest)
     so transition-aware steps see the keys they expect.
  3. Run preprocessor.
  4. Unwrap the transition's ``OBSERVATION`` slot to get the final
     flat dict the policy's ``select_action`` / ``select_message``
     consume.

Image features now reach the policy; the autonomous loop produces
real actions instead of swallowing warnings every tick.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:44:24 +02:00
Pepijn 41095e3cc3 fix(smolvla2): instantiate CameraConfig subclasses from JSON dicts
``--robot.cameras`` parses the JSON into ``dict[str, dict]``, but
``RobotConfig`` expects ``dict[str, CameraConfig]`` — each inner
value must be the actual ``CameraConfig`` subclass instance for the
chosen backend (e.g. ``OpenCVCameraConfig``). Passing raw dicts
blew up in ``RobotConfig.__post_init__`` with
``AttributeError: 'dict' object has no attribute 'width'`` when it
iterated cameras and tried to read attributes.

Look up the right subclass per-camera by its ``"type"`` field via
``CameraConfig.get_choice_class(...)`` (mirroring the lazy-import
dance we already do for ``RobotConfig``: eagerly walk
``lerobot.cameras``'s submodules so the registry is populated
before lookup). Construct an instance with the rest of the dict's
fields. On an unknown camera type, raise a clean ``ValueError``
listing the available choices.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:39:28 +02:00
Pepijn e0fa957569 fix(smolvla2): eagerly import robot submodules before get_choice_class
``RobotConfig._choice_registry`` is populated as a side-effect of
each robot's ``@RobotConfig.register_subclass`` decorator running,
and those decorators only fire when the corresponding
``lerobot.robots.<name>`` module is imported. The package's
``__init__.py`` doesn't import them — instead ``make_robot_from_config``
does it lazily in its big if/elif chain.

``_build_robot`` jumped the gun: called ``RobotConfig.get_choice_class
(robot_type)`` before any robot module had been imported, so the
registry was empty and every ``--robot.type=<X>`` produced
``KeyError: 'X'`` (e.g. ``KeyError: 'omx_follower'``).

Walk ``lerobot.robots``'s submodules via ``pkgutil.iter_modules`` and
``importlib.import_module`` each one before the lookup. ~200ms on the
first invocation, negligible for an autonomous run. On a real
``KeyError`` (typo / unsupported robot), raise a clean ``ValueError``
listing the registry's available choices instead of a bare KeyError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:31:58 +02:00
Pepijn c661d81409 fix(smolvla2): use RobotConfig.max_relative_target, drop --max_action_norm
The hand-rolled action-norm safety clip duplicated what every
``RobotConfig`` already exposes — ``max_relative_target`` — and at
the wrong layer (after postprocess but before send_action, instead
of inside the robot driver where every other lerobot entry point
puts it). The norm clip also rejected entire actions instead of
clipping per-motor relative motion, so a single rogue joint would
kill the whole tick.

Replace with ``--robot.max_relative_target``: a string parsed as
either a bare float (uniform per-motor cap) or a JSON object
mapping motor name → cap. Passed through to
``RobotConfig(max_relative_target=...)`` at robot construction;
the driver's ``send_action`` clips each commanded joint position
relative to the current measured one before issuing it on the bus —
same behaviour ``lerobot-record`` ships.

Also bump ``--chunk_hz`` default from ``4.0`` to ``1.0``. One new
chunk per second is what the trained checkpoint can comfortably
keep up with on common hardware and gives smoother motion than
sub-second chunk regenerations (no RTC interpolation between
chunks yet).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:41:57 +02:00
Pepijn 965d42825f review: skip-count fix, atomic writes, dedupe span reconstruction, role guards
**#1 Plan-update phase reports correct skip count.**
``_run_plan_update_phase`` only ran ``run_plan_updates`` for episodes
with at least one interjection but hardcoded ``episodes_skipped=0``.
The summary undercounted skipped episodes. Now returns
``len(records) - processed`` so processed + skipped == total.

**#2 ``run_hf_job.py`` installs ``openai``.**
The ``CMD`` block does ``pip install --no-deps lerobot[branch]`` then
explicitly lists transitive deps. ``openai`` was missing — and since
``VlmConfig.backend`` defaults to ``"openai"``, the job would have
``ImportError``'d when ``vlm_client._make_openai_client`` ran.

**#3 Dedupe subtask-span reconstruction.**
Module 1's ``_reconstruct_subtasks_from_rows`` (no ``and spans`` guard)
and Module 2's ``_read_subtask_spans`` (with the guard) had near-
identical logic. Promoted to ``reconstruct_subtask_spans`` in
``reader.py`` using the safer guarded form. Both modules now import
the single helper.

**#5 Atomic staging.py JSONL writes.**
Mirroring the parquet-writer fix from an earlier review round:
``EpisodeStaging.write`` now writes to a sibling ``.tmp`` and
``Path.replace`` atomically. A crash mid-write can no longer leave a
half-written JSONL that ``read()`` would then fail to parse.

**#6 Atomic ``info.json`` write.**
Same pattern in ``executor._ensure_annotation_metadata_in_info`` —
``info.json`` is load-bearing for dataset metadata, so partial writes
brick the dataset.

**#7 Writer's role-key guard.**
``_normalize_persistent_row`` and ``_normalize_event_row`` accessed
``row["role"]`` directly while every other field used ``.get()``.
Pre-validate ``"role" in row`` and raise a friendly ``ValueError``
naming the row, so a future module that accidentally drops ``role``
fails with a triagable message instead of a bare KeyError deep in the
writer.

**#8 Last subtask span's ``end`` extends to episode end.**
``reconstruct_subtask_spans`` (the new shared helper) takes an optional
``episode_end_t``. When provided, the final span's ``end`` is closed
to that timestamp instead of equalling its own ``start`` (zero
duration). Both Module 1's plan-update pass and Module 2's interjection
anchoring pass ``record.frame_timestamps[-1]``, so downstream "current
subtask at refresh_t" lookups no longer miss refreshes that land
inside the final span.

Sweep: 66 passed, 0 failed. Pre-commit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:18:09 +02:00
Pepijn 1238a0cd47 test(annotate): unstale the two failing module tests
Both tests were stale relative to design changes that landed earlier on
this branch. Update the tests to match the current production contract.

**``test_module1_attaches_video_block_to_subtask_prompt``**

The test took ``captured[0]`` and asserted on its content blocks, but
Module 1 issues several sub-prompts and the rephrasings call (which is
text-only, no video block) usually lands first. Two fixes:

* The test's intent is "the subtask prompt carries the video block" —
  not "the first prompt carries it". Pick the call by content
  (``"atomic subtasks"`` keyword in the text block) so the test is
  resilient to future reordering of unrelated sub-prompts.
* Set ``n_task_rephrasings=0`` so the rephrasings call is skipped
  entirely — keeps the test focused on ``_generate_subtasks``.

**``test_module2_mid_episode_emits_paired_interjection_and_speech``**

Two issues both rooted in design changes on the branch:

1. ``InterjectionsAndSpeechModule._mid_episode_interjections`` now
   anchors interjections on subtask boundaries from Module 1's staging
   tree, bailing out with zero rows when no spans exist. The production
   executor runs Module 1 first; the test ran Module 2 in isolation.
   Reproduce the contract by seeding two ``style=subtask`` rows in the
   staging before calling Module 2 — gives it the single ``0 → 1``
   boundary it needs.
2. The test's stub responder used the marker ``"ONE realistic
   interruption"`` to match the interjection prompt, but that string is
   from a previous prompt version. The current
   ``module_2_interjection.txt`` says ``"Write ONE interjection..."`` —
   the old prompt asked for counterfactual interjections (e.g. "skip the
   wipe"), the new one anchors on the upcoming subtask. Marker updated
   to ``"Write ONE interjection"``; canned response wording aligned to
   the new design.

Sweep on the language stack: 66 passed, 0 failed (was 64 passed, 2
failed). Pre-commit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:59:27 +02:00
Pepijn 53c7641885 review: fix dead-code bug, add thread safety, atomic writes, smaller cleanups
**Critical: video_for_episode was unreachable dead code.**
``video_for_episode`` was indented inside ``_decode_pyav_direct``, after
its ``return`` statement — Python parsed it as a nested function that
never executed. Module 1's ``_episode_video_block`` calls
``self.frame_provider.video_for_episode(record, target_count)`` on the
``use_video_url=False`` path, which would have AttributeError'd on any
real dataset. Tests passed only because they used ``_StubFrameProvider``
/ ``_NullProvider`` which have the method. Moved it to be a proper
method of ``VideoFrameProvider`` (right after ``frames_at``).

**Thread safety on VideoFrameProvider.**
The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so
the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail``
flag were exposed to concurrent reads/writes. Added a ``threading.Lock``
field, wrapped cache reads/writes and the warn-flag check-and-set in
``with self._lock:``. Stub fixtures unaffected.

**episode_clip_path is now a method of VideoFrameProvider.**
Used to be a free function reaching into ``provider._meta.episodes`` and
``provider._meta.get_video_file_path`` from outside the class. As a
method it just uses ``self._meta``. The only caller (Module 1) updated;
no external callers.

**Atomic write in LanguageColumnsWriter.**
``pq.write_table(new_table, path)`` was overwriting the parquet shard
in place — a crash mid-write would corrupt the file. Now writes to a
sibling ``.tmp`` and ``Path.replace`` atomically.

**Smaller items:**
* ``executor.py`` docstring opened with "four phases" but listed six.
  Now says "six phases" to match.
* ``[annotations]`` extra in ``pyproject.toml`` now includes
  ``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``,
  so without it ``_make_openai_client`` would ImportError on a fresh
  ``uv sync --extra annotations``.
* ``_snap_to_frame`` was duplicated identically in
  ``plan_subtasks_memory.py`` and ``interjections_and_speech.py``.
  Promoted to ``snap_to_frame`` in ``reader.py`` (next to
  ``EpisodeRecord``); both modules now import it. Backwards-compat alias
  not needed — no external callers.
* ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every
  call. Now memoizes via a private dataclass field so repeat calls from
  different modules pay the cost once. Method signature unchanged.
* ``_extract_first_json_object`` had a redundant ``and not escape`` guard
  that was dead because the prior block already handled and reset
  ``escape``. Replaced with a comment explaining the invariant.

**Pre-existing lint cleanups surfaced once these files entered
pre-commit's scope:**
* dead local ``client = clients[0]`` in ``_make_openai_client`` (the
  real round-robin uses ``clients[rr_counter[...]]``).
* ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in
  ``_spawn_parallel_inference_servers``.
* ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary
  collapse in ``_decode_pyav_direct``.
* ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in
  ``_server_is_up`` — the URL is the user-configured local-server endpoint
  the CLI itself spawned, not arbitrary user input.

**Test added.**
``tests/annotations/test_frames.py`` pins the regression on
``VideoFrameProvider``: asserts ``video_for_episode`` and
``episode_clip_path`` are callable methods (not nested dead code or
free functions), and that the ``_lock`` field is a real
``threading.Lock``.

Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as
before this commit). Pre-commit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:53:43 +02:00
Pepijn 088c8371df refactor(annotate): consolidate Module 1's prompt → VLM → JSON-extract pattern
Five Module 1 sub-prompts (`_derive_task_from_video`,
`_generate_task_rephrasings`, `_generate_subtasks`, `_generate_plan`,
`_generate_memory`) all repeated the same shape:

    result = self.vlm.generate_json([messages])[0]
    if isinstance(result, dict) and isinstance(result.get(<field>), <type>):
        ...

…each spelled with slightly different field names + post-processing.

Three small helpers replace it:

* `_vlm_field(messages, field)` — single VLM call, returns
  ``result[field]`` or ``None``. Centralizes the
  ``generate_json([m])[0]`` + ``isinstance(dict)`` dance.
* `_text_message(text)` — wraps a string in the canonical user-message
  shape every text-only prompt builds inline.
* `_video_message(record, prompt)` — combines the episode video block
  with a prompt; replaces the duplicated video-block construction
  inside `_generate_subtasks` (which previously inlined the same
  ``use_video_url``/``frames_per_second``/``max_video_frames`` branches
  that `_episode_video_block` already implements).

Net -35 LOC. Each call site now is 3-5 lines instead of 10-20. The
public method signatures are unchanged so tests don't move.

Drive-by: `_task_seems_bad` collapsed via SIM103 fix; `zip` in
`run_plan_updates` annotated `strict=True` per ruff B905.

Tests: same 2 pre-existing module-impl failures
(`test_module1_attaches_video_block_to_subtask_prompt`,
`test_module2_mid_episode_emits_paired_interjection_and_speech`) —
they were failing on `origin/feat/language-annotation-pipeline` before
this commit and continue to do so for the same reasons. 61/63 in the
language stack pass; pre-commit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:29:45 +02:00
Pepijn 3a52a18b0e Merge branch 'feat/language-columns' into feat/language-annotation-pipeline
Resolve conflicts and pull in the latest PR 1 fixes.

Conflicts:
- pyproject.toml: PR 1 added `lerobot-rollout` and PR 2 added
  `lerobot-annotate` to the same `[project.scripts]` block. Kept both.
- uv.lock: dropped both sides and regenerated against the merged
  `pyproject.toml` (PR 2 dropped the `datatrove` dep when distribution
  moved to HF Jobs; PR 1's lock didn't have it).

Test follow-up:
- `tests/annotations/test_pipeline_recipe_render.py` — PR 1 deleted
  `src/lerobot/configs/recipes/pi05_hirobot.yaml` (review feedback:
  remove the canonical-recipe file; recipes are user-supplied). The
  cross-PR contract this test guards is "the recipe DSL renders
  non-empty messages from pipeline output", which doesn't depend on
  any specific YAML, so the test now builds an inline blend recipe
  with the same coverage. Passes.

Sweep: 82 passed, 2 failed (pre-existing module-impl bugs:
`test_module1_attaches_video_block_to_subtask_prompt`,
`test_module2_mid_episode_emits_paired_interjection_and_speech`).
The PR 1 carryover (`test_emitted_at_raises_on_ambiguous_per_camera_vqa`)
is now passing — the merge brought in PR 1's tightened `_select_one`
ambiguity check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:13:11 +02:00
Pepijn dad2cf1178 refactor(annotate): delegate distribution to HF Jobs; drop SLURM/local switch
The executor previously claimed it would "optionally hand off" to
datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it
already runs phases inline in every code path, and HF Jobs (see
``examples/annotation/run_hf_job.py``) is the actual distribution
strategy. Stop pretending we have an executor selector.

* `executor.py`: drop `select_executor_class`, the "kind" log line, and
  the references to LocalPipelineExecutor / SlurmPipelineExecutor.
  Module docstring now says distribution is delegated to HF Jobs.
* `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`,
  `slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only
  `episode_parallelism`. While here, prune the longer "why" docstrings
  on every field down to the load-bearing bits — full story moves to
  `docs/source/annotation_pipeline.mdx`.
* `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the
  `[annotations]` extra; the dep was only there for the (never used)
  cluster executors. Comment block notes the new HF-Jobs delegation.
* `reader.py`, `lerobot_annotate.py`: drop their own datatrove /
  flavor-namespace mentions.
* `docs/source/annotation_pipeline.mdx`:
  - remove the flavor-namespace / sidecar paragraph (out of scope —
    "multiple revisions = multiple copies" is dataset-level policy);
  - remove the "writer drops the legacy `subtask_index` column" note
    (already covered by PR 1's intentional-break call-out);
  - remove the chat-template + `apply_chat_template(messages, tools=...)`
    line (covered by Tools doc);
  - replace the "executor picks Local vs Slurm" paragraph with
    `--executor.episode_parallelism` and a pointer to HF Jobs;
  - rewrite the style→recipe section to talk about "recipes" generically
    instead of pinning a specific YAML;
  - add a "Running on Hugging Face Jobs" section pointing at
    `examples/annotation/run_hf_job.py`;
  - add a "Running locally" example matching the CLI's docstring
    (`uv run lerobot-annotate --root=... --vlm.model_id=...`);
  - extend the paper-inspirations list with Pi0.7 and Steerable VLA
    Policies (Zhao 2025) for Module 3.

Tests: same 3 pre-existing failures as before this commit (2 module
assertions still in flight; 1 carryover from PR 1). 41/44 pass.
Pre-commit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:09:22 +02:00
Pepijn bce5387e04 Merge branch 'main' into feat/language-columns 2026-05-08 10:29:49 +02:00
Pepijn 85576acc29 docs(tools): drop follow-up-PR references
Reword the two callouts in `tools.mdx` to describe the runtime layer
in present tense ("not part of the catalog layer shipped today",
"those modules don't yet exist in the tree") instead of pointing at a
specific follow-up PR. Keeps the doc honest about what works now
without coupling it to a particular release order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 20:29:42 +02:00
Pepijn e7e5fca5de review: emitted_at uses 0.1s tolerance; MessageTurn requires stream at construction
* **Float tolerance in `emitted_at` for persistent styles.** The
  ``_timestamp(row) == t`` exact-equality check silently missed any
  caller that derived ``t`` arithmetically (e.g. ``frame_idx / fps``)
  even though the parquet timestamp would only differ by ULPs. Added
  ``EMITTED_AT_TOLERANCE_S = 0.1`` and check ``abs(...) <= tolerance``
  instead, with a docstring explaining why exact equality wasn't
  enough and why 0.1 s is safe at typical 30–100 Hz control rates.
  Test asserts the new behavior at half-window (matches) and
  double-window (no match) using the constant so it stays in sync.

* **`MessageTurn.stream` is required at construction.** It was typed
  ``MessageStream | None = None`` so YAML could omit ``stream:`` and
  pass the dataclass invariant — but ``_validate_rendered`` rejected
  ``None`` streams later, surfacing the error at the first sample
  instead of at recipe load. Now ``__post_init__`` raises
  ``ValueError`` if ``stream`` is ``None``, with the list of valid
  streams in the message. The redundant late-stage check in
  ``_validate_rendered`` is replaced with a one-line comment that
  cites the upstream invariant. Test pins the new construction-time
  rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 19:55:08 +02:00
Pepijn beb22afd81 review: dedupe regex, centralize column names, harden collate, more tests
* **#2 — dedupe `_PLACEHOLDER_RE`.** The same regex was compiled in
  `recipe.py` and `language_render.py`. Promote to module-level
  `PLACEHOLDER_RE` in `recipe.py` (its primary owner — declares
  template syntax) and import from `language_render.py`.
* **#3 — centralize language column names.** `io_utils.py` had
  hardcoded `{"language_persistent", "language_events"}` literals at
  two sites. Replace with `LANGUAGE_COLUMNS` import so a future column
  rename can't silently desync.
* **#4 — defensive collate preserved-keys.** `lerobot_collate_fn`
  silently filtered language fields from samples that didn't have
  them, which would hand downstream consumers a preserved list
  shorter than the tensor batch. Now: if any sample carries a key,
  every sample in the batch must carry it; otherwise raise a
  `ValueError` so the upstream rendering bug surfaces at the boundary.
* **#5 — `_scalar` rejects non-singleton lists.** Previously a zero-
  or multi-element list fell through and triggered confusing
  `float([])` errors downstream. Now raises `ValueError` with the
  actual length.
* **#6 — refactor `_extract_complementary_data`.** Replace 11 lines
  of `key = {... if ... else {}}` plus an 11-line splat dict with a
  single `_COMPLEMENTARY_KEYS` tuple iterated once.
* **#7 — document `EXTENDED_STYLES`.** Was an empty `set()` with no
  comment. Add a docstring explaining it's an intentional extension
  point: downstream modules append project-local styles before
  `column_for_style` is called.
* **#9 — `tools.mdx` notes the runtime layer is future work.** The
  page referenced `src/lerobot/tools/`, `registry.py`, and
  `get_tools(meta)` — none exist in this PR. Added a callout at the
  start of "How to add your own tool" plus a note on the
  implementations paragraph.
* **#10 — tests for YAML round-trip, malformed rows, blend
  validation.** `test_recipe.py` grew from 1 case to 12 covering:
  blend-or-messages exclusivity, target-turn requirement, blend
  emptiness, weight presence/positivity, nested-blend rejection,
  `from_dict` with nested blends, `from_yaml` / `load_recipe`
  agreement, top-level non-mapping rejection. Added a malformed-row
  test for `_normalize_rows` that asserts non-dict entries raise
  `TypeError`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 19:06:38 +02:00
Pepijn 33a4b4a5a0 feat(smolvla2): autonomous robot mode in lerobot-smolvla2-runtime
The runtime CLI was deliberately scoped to dry-run only: it
hard-coded ``robot_executor=None`` and printed a "real-robot
integration is a follow-up" warning even when ``--no_robot`` was
omitted. The runtime *engine* was already structured for real-robot
operation (separate ``LowLevelForward`` chunk-rate generation +
``DispatchAction`` ctrl-rate dispatch with a ``robot_executor``
hook); only the wiring was missing.

Add the wiring:

  * ``_load_policy_and_preprocessor`` now also returns the
    postprocessor (action denormaliser).
  * ``--robot.type`` / ``--robot.port`` / ``--robot.id`` /
    ``--robot.cameras`` (JSON) build a ``Robot`` via
    ``make_robot_from_config`` and connect it.
  * ``_build_robot_observation_provider`` reads
    ``robot.get_observation()`` each call, drops the language
    columns (runtime drives messages itself), and runs the policy's
    preprocessor (rename → batch → device → normalise).
  * ``_build_robot_action_executor`` postprocesses the policy's
    action tensor (denormalise), converts to the ``{joint: value}``
    dict via ``make_robot_action(action, ds_meta.features)``, and
    calls ``robot.send_action(...)``. Optional ``--max_action_norm``
    safety clip rejects ticks whose action L2 norm exceeds the
    threshold (kill-switch when bringing up a new robot).
  * ``_run_autonomous`` runs ``runtime.run()`` in a background
    thread (the policy must keep generating chunks at chunk_hz and
    dispatching at ctrl_hz regardless of stdin) and handles user
    interjections / VQA queries from the foreground stdin loop.
    Confirmation prompt before start (skip with ``--auto_start``);
    Ctrl+C stops the thread and disconnects the robot cleanly.
  * Autonomous mode requires ``--dataset.repo_id`` for action stats
    / feature shapes — pass the same dataset the policy was trained
    on. The bootstrap path that pulls canonical task / plan / memory
    runs in both REPL and autonomous modes so the model's first
    prompt matches training distribution.

Dry-run REPL behaviour is unchanged when ``--robot.type`` is not
passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 18:30:56 +02:00
Pepijn d55b581ca1 fix(language): address review — tools accessor, motion docs, conditional collate
* **`meta.tools` actually reads `info.json["tools"]`.** `DatasetInfo`
  had no `tools` field, so `from_dict` silently dropped the key (it
  warned about unknown fields then discarded them) and the property
  always returned `DEFAULT_TOOLS`. Added `tools: list[dict] | None`
  to the dataclass; `to_dict()` drops it when unset so existing
  datasets keep a clean `info.json`. Fixed the accessor to read
  `self.info.tools` (the previous `.get(...)` would have raised
  AttributeError on the dataclass anyway). Added regression tests:
  fallback when absent, round-trip from disk, and round-trip
  through `DatasetInfo.from_dict` / `to_dict`.

* **`motion` is not view-dependent — fix the docs.** The mdx claimed
  rows of style `motion` must carry `camera`, but `VIEW_DEPENDENT_STYLES
  = {"vqa", "trace"}` and the validator agrees: motion primitives are
  joint/Cartesian-frame, not pixel-space. Updated both call-out
  paragraphs in `language_and_recipes.mdx`.

* **Conditional `collate_fn` swap.** Added `meta.has_language_columns`
  and gate the `lerobot_collate_fn` swap in `lerobot_train.py` on it,
  so non-language datasets keep PyTorch's `default_collate`. Also
  added a pass-through test in `test_collate.py` that asserts on a
  plain tensor batch the custom collate matches `default_collate`
  key-for-key, plus a test for the `None`-sample drop path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:51:06 +02:00
Pepijn 24d2ffe3c6 fix(language): keep base install green — drop processor re-export, gate dataset-extra tests
`lerobot.processor` re-exported `RenderMessagesStep` at the package
level, so importing anything from `lerobot.processor` pulled in
`lerobot.datasets.language` → `lerobot.datasets/__init__.py` →
`require_package("datasets")`, which fails in the Tier 1 base install
that intentionally omits the `[dataset]` extra. The chain bricked
collection for unrelated suites (`tests/policies/pi0_pi05/...`,
`tests/envs/...`, etc.).

* Stop re-exporting `RenderMessagesStep` from `lerobot.processor`. The
  only consumer (the test) already imports from the submodule.
  Document the deliberate omission in the module docstring.
* Add `pytest.importorskip("datasets", ...)` (and `pandas` where
  needed) at the top of the four PR-added tests that exercise the
  language stack:
  - tests/datasets/test_language.py
  - tests/datasets/test_language_render.py
  - tests/processor/test_render_messages_processor.py
  - tests/utils/test_collate.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:12:54 +02:00
Pepijn 789f29aa56 chore: fix CI — collapse short ValueError to one line, refresh uv.lock
* `ruff format` on CI (newer version) wants the short `camera=None`
  ValueError on a single line.
* `uv.lock` was stale relative to `pyproject.toml`'s `datasets>=4.7.0`
  pin (and picked up upstream `s390x` marker fixes for cuda packages).
  CI runs `uv sync --locked` which rejected the divergence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:05:42 +02:00
Pepijn a356b12c41 fix(language): always raise on ambiguous resolver matches
`_select_one` previously skipped its ambiguity check whenever any of
`role`/`tool_name`/`camera` was set, on the assumption that the caller
had already pinned down a unique row. That left a real ambiguity hole
for VQA: with two cameras emitting `(vqa, assistant)` at the same
frame, `emitted_at(..., role="assistant")` silently picked the first
sorted row instead of telling the recipe to add `camera=...`. The
existing `test_emitted_at_raises_on_ambiguous_per_camera_vqa` test
already encoded the desired behavior.

Tighten the check: any time `len(rows) > 1` we now raise with the
selectors echoed back, so users see exactly which fields they passed
and that more is needed to disambiguate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:00:45 +02:00
Pepijn e8327b8e62 refactor(language): unify resolver dispatch and prune redundant test scaffolding
* Drop the unused `events` kwarg from `active_at`/`nth_prev`/`nth_next`;
  only `emitted_at` actually consults events. The dispatcher in
  `_resolve_spec` now passes events conditionally.
* Replace the dual `_persistent_sort_key`/`_event_sort_key` pair with a
  single `_row_sort_key` and drop the `sort_key` parameter from
  `_select_one`. Event rows lack `timestamp` (it is implicit in the
  frame) and now default to `0.0` for sort purposes — the
  `(style, role)` tiebreaker is unchanged.
* Inline `_select_latest` into `active_at` (its only caller).
* Collapse `emitted_at`'s dual-branch into one `_select_one` call.
* Tighten `_validate_persistent_resolver` to a single
  `column_for_style(style) != LANGUAGE_PERSISTENT` check.
* Parameterize `test_per_camera_blend_renders_both_views` over the two
  cameras and factor the sub-recipe builder into `_vqa_subrecipe` so
  the test no longer hand-rolls two near-identical recipe blocks.

Net -98 LOC; behavior, public resolver names, and test expectations
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:15:45 +02:00
Pepijn c450298147 Apply ruff and prettier formatting after merge
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 12:10:41 +02:00
Pepijn 5c30b14929 Merge remote-tracking branch 'origin/main' into feat/language-columns 2026-05-06 12:09:13 +02:00
Pepijn a764c3e1d6 fix(datasets,annotate): tag pushed dataset + clean revision error
Two bugs combining to make the brand-new ``_tool3`` dataset
unloadable:

1. ``lerobot_annotate.py:_push_to_hub`` uploads the annotated
   dataset folder but never creates a codebase-version tag, so
   ``api/datasets/<repo>/refs`` returns ``"tags": []``. Then
   ``LeRobotDatasetMetadata`` → ``get_safe_version`` →
   ``get_repo_versions`` returns empty and the loader raises
   ``RevisionNotFoundError``.

2. ``RevisionNotFoundError`` itself was unconstructible: its
   ``HfHubHTTPError.__init__`` indexes ``response.headers``
   unconditionally on current ``huggingface_hub`` versions, so
   constructing it without a real ``Response`` blew up with
   ``AttributeError: 'NoneType' object has no attribute 'headers'``,
   masking the real "no tag" message.

Fix #1: after upload, read ``meta/info.json["codebase_version"]`` and
``HfApi.create_tag(..., tag=<v3.x>, repo_type='dataset',
exist_ok=True)`` so the dataset is loadable straight from the Hub on
the next ``LeRobotDataset(repo_id)`` call. Falls back to the in-tree
``CODEBASE_VERSION`` if info.json is missing/malformed; on tag
creation failure, prints the manual one-liner the user needs.

Fix #2: stop trying to instantiate ``RevisionNotFoundError`` (which
inherits HfHubHTTPError) for what is really a config issue, not an
HTTP failure. Raise plain ``RuntimeError`` with the same message —
the caller actually sees what's wrong instead of an upstream
attribute error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:23:18 +02:00
Pepijn b416f287f2 fix(datasets): raise readable error when repo has no version tags
``RevisionNotFoundError`` inherits from
``huggingface_hub.HfHubHTTPError`` which made ``response`` a required
keyword-only argument on recent versions. Constructing it with just a
message string blew up with
``TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only
argument: 'response'`` instead of surfacing the actual problem (the
dataset/checkpoint repo doesn't exist on the Hub yet).

Pass ``response=None`` explicitly. Fall back to the bare-message form
for older ``huggingface_hub`` versions that don't accept the kwarg.
Also clarify the message to call out the most common cause: typing a
hub repo id that hasn't been pushed yet (instead of just "needs a
version tag").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:12:40 +02:00
Pepijn aa749d4947 chore(annotate): throttle Module 3 + executor parallelism to fix vLLM stall
Last bump combined ``module_3.K=3`` with ``vqa_emission_hz=2.0`` and
``executor.episode_parallelism=32``. With 2 cameras per dataset that
produced ~12× the original VQA call volume, all submitted concurrently.
Module 3 latency went from ~30s/phase to ~490s per episode, vLLM's
KV cache pegged at 94% with 800+ in-flight requests, and the
multimodal cache corrupted with ``AssertionError: Expected a cached
item for mm_hash='...'`` (a known vLLM bug under image-heavy
concurrency). Module 1 and 2 ran fine; Module 3 was the bottleneck.

Pull back the multipliers to land in a sustainable spot:

  * module_3.K: 3 (kept) — three diverse questions per emission,
    where the diversity actually helps the LM head.
  * module_3.vqa_emission_hz: 2.0 → 1.0 — back to the original
    emission rate. Net VQA volume is now ~3× original (K alone) on
    a single camera, ~6× across both cameras — manageable.
  * module_2.max_interjections_per_episode: 9 → 6 — still 2× the
    default, fewer than the prior 3× to keep total request volume
    in check.
  * vlm.client_concurrency: 256 → 128 — gives vLLM headroom on the
    multimodal request path so the mm_cache doesn't desync.
  * executor.episode_parallelism: 32 → 16 — half the episodes
    in flight at once, so peak vLLM load is ~half.

n_task_rephrasings stays at 30 (text-only, doesn't load the image
path) and vlm.temperature stays at 0.7. The diversity gains are
preserved; only the throughput knobs come down.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:07:18 +02:00
Pepijn 1394a6ab5d chore(annotate): bump diversity knobs ~3x to fight memorisation
Following Pi0.7 §V (prompt expansion / diverse context conditioning),
push more atom variants per episode and higher VLM sampling
temperature so the training distribution has enough wording diversity
that the LM head is forced to use its parameters rather than memorise
specific (prompt, target) pairs.

Changes vs prior annotation pass:

  * vlm.temperature: 0.2 (default) → 0.7 — every Module-1/2/3 call
    now produces diverse phrasings; same prompt yields different
    completions across emissions.
  * module_1.n_task_rephrasings: 10 → 30 — three times as many
    ``task_aug`` rows in language_persistent. ``${task}`` already
    rotates through them deterministically per sample_idx (see
    ``_resolve_task`` in language_render.py).
  * module_2.max_interjections_per_episode: 3 (default) → 9 — more
    ``user_interjection_response`` training samples + more plan
    refresh events.
  * module_3.K: 1 → 3 — three VQA pairs per emission tick instead of
    one. Combined with the hz bump below, ~6× more VQA samples.
  * module_3.vqa_emission_hz: 1.0 → 2.0 — double the VQA emission
    rate within each subtask span.

Pushes to a new hub repo (``_tool3``) so the working ``_tool2``
dataset stays intact for comparison. ``${task}`` already wired to
rotate through ``task_aug`` rows, so no renderer change needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:32:05 +02:00
Pepijn db9118f16f fix(smolvla2): reject gibberish high-level generations
Memorised models can collapse to dominant-mode outputs (the
JSON-token salad ``":":":":...`` from VQA training) when the prompt
drifts even slightly from training distribution. Without a guard,
that gibberish lands in ``current_subtask`` / ``current_plan`` /
``current_memory``, which feeds the next tick's prompt and cascades
into worse outputs. The user observed exactly this: a clean run
followed by a tick that wrote ``" " "`` into plan and memory, then
slow recovery several ticks later.

Add ``_looks_like_gibberish`` heuristic (alpha density, repeating
chars, JSON-prefix sniff) and apply it before mutating state in
``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd``.
Bad generations are logged inline (``[info] subtask gen rejected
(gibberish): "":":":..."``) so the user can see what was dropped, but
the state stays at its last-known-good value (typically the dataset
bootstrap) instead of being polluted.

VQA path is intentionally exempt — its training targets *are*
JSON-shaped, so the heuristic would false-positive on them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:07:25 +02:00
Pepijn 7a945d7bdc fix(smolvla2): bootstrap canonical task + plan/memory from dataset
The user-typed task and the dataset's canonical task differ in
wording (capitalisation, ``green box`` vs ``green bin``, etc.). With
``text_loss`` driven down to ~6e-6 across 78 epochs the model is
memorised on the *exact* rendered training prompts: any wording drift
puts the prompt out of distribution and the model collapses to its
dominant training mode (VQA JSON output).

When ``--dataset.repo_id`` is set, automatically:
  * read the canonical task string from the chosen episode (and use
    it as ``--task`` when the user didn't pass one);
  * pull the active ``plan`` / ``memory`` / ``subtask`` rows from the
    persistent slice (latest row whose timestamp ≤ start frame's
    timestamp — same semantics as the renderer's ``active_at``) and
    seed them into the runtime state.

The first prompt the runtime builds at REPL start now mirrors what
the recipe rendered during training (task + active plan + active
memory + optional current subtask). The user can still override any
of these by typing.

Memorisation itself is upstream (training mix collapsed to too few
unique high-level targets); this commit only fixes the inference-side
prompt mismatch that was making the memorisation surface as gibberish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:00:36 +02:00
Pepijn a47e535b02 fix(smolvla2): per-recipe inference prompts to match training shape
The four high-level steps shared one generic
``_control_context_messages`` that jammed task + plan + memory +
completed_subtask into a single user message. The recipes in
``smolvla2_hirobot.yaml`` each have a *specific* multi-message layout
(``memory_update``: ``user(task) → assistant(prev memory) →
user(completed subtask)``; ``high_level_subtask``: ``user(task+plan+
memory) → user(current subtask)``; ``user_interjection_response``:
``user(task) → assistant(prev plan) → user(interjection)``). After
``apply_chat_template`` those layouts produce different prompts than
the runtime's flattened single-user-turn version, and the model fell
back to its dominant training mode (VQA JSON output) — generating
``":":":":":":...`` repetition.

Add four per-recipe prompt builders (``_msgs_for_subtask``,
``_msgs_for_memory``, ``_msgs_for_interjection``, ``_msgs_for_vqa``),
each mirroring its sub-recipe's exact message structure including
the ``if_present`` skips. Wire each high-level step to its matching
builder. Inference prompts now line up with what the model saw in
training, so generation should produce coherent text instead of
repeated tokens.

Generic ``_control_context_messages`` is kept (still used by tests
and the no-recipe fallback path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:47:22 +02:00
Pepijn 6d9b431b54 fix(smolvla2): match training's text-loss forward in select_message
Previous rewrite drove generation through ``vlm.generate()`` (the
standard SmolVLM path), which ignores SmolVLA's custom ``embed_prefix``
that interleaves images + lang + state. Result: the model received a
prompt format it had never been trained on at inference and emitted
JSON-fragment gibberish (``" " " ,",","`` ``cube lift {"...``).

Revert to the cumulative-buffer AR loop driven through
``vlm_with_expert.forward`` — the *same* forward call ``_compute_text_loss``
makes during training (``inputs_embeds=[prefix_embs, None],
use_cache=False, fill_kv_cache=True``). With ``fill_kv_cache=True``,
every layer routes through ``forward_attn_layer``, which gracefully
skips ``None`` expert inputs (``if hidden_states is None or layer is
None: continue``); cross-attention layers — which would otherwise hard-
require a non-None expert input — are bypassed entirely.

Inference now sees the same prefix structure as training: images +
lang + state, with new tokens appended to the lang region. The text
distribution matches what the model was trained to produce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:42:15 +02:00
Pepijn 347e706326 fix(smolvla2): drop pixel_values from select_message generate path
SmolVLA's image preprocessor sizes frames to whatever the action
expert was trained on, but SmolVLM's standard vision tower expects
its own default tile grid (e.g. 384/14 → 27×27 patches). The
mismatch surfaces deep in the post-vision reshape as
``RuntimeError: shape '[2, 34, 34, 768]' is invalid for input of
size 1843200`` — the model has 1200 patches but expects 34×34=1156.

Drop ``pixel_values`` from ``vlm.generate(...)`` so SmolVLM runs as
a text-only LM at REPL time. The high-level branches (subtask /
plan / memory) are dominated by their text context anyway, so this
is acceptable for dry-run inference. VQA loses its image grounding
— that will be marked as expected for the dry-run path until a
follow-up either re-processes images through SmolVLM's own
``ImageProcessor`` to match its tile grid, or gives
``vlm_with_expert`` a real AR text decode mode that handles state
and image embeddings the way training does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:36:53 +02:00
Pepijn fa8ae1e89b fix(smolvla2): drive select_message through SmolVLM.generate
The hand-rolled AR loop in ``select_message`` was fighting the
underlying ``vlm_with_expert.forward`` design, which assumes the
"prefix-once + suffix-always-via-expert" pattern that ``denoise_step``
uses for action chunks. Cross-attn layers (every other layer with
``attention_mode='cross_attn'`` + ``self_attn_every_n_layers=2``)
hard-require an expert input on every call: passing
``inputs_embeds=[current_embs, None]`` crashed at
``expert_layer.input_layernorm(None)`` with ``'NoneType' object has
no attribute 'dtype'``. Earlier KV-cache attempts ran into the
matching ``[15, 139] vs [15, 1]`` shape mismatch because the cache
gets *overwritten*, not appended, on each ``fill_kv_cache=True`` call
— there's just no AR-text-decode mode in this forward.

Stop fighting it: drive AR text generation through the underlying
SmolVLM via ``vlm.generate(input_ids=..., attention_mask=...,
pixel_values=...)``. KV caching, sampling/greedy, EOS handling all
come from HF's standard implementation. Trade-off: ``state`` drops
out of the prefix at inference (no slot for it on the standard
SmolVLM path), so high-level generations may drift from training
distribution slightly. That's acceptable for the dry-run REPL — the
high-level branches (subtask / plan / memory / vqa) are mostly
vision+language conditioned anyway, and the action expert (where
state actually matters) goes through the unchanged ``select_action``
path.

Image features the runtime merged in (``observation.images.*``) are
stacked into the ``[B, num_images, C, H, W]`` shape SmolVLM expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:39:34 +02:00
Pepijn 3ff6c6860e fix(smolvla2): rewrite select_message decode loop without KV cache
SmolVLA's ``vlm_with_expert.forward`` doesn't actually support
incremental KV cache growth — its only ``fill_kv_cache=True`` mode
*overwrites* the cache with the latest call's key/value states, and
its only ``fill_kv_cache=False`` mode concatenates ``cache + new``
into a local ``key_states`` for one matmul without ever updating the
cache itself. The original ``select_message`` decode loop tried to
use ``fill_kv_cache=True`` per step, which clobbered the cache to
1 token after the first decode and threw
``Expected size for first two dimensions of batch2 tensor to be:
[15, 139] but got: [15, 1]`` — the attention mask still expected
139 keys but the cached + new key_states only had 1.

Match the pattern ``denoise_step`` already uses successfully:
maintain a cumulative ``(embs, pad, att)`` buffer that starts as the
prefix and grows by one bool/embedding row per step. Each step
forwards the *full* sequence with ``use_cache=False,
fill_kv_cache=False, past_key_values=None`` so the matmul shapes
always line up. Generated-token rows are tagged ``pad=1, att=1``
which makes them fully causal among themselves while still able to
attend back to the entire prefix (per ``make_att_2d_masks``
semantics: a token can attend to any earlier token whose cumulative
``att`` count is ≤ its own).

Image encoding is still done once via the initial ``embed_prefix``
call — the expensive part doesn't repeat. The remaining cost is
O(n²) text-only transformer forwards, which is fine for the dry-run
REPL's 50–100 token responses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:15:28 +02:00
Pepijn fd89efb545 fix(smolvla2): 3D attention mask in select_message decode loop
SmolVLA's ``eager_attention_forward`` does
``masked = torch.where(attention_mask[:, None, :, :], ...)``, which
requires a 3D ``[B, query_len, key_len]`` bool tensor so the
broadcast to 4D works. ``select_message``'s prefix forward got this
right (passes ``prefix_2d`` from ``make_att_2d_masks``), but the
KV-cache decoding loop built ``new_attn = torch.ones((bsize,
cur_pos + 1))`` — 2D — and the very first decode step blew up with
``IndexError: too many indices for tensor of dimension 2``.

During KV-cache decoding ``query_len = 1`` and
``key_len = cur_pos + 1`` (prefix + every token already generated),
so the right shape is ``[B, 1, cur_pos + 1]``. Match the layout
SmolVLA's working ``denoise_step`` uses for the equivalent
``prefix_pad_2d_masks`` build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:08:52 +02:00
Pepijn 2776b57c9e fix(smolvla2): bool attention mask + clean Claude-Code-style REPL
Two issues that combined to make the REPL unusable:

1. ``BatchEncoding.attention_mask`` is a ``Long`` tensor, but SmolVLA's
   ``eager_attention_forward`` does
   ``torch.where(attention_mask[..., None, :, :], ...)`` which
   requires a *bool* condition. Every forward raised ``where expected
   condition to be a boolean tensor, but got a tensor with dtype Long``
   and the diagnostic surfaced it cleanly in the REPL — but generation
   produced nothing useful. Cast to ``bool`` in ``_build_text_batch``
   so the prefix forward goes through.

2. The interactive REPL used ``rich.live.Live`` panels stacked on top
   of ``logging.basicConfig(level=DEBUG)`` HTTP request lines from
   ``httpcore`` / ``httpx`` / ``huggingface_hub``. The two rendering
   loops fought each other in the user's terminal and the output was
   illegible: hundreds of debug lines interleaved with re-rendered
   panels.

   Replace ``Live`` with a simple block redraw — clear screen, print
   the state block, print any robot log lines, then a single ``> ``
   prompt. State changes are visible above the prompt, the way Claude
   Code's REPL renders. No flicker, no re-render races.

   ``_silence_noisy_loggers`` drops the chatty third-party HTTP /
   download / model-init loggers to WARNING. ``-v`` still enables
   DEBUG on the lerobot loggers; if the user needs the HTTP traces,
   they can flip those individually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:03:47 +02:00
Pepijn 0fb5f04965 fix(smolvla2): handle BatchEncoding return from apply_chat_template
``tokenizer.apply_chat_template(..., tokenize=True, return_tensors='pt')``
on newer transformers returns a ``BatchEncoding`` (dict-like) rather
than a raw ``Tensor`` — particularly when the underlying call routes
through a processor. ``_build_text_batch`` only handled the ``Tensor``
and ``list`` shapes, so the encoding object reached SmolVLA's
``embed_language_tokens`` and ``F.embedding`` blew up with
``argument 'indices' must be Tensor, not BatchEncoding`` on every
high-level forward.

Normalise the return:
  * ``BatchEncoding`` / ``dict`` → take ``input_ids`` (and the encoder's
    ``attention_mask`` when present, since ``pad_token_id`` can be
    ``None`` for SmolVLM and the fall-back ``ids != pad_token_id``
    breaks then),
  * ``list[int]`` / ``list[list[int]]`` → wrap in a long tensor,
  * ``Tensor`` → keep as-is.

After unwrapping, ensure shape ``(1, seq)`` and that ``attention_mask``
is a tensor on the same device as ``ids``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:59:57 +02:00
Pepijn 7296ac97af fix(smolvla2): make silent generation failures visible in REPL
Two failure modes were combining to make the runtime "look dead":

1. ``_build_text_batch`` produced lang tokens via
   ``apply_chat_template(return_tensors='pt')`` on CPU, but the policy
   sits on the configured device (mps / cuda). The first prefix-embed
   inside ``select_message`` then raised a device-mismatch on every
   call. The bare ``except Exception`` in ``_generate_with_policy``
   swallowed it at debug level — no logs, no chat output, no visible
   sign anything had run.

2. Even when generation succeeded but returned an empty string
   (greedy EOS, unhappy chat template, etc.), the high-level steps
   silently no-op'd, so users saw nothing.

Move tokens to ``policy.config.device`` in ``_build_text_batch`` so
the prefix forward succeeds in the common case. Bump the swallowing
log level to ``warning`` (with optional traceback under ``-v``), and
when ``state`` is given route the same diagnostic into the REPL log
via ``push_log`` so the user sees ``[warn] subtask gen failed: ...``
inline. Also push an ``[info] ... produced no text this tick`` line
when generation runs but yields nothing, so empty completions are
distinguishable from "step never ran". Apply the same surface to
``LowLevelForward.select_action`` failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:47:34 +02:00
Pepijn 9cbbcfb6a2 fix(smolvla2): tokenize lang prompt inline before select_action
LowLevelForward was handing the observation provider's output straight
to ``policy.select_action``, but SmolVLA's ``_get_action_chunk``
indexes ``batch[OBS_LANGUAGE_TOKENS]`` and crashes with ``KeyError:
'observation.language.tokens'`` when the key isn't there. Our provider
deliberately strips the dataset's language columns (the runtime drives
messages itself), so nothing else was producing those tokens — the
chunk path crashed on the very first tick after task was set.

Build a low-level prompt from current runtime state inline (task /
plan / memory as the user turn, current subtask appended as a
continuation assistant turn when known), tokenize it with the same
helper the high-level steps use, and merge ``lang_tokens`` /
``lang_masks`` into the observation before the call. Skip the step
when no task is set yet, and swallow ``select_action`` exceptions at
debug level so a missing observation feature doesn't kill the REPL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:40:18 +02:00
Pepijn fea41b29f5 fix(datasets): probe parquet for language columns before strict cast
``_load_hf_dataset`` was building the strict cast schema only from
``meta/info.json["features"]``. Datasets annotated by
``lerobot-annotate`` but still tagged at the older codebase version
(no ``language_persistent`` / ``language_events`` entry in
``info.json``) carry both columns in the parquet itself but not in the
features dict, so ``Dataset.from_parquet`` blew up with
``CastError: column names don't match`` when trying to project a
9-column parquet onto a 7-column schema.

Probe one parquet shard's actual schema; if either language column is
present in the parquet but missing from ``features``, graft it on
using PR 1's ``language_persistent_column_feature`` /
``language_events_column_feature`` helpers. No-op when neither column
is present (fully backwards-compatible with v3.0 datasets), no-op when
both are already registered (fully forwards-compatible with future
v3.1 ``info.json`` writes).

This unblocks dry-run inference on PR 2-annotated datasets that
weren't re-tagged to v3.1 — including the ones in the field today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:31:19 +02:00
Pepijn 7b4d281ef5 fix(smolvla2): build preprocessor fresh, don't round-trip the recipe
``PolicyProcessorPipeline.from_pretrained`` reconstructs each saved
step by passing the persisted JSON config back to ``__init__``, but
``RenderMessagesStep.recipe`` (a ``TrainingRecipe``) doesn't survive
the JSON round-trip — the saved entry is ``{}`` and the reconstructor
crashes with ``missing 1 required argument: 'recipe'``.

Bypass the round-trip in the runtime CLI by passing
``pretrained_path=None`` to ``make_pre_post_processors``. That re-runs
``make_smolvla2_pre_post_processors``, which reloads the recipe YAML
referenced by ``cfg.recipe_path`` and wires it back into the step
correctly. ``NormalizerProcessorStep`` still gets stats from
``ds_meta.stats`` so normalization matches training.

Proper fix is to make ``RenderMessagesStep`` serializable (e.g. by
persisting the recipe path / contents); this commit keeps it scoped to
the runtime path so dry-run testing isn't blocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:27:12 +02:00
Pepijn 29bb8bb20e fix(tools): unblock pocket-tts resolution (>=1.0.0,<3.0.0)
The previous bound `>=0.1.0,<1.0.0` matched zero published versions —
pocket-tts went straight to 1.0.0 on PyPI, with 0.x never released.
That made `uv sync --extra tools` (and any sync that pulls the `dev` /
`all` superset) fail with "requirements are unsatisfiable" on every
Python version uv tried, including 3.12.

Bump to `>=1.0.0,<3.0.0` so 1.x and 2.x are reachable. SayTool only
touches `TTSModel.load_model()`, `get_state_for_audio_prompt`,
`generate_audio`, and `sample_rate` — small enough surface that 1.x
and 2.x should both work; tighten if a real API break shows up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:15:20 +02:00
Pepijn 3fe686ce9f feat(smolvla2): runtime accepts Hub IDs + dataset-driven dry-run
The runtime CLI's loader was broken — it imported a `make_policy_from_path`
that doesn't exist in `lerobot.policies.factory` — and the high-level text
steps generated plan / subtask / memory / VQA from a text-only batch with
no images or state, so dry-runs drifted from the training distribution.

Switch to the standard `PreTrainedConfig.from_pretrained` +
`make_policy(cfg, ds_meta=...)` flow so `--policy.path` accepts both local
directories and Hub repo ids, and add a `--dataset.repo_id` path that walks
a chosen episode and feeds preprocessed observations into every forward
pass — including the four high-level steps (`HighLevelSubtaskFwd`,
`MemoryUpdateFwd`, `UserInterjectionFwd`, `AskVQAFwd`). Frames are routed
through the saved preprocessor pipeline with `language_persistent` /
`language_events` stripped so the recipe-render step stays a no-op (the
runtime supplies its own messages from current state).

Also wires the rich-based two-zone REPL layout (`ui.py`) that the script
was already importing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:09:19 +02:00
pepijn a1b8134ef1 fix(smolvla2): train on rendered language batches
Keep annotated language columns through collation, render batched recipe samples, and make SmolVLA2 text loss robust enough for distributed training on the steerable dataset.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-05 08:55:56 +00:00
pepijn 8fa8323c91 fix(annotate): sync language metadata after parquet rewrite
Ensure annotated datasets advertise language columns in meta/info.json so non-streaming dataset loads cast against the rewritten parquet schema.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-04 15:17:15 +00:00
Pepijn 5f7c6ba61d feat(annotate): compact steerable annotation prompts
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-04 15:57:04 +02:00
Pepijn 223cc8a9e2 feat(smolvla2): inference runtime — select_message + multi-rate REPL
Closes the loop on PR 3: SmolVLA2 can now be queried interactively at
inference, dispatching the same five sub-recipe shapes it was trained
on (action chunks, subtask gen, memory updates, plan/speech on
interjection, VQA on questions).

Modeling fixes + additions
--------------------------

- ``_compute_text_loss``: standard next-token CE shift was missing
  (logits at position t were CE'd against the label at t — identity-
  mapped, learning nothing). Adds ``logits[:, :-1]`` /
  ``labels[:, 1:]`` shift to match HuggingFace ``LlamaForCausalLM``.

- New ``select_message`` on ``SmolVLA2Policy``: AR text generation
  with KV caching, mirroring SmolVLA's ``select_action`` pattern.
  Single prefix forward fills the cache, then per-token forwards
  reuse it. Greedy + top-p nucleus sampling. Returns the decoded
  string with the prompt stripped.

Runtime package — ``src/lerobot/policies/smolvla2/inference/``
-------------------------------------------------------------

- ``triggers.py`` — ``Trigger`` Protocol + ``HzTrigger`` /
  ``EventTrigger`` + ``TickClock``. The whole runtime ticks at
  ``max_rate_hz=50`` and each step gates itself off its own
  cadence.

- ``runtime_state.py`` — runtime state dict factory plus tiny
  helpers (``take_event``, ``set_if_changed``, ``push_log``).
  Stable keys are documented at the top of the module.

- ``steps.py`` — :class:`InferenceStep` base + concrete steps:
  ``LowLevelForward`` / ``DispatchAction`` (action path),
  ``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` /
  ``UserInterjectionFwd`` / ``AskVQAFwd`` (text paths),
  ``DispatchToolCalls`` (tool registry → ``Tool.call``). Each
  text step builds a chat-template prompt from current
  ``RuntimeState`` (task / plan / memory / subtask) matching
  what ``smolvla2_hirobot.yaml`` renders during training.
  Includes a tiny ``<say>...</say>`` parser for the
  ``user_interjection_response`` branch's combined plan + speech
  output.

- ``runtime.py`` — :class:`SmolVLA2Runtime` composes the pipeline,
  drives ticks via ``TickClock``, polls a user-supplied
  ``event_collector`` per tick, and prints state-change log lines.

- ``repl.py`` — :class:`StdinReader` non-blocking line reader
  with simple intent classification: ``stop`` / ``quit`` /
  ``exit`` → terminate; ``?`` suffix → ``user_vqa_query`` event;
  first line → set task; other lines → ``user_interjection``.

CLI
---

- ``src/lerobot/scripts/lerobot_smolvla2_runtime.py``: console
  script ``lerobot-smolvla2-runtime`` that loads a checkpoint,
  optionally instantiates ``SayTool`` (pocket-tts), wires up
  ``SmolVLA2Runtime`` + ``StdinReader``, and runs.

  Real-robot wiring (observation_provider / robot_executor) is
  intentionally left as a follow-up — v1 is dry-run / language-
  only so the REPL works without robot hardware.

  Registered in ``pyproject.toml`` ``[project.scripts]``.

Known follow-ups
----------------

- Real-robot integration: today ``LowLevelForward`` only fires when
  an observation_provider is wired. The CLI prints a warning if
  ``--no_robot`` is omitted.
- ``select_message`` runs an extra prefix forward; could share with
  the action path's prefix when both are needed in the same tick.
- Tests: no end-to-end runtime test yet (would need a tiny SmolVLM
  fixture). The components compile and the public surface is
  exercised by the CLI's argument-parsing path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:04:00 +02:00
Pepijn af6d8ebd5b feat(smolvla2): dual-head forward — flow loss + lm_head text loss
The third and final commit of PR 3's SmolVLA2 work. Wires the actual
training signal through:

* ``predict_actions[i] = True``  → sample i contributes to flow loss
* ``text_labels[i, t] != -100``  → token t of sample i contributes to
                                    LM-head cross-entropy

Both routing knobs come from ``SmolVLA2ChatTokenizerStep`` (previous
commit on this branch), which builds them from the recipe's
``message_streams`` / ``target_message_indices``. The per-sample
``predict_actions`` mask preserves the Pi0.5 convention from the
plan's Section I.7: "True iff any low_level target exists".

Implementation:

- ``forward`` reads ``text_labels`` and ``predict_actions`` from the
  batch. When neither is present (vanilla SmolVLA usage with no
  recipe), delegates to ``SmolVLAPolicy.forward`` so unannotated
  datasets keep training as before — full backward compatibility.
- ``flow_loss``: super().forward(reduction="none") returns the
  per-sample (B,) flow loss; we mask non-action samples with the
  ``predict_actions`` bool and renormalize by the count of action
  samples. ``flow_loss_weight = 0`` in the config disables this
  branch entirely (text-only training).
- ``text_loss``: a prefix-only forward through the VLM (no action
  expert / suffix), slicing the lang-token range out of the
  resulting hidden states (``embed_prefix`` orders the prefix as
  ``[image_blocks..., lang, state]`` so the slice is unambiguous).
  Apply ``vlm.lm_head`` to those hidden states, cross-entropy with
  ``text_labels`` (ignore_index=-100). ``text_loss_weight = 0``
  disables this branch (reverts to flow-only behaviour, matching
  SmolVLA exactly).
- The two losses are summed with the config-supplied weights.

Mixed-stream samples (one batch containing both action targets and
text-only sub-recipes) are handled correctly: each sample contributes
where its labels are valid and is masked elsewhere.

Limitations / known follow-ups:

- Text loss runs an additional prefix-only forward separate from the
  flow path's prefix forward. The forwards could share their prefix
  computation; for clarity of this first commit they don't.
  Optimization is straightforward when needed.
- Per-sample loss for ``reduction="none"`` is not yet meaningfully
  defined for the dual path — we broadcast the scalar to (B,) for
  caller compatibility (e.g. RA-BC weighting will need follow-up).
- Inference ``select_action`` is unchanged from SmolVLA today —
  it predicts actions only. A separate "generate text"
  ``select_message`` path is the natural next step for runtime
  use of the LM head (memory updates, plan refreshes, VQA answers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:54:57 +02:00
Pepijn 37b1eb218a feat(smolvla2): chat-template processor + label mask + predict_actions
Wires PR 1's recipe stack into the SmolVLA2 pipeline so multi-target
sub-recipes (memory_update, ask_vqa, user_interjection_response,
high_level_subtask) carry meaningful supervision through to the model.

- New ``chat_processor_smolvla2.py`` with
  ``SmolVLA2ChatTokenizerStep``: reads ``messages`` /
  ``message_streams`` / ``target_message_indices`` from the rendered
  sample (PR 1 ``RenderMessagesStep``), calls
  ``apply_chat_template(messages, tools=DEFAULT_TOOLS, ...)`` on the
  SmolVLM tokenizer, and writes:

    OBS_LANGUAGE_TOKENS / _ATTENTION_MASK   ← chat-templated prompt
    text_labels                              ← -100 except target msg tokens
    predict_actions                          ← True iff any low_level target

  Builds the label mask robustly by re-rendering the chat through
  each target's prefix and reading off the prefix length — same
  tokenizer, same tools, so the prefix tokens are guaranteed to be
  a prefix of the full sequence. Image/video content blocks
  (LeRobot ``feature``-keyed) are stripped before tokenizing; the
  actual image tensors flow through SmolVLA's existing
  ``OBS_IMAGES_*`` channels and ``embed_prefix`` puts them before
  the language embeddings, matching the chat-template-stripped
  text order.

- ``processor_smolvla2.py``: when ``config.recipe_path`` is set,
  build a new pipeline with ``RenderMessagesStep`` +
  ``SmolVLA2ChatTokenizerStep`` instead of SmolVLA's plain
  ``TokenizerProcessorStep``. When ``recipe_path`` is ``None``,
  fall back to SmolVLA's pipeline so unannotated datasets still
  work unchanged. Resolves recipe paths relative to
  ``src/lerobot/configs/`` so ``recipes/smolvla2_hirobot.yaml``
  works directly.

The next commit on this branch picks up ``text_labels`` and
``predict_actions`` from the batch and routes them through the
SmolVLM ``lm_head`` for the actual dual-loss training.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:21:03 +02:00
Pepijn 52e1fd35cb feat(tools): src/lerobot/tools/ — runnable tool registry + SayTool
Ships the runtime side of the OpenAI-style function-calling stack
introduced in PR 1 (catalog in ``meta/info.json["tools"]``) and PR 2
(annotation pipeline writes the catalog after a run). One file per
tool — heavy deps stay isolated.

Layout:

- ``base.py`` — :class:`Tool` Protocol: ``name``, ``schema``,
  ``call(arguments)``. Runtime-checkable so tests can use
  ``isinstance(...)``.
- ``registry.py`` — :data:`TOOL_REGISTRY` (name → class) plus
  ``get_tools(meta, **kwargs)`` that instantiates every entry whose
  ``function.name`` is registered. Tools whose name is unknown are
  silently skipped — the schema still rides through the chat
  template, the model just can't actually invoke that tool at
  inference.
- ``say.py`` — :class:`SayTool` wrapping Kyutai's pocket-tts
  (CPU-only, ~100M params, ~6× real-time on a MacBook Air M4).
  Lazy model load: pocket-tts is imported and the voice state
  computed on first ``call(...)`` (or eagerly via ``preload()``).
  Returns the PCM tensor; optionally writes a ``.wav`` to
  ``output_dir`` for offline inspection.
- ``__init__.py`` — re-exports the public surface.

Optional install:

    pip install lerobot[tools]

The ``[tools]`` extra in ``pyproject.toml`` pulls in ``pocket-tts`` +
``scipy`` (for the wav writer). Adding more tools later means a new
file + a registry entry — no new extras unless the tool brings new
deps.

To add your own tool, follow the three-step guide in
``docs/source/tools.mdx`` (PR 1):

  1. Drop ``src/lerobot/tools/<my_tool>.py`` with a ``Tool``-conforming
     class.
  2. Register the class in ``TOOL_REGISTRY`` (this file).
  3. Pre-populate ``meta/info.json["tools"]`` with the schema (or let
     ``lerobot-annotate`` add it on the next run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:58:04 +02:00
Pepijn 7459dfccb6 feat(policies): scaffold smolvla2 (smolvla + lm_head re-enabled)
PR 3 of the steerable-annotation plan retargeted from Pi0.5 to SmolVLA
because the recipe stack (PR 1 + PR 2) outputs HF/TRL-compatible chat
which a chat-pretrained backbone consumes natively. SmolVLA strips the
SmolVLM ``lm_head`` though, so it can only do flow-matching action
prediction. SmolVLA2 keeps the LM head so the same model can train on
the full Hi Robot / MEM / ECoT blend defined in the plan:

  * action-only sub-recipes  (low_level_execution)        flow loss
  * text-only sub-recipes    (memory_update / ask_vqa /   CE loss on
                              user_interjection_response)  lm_head
  * mixed sub-recipes                                      both summed

This first commit lays down the structural scaffold:

- ``src/lerobot/policies/smolvla2/`` — new package with thin subclasses
  of ``SmolVLAConfig`` / ``SmolVLAPolicy`` so we don't fork the 900-line
  modeling code. ``SmolVLA2Config`` adds ``recipe_path``,
  ``apply_chat_template``, ``text_loss_weight``, ``flow_loss_weight``,
  and ``unfreeze_lm_head``. ``SmolVLA2Policy`` unfreezes the SmolVLM
  ``lm_head`` (and the surrounding norm + last text-model layer SmolVLA
  freezes) when ``unfreeze_lm_head=True`` and ``text_loss_weight>0``.
- ``factory.py`` registers ``smolvla2`` in ``get_policy_class``,
  ``make_policy_config``, and the pre/post-processor builder. Important:
  the ``smolvla2`` branch lives BEFORE the ``isinstance(config,
  SmolVLAConfig)`` check because ``SmolVLA2Config`` subclasses
  ``SmolVLAConfig`` — without the ordering, SmolVLA2 would silently
  pick up SmolVLA's processor.
- ``configs/recipes/smolvla2_hirobot.yaml`` — canonical Hi Robot blend
  for SmolVLA2. Same shape as ``pi05_hirobot.yaml`` (PR 1) so the
  recipe stack stays uniform across policy backbones.

Behaviour today is identical to SmolVLA: the modeling forward
delegates to ``SmolVLAPolicy.forward`` and the processor delegates to
``make_smolvla_pre_post_processors``. The next commit on this branch
adds the chat-template processor + ``text_labels`` / ``predict_actions``
batch keys; the commit after that wires the actual text-loss path
through ``vlm.lm_head``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:55:23 +02:00
Pepijn 73740ecf4b feat(annotate): write tool catalog to meta/info.json after annotation
After every ``lerobot-annotate`` run, the executor ensures
``meta/info.json["tools"]`` contains at minimum the canonical ``say``
schema, while preserving any tools the user pre-declared on the
dataset. Chat-template consumers (PR 3 SmolVLA2 / Pi0.5 / dataset
visualizer) read the catalog through
``LeRobotDatasetMetadata.tools`` and pass it to
``apply_chat_template(messages, tools=meta.tools, ...)``.

- ``executor.py``: new ``_ensure_tools_in_info`` helper called
  after the parquet rewrite. Idempotent and additive — merges by
  ``function.name``, only writes back if the list changed.
- ``writer.py``: drops the duplicated ``SAY_TOOL_SCHEMA`` /
  ``DEFAULT_TOOLS`` constants in favour of importing from
  ``lerobot.datasets.language`` (PR 1's single source of truth).
  Re-exported so existing imports keep working.
- ``annotation_pipeline.mdx``: replace the "code constant only" note
  with a pointer to the new Tools doc and a description of the
  meta/info.json behaviour, including how to pre-declare custom
  tools before annotation runs.

This is the storage half of the tools work; PR 3 ships the runnable
implementations under ``src/lerobot/tools/`` (one file per tool,
first up: ``say.py`` wired to Kyutai's pocket-tts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:51:38 +02:00
Pepijn 1b81e49214 feat(annotate): task rephrasings + video-derived task fallback
Module 1 now produces ``task_aug`` rows (registered in PR 1) so the
PR-1 ``${task}`` resolver can rotate phrasings deterministically per
``sample_idx``. Plus an opt-in video-derived task that bypasses the
canonical ``meta/tasks.parquet`` task when it's empty, low-quality, or
explicitly disabled — every downstream Module-1 prompt then uses the
derived task as its grounding.

- ``Module1Config``: adds ``n_task_rephrasings`` (default 10) and
  ``derive_task_from_video`` ∈ ``{off, if_short, always}`` (default
  ``if_short``: triggers when canonical is empty, < 3 words, or matches
  a placeholder string like ``debug`` / ``unnamed`` / ``tbd``).
- ``plan_subtasks_memory.py``: ``run_episode`` now resolves an
  ``effective_task`` (canonical OR video-derived) and threads it
  through ``_generate_subtasks`` / ``_generate_plan`` /
  ``_generate_memory`` so subtasks, plans, and memory are all grounded
  in the same task string. Then generates ``n`` rephrasings of the
  effective task and writes them as ``task_aug`` rows at ``t=0`` with
  ``role=user``. The effective task itself is included as the first
  variant so the rotation is guaranteed to cover the source-of-truth
  phrasing.
- New prompts: ``module_1_video_task.txt`` (one-shot video → task),
  ``module_1_task_rephrasings.txt`` (text-only paraphraser, ``n`` per
  call).
- ``meta/tasks.parquet`` is NOT modified — derived tasks live only in
  ``language_persistent``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn d813c75b76 fix(annotate): align interjections with the actual demo trajectory
qwen36moe-11 surfaced a deeper semantic problem with mid-episode
interjections: they were generated as *counterfactual* user requests
("actually skip the wipe", "use the blue one instead") but teleop data
is frozen — the robot in the video already executed everything,
including the steps the user "asked to skip". The training signal was
therefore self-contradictory: interjection text said one thing, the
robot's subsequent action stream did the opposite.

Flip the framing. Anchor every interjection at a subtask boundary and
write it as a natural user request for the *upcoming* subtask. The
robot's visible next behavior IS the interjection's effect, so:

  interjection text → plan refresh → action stream

are all consistent with the same observed video.

Concretely:

- ``interjections_and_speech.py``: instead of sampling random
  timestamps from ``frame_timestamps``, walk Module 1's subtask spans
  and sample from the (subtask N → subtask N+1) transitions. Pass both
  the just-finished and the upcoming subtask texts into the prompt.

- ``_window_timestamps``: re-center the multi-frame video window on
  the boundary itself (half the frames cover the end of the previous
  subtask, half cover the start of the next one) so the VLM has the
  same visual conditioning the policy will see at training time.

- ``module_2_interjection.txt``: rewritten. The prompt now states
  explicitly that this is offline data, the robot already committed to
  the next subtask, and the interjection must be a natural request
  that aligns with — not contradicts — the next subtask. Removes the
  "negative task / situated correction" Hi Robot framing because those
  scenarios require online execution to be coherent.

Plan-refresh logic from the previous commit (forwarding interjection
text into the refresh prompt) is unchanged and now reinforces the same
direction: the refreshed plan emphasizes the upcoming subtask the
interjection just asked for.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 3434d2ef22 fix(annotate): ground interjections in video + propagate text to plan refresh
qwen36moe-10 showed three Module-2 / plan-refresh quality issues that
are not architecture problems — they're prompt-grounding bugs:

1. Interjection prompt passed ``current_subtask = record.episode_task``
   (the WHOLE-episode task), not the actual subtask in force at the
   chosen timestamp. The VLM had no signal about what was visible at
   that moment, so its interjections were generic ("actually skip X"
   where X had nothing to do with the visible activity).

2. Interjection prompt only attached a single frame
   (``frames_at(record, [t_snap])``). With one frozen image the VLM
   couldn't read the ongoing motion. Module 1 already gets the whole
   episode video for subtask decomposition, which is why subtasks are
   well-grounded; Module 2 was the outlier.

3. The plan-refresh prompt told the model "a plan refresh after a user
   interjection at t=X.YZs" but never showed it the interjection
   *text*. So the refreshed plan couldn't actually reflect the user's
   correction — at best it recombined the same step list.

Fix:

- ``interjections_and_speech.py``: Module 2 reads Module 1's subtask
  rows from the same staging tree (executor orders module_1 → module_2
  so they're already there) and resolves the actual ``current_subtask``
  at each chosen timestamp. Pulls a small clip
  (``interjection_window_seconds`` × ``interjection_window_frames``,
  defaulting to 4 frames over the leading 2 s) instead of one frame.
  Drops the silently-zeroing ``len(candidate_ts) // 4`` cap on the
  interjection count.

- ``module_2_interjection.txt``: prompt is rewritten to reference the
  multi-frame visual context and require the interjection to mention
  something visible OR named in the current subtask, not invented.

- ``plan_subtasks_memory.py``: ``run_plan_updates`` now accepts and
  threads through interjection texts. ``_generate_plan(refresh_t,
  interjection)`` injects both the current subtask AND the interjection
  text into the prompt so the refreshed plan can drop / reorder /
  constrain steps to match the user's correction. (Plan still refreshes
  ONLY at user interjections — subtask generation runs ~1 Hz at
  inference, plan re-emission is event-driven.)

- ``executor.py``: forwards ``interjection_texts`` alongside
  ``interjection_times`` to ``run_plan_updates``.

- ``Module2Config``: bumps ``max_interjections_per_episode`` default
  from 1 to 3 and exposes ``interjection_window_seconds`` /
  ``interjection_window_frames``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn b71e10da6b refactor(annotate): drop dataset-level `tools` parquet column
PR 2 used to write a top-level ``tools`` column on every parquet shard
holding the JSON schema for the ``say`` tool, broadcast identically
across every row. That extends PR 1's schema for no real information
gain — the schema is a fixed code constant, parquet's RLE/dict encoding
collapses it on disk anyway, and HF/TRL chat-template consumers can
just import the constant directly.

PR 2 should fill in PR 1's existing schema, not add to it. So:

- ``writer.py``: stop emitting the ``tools`` column. Strip any legacy
  ``tools`` column from older shards on rerun so the schema converges to
  v3.1. ``SAY_TOOL_SCHEMA`` stays as a public constant (now joined by
  ``DEFAULT_TOOLS = [SAY_TOOL_SCHEMA]``); chat-template policies and the
  visualizer import them directly.
- ``test_writer.py``: replace the "tools column present" assertion with
  one that explicitly checks the column is absent, plus a new test
  asserting the constant's shape.
- ``test_pipeline_recipe_render.py``: drop the tools-column read; assert
  it's not present in the rewritten parquet.
- ``annotation_pipeline.mdx``: update the writer description to note the
  parquet stays small and the schema lives as a code constant.

If multi-tool-set support ever becomes real (datasets with different
tool inventories), the right home is ``meta/info.json["tools"]`` —
adding it later is non-breaking; ripping out a parquet column already
shipped is not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 0f6e3230df fix(annotate): decode video frames with PyAV directly
``lerobot.datasets.video_utils.decode_video_frames`` routes
``backend="pyav"`` through ``decode_video_frames_torchvision`` →
``torchvision.io.VideoReader``, but ``VideoReader`` was removed in
torchvision >= 0.22 (the vllm/vllm-openai:latest container ships with
torchvision 0.25). That made every Module 3 frame decode raise
``AttributeError: module 'torchvision.io' has no attribute 'VideoReader'``,
which the previous catch-all silently turned into an empty image list,
which then made every Module 3 prompt skip via the
``not _has_image_block(messages)`` branch and produce zero VQA rows.

Bypass ``video_utils`` entirely. The annotation pipeline only needs
a handful of PIL frames per (episode, ts), so a direct PyAV decode is
both simpler and insulated from torchvision API churn. ``av`` is already
in the install set, no new dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 2f2e42c4aa log(annotate): warn loudly on first video decode failure
VideoFrameProvider._decode used to swallow every exception silently and
return []. That made Module 3 (VQA) produce zero rows whenever local
video decoding broke (codec, backend, missing file, ...) because every
prompt got skipped via the ``not _has_image_block(messages)`` branch in
general_vqa.py — without any signal in the job log.

Log the first failure with full exception info (subsequent failures
stay quiet to avoid log spam) so this fast-path is debuggable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 5ee0104739 log(annotate): surface resolved frame-provider cameras at startup
Print the default and full camera list once at the top of every run so a
silent Module-3-no-op (cam_keys=[]) is visible in the job log instead of
only being discoverable by counting parquet rows after upload.

Also warn loudly when Module 3 is enabled but no cameras resolved, with
a hint about the --vlm.camera_key fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn e064cfcb04 fix(annotate): seed Module 3 cameras from camera_keys + camera_key fallback
Module 3 fast-pathed out (50 episodes in 0.6s) when
``frame_provider.camera_keys`` came back empty even though Module 1/2
worked, because they use ``frame_provider.camera_key`` (singular) and
were happy with the explicit ``--vlm.camera_key=...`` override.

Two fixes:

- ``frames.py``: read ``meta.camera_keys`` (covers both video- and
  image-stored cameras) instead of ``meta.video_keys`` (video-only),
  matching :class:`LeRobotDatasetMetadata`'s canonical accessor. If
  metadata still surfaces nothing but the caller explicitly passed
  ``--vlm.camera_key=<key>``, fall back to ``[<key>]`` — the key is by
  definition known to exist on the dataset.
- ``general_vqa.py``: emit a one-time WARNING log when Module 3 sees
  zero cameras so this never silently produces zero VQA again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn b3d9494831 docs(annotate): add HF Jobs runner example for lerobot-annotate
A ready-to-run example of launching the annotation pipeline on a
Hugging Face job (h200x2) with two vllm replicas serving
Qwen3.6-35B-A3B-FP8. Lives next to other end-to-end recipes under
examples/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 1217fdb6f0 feat(annotate): emit VQA per-camera and propagate camera field
Module 3 now produces one (vqa, user) + (vqa, assistant) pair per
emission tick *per camera* rather than only against the dataset's first
camera. Each emitted row carries the `camera` field added in PR 1
(language-columns), so the resolver can disambiguate per-camera VQA via
`emitted_at(t, style=vqa, role=assistant, camera=...)` without ambiguity.

- `frames.py`: `FrameProvider` Protocol gains a `camera_keys` property
  and a `camera_key=` argument on `frames_at` / `video_for_episode`.
  `VideoFrameProvider` exposes every `observation.images.*` key the
  dataset declares (not just the first) and keys its decode cache on
  `(episode, camera, timestamp)` so per-camera reads don't collide.
  Module 1 / 2 keep their old single-camera behaviour by leaving
  `camera_key=None` (falls back to the default camera).
- `modules/general_vqa.py`: `run_episode` iterates `frame_provider
  .camera_keys` for each emission tick, builds one prompt per camera,
  batches all of them through the VLM, and stamps the resulting rows
  with `camera=<that key>`. Empty `camera_keys` (null provider) makes
  the module a no-op rather than silently emitting untagged rows.
- `writer.py`: `_normalize_persistent_row` / `_normalize_event_row`
  carry `camera` through and call `validate_camera_field` so the
  invariant is enforced at the writer boundary. Event sort key now
  includes `camera` for deterministic ordering when several cameras
  share `(timestamp, style, role)`. `speech_atom` sets `camera=None`.
- `validator.py`: `StagingValidator` gains a `dataset_camera_keys`
  field; `_check_camera_field` enforces the invariant and cross-checks
  every view-dependent row's `camera` against the dataset's known video
  keys. New `_check_vqa_uniqueness_per_frame_camera` flags duplicate
  `(vqa, role)` pairs at the same `(t, camera)`.
- `lerobot_annotate.py`: passes the live frame provider's
  `camera_keys` into the validator so the cross-check uses the actual
  dataset camera set.
- Tests: `_StubFrameProvider` exposes `camera_keys` and accepts the new
  `camera_key=` kwarg. `test_module3_vqa_unique_per_frame_and_camera`
  configures two cameras and asserts both are represented, that every
  emitted row has a `camera` tag, and that uniqueness holds per
  `(timestamp, camera, role)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn d0388e1142 fix(annotate): transcode subclips to H.264 instead of stream-copy
Modern LeRobot datasets store videos in AV1, which vllm's libav build
cannot decode (the video processor returns 0 frames and downstream
chokes with ZeroDivisionError). Re-encode each per-episode subclip
with libx264 (preset ultrafast, crf 23) so the resulting mp4 is
universally decodable. Strip audio with -an for a smaller payload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:36 +02:00
Pepijn 524aa59faa feat(annotate): pack multiple vllm replicas per GPU via num_gpus
Adds VlmConfig.num_gpus so parallel_servers can exceed the physical
GPU count. Replicas are round-robin-assigned to GPUs (e.g.
parallel_servers=4 + num_gpus=2 → replicas pinned to GPUs 0,1,0,1).
Backward-compatible: num_gpus=0 keeps the existing 1-replica-per-GPU
behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 27f7829b09 feat(annotate): forward chat_template_kwargs to OpenAI extra_body
Lets callers pass per-request template flags such as
{"enable_thinking": false} for Qwen3.5/Qwen3.6 models, where the
default thinking preamble otherwise consumes the entire max_new_tokens
budget before any JSON is emitted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 7f8bf108e8 fix(annotate): include prompt .txt files in wheel
The setuptools package-data declaration only listed envs/*.json, so
pip-installed wheels (including HF Jobs runs) were missing the
module_1_subtasks/plan/memory and module_2/3 prompt templates,
causing FileNotFoundError at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 855ff027f8 refactor(annotate): drop HF Inference Providers code path
Default backend is now a local OpenAI-compatible server (vllm /
transformers) which auto_serve spawns. Removes the
use_hf_inference_providers config flag and the router.huggingface.co
routing branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 3b797bb118 feat(annotate): --vlm.push_to_hub uploads the annotated dataset
After the pipeline completes, optionally create/locate a dataset repo
and upload the dataset root (excluding .annotate_staging/). Add
push_private and push_commit_message knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn aea04721ae feat(annotate): parallelize episodes within each module phase
Saturates parallel_servers + client_concurrency. Previously the
executor processed one episode at a time, so each Module 1 episode's
3-5 dependent VLM calls hit a single server with the others idle. Now
defaults to 16 episodes in flight; configurable via
ExecutorConfig.episode_parallelism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn ab5479129a fix(annotate): probe /v1/models for spawn-helper readiness
vllm with --uvicorn-log-level warning suppresses the "Uvicorn running"
banner that the readiness watcher waited for, so the spawn helper hung
forever even after the API was live. Add an HTTP probe in parallel with
the log watcher and broaden the log markers to include vllm's own
"Starting vLLM API server" / "Available routes are" lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn e6d4ac6f02 fix(annotate): lock-protect per-line writes for parallel server streams
8 server-streaming threads writing chars unsynchronized cause UTF-8
sequences from different servers to interleave mid-byte, garbling the
terminal output. Switch to line-buffered reads with a single shared
print lock — output stays readable, ready-marker detection still works
on the line containing 'Uvicorn running' / 'Application startup
complete'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 5722d365c5 feat(annotate): client_concurrency for parallel in-flight requests
Adds vlm.client_concurrency (default 16) which uses a ThreadPoolExecutor
to fan out batched chat.completions calls. vllm batches them internally
on the server side, giving big throughput wins on a single TP=1 server
without needing DP/TP and the NCCL setup it requires.

Module 3 now batches all per-episode VQA calls into a single
generate_json invocation so they fire in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 3d7e60cee4 feat(annotate): parallel_servers spawns N independent vllm replicas
Adds --vlm.parallel_servers=N. Spawns N independent vllm processes
(each pinned to GPU i via CUDA_VISIBLE_DEVICES, listening on
serve_port+i) and round-robins requests across them. Sidesteps DP/TP
NCCL setup failures on nodes with restricted P2P/SHM.

Default serve_command for parallel mode: vllm serve <model_id>
--tensor-parallel-size 1 --max-model-len 32768 --uvicorn-log-level
warning. Override via --vlm.serve_command (use {port} placeholder).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 7b767d4d60 feat(annotate): per-episode progress logs in executor 2026-04-30 18:48:35 +02:00
Pepijn f1e3ab7794 fix(annotate): don't crash pipeline on persistent JSON parse failure
Some prompts/models occasionally return pure prose with no JSON object
even on retry. Returning None (and logging a preview) lets the pipeline
skip that one VLM call cleanly instead of aborting the whole episode.
The modules already check for None / non-dict results and degrade
gracefully (no row emitted from that call).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 585341ba9f fix(annotate): robust JSON extraction (think tags + first balanced object)
Models often wrap JSON in prose or <think>...</think> blocks. Strip the
think tags first, then try direct json.loads, then fall back to scanning
for the first balanced {...} substring (ignoring braces inside strings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn 23ff346027 fix(annotate): stream child stdout char-by-char so tqdm \\r progress flushes 2026-04-30 18:48:35 +02:00
Pepijn 3c5cbe7af4 test(annotate): adjust video-block test for fps-based frame sampling 2026-04-30 18:48:35 +02:00
Pepijn f2cbd97635 feat(annotate): Module 1 samples image frames at fps rate
Replace the fixed max_video_frames count with a rate (default 1 fps).
A 30 s episode now sends 30 frames; a 5 s episode sends 5; capped at
max_video_frames (default 128) to avoid blowing up the payload on long
episodes.

Override with --module_1.frames_per_second=2.0 for denser sampling, or
--module_1.frames_per_second=0.5 for sparser.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn c06c8d594a feat(annotate): use cached HF token from huggingface-cli login
Fall back to huggingface_hub.get_token() when HF_TOKEN/HUGGINGFACE_API_KEY
env vars aren't set. That picks up the token cached by
'huggingface-cli login' so users don't need to export it on every shell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:35 +02:00
Pepijn cd495a3a9d feat(annotate): default to HF Inference Providers, no local GPU needed
Flip the default backend to 'openai' with use_hf_inference_providers=True
and a Qwen3-VL-30B-A3B-Instruct:novita default model_id. The CLI now
runs end-to-end without a local model load — annotations are produced
by sending video_url + prompt to https://router.huggingface.co/v1.

Switch back to local inference with --vlm.backend=vllm or
--vlm.use_hf_inference_providers=false.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn c99ac45cd1 feat(annotate): one-flag HF Inference Providers backend
Setting --vlm.use_hf_inference_providers=true routes requests through
https://router.huggingface.co/v1 using HF_TOKEN as the API key, and
disables auto_serve so no local server is spawned. Combine with a
provider-pinned model id like 'Qwen/Qwen3-VL-30B-A3B-Instruct:novita'
or any plain model id to let HF route.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 13aaafeae0 fix(annotate): omit mm_processor_kwargs by default; transformers serve rejects it
transformers serve returns HTTP 422 'Unexpected fields' when
mm_processor_kwargs is in extra_body — that field is vllm-specific.
Drop it by default; opt in via LEROBOT_OPENAI_SEND_MM_KWARGS=1 when
talking to vllm serve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 2129648bf4 fix(annotate): mm_processor_kwargs in extra_body; inline file URLs as data URLs
Two fixes for video_url with transformers serve:
- fps must be in extra_body.mm_processor_kwargs, not in the content
  block; otherwise the server discards it as unknown kwargs.
- file:// URLs aren't fetched by transformers serve. Read the local mp4
  and inline it as a base64 data:video/mp4 URL so the server sees the
  bytes directly.

Both surface as std::bad_alloc on the server side when wrong, which is
unhelpful but explains what we hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn f5cd3f6e4e fix(annotate): detect server ready via stdout banner, not /v1/models polls
transformers serve rescans the HF cache on every /v1/models request
which exceeds the 2s urllib timeout, leaving the probe loop spinning
even after Uvicorn is fully up. Watch the streamed server output for
'Uvicorn running' / 'Application startup complete' instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ecf5766301 fix(annotate): visible auto_serve via stdout prints + live server log stream
The previous logger-based output never appeared, leaving users in the
dark when auto_serve silently no-op'd. Switch to print(flush=True) so
the spawn decision is unmistakable, and stream the server's stdout to
the parent terminal in real-time on a background thread so model-load
progress and errors surface immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 11597d4f71 fix(annotate): auto_serve defaults to True; probe before spawning
Default auto_serve to True so lerobot-annotate can drive the entire
flow with one command. Probe api_base/models first — if a server is
already reachable (user started one manually, or it's a remote
endpoint), skip the spawn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 8b9c598cf4 feat(annotate): auto_serve mode spawns and tears down inference server
Setting --vlm.auto_serve=true with --vlm.backend=openai makes the CLI
launch 'transformers serve <model_id> --port <serve_port>
--continuous-batching' as a child process, poll /v1/models until ready
(up to serve_ready_timeout_s), run the pipeline, then SIGINT the
server on process exit.

Override the spawn command with --vlm.serve_command='vllm serve ...'
or any OpenAI-compatible launcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn b325475b38 feat(annotate): video_url block for openai backend
Module 1 can now send the episode's actual mp4 file as a video_url
content block instead of pre-decoded frames. The server (transformers
serve / vllm serve / ktransformers serve) handles frame sampling at
the configured fps. Default fps=1 (one frame per second is enough for
subtask-boundary detection on manipulation episodes).

A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/
via ffmpeg stream-copy (no re-encode) so the model sees only this
episode's frames, not the whole shard.

Enable with --module_1.use_video_url=true (and --vlm.backend=openai).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ef137ff86a feat(annotate): openai-compatible backend for transformers/ktransformers serve
Adds a third backend that talks to any OpenAI-compatible server. This
unblocks Qwen3.6 (and other models) that work in transformers serve /
ktransformers but not in vllm 0.10.2's fallback path:

- launch the server out-of-process (transformers serve, vllm serve,
  ktransformers serve)
- point lerobot-annotate at it via --vlm.backend=openai
  --vlm.api_base=http://localhost:8000/v1 --vlm.model_id=...

Image and video blocks are converted to OpenAI image_url/video_url
data URLs automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn c5df821a96 fix(annotate): use vllm.chat() API for multimodal prompts
vllm.generate() expects a string/TextPrompt; passing message dicts
fails. vllm.chat() applies the chat template and extracts image/video
blocks automatically, which is what we need for VL models.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 7ec3d7999c fix(annotate): drop guided_decoding=dict (api differs across vllm)
vllm 0.10.2 expects guided_decoding to be a GuidedDecodingParams object,
not a dict. Different vllm versions differ here. The parser already has
a one-retry JSON-recovery path, so drop guided decoding entirely for
portability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 712d63abbd fix(annotate): tolerate decoder returning fewer frames than requested
pyav (and sometimes torchcodec) decode can return fewer frames than
requested timestamps when some timestamps fall outside the video file's
content range. Drop the strict=True on the zip and rely on the
None-filter to discard missing frames.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 6653999983 fix(annotate): default video decode backend to pyav
torchcodec's __init__ bad-allocs on the cu128/torch-2.8 stack in some
environments (Lustre/conda combos). The annotation pipeline calls
decode_video_frames many times per episode, so this is a hard blocker.
Default to pyav (always available via the av package) and let users
opt back into torchcodec via LEROBOT_VIDEO_BACKEND=torchcodec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn 4bdbedc9a0 fix(annotate): default trust_remote_code=False for HF loaders
Setting trust_remote_code=True unconditionally pulled custom loader
code that triggers std::bad_alloc post-load on Qwen3-VL — the official
transformers class is sufficient. Flip the default to False; keep the
config field so users can opt in for models that actually need it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn e240305e8e fix(annotate): default transformers backend to manual GPU placement
Loading Qwen3-VL via transformers + accelerate's device_map='auto'
fails with std::bad_alloc on hosts with abundant RAM. The bug is in
accelerate's post-load dispatch path. Bypassing accelerate by loading
to CPU first and then calling .to('cuda') manually avoids that path.

LEROBOT_TRANSFORMERS_DEVICE_MAP=auto switches back to the old behavior
for cases where it works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ccd189b264 fix(annotate): LEROBOT_DISABLE_CUDNN escape hatch for conv3d crash
cuDNN 9.x + torch 2.8 has a regression where the conv3d kernel used in
Qwen-VL vision tower patch embedders fails with
CUDNN_STATUS_NOT_INITIALIZED. The crash is independent of model size
and reproduces on both Qwen2.5-VL and Qwen3-VL because both use 3D conv
for video patch embedding.

Setting LEROBOT_DISABLE_CUDNN=1 falls back to native PyTorch conv3d
kernels (slower but functional) so the pipeline can run while the
torch/cuDNN stack is still on the broken combo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:34 +02:00
Pepijn ef1242bbd4 fix(annotate): expose gpu_memory_utilization and max_model_len for vllm
Large VL models (Qwen3-VL-30B-A3B BF16) take ~58 GB of an 80 GB H100,
leaving only ~22 GB for KV cache + cuDNN workspace. The vision tower's
3D conv then fails with CUDNN_STATUS_NOT_INITIALIZED because cuDNN
can't grab a workspace large enough.

- vlm.gpu_memory_utilization (default 0.9) — drop to 0.7 when the vision
  encoder needs more cuDNN workspace.
- vlm.max_model_len — cap context to free KV cache memory; the 262k
  default for Qwen3 is wildly more than annotation prompts need.
- vlm.trust_remote_code — already plumbed; now also passed to LLM().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn ebf4a04d41 fix(annotate): pass trust_remote_code=True to HF auto-classes
Required for many newer VL checkpoints (Qwen3.x FP8 in particular) that
ship custom loader code in their repo. Without it, the FP8
weight_scale_inv parameters never bind to FP8Linear modules and the
post-load dispatch path bad-allocs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn 4419b4ef1b fix(annotate): low_cpu_mem_usage=True on transformers load path
The std::bad_alloc we hit on Qwen3-line VL models is not a real OOM —
it triggers in the post-load tensor-placement path even on hosts with
2 TB RAM. low_cpu_mem_usage=True bypasses the offending intermediate
staging buffer and is the standard accelerate workaround.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn ff06ca82d2 fix(annotate): use device_map='auto' for transformers backend
Without device_map, transformers stages the full FP8 checkpoint in CPU
RAM before any GPU placement, OOMing the host on 27B+ models even when
the GPU has enough VRAM. device_map='auto' streams shards directly to
GPU memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn fcb01e73eb fix(annotate): try AutoModelForImageTextToText first, fall back to AutoModelForVision2Seq
Newer transformers versions renamed/removed AutoModelForVision2Seq in
favour of AutoModelForImageTextToText for VL models. Try the new name
first and fall back gracefully so the transformers backend works on
both transformers 4.45-4.5x and 5.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn 268f8d1f53 fix(annotate): replace Literal types with str for older draccus
Older draccus versions (e.g. 0.10.x bundled in some envs) lack a decoder
for typing.Literal and raise:
  No decoding function for type typing.Literal['vllm', 'transformers', 'stub']

Switching VlmConfig.backend from Literal to str works under every
draccus version. The runtime branch in vlm_client.make_vlm_client
already validates the value and raises ValueError on unknown backends,
so the constraint stays enforced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn 663fff0ae2 feat(annotate): Module 1 sees the whole episode as one video block
Replaces keyframe sampling with a single Qwen-VL video block covering
the whole demonstration. The model pools temporally itself and chooses
where to cut subtasks — no stride, no count, no keyframe count knob to
tune.

- frames.py: ``FrameProvider`` gains ``video_for_episode(record,
  max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames``
  uniformly across the episode duration; ``_NullProvider`` returns []
  for the no-video fallback. New ``to_video_block`` helper.
- Module 1: drops keyframe sampling. The subtask prompt now goes out as
  ``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and
  the prompt template asks the model to "watch the whole clip, then
  segment it" with cut points decided from gripper/contact/regrasp
  events the model sees.
- Module1Config: ``keyframes_per_episode`` removed; replaced with
  ``max_video_frames: int = 32`` (model-capacity bound, not annotation
  logic).
- Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in
  the single-video-block invariant.
- Stub-VLM markers updated: tests now key on "atomic subtasks" instead
  of the old "Decompose the demonstration" phrase that no longer
  appears in the prompt.
- Docs: updated to describe the whole-episode video-block behavior and
  the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn 9d6af804bf feat(annotate): attach camera keyframes to module prompts; default to Qwen3.6-27B-FP8
Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.

- New `frames.py`:
  - `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
    dataset's first `observation.images.*` stream via
    `LeRobotDatasetMetadata.get_video_file_path` and
    `decode_video_frames`, with the same `from_timestamp` shift the main
    dataset uses.
  - Per-process LRU cache so co-timestamped Module 1 plan-update + Module
    2 calls share decode work.
  - `make_frame_provider` falls back to a null provider when the dataset
    has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
  prepend image blocks before the text block.
  - Module 1 attaches `keyframes_per_episode` keyframes to the subtask
    decomposition prompt.
  - Module 2 attaches the frame at the interjection timestamp.
  - Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
  `Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
  `--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
  on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
  emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
  Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
  attachment behaviour and the no-video fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn f763f85213 feat: language annotation pipeline (PR 2/3)
Adds the steerable annotation pipeline (`lerobot-annotate`) that populates
the `language_persistent` and `language_events` columns introduced in
PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace,
no sidecar tree.

Modules produced:
- Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init +
  refresh on interjection), MEM-style memory at subtask boundaries.
- Module 2 (interjections_and_speech): t=0 speech-only acknowledgement,
  mid-episode paired interjection + speech tool-call atom.
- Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at
  configurable cadence with one-retry JSON validation.

Writer enforces: per-episode persistent identity, exact-frame event
timestamps, column routing per `column_for_style`, dataset-level `tools`
column with the `say` schema, drops legacy `subtask_index`. Validator
runs against staged JSONL artifacts before the writer rewrites parquet.

Adds `lerobot-annotate` console script, `annotations` extra (datatrove +
optional vllm), `make annotation-e2e` opt-in smoke target, and
`docs/source/annotation_pipeline.mdx`.

Branched from PR 1 (`feat/language-columns`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:48:33 +02:00
Pepijn e3e9374e2c feat(language): tool catalog in meta/info.json + LeRobotDatasetMetadata.tools
Stores OpenAI-style function schemas at ``meta/info.json["tools"]`` so
datasets can declare which tools are available (today: just ``say``;
tomorrow: per-dataset extensions). The ``DEFAULT_TOOLS`` constant
fills in for unannotated datasets so chat-template consumers don't
have to special-case anything.

Three pieces:

- ``language.py``: ``SAY_TOOL_SCHEMA`` and ``DEFAULT_TOOLS``
  constants. Single source of truth — PR 2's writer and PR 3's
  runtime tool registry will both import from here instead of
  duplicating the dict.
- ``dataset_metadata.py``: ``LeRobotDatasetMetadata.tools`` property
  reads ``info.json["tools"]`` and falls back to ``DEFAULT_TOOLS``.
  Returns deep-copied dicts so callers can mutate the result safely.
- ``docs/source/tools.mdx``: spec page covering the catalog, per-row
  invocations, and the three-step "how to add a new tool" workflow
  (declare schema, implement, register). Linked from the docs
  toctree under the Datasets section.

This lays the groundwork for PR 2's pipeline writing the catalog out
during annotation, and PR 3's ``src/lerobot/tools/`` package shipping
runnable implementations (one file per tool — first up:
``say.py`` wrapping Kyutai's pocket-tts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:44:58 +02:00
Pepijn c1a0c601e2 feat(language): task_aug style + automatic ${task} rephrasing rotation
Adds task-prompt diversity (Xiao 2022 / CAST) without touching
``meta/tasks.parquet`` or forcing recipes to opt in. The plan reserved
``task_aug`` as a future style; this lands it now.

- ``language.py``: add ``task_aug`` to ``CORE_STYLES`` and
  ``PERSISTENT_STYLES``. ``column_for_style("task_aug")`` returns
  ``language_persistent`` so PR 2 writers route it correctly.

- ``language_render.py``: ``_resolve_task`` now consults the persistent
  slice for rows of ``style="task_aug", role="user"``. When any exist
  it picks one deterministically by ``sample_idx`` (blake2b-keyed, not
  Python's randomized hash) so an epoch sees every rephrasing of every
  episode while the same sample still resolves identically across
  reruns. Falls back to the canonical ``meta/tasks.parquet`` task when
  no rephrasings are present, so existing datasets and unannotated runs
  keep their behaviour. Explicit ``task=`` overrides still win.

- Tests: rephrasing coverage across samples, determinism on repeat
  ``sample_idx``, fallback when persistent has no ``task_aug`` rows,
  and explicit override priority.

Recipes get this for free: any ``${task}`` placeholder rotates through
the available rephrasings. Recipes that want the literal canonical task
can override the binding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:45:39 +02:00
Pepijn 1ca38d9748 fix(language): drop motion from VIEW_DEPENDENT_STYLES
Motion primitives are described in robot-frame (joint / Cartesian) terms,
not pixel space, so they are camera-agnostic. Only `vqa` (event) and
`trace` (event, pixel-trajectory) are view-dependent.

The `camera` field stays on PERSISTENT_ROW_FIELDS for schema symmetry —
the validator, resolver, and HF feature mapping behave identically across
the two columns regardless of which styles populate `camera` today —
but persistent rows now always have `camera=None` in practice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:54:12 +02:00
Pepijn 5a6aa64570 feat(language): per-camera tagging on view-dependent styles
Adds a nullable `camera` field to the language row struct (both persistent
and event variants) so view-dependent styles like `vqa` can carry which
`observation.images.*` view they were grounded against. Without this,
multi-camera datasets ended up with multiple `(vqa, role)` rows at the
same timestamp that the resolver could not disambiguate.

- `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS,
  to both Arrow struct types and the HF datasets feature mappings;
  introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus
  `is_view_dependent_style` and `validate_camera_field` helpers (camera
  required iff style is view-dependent).
- `language_render.py`: thread an optional `camera=` kwarg through every
  resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through
  `_matching_rows` / `_select_*`, so recipes can disambiguate per-camera
  VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`.
  Without a `camera` filter, multi-row matches keep raising the existing
  ambiguity error — which is the desired behaviour on multi-camera data.
- `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with
  `ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying
  the matching image block), keeping the original 0.20 budget and
  documenting the customization point for datasets with different cameras.
- Tests: schema test asserts the new field order; new tests cover
  `is_view_dependent_style`, `validate_camera_field` (both required and
  forbidden directions), per-camera `emitted_at` filtering, and the
  ambiguity error when two cameras emit `(vqa, assistant)` at the same
  timestamp without a `camera=` filter. RenderMessagesStep + dataset
  passthrough fixtures updated to include the new field.
- `docs/source/language_and_recipes.mdx`: document the `camera` field,
  the per-camera resolver pattern, and the canonical recipe convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:48:17 +02:00
Pepijn 0b06790da0 feat(language): add motion (persistent) and trace (event-only) styles
Promote the previously-reserved motion/trace styles to first-class core
styles. motion routes to language_persistent (it tracks robot state over
time); trace routes to language_events (single-moment annotations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 14:21:49 +02:00
Pepijn b43dc39ba4 Add docstrings to all new helpers; revert uv.lock
Covers private helpers in recipe.py, language.py, language_render.py,
and render_messages_processor.py. Also reverts uv.lock to main (it was
re-generated by `uv run` during local checks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 14:15:03 +02:00
Pepijn 2b71221194 Address review: split persistent/event schemas, drop event timestamps
- recipe.py: derive _VALID_ROLES/_VALID_STREAMS from MessageRole/MessageStream Literals
- dataset_metadata.py: keep CODEBASE_VERSION at v3.0
- language.py: remove RESERVED_STYLES; split arrow/feature schemas into
  persistent (with timestamp) and event (without timestamp); add docstrings
- language_render.py: events use frame-row timestamp implicitly; no
  per-event timestamp filtering or sorting
- converters.py: drop unused subtask_key passthrough
- add docstrings to new public APIs (recipe, render_messages_processor, collate)
- update tests for split schemas; revert uv.lock

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 13:38:23 +02:00
Pepijn 8833d735a1 Add extensive language support 2026-04-27 10:56:32 +02:00
196 changed files with 41770 additions and 1056 deletions
+6
View File
@@ -178,3 +178,9 @@ test-smolvla-ete-eval:
--env.episode_length=5 \
--eval.n_episodes=1 \
--eval.batch_size=1
# E2E annotation pipeline smoke test against a tiny in-memory fixture
# dataset. Opt-in (not part of `make test-end-to-end`) and uses a stub VLM
# backend, so it does not require a real model checkpoint or GPU.
annotation-e2e:
uv run python -m tests.annotations.run_e2e_smoke
+12
View File
@@ -9,6 +9,8 @@
- sections:
- local: il_robots
title: Imitation Learning for Robots
- local: lelab
title: LeLab - Lerobot GUI
- local: bring_your_own_policies
title: Adding a Policy
- local: integrate_hardware
@@ -43,6 +45,8 @@
title: Language Columns and Recipes
- local: tools
title: Tools
- local: annotation_pipeline
title: Annotation Pipeline
- local: video_encoding_parameters
title: Video encoding parameters
- local: streaming_video_encoding
@@ -59,6 +63,10 @@
title: π₀-FAST (Pi0Fast)
- local: pi05
title: π₀.₅ (Pi05)
- local: molmoact2
title: MolmoAct2
- local: vla_jepa
title: VLA-JEPA
- local: eo1
title: EO-1
- local: groot
@@ -73,6 +81,10 @@
- sections:
- local: sarm
title: SARM
- local: robometer
title: ROBOMETER
- local: topreward
title: TOPReward
title: "Reward Models"
- sections:
- local: inference
+6 -10
View File
@@ -79,17 +79,13 @@ If your local computer doesn't have a powerful GPU, you can utilize Google Colab
Once training is complete, you can evaluate your ACT policy using the `lerobot-record` command with your trained policy. This will run inference and record evaluation episodes:
```bash
lerobot-record \
--robot.type=so100_follower \
lerobot-rollout \
--strategy.type=base \
--policy.path=${HF_USER}/act_policy \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=my_robot \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--display_data=true \
--dataset.repo_id=${HF_USER}/eval_act_your_dataset \
--dataset.num_episodes=10 \
--dataset.single_task="Your task description" \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.camera_encoder.vcodec=auto \
--policy.path=${HF_USER}/act_policy
--task="Your task description" \ # can be skipped for ACT
--duration=60
```
+281
View File
@@ -0,0 +1,281 @@
# Annotation Pipeline
`lerobot-annotate` watches each episode's video with a vision-language
model (VLM) and writes natural-language annotations back into your
dataset. It fills the two language columns from the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — straight into
`data/chunk-*/file-*.parquet`.
In short: point it at a LeRobot dataset, and it adds subtasks, plans,
memory, interjections, speech, and visual Q&A that a policy can be
trained on.
## How it fits together
```text
your dataset lerobot-annotate
(LeRobot v3.1)
┌─────────────────────────────────────────────────────┐
│ read episodes │
└──────────────────────────┬──────────────────────────┘
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌───────────────┐ ┌──────────┐ one shared Qwen-VL
│ plan │ │ interjections │ │ vqa │ ◀── server (vLLM, OpenAI
└────┬─────┘ └───────┬───────┘ └────┬─────┘ API) drives all three
└────────────────────┼─────────────────────┘
│ each module stages raw JSONL
▼ into .annotate_staging/
┌─────────────────┐
│ validator │ ◀── checks everything
└────────┬────────┘
┌─────────────────┐
│ writer │
└────────┬────────┘
data/chunk-*/file-*.parquet
(+ meta/info.json tools)
```
Three modules (`plan`, `interjections`, `vqa`) all talk to **one** shared
VLM. Each module stages its output to disk, a validator checks it, and a
single writer rewrites the dataset shards in place.
## What the pipeline produces
Each module emits a few kinds of annotation ("styles"), routed to one of
the two language columns:
| Style / atom | Column | Module |
| ------------------------------------------- | --------------------- | --------------- |
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | `plan` |
| `plan` (initial + refresh on interjection) | `language_persistent` | `plan` |
| `memory` (MEM-style compression) | `language_persistent` | `plan` |
| `task_aug` (rephrasings of the task) | `language_persistent` | `plan` |
| `interjection` | `language_events` | `interjections` |
| speech tool-call atom (`style=null`, `say`) | `language_events` | `interjections` |
| `vqa` (user / assistant pair) | `language_events` | `vqa` |
### How subtasks are generated
The `plan` module doesn't ask the VLM for subtasks in one shot. Instead
it uses a two-step **describe → segment** flow:
1. **Describe** — the VLM narrates only what it actually sees in the
chosen camera (no guessing about the task).
2. **Segment** — that description is fed back in, and the VLM splits the
episode into consecutive atomic subtasks.
The resulting spans are then stitched into a gap-free, full-episode
cover, so **every frame has exactly one active subtask**. See
[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
for the production settings (single camera, embedded frames, windowed
subtask generation).
### Tools
The writer does **not** add a `tools` column to the parquet. The tool
catalog lives in `meta/info.json["tools"]` instead (see [Tools](./tools)).
After every run, the pipeline makes sure the canonical `say` schema is in
that list, keeping any tools you declared beforehand.
Want to add your own tool? Edit `meta/info.json["tools"]` directly — the
pipeline preserves whatever is already there. That makes the tool visible
to the chat template, so the model can learn to _generate_ the call. The
runtime layer that actually _executes_ a generated call (the `Tool`
protocol / `TOOL_REGISTRY` under `src/lerobot/tools/`) is not part of
this PR — the [Tools](./tools) doc marks those pieces as
not-yet-implemented.
## Running on Hugging Face Jobs
Annotation runs on [Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs).
The repo ships a launcher script you copy and tweak for your dataset:
```bash
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
```
[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
starts a single-GPU `h200` job (bump it to `h200x4` for big datasets)
that:
1. installs `lerobot` (from `main`) plus the annotation extras,
2. boots one vLLM server per GPU (using the `vllm/vllm-openai` image) and
drives it over the OpenAI-compatible API,
3. runs the `plan` / `interjections` / `vqa` modules across the dataset
with `lerobot-annotate`,
4. with `--push_to_hub=true`, uploads the result to `--new_repo_id` (or
back to `--repo_id` in place if you leave that unset).
To use a different dataset, model, or hub repo, edit the `CMD` block in
the script. Every flag there maps directly to a `lerobot-annotate` flag
(run `lerobot-annotate --help` for the full list).
## Key options
These are the flags you'll reach for most often. Run
`lerobot-annotate --help` for everything else; the defaults are tuned for
short manipulation episodes.
### Dataset in / out
| Flag | Default | What it does |
| ----------------- | ------- | ----------------------------------------------------------------------- |
| `--repo_id` | — | Hub dataset to annotate (downloaded if `--root` unset). |
| `--root` | — | Annotate a local dataset directory instead. |
| `--new_repo_id` | — | Push the result to a new repo (leaves the source repo untouched). |
| `--push_to_hub` | `false` | Upload after annotating (to `--new_repo_id`, else back to `--repo_id`). |
| `--only_episodes` | all | Annotate just these episode indices (handy for a test run). |
| `--seed` | `1729` | Seeds the RNGs that pick interjection timestamps + VQA question types. |
### Which modules run
Every module is on by default and can be toggled independently (set to
`false` to skip it, e.g. to iterate on one module at a time):
| Flag | Default | Turns off |
| ------------------------- | ------- | ----------------------------------- |
| `--plan.enabled` | `true` | subtasks + plan + memory + task_aug |
| `--interjections.enabled` | `true` | interjections + speech atoms |
| `--vqa.enabled` | `true` | the VQA pairs |
### The VLM (`--vlm.*`)
| Flag | Default | What it does |
| -------------------------- | ------------------ | ----------------------------------------------------------------------------------- |
| `--vlm.model_id` | `Qwen/Qwen3.6-27B` | The model to serve and prompt. |
| `--vlm.camera_key` | first `images.*` | Which camera every prompt is grounded on. |
| `--vlm.serve_command` | auto | The exact `vllm serve …` command (set TP size, GPU memory, `--max-model-len` here). |
| `--vlm.parallel_servers` | `1` | Independent servers for round-robin routing (one per GPU). |
| `--vlm.num_gpus` | `0` | GPUs per server (`0` = one each). |
| `--vlm.client_concurrency` | `16` | In-flight requests across all servers. |
| `--vlm.max_new_tokens` | `512` | Generation cap per call. |
| `--vlm.temperature` | `0.2` | Sampling temperature. |
### Subtasks / plan / memory (`--plan.*`)
| Flag | Default | What it does |
| ------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------- |
| `--plan.frames_per_second` | `1.0` | How densely the episode video is sampled. |
| `--plan.max_video_frames` | `32` | Hard cap on frames per call (context-budget guard — don't exceed ~32 for a 32k context). |
| `--plan.subtask_window_seconds` | `0` | Split long episodes into fixed windows for constant frame density (`0` = whole episode). |
| `--plan.plan_max_steps` | `8` | Upper bound on subtasks per episode. |
| `--plan.subtask_describe_first` | `true` | Run the describe→segment grounding pass (best subtask quality; +1 call/episode). |
| `--plan.emit_plan` | `true` | Emit the numbered `plan` rows (`false` = subtasks + memory only). |
| `--plan.n_task_rephrasings` | `10` | How many `task_aug` rephrasings to emit (`0` disables). |
| `--plan.derive_task_from_video` | `if_short` | Use the dataset task as-is (`off`), only when it's missing/short (`if_short`), or always re-derive from video (`always`). |
| `--plan.use_video_url` | `false` | Send a server-side video clip instead of embedded frames. |
### Interjections + VQA
| Flag | Default | What it does |
| ----------------------------------------------- | ------- | ---------------------------------------------------------- |
| `--interjections.max_interjections_per_episode` | `3` | Cap on interjection/speech pairs per episode. |
| `--vqa.vqa_emission_hz` | `1.0` | How often VQA pairs are emitted. |
| `--vqa.restrict_to_default_camera` | `false` | Ground VQA only on `--vlm.camera_key` (else every camera). |
| `--executor.episode_parallelism` | `16` | Episodes processed concurrently within each phase. |
## Contributing new modules
The pipeline is built to grow, and **contributions are very welcome** —
a brand-new module (say, trajectory traces or affordances), a new prompt
template, a smarter grounding flow, or quality fixes to the existing
`plan` / `interjections` / `vqa` modules.
Every module lives under
`src/lerobot/annotations/steerable_pipeline/modules/`, shares the VLM
client and the keyframe cache, writes its raw output to the staging
tree, and plugs into the executor as its own phase. Got an idea? Open an
issue or PR on [the repo](https://github.com/huggingface/lerobot).
## How recipes consume the output
The annotations are meant to be read by recipes (see
[Language Columns and Recipes](./language_and_recipes)). Typically:
- low-level / high-level / memory-update branches read
`subtask` / `plan` / `memory` from `language_persistent`.
- an interjection-response branch reads `interjection` events plus the
paired speech atom (merged into one assistant turn via `tool_calls_from`)
and the matching `plan` refresh at the same timestamp.
- a VQA branch reads the `(vqa, user)` and `(vqa, assistant)` pairs from
`language_events`.
## Why state and events are split
Two ideas shape the design:
1. **Persistent state vs. exact events.** Persistent rows (`subtask`,
`plan`, `memory`) apply to the whole episode and answer "what's true
right now?". Event rows (`interjection`, `vqa`, speech) appear only on
the one frame whose timestamp matches. Timestamps are copied straight
from the source parquet — never recomputed in floating point.
2. **One VLM pass.** All three modules share a single VLM client (the
OpenAI-compatible client talking to the job's vLLM server), so you pay
for one model load per dataset, not three.
## Re-running a single module
Each module stages its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. This makes
prompt iteration cheap: re-running one module overwrites only its own
JSONL, then the writer recomposes the final parquet. Disable modules you
don't want with `--plan.enabled=false` (and likewise
`--interjections.enabled` / `--vqa.enabled`) to test one at a time.
## What the validator checks
Before the writer runs, `StagingValidator` confirms:
- every event row lands exactly on a real frame timestamp;
- no speech / interjection pairs are left orphaned;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (a warning, not an error);
- each VQA assistant `content` is valid JSON in one of the
bbox / keypoint / count / attribute / spatial shapes;
- every row goes to the column chosen by `column_for_style(style)`.
Any error aborts the writer. Pass `--skip_validation=true` to override
while debugging.
## Where each module's ideas come from
- **`plan` — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
for atom granularity ("pick up one piece of lettuce", "place bowl to
box"); Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07))
for "how, not what" detail.
- **`plan` — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596)):
keep only the minimal relevant information — preserve outcomes, drop
specific attributes.
- **`interjections`.** Hi Robot's scenario taxonomy: negative task,
situated correction, specific constraint, preference. Speech is a
tool-call-only atom
(`tool_calls=[{type:function, function:{name:"say", arguments:{text:...}}}]`).
- **`vqa`.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693)) for
grounded features (pixel bounding boxes `[x_min, y_min, x_max, y_max]`,
keypoints) and Steerable VLA Policies
([Zhao 2025](https://arxiv.org/abs/2509.07626)) for multi-abstraction
grounding. Pi0.7 also grounds answers across abstraction levels.
When improving a module, tweak its prompt template in
`src/lerobot/annotations/steerable_pipeline/prompts/` rather than
rewriting from scratch.
## Roughly how much it costs
Per episode, the pipeline makes about `max_steps` plan calls,
`max_interjections_per_episode` interjection calls, and
`vqa_emission_hz × episode_seconds` VQA calls. With the defaults (8
subtasks, 1 interjection, 1 Hz × 3 pairs) on a 30-second episode, that's
~50 VLM calls.
Storage stays small: `language_persistent` is at most tens of KB per
episode (parquet dictionary-encodes the one entry that repeats across
frames), and `language_events` is empty on most frames — its size scales
with the number of emissions, not `num_frames × num_emissions`.
+5 -5
View File
@@ -105,10 +105,12 @@ These results demonstrate GR00T's strong generalization capabilities across dive
### Evaluate in your hardware setup
Once you have trained your model using your parameters you can run inference in your downstream task. Follow the instructions in [Imitation Learning for Robots](./il_robots). For example:
Once you have trained your model using your parameters you can run inference in your downstream task. Follow the instructions in [Policy Deployment (lerobot-rollout)](./inference). For example:
```bash
lerobot-record \
lerobot-rollout\
--strategy.type=sentry \
--strategy.upload_every_n_episodes=5 \
--robot.type=bi_so_follower \
--robot.left_arm_port=/dev/ttyACM1 \
--robot.right_arm_port=/dev/ttyACM0 \
@@ -119,14 +121,12 @@ lerobot-record \
}' \
--display_data=true \
--dataset.repo_id=<user>/eval_groot-bimanual \
--dataset.num_episodes=10 \
--dataset.single_task="Grab and handover the red cube to the other arm" \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.camera_encoder.vcodec=auto \
--policy.path=<user>/groot-bimanual \ # your trained model
--dataset.episode_time_s=30 \
--dataset.reset_time_s=10
--duration=600
```
## License
+210 -108
View File
@@ -68,13 +68,13 @@ from lerobot.teleoperators.so_leader import SO101Leader, SO101LeaderConfig
from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
robot_config = SO101FollowerConfig(
port="/dev/tty.usbmodem58760431541",
id="my_red_robot_arm",
port="/dev/tty.usbmodem5AB90687491",
id="my_follower_arm",
)
teleop_config = SO101LeaderConfig(
port="/dev/tty.usbmodem58760431551",
id="my_blue_leader_arm",
port="/dev/tty.usbmodem5AB90689011",
id="my_leader_arm",
)
robot = SO101Follower(robot_config)
@@ -108,13 +108,13 @@ With `rerun`, you can teleoperate again while simultaneously visualizing the cam
<hfoption id="Command">
```bash
lerobot-teleoperate \
--robot.type=koch_follower \
--robot.port=/dev/tty.usbmodem58760431541 \
--robot.id=my_awesome_follower_arm \
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \
--teleop.type=koch_leader \
--teleop.port=/dev/tty.usbmodem58760431551 \
--teleop.id=my_awesome_leader_arm \
--robot.type=so101_follower \
--robot.port=/dev/tty.usbmodem5AB90687491 \
--robot.id=my_follower_arm \
--robot.cameras="{front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--teleop.type=so101_leader \
--teleop.port=/dev/tty.usbmodem5AB90689011 \
--teleop.id=my_leader_arm \
--display_data=true
```
</hfoption>
@@ -122,34 +122,48 @@ lerobot-teleoperate \
<!-- prettier-ignore-start -->
```python
import time
from lerobot.teleoperators.so_leader import SO101Leader, SO101LeaderConfig
from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
from lerobot.cameras.opencv import OpenCVCameraConfig
from lerobot.teleoperators.koch_leader import KochLeader, KochLeaderConfig
from lerobot.robots.koch_follower import KochFollower, KochFollowerConfig
from lerobot.utils.visualization_utils import init_rerun, log_rerun_data, shutdown_rerun
camera_config = {
"front": OpenCVCameraConfig(index_or_path=0, width=1920, height=1080, fps=30)
}
robot_config = KochFollowerConfig(
port="/dev/tty.usbmodem585A0076841",
id="my_red_robot_arm",
cameras=camera_config
robot_config = SO101FollowerConfig(
port="/dev/tty.usbmodem5AB90687491",
id="my_follower_arm",
cameras={
"wrist": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=30),
"top": OpenCVCameraConfig(index_or_path=1, width=640, height=480, fps=30)
}
)
teleop_config = KochLeaderConfig(
port="/dev/tty.usbmodem58760431551",
id="my_blue_leader_arm",
teleop_config = SO101LeaderConfig(
port="/dev/tty.usbmodem5AB90689011",
id="my_leader_arm",
)
robot = KochFollower(robot_config)
teleop_device = KochLeader(teleop_config)
init_rerun(session_name="teleoperation")
robot = SO101Follower(robot_config)
teleop_device = SO101Leader(teleop_config)
robot.connect()
teleop_device.connect()
TARGET_HZ = 30
TIME_PER_FRAME = 1.0 / TARGET_HZ
while True:
start_time = time.perf_counter()
observation = robot.get_observation()
action = teleop_device.get_action()
robot.send_action(action)
log_rerun_data(observation=observation, action=action)
elapsed_time = time.perf_counter() - start_time
sleep_time = TIME_PER_FRAME - elapsed_time
if sleep_time > 0:
time.sleep(sleep_time)
```
<!-- prettier-ignore-end -->
@@ -202,10 +216,11 @@ lerobot-record \
<!-- prettier-ignore-start -->
```python
from lerobot.cameras.opencv import OpenCVCameraConfig
from lerobot.datasets import LeRobotDataset
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.utils.feature_utils import hw_to_dataset_features
from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
from lerobot.teleoperators.so_leader import SO100Leader, SO100LeaderConfig
from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
from lerobot.teleoperators.so_leader.config_so_leader import SO101LeaderConfig
from lerobot.teleoperators.so_leader.so_leader import SO101Leader
from lerobot.common.control_utils import init_keyboard_listener
from lerobot.utils.utils import log_say
from lerobot.utils.visualization_utils import init_rerun
@@ -218,71 +233,56 @@ EPISODE_TIME_SEC = 60
RESET_TIME_SEC = 10
TASK_DESCRIPTION = "My task description"
# Create robot configuration
robot_config = SO100FollowerConfig(
id="my_awesome_follower_arm",
cameras={
"front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS) # Optional: fourcc="MJPG" for troubleshooting OpenCV async error.
},
port="/dev/tty.usbmodem58760434471",
)
teleop_config = SO100LeaderConfig(
id="my_awesome_leader_arm",
port="/dev/tty.usbmodem585A0077581",
)
# Initialize the robot and teleoperator
robot = SO100Follower(robot_config)
teleop = SO100Leader(teleop_config)
# Configure the dataset features
action_features = hw_to_dataset_features(robot.action_features, "action")
obs_features = hw_to_dataset_features(robot.observation_features, "observation")
dataset_features = {**action_features, **obs_features}
# Create the dataset
dataset = LeRobotDataset.create(
repo_id="<hf_username>/<dataset_repo_id>",
fps=FPS,
features=dataset_features,
robot_type=robot.name,
use_videos=True,
image_writer_threads=4,
)
# Initialize the keyboard listener and rerun visualization
_, events = init_keyboard_listener()
init_rerun(session_name="recording")
# Connect the robot and teleoperator
robot.connect()
teleop.connect()
# Create the required processors
teleop_action_processor, robot_action_processor, robot_observation_processor = make_default_processors()
episode_idx = 0
while episode_idx < NUM_EPISODES and not events["stop_recording"]:
log_say(f"Recording episode {episode_idx + 1} of {NUM_EPISODES}")
record_loop(
robot=robot,
events=events,
fps=FPS,
teleop_action_processor=teleop_action_processor,
robot_action_processor=robot_action_processor,
robot_observation_processor=robot_observation_processor,
teleop=teleop,
dataset=dataset,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
def main():
# Create robot configuration
robot_config = SO101FollowerConfig(
port="/dev/tty.usbmodem5AB90687491",
id="my_follower_arm",
cameras={
"wrist": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=30),
"top": OpenCVCameraConfig(index_or_path=1, width=640, height=480, fps=30)
}
)
# Reset the environment if not stopping or re-recording
if not events["stop_recording"] and (episode_idx < NUM_EPISODES - 1 or events["rerecord_episode"]):
log_say("Reset the environment")
teleop_config = SO101LeaderConfig(
port="/dev/tty.usbmodem5AB90689011",
id="my_leader_arm",
)
# Initialize the robot and teleoperator
robot = SO101Follower(robot_config)
teleop = SO101Leader(teleop_config)
# Configure the dataset features
action_features = hw_to_dataset_features(robot.action_features, "action")
obs_features = hw_to_dataset_features(robot.observation_features, "observation")
dataset_features = {**action_features, **obs_features}
# Create the dataset
dataset = LeRobotDataset.create(
repo_id="<hf_username>/<dataset_repo_id>",
fps=FPS,
features=dataset_features,
robot_type=robot.name,
use_videos=True,
image_writer_threads=4,
)
# Initialize the keyboard listener and rerun visualization
_, events = init_keyboard_listener()
init_rerun(session_name="recording")
# Connect the robot and teleoperator
robot.connect()
teleop.connect()
# Create the required processors
teleop_action_processor, robot_action_processor, robot_observation_processor = make_default_processors()
episode_idx = 0
while episode_idx < NUM_EPISODES and not events["stop_recording"]:
log_say(f"Recording episode {episode_idx + 1} of {NUM_EPISODES}")
record_loop(
robot=robot,
events=events,
@@ -291,26 +291,50 @@ while episode_idx < NUM_EPISODES and not events["stop_recording"]:
robot_action_processor=robot_action_processor,
robot_observation_processor=robot_observation_processor,
teleop=teleop,
control_time_s=RESET_TIME_SEC,
dataset=dataset,
control_time_s=EPISODE_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
)
if events["rerecord_episode"]:
log_say("Re-recording episode")
events["rerecord_episode"] = False
events["exit_early"] = False
dataset.clear_episode_buffer()
continue
# Reset the environment if not stopping or re-recording
if not events["stop_recording"] and (episode_idx < NUM_EPISODES - 1 or events["rerecord_episode"]):
log_say("Reset the environment")
record_loop(
robot=robot,
events=events,
fps=FPS,
teleop_action_processor=teleop_action_processor,
robot_action_processor=robot_action_processor,
robot_observation_processor=robot_observation_processor,
teleop=teleop,
control_time_s=RESET_TIME_SEC,
single_task=TASK_DESCRIPTION,
display_data=True,
)
dataset.save_episode()
episode_idx += 1
if events["rerecord_episode"]:
log_say("Re-recording episode")
events["rerecord_episode"] = False
events["exit_early"] = False
dataset.clear_episode_buffer()
continue
# Clean up
log_say("Stop recording")
robot.disconnect()
teleop.disconnect()
dataset.push_to_hub()
dataset.save_episode()
episode_idx += 1
# finalize dataset
log_say("Finalizing dataset...")
dataset.finalize()
# Clean up
log_say("Stop recording")
robot.disconnect()
teleop.disconnect()
dataset.push_to_hub()
if __name__ == "__main__":
main()
```
<!-- prettier-ignore-end -->
@@ -348,7 +372,7 @@ The `record` function provides a suite of tools for capturing and managing data
##### 2. Checkpointing and Resuming
- Checkpoints are automatically created during recording.
- If an issue occurs, you can resume by re-running the same command with `--resume=true`. When resuming a recording, `--dataset.num_episodes` must be set to the **number of additional episodes to be recorded**, and not to the targeted total number of episodes in the dataset !
- If an issue occurs or you want to record additional episodes in the same dataset, you can resume by re-running the same command with `--resume=true`. When resuming a recording, `--dataset.num_episodes` must be set to the **number of additional episodes to be recorded**, and not to the targeted total number of episodes in the dataset! Make sure that you also set `--dataset.root="local_path"`, it's a local path to save the new part of the dataset and is required to resume.
- To start recording from scratch, **manually delete** the dataset directory.
##### 3. Recording Parameters
@@ -422,7 +446,7 @@ from lerobot.utils.utils import log_say
episode_idx = 0
robot_config = SO100FollowerConfig(port="/dev/tty.usbmodem58760434471", id="my_awesome_follower_arm")
robot_config = SO100FollowerConfig(port="/dev/tty.usbmodem5AB90687491", id="my_follower_arm")
robot = SO100Follower(robot_config)
robot.connect()
@@ -490,6 +514,83 @@ Additionally you can provide extra `tags` or specify a `license` for your model
If your local computer doesn't have a powerful GPU you could utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act).
#### Train using Hugging Face Jobs
Hugging Face jobs let's you easily select hardware and run the training in the cloud. So if you don't have a powerful GPU or you need more VRAM or just want to train a model much faster use HF Jobs! It's pay as you go and you simply pay for each second of use, you can see the pricing and additional information [here](https://huggingface.co/docs/hub/jobs).
To run the training use this command:
<hfoptions id="train_with_hf_jobs">
<hfoption id="Command">
```bash
hf jobs run \
--flavor a10g-small \
--timeout 4h \
--secrets HF_TOKEN \
huggingface/lerobot-gpu:latest \
-- \
python -m lerobot.scripts.lerobot_train \
--dataset.repo_id=username/dataset \
--policy.type=act \
--steps=5000 \
--batch_size=16 \
--policy.device=cuda \
--policy.repo_id=username/your_policy \
--log_freq=100
```
</hfoption>
<hfoption id="API example">
<!-- prettier-ignore-start -->
```python
from huggingface_hub import run_job, get_token
run_name = "act_so101_hf_jobs"
dataset_id = "username/dataset"
user_hub_id = "username"
command_args = [
"python", "-m", "lerobot.scripts.lerobot_train",
"--dataset.repo_id", dataset_id,
"--policy.type", "act",
"--steps", "5000",
"--batch_size", "16",
"--num_workers", "4",
"--policy.device", "cuda",
"--log_freq", "100",
"--save_freq", "1000",
"--save_checkpoint", "true",
"--wandb.enable", "false",
"--policy.repo_id", f"{user_hub_id}/{run_name}"
]
print(f"Submitting job '{run_name}' to Hugging Face Infrastructure...")
job_info = run_job(
image="huggingface/lerobot-gpu:latest",
command=command_args,
flavor="a10g-small",
timeout="4h",
secrets={"HF_TOKEN": get_token()}
)
print("\n🚀 Job successfully launched!")
print(f"🔹 Job ID: {job_info.id}")
print(f"🔗 Live UI Dashboard & Logs: {job_info.url}")
```
<!-- prettier-ignore-end -->
</hfoption>
</hfoptions>
You can modify the `--flavor` to use different hardware, for example: `t4-small`, `a100-large`, `h200`. Use `hf jobs hardware` to see the full list with pricing.
Depending on the model you want to train and the hardware you selected you can also modify the `--batch_size` and `--number_of_workers`.
For longer training sessions increase the timeout.
Once the training is started you can go to [Jobs](https://huggingface.co/settings/jobs) and see if your jobs is running as well as all the outputs. Sometimes it takes a few minutes to schedule your job so be patient.
After training the model will be pushed to hub and you can use it as any other model with LeRobot.
#### Upload policy checkpoints
Once training is done, upload the latest checkpoint with:
@@ -546,5 +647,6 @@ The `--strategy.type` flag selects the execution mode:
- `sentry`: Continuous recording with auto-upload (useful for large-scale evaluation)
- `highlight`: Ring buffer recording with keystroke save (useful for capturing interesting events)
- `dagger`: Human-in-the-loop data collection (see [HIL Data Collection](./hil_data_collection))
- `episodic`: Episode-oriented policy recording with reset phases between episodes
All strategies support `--inference.type=rtc` for smooth execution with slow VLA models (Pi0, Pi0.5, SmolVLA).
+38
View File
@@ -157,6 +157,44 @@ Foot pedal input is also supported via `--strategy.input_device=pedal`. Configur
| `--strategy.input_device` | Input device: `keyboard` or `pedal` (default: keyboard) |
| `--teleop.type` | **Required.** Teleoperator type |
### Episodic (`--strategy.type=episodic`)
Episode-oriented recording that mirrors the behavior of `lerobot-record`. The policy drives the robot for each episode; an optional teleoperator can drive the robot during the reset phase between episodes.
```bash
lerobot-rollout \
--strategy.type=episodic \
--policy.path=${HF_USER}/my_policy \
--robot.type=so100_follower \
--robot.port=/dev/ttyACM0 \
--teleop.type=so100_leader \
--teleop.port=/dev/ttyACM1 \
--dataset.repo_id=${HF_USER}/my_eval_data \
--dataset.num_episodes=20 \
--dataset.episode_time_s=30 \
--dataset.reset_time_s=10 \
--dataset.single_task="Pick up the red cube"
```
Teleop is optional — if omitted the robot holds its position during the reset phase.
**Keyboard controls:**
| Key | Action |
| ----------- | -------------------------------- |
| `→` (right) | End the current episode early |
| `←` (left) | Discard episode and re-record it |
| `ESC` | Stop the recording session |
| Flag | Description |
| ----------------------------------------------- | -------------------------------------------------------------------------- |
| `--dataset.num_episodes` | Number of episodes to record |
| `--dataset.episode_time_s` | Duration of each recording episode in seconds |
| `--dataset.reset_time_s` | Duration of the reset phase between episodes in seconds |
| `--teleop.type` | Optional. Teleoperator to drive the robot during resets |
| `--strategy.reset_to_initial_position` | Whether to reset the robot to its initial position between episodes |
| `--strategy.smooth_leader_to_follower_handover` | Whether to turn on or off the leader -> follower smooth handover behavior. |
---
## Inference Backends
+5
View File
@@ -141,6 +141,11 @@ sample["target_message_indices"]
The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone, which keeps the same dataset usable across SmolVLA, Pi0.5, and any future VLM that expects OpenAI-style chat messages.
## Blends
Blend recipes select one weighted sub-recipe deterministically from the sample index.
`recipes/subtasks_vqa.yaml` trains the core blend — high-level subtask prediction, low-level execution, and VQA. `recipes/subtask_mem_vqa_speech.yaml` is the fuller variant that also adds memory updates and spoken interjection responses.
## Graceful absence
If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
+29
View File
@@ -0,0 +1,29 @@
# LeLab - LeRobot Guide
LeLab is a graphical user interface built on top of the LeRobot library, designed to make robotics accessible without needing to memorize CLI commands. From a single app you can configure your robot, teleoperate it, collect datasets, train policies locally or on cloud GPUs via HF Jobs, and deploy trained models back onto your robot. It's the easiest way to go from an unboxed SO-101 to a working policy, and a great companion for anyone learning the LeRobot workflow. Source code and issues live on GitHub: [huggingface/leLab](https://github.com/huggingface/leLab).
> [!TIP]
> For now LeLab is compatible only with SO-ARM101
<Youtube id="VqyKUuW9V1g" />
### Installation
Requires [`uv`](https://docs.astral.sh/uv/getting-started/installation/). Install and launch in one command:
```
uv tool install git+https://github.com/huggingface/leLab.git && lelab
```
After install, run `lelab` from your terminal anytime to start the app.
### Features
- **Add robots** — Select arm type (leader/follower), calibrate each joint from the middle position, and attach cameras.
- **Teleoperation** — Control the follower arm with the leader and see a live 3D visualization of the arms.
- **Dataset recording** — Define a task description, number of episodes, and episode/reset durations. Press spacebar to advance between episodes. 30+ episodes recommended.
- **Local training** — Train a policy directly on your own machine with a selected dataset, policy type, batch size, and step count.
- **Cloud training with HF Jobs** — Train on powerful GPUs via [HF Jobs](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) with transparent pricing. Run `hf auth login` first. See the [Compute HW Guide](hardware_guide) for hardware/batch size tips.
- **Training visualization** — Watch progress live in the app, with checkpoints saved automatically.
- **Run trained policies** — Pick any model from your jobs list and run inference on your robot with one click.
- **Use community datasets** — Provide any Hugging Face dataset ID to train on datasets you didn't record yourself.
+1 -1
View File
@@ -275,7 +275,7 @@ A converter aggregates perepisode files into larger shards and writes episode
pip install "https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip"
# Convert an existing v2.1 dataset hosted on the Hub:
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=<HF_USER/DATASET_ID>
python -m lerobot.scripts.convert_dataset_v21_to_v30 --repo-id=<HF_USER/DATASET_ID>
```
**What it does**
+433
View File
@@ -0,0 +1,433 @@
# MolmoAct2 Policy
MolmoAct2 is the LeRobot policy implementation of
[MolmoAct2](https://allenai.org/blog/molmoact2), ported into the LeRobot
training, evaluation, checkpointing, and dataset interfaces for easier use with
LeRobot datasets.
This implementation currently supports training and evaluation for the regular
MolmoAct2 model. MolmoAct2-Think, which supports adaptive depth reasoning, is
not included in this LeRobot policy yet and is coming soon.
For the original MolmoAct2 training code used for the experiments reported in
the paper, see [allenai/molmoact2](https://github.com/allenai/molmoact2).
## Installation Requirements
Install LeRobot with the MolmoAct2 optional dependencies:
```bash
pip install -e ".[molmoact2]"
```
To run the models in this repository, you need an NVIDIA GPU. The measurements
below were taken on a single NVIDIA H100 80GB with bf16 model loading, LIBERO with two RGB cameras. MolmoAct2 rows use `chunk_size=10`, action dim 7
padded to `expected_max_action_dim=32`, and `num_flow_timesteps=8`. Training measurements use
`gradient_checkpointing=true` and include the forward pass, backward pass,
gradient clipping, optimizer step, and optimizer state allocation. Values are
peak GPU memory sampled with `nvidia-smi`. Leave a few GiB of headroom for
dataloader workers, CUDA context, and fragmentation.
Multi-GPU training through `accelerate` increases throughput and global batch
size, but this LeRobot port does not currently expose the original MolmoAct2
`fsdp_devices` model-parallel training path. The current training script has
not been tested for multi-node training.
| Mode | Peak Memory, bs=8 | Peak Memory, bs=16 | Peak Memory, bs=32 |
| ------------------------------------------------ | ----------------: | -----------------: | -----------------: |
| Inference, continuous, CUDA graph enabled (bs=1) | 12.1 GiB | - | - |
| Fine-tuning, action expert only, continuous | 16.5 GiB | 18.3 GiB | 21.4 GiB |
| Fine-tuning, LoRA VLM, both action modes | 20.2 GiB | 26.8 GiB | 41.3 GiB |
| Fine-tuning, full model, both action modes | 48.3 GiB | 49.8 GiB | 60.1 GiB |
The repo has been tested with Ubuntu 22.04.
## Usage
To use MolmoAct2 in a LeRobot training config, set:
```python
policy.type=molmoact2
```
## Training
MolmoAct2 can be fine-tuned from either the released MolmoAct2 Hugging Face
checkpoint format or from a checkpoint already saved by LeRobot. Both routes use
the same LeRobot training loop, dataset transforms, checkpoint saving, and
logging. The difference is only how the initial policy weights and processor
state are loaded.
### Training With Original MolmoAct2 Weight
Use `policy.checkpoint_path` when starting from a released MolmoAct2 checkpoint,
for example `allenai/MolmoAct2` or `allenai/MolmoAct2-LIBERO`. LeRobot will load
the original HF model files, then build its own policy processor from the
dataset metadata and the policy options below.
The command below shows full fine-tuning on the merged LIBERO dataset. It uses
bf16 model loading, 8 flow timesteps, LeRobot dataset statistics, image
augmentation, and LeRobot's checkpointing/logging path.
```bash
accelerate launch \
--num_processes=8 \
--mixed_precision=bf16 \
-m lerobot.scripts.lerobot_train \
--dataset.repo_id=allenai/MolmoAct2-LIBERO-Dataset \
--dataset.root=/path/to/lerobot/data/allenai/MolmoAct2-LIBERO-Dataset \
--dataset.video_backend=pyav \
--dataset.image_transforms.enable=true \
--policy.type=molmoact2 \
--policy.checkpoint_path=allenai/MolmoAct2-LIBERO \
--policy.device=cuda \
--policy.action_mode=both \
--policy.chunk_size=10 \
--policy.n_action_steps=10 \
--policy.setup_type="single franka robotic arm in libero" \
--policy.control_mode="delta end-effector pose" \
--policy.image_keys='["observation.images.image","observation.images.wrist_image"]' \
--policy.model_dtype=bfloat16 \
--policy.num_flow_timesteps=8 \
--policy.gradient_checkpointing=true \
--policy.freeze_embedding=true \
--policy.normalize_gripper=false \
--policy.enable_knowledge_insulation=false \
--policy.push_to_hub=false \
--wandb.enable=true \
--wandb.entity=<wandb_entity> \
--wandb.project=<wandb_project> \
--job_name=<job_name> \
--output_dir=outputs/<job_name> \
--steps=10000 \
--batch_size=32 \
--num_workers=4 \
--log_freq=20 \
--eval_freq=-1 \
--save_checkpoint=true \
--save_freq=2000
```
### Training With LeRobot MolmoAct2 Weight
Use `policy.path` when starting from a MolmoAct2 checkpoint that was saved by
LeRobot, either from a local `pretrained_model` directory or from the Hub. This
restores the saved LeRobot policy config, model weights, processor, and
normalization statistics. You can still override training-time options such as
`batch_size`, `steps`, LoRA flags, or `policy.action_mode`.
```bash
accelerate launch \
--num_processes=8 \
--mixed_precision=bf16 \
-m lerobot.scripts.lerobot_train \
--dataset.repo_id=allenai/MolmoAct2-LIBERO-Dataset \
--dataset.root=/path/to/lerobot/data/allenai/MolmoAct2-LIBERO-Dataset \
--dataset.video_backend=pyav \
--dataset.image_transforms.enable=true \
--policy.path=/path/to/pretrained_model \
--policy.device=cuda \
--policy.action_mode=both \
--policy.chunk_size=10 \
--policy.n_action_steps=10 \
--policy.model_dtype=bfloat16 \
--policy.num_flow_timesteps=8 \
--policy.gradient_checkpointing=true \
--wandb.enable=true \
--wandb.entity=<wandb_entity> \
--wandb.project=<wandb_project> \
--job_name=<job_name> \
--output_dir=outputs/<job_name> \
--steps=10000 \
--batch_size=32 \
--num_workers=4 \
--log_freq=20 \
--eval_freq=-1 \
--save_checkpoint=true \
--save_freq=2000
```
### Common Practices
For fine-tuning on a comparatively small dataset, such as a single LIBERO suite
or a real-world dataset with less than 200 demonstrations, a global batch size of
16 to 32 is a good starting point. In these settings, `policy.enable_lora_vlm=true` or `policy.train_action_expert_only=true` is also a practical choice. In both
cases, we intentionally keep the action expert fully trainable, which we found
to be crucial for model performance. For larger fine-tuning datasets, larger
global batch sizes and full fine-tuning are usually preferred.
### Common Policy Options
- `policy.checkpoint_path`: original MolmoAct2 HF checkpoint to initialize from.
Use this for released MolmoAct2 weights.
- `policy.path`: LeRobot checkpoint to initialize from. Use this for checkpoints
created by LeRobot training.
- `policy.action_mode`: training target, one of `continuous`, `discrete`, or
`both`. `both` trains the flow-matching action expert and the discrete
action-token loss.
- `policy.train_action_expert_only`: trains only parameters whose names contain
`action_expert`. It requires `policy.action_mode=continuous`.
- `policy.enable_lora_vlm`: enables LoRA on VLM linear layers. Use
`policy.enable_lora_action_expert=true` only if LoRA should also cover action
expert linear layers. When `policy.enable_lora_action_expert=false`, the
action expert base weights remain fully trainable while the VLM is trained
through LoRA adapters. When `policy.enable_lora_action_expert=true`, the
action expert is also adapter-tuned instead of fully fine-tuned.
- `policy.enable_knowledge_insulation`: when `true`, detaches action-expert
context K/V states before the action loss. The default is `false`.
- `policy.chunk_size`: action horizon used by the policy. For LIBERO we use
`10`. This LeRobot port overrides the loaded checkpoint's
`max_action_horizon` with this value.
- `policy.n_action_steps`: number of actions consumed from each predicted
chunk before querying the policy again. For LIBERO, set it to `chunk_size`.
- `policy.setup_type`: text inserted into the prompt to describe the robot and
scene, e.g. `single franka robotic arm in libero`. More examples are listed
in the `metadata_by_tag` entries of
[`norm_stats.json`](https://huggingface.co/allenai/MolmoAct2/blob/main/norm_stats.json).
- `policy.control_mode`: text inserted into the prompt to describe the action
space, e.g. `delta end-effector pose` or `absolute joint pose`.
- `policy.image_keys`: ordered LeRobot image observation keys passed to the
processor.
- `policy.model_dtype`: checkpoint/forward dtype, one of `float32`,
`bfloat16`, or `float16`. Use `bfloat16` for normal training.
- `policy.num_flow_timesteps`: number of flow-matching timesteps sampled per
example during training. We use `8` for fine-tuning.
- `policy.num_inference_steps`: optional override for continuous action
generation steps at inference time.
- `policy.gradient_checkpointing`: enables checkpointing in the VLM/action path
to reduce activation memory.
- `policy.freeze_embedding`: freezes input embeddings. The default is `true`.
- `policy.normalize_gripper`: controls whether gripper dimensions are included
in state/action quantile normalization. The default is `false`.
- `policy.normalize_language`: normalizes task strings before prompt
construction. The default is `true`.
- `policy.mask_action_dim_padding`: masks padded dimensions in the flow loss.
Released checkpoints use `policy.expected_max_action_dim=32`.
- `policy.max_sequence_length`: optional manual sequence cap. Leave unset to
infer it from images, state dimension, action dimension, action horizon, and
discrete-action mode.
### Learning Rates
MolmoAct2 uses parameter-group learning rates to match the original MolmoAct2
fine-tuning experiments.
- Full fine-tuning uses `policy.optimizer_lr=1e-5` for the VLM,
`policy.optimizer_vit_lr=5e-6` for the vision tower,
`policy.optimizer_connector_lr=5e-6` for image connector layers, and
`policy.optimizer_action_expert_lr=5e-5` for the action expert.
- LoRA VLM fine-tuning sets the VLM, vision, and connector LoRA parameter
groups to `5e-5` when `policy.enable_lora_vlm=true`. By default,
`policy.enable_lora_action_expert=false`, so the action expert is still fully
fine-tuned with `policy.optimizer_action_expert_lr`. If
`policy.enable_lora_action_expert=true`, the action expert is trained through
LoRA adapters instead.
- Action-expert-only fine-tuning trains only the action expert and uses
`policy.optimizer_action_expert_lr=5e-5`.
You can override the full fine-tuning and action-expert learning rates with
`policy.optimizer_lr`, `policy.optimizer_vit_lr`,
`policy.optimizer_connector_lr`, and `policy.optimizer_action_expert_lr`.
Scheduler settings can be changed with `policy.scheduler_warmup_steps`,
`policy.scheduler_decay_steps`, and `policy.scheduler_decay_lr`.
### Dataset Quantile Statistics
MolmoAct2 defaults to quantile normalization for state and action features. If
your dataset has not been converted with quantile statistics, you can add them
with:
```bash
python src/lerobot/scripts/augment_dataset_quantile_stats.py \
--repo-id=your_dataset
```
Alternatively, train MolmoAct2 with mean/std normalization:
```bash
--policy.normalization_mapping='{"ACTION": "MEAN_STD", "STATE": "MEAN_STD", "VISUAL": "IDENTITY"}'
```
## Evaluation
Evaluation also supports both LeRobot-saved checkpoints and original MolmoAct2
HF checkpoints. For LIBERO replication, keep the EGL rendering environment
fixed and use `policy.per_episode_seed=true`.
**Important:** We found that `num_steps_wait=10` does not reliably let the
LIBERO scene stabilize and can degrade measured success. All LIBERO evaluation
results reported here use `num_steps_wait=50`.
### Evaluation With LeRobot MolmoAct2 Weight
Use `policy.path` for a checkpoint saved by LeRobot. The saved processor and
normalization statistics are restored together with the model.
```bash
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
lerobot-eval \
--policy.path=allenai/MolmoAct2-LIBERO-LeRobot \
--policy.inference_action_mode=continuous \
--policy.model_dtype=bfloat16 \
--policy.use_amp=true \
--policy.enable_inference_cuda_graph=true \
--policy.device=cuda \
--policy.per_episode_seed=true \
--policy.eval_seed=1000 \
--env.type=libero \
--env.task=libero_10,libero_goal,libero_object,libero_spatial \
--env.camera_name_mapping='{"agentview_image":"image","robot0_eye_in_hand_image":"wrist_image"}' \
--eval.batch_size=1 \
--eval.n_episodes=50 \
--seed=1000
```
### Evaluation With Original MolmoAct2 Weight
You can evaluate a released Hugging Face checkpoint directly without first
converting it to a LeRobot checkpoint. In this case, set
`policy.checkpoint_path` to the HF model repo and provide `policy.norm_tag`.
For LIBERO, `policy.norm_tag=libero` loads the LIBERO action/state
normalization statistics, action horizon, prompt metadata, and image-key order
from the checkpoint's `norm_stats.json`.
To fully replicate the MolmoAct2 paper results with released Hugging Face
checkpoints, we recommend using the v0.5.1-pinned
[`allenai/lerobot` `molmoact2-hf-inference`](https://github.com/allenai/lerobot/tree/molmoact2-hf-inference)
branch. That branch matches the original evaluation settings used for the
reported numbers.
```bash
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
lerobot-eval \
--policy.type=molmoact2 \
--policy.checkpoint_path=allenai/MolmoAct2-LIBERO \
--policy.norm_tag=libero \
--policy.inference_action_mode=continuous \
--policy.model_dtype=float32 \
--policy.use_amp=false \
--policy.enable_inference_cuda_graph=true \
--policy.device=cuda \
--policy.per_episode_seed=true \
--policy.eval_seed=1000 \
--env.type=libero \
--env.task=libero_goal \
--env.camera_name_mapping='{"agentview_image":"image","robot0_eye_in_hand_image":"wrist_image"}' \
--eval.batch_size=1 \
--eval.n_episodes=50 \
--seed=1000
```
Use `--env.task=libero_10,libero_goal,libero_object,libero_spatial` to run the
full LIBERO suite. The same command works for other released MolmoAct2
checkpoints as long as the requested `policy.norm_tag` exists in that
checkpoint's `norm_stats.json`.
### Common Evaluation Options
- `policy.inference_action_mode`: required for rollout. Use `continuous` for
flow-matching inference or `discrete` for action-token inference. It must be
compatible with the training-time `policy.action_mode` saved in the
checkpoint.
- `policy.path`: LeRobot checkpoint path or Hub repo. Use this for checkpoints
saved by LeRobot.
- `policy.checkpoint_path`: original MolmoAct2 HF checkpoint path or Hub repo.
Use this with `policy.type=molmoact2` and `policy.norm_tag`.
- `policy.norm_tag`: selects normalization statistics, prompt metadata,
image-key order, and action horizon from the original checkpoint's
`norm_stats.json`. It is required for direct original-HF checkpoint
evaluation.
- `policy.model_dtype`: model load/forward dtype. Use `bfloat16` for normal
GPU evaluation. Use `float32` only when you explicitly want fp32 inference.
- `policy.use_amp`: runs the policy forward under autocast during eval. For
`model_dtype=bfloat16`, keep this enabled.
- `policy.enable_inference_cuda_graph`: enables the MolmoAct2 inference CUDA
graph path for faster repeated continuous-action rollout.
- `policy.per_episode_seed` and `policy.eval_seed`: make stochastic continuous
action generation deterministic per episode for replication.
- `env.task`: comma-separated LIBERO suites or a single suite. Use
`libero_10,libero_goal,libero_object,libero_spatial` for the full benchmark.
- `env.camera_name_mapping`: maps LIBERO camera names to the image keys expected
by the policy processor.
## Performance Results
### LIBERO Benchmark Results
MolmoAct2 has demonstrated strong performance on the LIBERO benchmark suite. To
compare and test its LeRobot implementation, we fine-tuned
[`allenai/MolmoAct2-LIBERO`](https://huggingface.co/allenai/MolmoAct2-LIBERO)
for an additional 10k steps on the LIBERO dataset with per-GPU batch size 32 on
8 H100 GPUs, then compared the results to the original MolmoAct2 reference
results.
The LeRobot fine-tuned checkpoint reported here is available at
[`allenai/MolmoAct2-LIBERO-LeRobot`](https://huggingface.co/allenai/MolmoAct2-LIBERO-LeRobot)
and was trained on
[`allenai/MolmoAct2-LIBERO-Dataset`](https://huggingface.co/datasets/allenai/MolmoAct2-LIBERO-Dataset).
| Benchmark | LeRobot Implementation | MolmoAct2 Original |
| -------------- | ---------------------: | -----------------: |
| LIBERO Spatial | 98.4% | 97.8% |
| LIBERO Object | 100.0% | 100.0% |
| LIBERO Goal | 98.0% | 97.8% |
| LIBERO 10 | 96.6% | 93.2% |
| Average | 98.25% | 97.20% |
These results demonstrate MolmoAct2's strong performance across diverse robotic
manipulation tasks. To reproduce them, follow the instructions in the LIBERO
evaluation section.
## Differences From the Original Implementation
This LeRobot port is intended to match MolmoAct2 behavior while using LeRobot's
dataset, training, evaluation, checkpoint, and logging infrastructure. The main
differences from the original training repository are:
- The original paper training stack loads the model in fp32 and trains under
mixed precision. This LeRobot port usually loads the checkpoint directly in
`policy.model_dtype=bfloat16` for lower memory use.
- The original repository uses its own FSDP/model-parallel training path. The
LeRobot port uses the standard LeRobot/Accelerate training path and has not
been tested for multi-node training.
- The original repository supports sequence packing. The LeRobot port trains on
one LeRobot sample per item and pads to an inferred fixed sequence budget.
- The LeRobot port follows LeRobot's optimizer, scheduler, checkpoint saving,
dataset transforms, image augmentation, and Weights & Biases logging
conventions.
- The original training path supports mixed action horizons by padding to
`max_action_horizon` and masking padded horizon slots in the action expert
self-attention. This is useful when training across datasets with different
control frequencies. The LeRobot port currently targets single-dataset
fine-tuning, so `policy.chunk_size` overrides the checkpoint
`max_action_horizon` and horizon masking is not implemented yet. Support for
this mixed-horizon path is planned.
## Citation
```bibtex
@misc{fang2026molmoact2actionreasoningmodels,
title={MolmoAct2: Action Reasoning Models for Real-world Deployment},
author={Haoquan Fang and Jiafei Duan and Donovan Clay and Sam Wang and Shuo Liu and Weikai Huang and Xiang Fan and Wei-Chuan Tsai and Shirui Chen and Yi Ru Wang and Shanli Xing and Jaemin Cho and Jae Sung Park and Ainaz Eftekhar and Peter Sushko and Karen Farley and Angad Wadhwa and Cole Harrison and Winson Han and Ying-Chun Lee and Eli VanderBilt and Rose Hendrix and Suveen Ellawela and Lucas Ngoo and Joyce Chai and Zhongzheng Ren and Ali Farhadi and Dieter Fox and Ranjay Krishna},
year={2026},
eprint={2605.02881},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.02881},
}
```
## License
This model is licensed under Apache 2.0. It is intended for research and
educational use in accordance with
[Ai2's Responsible Use Guidelines](https://allenai.org/responsible-use),
consistent with [allenai/molmoact2](https://github.com/allenai/molmoact2).
+1 -1
View File
@@ -91,7 +91,7 @@ lerobot-train \
If your dataset is not converted with `quantiles`, you can convert it with the following command:
```bash
python src/lerobot/datasets/v30/augment_dataset_quantile_stats.py \
python src/lerobot/scripts/augment_dataset_quantile_stats.py \
--repo-id=your_dataset \
```
+39
View File
@@ -0,0 +1,39 @@
# MolmoAct2
This repository contains the LeRobot policy implementation of
[MolmoAct2](https://allenai.org/blog/molmoact2), ported into LeRobot for
training, evaluation, checkpointing, and dataset compatibility.
This implementation currently supports training and evaluation for the regular
MolmoAct2 model. MolmoAct2-Think, which supports adaptive depth reasoning, is
not included in this LeRobot policy yet and is coming soon.
For the original MolmoAct2 training code used for the experiments reported in
the paper, see [allenai/molmoact2](https://github.com/allenai/molmoact2).
## LIBERO Evaluation
Important: we found that `num_steps_wait=10` does not reliably let the LIBERO
scene stabilize and can degrade measured success. All LIBERO evaluation results
reported for this LeRobot implementation use `num_steps_wait=50`.
## Citation
```bibtex
@misc{fang2026molmoact2actionreasoningmodels,
title={MolmoAct2: Action Reasoning Models for Real-world Deployment},
author={Haoquan Fang and Jiafei Duan and Donovan Clay and Sam Wang and Shuo Liu and Weikai Huang and Xiang Fan and Wei-Chuan Tsai and Shirui Chen and Yi Ru Wang and Shanli Xing and Jaemin Cho and Jae Sung Park and Ainaz Eftekhar and Peter Sushko and Karen Farley and Angad Wadhwa and Cole Harrison and Winson Han and Ying-Chun Lee and Eli VanderBilt and Rose Hendrix and Suveen Ellawela and Lucas Ngoo and Joyce Chai and Zhongzheng Ren and Ali Farhadi and Dieter Fox and Ranjay Krishna},
year={2026},
eprint={2605.02881},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.02881},
}
```
## License
This model is licensed under Apache 2.0. It is intended for research and
educational use in accordance with
[Ai2's Responsible Use Guidelines](https://allenai.org/responsible-use),
consistent with [allenai/molmoact2](https://github.com/allenai/molmoact2).
+39
View File
@@ -0,0 +1,39 @@
# VLA-JEPA
This repository contains the LeRobot port of **VLA-JEPA**, a Vision-Language-Action model that combines a Qwen3-VL language backbone with a self-supervised video world model (V-JEPA2) and a flow-matching DiT action head.
Converted from [ginwind/VLA-JEPA](https://huggingface.co/ginwind/VLA-JEPA).
---
## Architecture Overview
| Component | Module | Role |
| ----------------------- | --------------------------------- | ------------------------------------------------------- |
| **Qwen3-VL backbone** | `Qwen3VLInterface` | Fuses images + language instruction into context tokens |
| **DiT-B action head** | `VLAJEPAActionHead` | Flow-matching diffusion over the action chunk |
| **V-JEPA2 world model** | `ActionConditionedVideoPredictor` | Self-supervised video prediction loss (training only) |
At inference time only the Qwen backbone and action head are used; the world model is not needed.
---
## Citation
```bibtex
@misc{sun2026vlajepaenhancingvisionlanguageactionmodel,
title = {VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model},
author = {Jingwen Sun and Wenyao Zhang and Zekun Qi and Shaojie Ren and Zezhi Liu and Hanxin Zhu and Guangzhong Sun and Xin Jin and Zhibo Chen},
year = {2026},
eprint = {2602.10098},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2602.10098},
}
```
---
## License
Weights are distributed under the license terms of the original [ginwind/VLA-JEPA](https://huggingface.co/ginwind/VLA-JEPA) repository (**Apache 2.0 License**). The LeRobot integration code follows the **Apache 2.0 License**.
+1 -1
View File
@@ -300,7 +300,7 @@ This replaces the old episode-per-file structure with efficient, optimally-sized
If you have existing datasets in v2.1 format, use the migration tool:
```bash
python src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py \
python src/lerobot/scripts/convert_dataset_v21_to_v30.py \
--repo-id your_id/existing_dataset
```
+185
View File
@@ -0,0 +1,185 @@
# ROBOMETER
ROBOMETER is a **general-purpose video-language robotic reward model**. It predicts dense, frame-level task progress and frame-level success from a trajectory video and a task description.
**Paper**: [ROBOMETER: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons](https://arxiv.org/abs/2603.02115)
**Project**: [robometer.github.io](https://robometer.github.io/)
**Original code**: [github.com/robometer/robometer](https://github.com/robometer/robometer)
**Checkpoint**: [lerobot/Robometer-4B](https://huggingface.co/lerobot/Robometer-4B)
## Overview
ROBOMETER builds on `Qwen/Qwen3-VL-4B-Instruct` and adds three lightweight prediction heads:
- **Progress head**: predicts per-frame task progress in `[0, 1]`.
- **Success head**: predicts per-frame task success probability.
- **Preference head**: predicts which of two trajectories better completes the task during training.
The paper trains ROBOMETER with a composite objective:
```text
L = L_pref + L_prog + L_succ
```
The LeRobot integration is currently **inference-only**. It preserves the preference head so that the published `Robometer-4B` checkpoint loads without remapping, but `compute_reward()` queries the progress or success head only.
## What the LeRobot Integration Covers
- Standard `reward_model.type=robometer` configuration through LeRobot.
- Qwen3-VL image and text preprocessing through `RobometerEncoderProcessorStep`.
- LeRobot reward-model save/load APIs through `PreTrainedRewardModel`.
- Dense, frame-level progress and success predictions internally.
- A scalar reward through `compute_reward()` for downstream LeRobot reward-model usage.
This page focuses on using the published ROBOMETER checkpoint as a zero-shot reward model. Training ROBOMETER from scratch is outside the current LeRobot integration.
## Installation Requirements
1. Install LeRobot by following the [Installation Guide](./installation).
2. Install the ROBOMETER dependencies:
```bash
pip install -e ".[robometer]"
```
If you use `uv` directly from a source checkout:
```bash
uv sync --extra robometer
```
ROBOMETER uses a Qwen3-VL-4B backbone, so GPU inference is strongly recommended.
## Model Inputs and Outputs
ROBOMETER expects:
- A trajectory video or sequence of frames.
- A natural-language task description.
In LeRobot datasets, the preprocessor reads:
| Config field | Default | Meaning |
| ------------------------- | ------------------------ | ----------------------------------------------------- |
| `reward_model.image_key` | `observation.images.top` | Camera/video observation used by ROBOMETER |
| `reward_model.task_key` | `task` | Key in complementary data that stores the task string |
| `reward_model.max_frames` | `8` | Maximum number of frames passed to ROBOMETER |
The model predicts per-frame progress and success internally. The LeRobot reward API returns a scalar per sample:
- `reward_output="progress"` (default): return the last-frame progress, clamped to `[0, 1]`.
- `reward_output="success"`: return `1.0` if the last-frame success probability is above `success_threshold`, otherwise `0.0`.
## Usage
### Load the Reward Model Directly
```python
from lerobot.rewards.robometer import RobometerConfig, RobometerRewardModel
cfg = RobometerConfig(
pretrained_path="lerobot/Robometer-4B",
device="cuda",
reward_output="progress",
)
reward_model = RobometerRewardModel.from_pretrained(cfg.pretrained_path, config=cfg)
```
### Encode Frames and Compute a Reward
For a direct Python call, provide frames as `uint8` arrays with shape `(T, H, W, C)` and a task string:
```python
from lerobot.rewards.robometer.modeling_robometer import ROBOMETER_FEATURE_PREFIX
from lerobot.rewards.robometer.processor_robometer import RobometerEncoderProcessorStep
# frames: np.ndarray, shape (T, H, W, C), dtype uint8
# task: str
encoder = RobometerEncoderProcessorStep(
base_model_id=cfg.base_model_id,
use_multi_image=cfg.use_multi_image,
use_per_frame_progress_token=cfg.use_per_frame_progress_token,
max_frames=cfg.max_frames,
)
encoded = encoder.encode_samples([(frames, task)])
batch = {f"{ROBOMETER_FEATURE_PREFIX}{key}": value for key, value in encoded.items()}
reward = reward_model.compute_reward(batch)
```
`reward` is a tensor of shape `(batch_size,)`.
### Use the Reward Factory
You can also instantiate ROBOMETER through the reward factory:
```python
from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors
cfg = make_reward_model_config(
"robometer",
pretrained_path="lerobot/Robometer-4B",
device="cuda",
image_key="observation.images.top",
)
reward_model = make_reward_model(cfg)
preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
```
The preprocessor writes Qwen-VL tensors under the `observation.robometer.*` namespace, and `compute_reward()` reads those encoded tensors.
## Configuration Notes
### Backbone and Vocabulary
The published checkpoint uses a Qwen3-VL-4B backbone. ROBOMETER adds five special tokens to the tokenizer in a fixed order:
```text
<|split_token|>
<|reward_token|>
<|pref_token|>
<|sim_token|>
<|prog_token|>
```
`<|prog_token|>` is inserted after each frame and is the hidden-state position used for per-frame progress and success prediction. `<|split_token|>` and `<|pref_token|>` are used by the paper's pairwise trajectory preference objective. `<|reward_token|>` and `<|sim_token|>` are preserved for checkpoint compatibility.
The LeRobot config stores a serialized `vlm_config` with the post-resize vocabulary so the model can reload from `config.json` without downloading the base Qwen weights first. For `Qwen/Qwen3-VL-4B-Instruct`, the tokenizer length is `151669`, and the five ROBOMETER tokens produce the checkpoint vocabulary size `151674`.
### Progress Prediction
In the published checkpoint, progress is discrete. The progress head outputs logits over `progress_discrete_bins=10` uniformly spaced bin centers in `[0, 1]`. LeRobot converts these logits into a continuous value by applying a softmax and taking the expectation over bin centers, matching the upstream ROBOMETER implementation.
### Success Prediction
The success head outputs raw logits per frame. LeRobot converts them to probabilities with `sigmoid`. When `reward_output="success"`, `compute_reward()` thresholds the last-frame success probability using `success_threshold`.
## Limitations
- The current LeRobot integration is inference-only; it does not implement ROBOMETER training or preference-pair training.
- `compute_reward()` returns a scalar per sample for the LeRobot reward-model API, even though ROBOMETER predicts per-frame progress and success internally.
- ROBOMETER is video-language based; it does not use privileged robot state such as contact forces or object poses.
## References
- [ROBOMETER project](https://robometer.github.io/)
- [ROBOMETER paper](https://arxiv.org/abs/2603.02115)
- [Original ROBOMETER code](https://github.com/robometer/robometer)
- [Published ROBOMETER-4B checkpoint](https://huggingface.co/lerobot/Robometer-4B)
- [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)
## Citation
```bibtex
@inproceedings{liang2026robometer,
title = {Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons},
author={Anthony Liang and Yigit Korkmaz and Jiahui Zhang and Minyoung Hwang and Abrar Anwar and Sidhant Kaushik and Aditya Shah and Alex S. Huang and Luke Zettlemoyer and Dieter Fox and Yu Xiang and Anqi Li and Andreea Bobu and Abhishek Gupta and Stephen Tu and Erdem Biyik and Jesse Zhang},
year={2026},
booktitle={Robotics: Science and Systems 2026},
}
```
## License
This LeRobot integration follows the **Apache 2.0 License** used by LeRobot. Check the upstream ROBOMETER code and model pages for the licenses of the original implementation and released checkpoints.
+8 -8
View File
@@ -97,22 +97,22 @@ Similarly for when recording an episode, it is recommended that you are logged i
Once you are logged in, you can run inference in your setup by doing:
```bash
lerobot-record \
lerobot-rollout \
--strategy.type=base \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \ # <- Use your port
--robot.id=my_blue_follower_arm \ # <- Use your robot id
--robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras
--dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording
--dataset.repo_id=${HF_USER}/eval_DATASET_NAME_test \ # <- This will be the dataset name on HF Hub
--dataset.episode_time_s=50 \
--dataset.num_episodes=10 \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
# --dataset.camera_encoder.vcodec=auto \
--task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording
# <- RTC optional, use when running on low power hardware \
# --inference.type=rtc \
# --inference.rtc.execution_horizon=10 \
# --inference.rtc.max_guidance_weight=10.0 \
# <- Teleop optional if you want to teleoperate in between episodes \
# --teleop.type=so100_leader \
# --teleop.port=/dev/ttyACM0 \
# --teleop.id=my_red_leader_arm \
# --display_data=true #optional use if you want to see the camera stream \
--policy.path=HF_USER/FINETUNE_MODEL_NAME # <- Use your fine-tuned model
```
+177
View File
@@ -0,0 +1,177 @@
# TOPReward
TOPReward is a **zero-shot reward model** that extracts token log-probabilities from an off-the-shelf vision-language model (VLM) as a robotic reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood that the instruction is true — no fine-tuning required.
**Paper**: [TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics](https://arxiv.org/abs/2602.19313)
**Project**: [topreward.github.io](https://topreward.github.io/webpage/)
**Original code**: [github.com/TOPReward/TOPReward](https://github.com/TOPReward/TOPReward)
**Default backbone**: [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
## Overview
TOPReward asks a generic VLM how likely a task instruction is, **conditioned on the video** of a robot trying to complete that task. Concretely, given:
- A trajectory video (a sequence of frames).
- A task instruction (e.g. _"open the drawer"_).
it builds a chat prompt of the form
```text
<video>
"The above video shows a robot manipulation trajectory that completes the
following task: <instruction> Decide whether the above statement is True
or not. The answer is: True"
```
forwards it through the VLM, label-masks everything except the very last token, and reads back the log-probability of that token — by default the literal `"True"` that closes the suffix template. The resulting `log P("True" | video + prompt + instruction)` is the reward.
Because the method only depends on a frozen VLM, TOPReward is **zero-shot**: there are no fine-tuned weights to host. The "model" in LeRobot is a small wrapper around `transformers`' `Qwen3VLForConditionalGeneration` plus the label-masking logic. The processor owns the tokeniser and builds the full chat prompt (EO-1/Robometer pattern).
## What the LeRobot integration covers
- Standard `reward_model.type=topreward` configuration through LeRobot.
- VLM loading via the `transformers` `Qwen3VLForConditionalGeneration` API.
- Prompt assembly + tokenisation in the processor (matching upstream `QwenClient.compute_instruction_reward`).
- `compute_reward()` returns one scalar log-prob per sample.
- LeRobot reward-model save/load — `save_pretrained` writes only `config.json` (the VLM is identified by `vlm_name`).
- An offline labeling script that writes a `topreward_progress.parquet` (SARM-compatible schema) for RA-BC and overlay.
The current LeRobot port supports the **Qwen3-VL client only**. Other upstream clients (Gemini, OpenAI, Gemma, Molmo) can be added as follow-up extras.
## Installation Requirements
1. Install LeRobot following the [Installation Guide](./installation).
2. Install the TOPReward optional extra:
```bash
pip install -e ".[topreward]"
```
or, with `uv` from a source checkout:
```bash
uv sync --extra topreward
```
This pulls in `transformers`. The first time you run TOPReward, Hugging Face will also download the VLM weights from the Hub (~16 GB for Qwen3-VL-8B-Instruct). A GPU is strongly recommended.
## Model Inputs and Outputs
TOPReward expects:
- A trajectory video or sequence of frames.
- A natural-language task description.
In LeRobot datasets the preprocessor reads:
| Config field | Default | Meaning |
| ------------------------- | --------------------------- | --------------------------------------------- |
| `reward_model.image_key` | `observation.images.top` | Camera observation used by TOPReward |
| `reward_model.task_key` | `task` | Key in complementary data for the task string |
| `reward_model.max_frames` | `16` | Cap on frames per sample |
| `reward_model.fps` | `2.0` | Metadata passed to the Qwen video processor |
| `reward_model.vlm_name` | `Qwen/Qwen3-VL-8B-Instruct` | Hugging Face Hub id of the underlying VLM |
The model returns:
- `compute_reward(batch)`: one log-probability per sample. Higher = better task-video alignment. When `success_threshold` is finite, returns the binary thresholded value instead.
## Usage
### Load the reward model directly
```python
from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel
cfg = TOPRewardConfig(
vlm_name="Qwen/Qwen3-VL-8B-Instruct",
device="cuda",
)
reward_model = TOPRewardModel(cfg)
```
### Use the reward factory
```python
from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors
cfg = make_reward_model_config(
"topreward",
vlm_name="Qwen/Qwen3-VL-8B-Instruct",
device="cuda",
image_key="observation.images.top",
)
reward_model = make_reward_model(cfg)
preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
```
The preprocessor tokenises the full prompt (video + prefix + instruction suffix), writes Qwen-VL tensors + `prompt_length` under `observation.topreward.*`. The model reads those tensors, label-masks based on `prompt_length`, and extracts the log-prob reward.
### Offline dataset labeling
Write a `topreward_progress.parquet` for RA-BC training and overlay videos:
```bash
# Sparse-dense (15 anchors per episode, matches upstream)
uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
--dataset-repo-id lerobot/libero_10_image \
--num-samples 15 \
--device cuda
```
Then render the progress overlay for any episode:
```bash
uv run examples/dataset/create_progress_videos.py \
--repo-id lerobot/libero_10_image \
--episode 0 \
--progress-file topreward_progress.parquet \
--gif
```
## Configuration Notes
### Prompt knobs
The default prompt mirrors the upstream paper:
```text
prompt_prefix = "The above video shows a robot manipulation trajectory that completes the following task: "
prompt_suffix_template = "{instruction} Decide whether the above statement is True or not. The answer is: True"
```
Both are exposed on `TOPRewardConfig` for ablation. The suffix template **must** contain `{instruction}`.
### Chat template
`add_chat_template=True` wraps the full prompt (including instruction) with the tokenizer's chat template before tokenisation. Default is `False`, matching the upstream paper's main experiments.
## Limitations
- The current LeRobot port is **inference-only and zero-shot**; `forward()` is not overridden and `is_trainable` returns `False`.
- Only the **Qwen3-VL family** is supported; other upstream clients are out of scope.
- TOPReward inherits the underlying VLM's biases.
## References
- [TOPReward project page](https://topreward.github.io/webpage/)
- [TOPReward paper](https://arxiv.org/abs/2602.19313)
- [Original TOPReward code](https://github.com/TOPReward/TOPReward)
- [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
## Citation
```bibtex
@article{chen2026topreward,
title={TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
author={Chen, Shirui and Harrison, Cole and Lee, Ying-Chun and Yang, Angela Jin and
Ren, Zhongzheng and Ratliff, Lillian J and Duan, Jiafei and Fox, Dieter and
Krishna, Ranjay},
journal={arXiv preprint arXiv:2602.19313},
year={2026}
}
```
## License
The original TOPReward codebase is MIT-licensed. The LeRobot port follows the LeRobot Apache 2.0 license; the wrapped Qwen3-VL weights are subject to the original Qwen license.
+235
View File
@@ -0,0 +1,235 @@
# VLA-JEPA
This is the LeRobot port of **VLA-JEPA**, a Vision-Language-Action model that combines a Qwen3-VL language backbone with a self-supervised video world model (V-JEPA2) and a flow-matching DiT action head.
---
## Architecture Overview
VLA-JEPA has three main components:
| Component | Module | Role |
| ----------------------- | --------------------------------- | ------------------------------------------------------- |
| **Qwen3-VL backbone** | `Qwen3VLInterface` | Fuses images + language instruction into context tokens |
| **DiT-B action head** | `VLAJEPAActionHead` | Flow-matching diffusion over the action chunk |
| **V-JEPA2 world model** | `ActionConditionedVideoPredictor` | Self-supervised video prediction loss (training only) |
### Data flow
**Training:**
1. A video clip of `num_video_frames` frames is encoded by V-JEPA2 into per-frame patch tokens.
2. The Qwen3-VL backbone processes multi-view images + the task instruction and produces a sequence of context tokens that includes special action tokens (for world model conditioning) and embodied tokens.
3. The action head receives those context tokens as cross-attention keys/values and predicts a denoised action chunk via flow matching.
4. The world model predictor uses the action tokens extracted from Qwen to predict future V-JEPA2 frame embeddings; a regression loss on those predictions is added to the action loss.
**Inference:**
Only Qwen + the action head are used. The world model is not needed at inference time.
### Action head details
Available presets via `action_model_type`:
| Preset | Hidden dim | Heads | Head dim |
| ------- | ---------- | ----- | -------- |
| `DiT-B` | 768 | 12 | 64 |
| `DiT-L` | 1536 | 32 | 48 |
### World model details
The video predictor is a ViT-style transformer (`ActionConditionedVideoPredictor`) that takes:
- **Frame tokens**: V-JEPA2 patch embeddings projected to `predictor_embed_dim`
- **Action tokens**: Qwen action token embeddings projected to `predictor_embed_dim`
It uses block-causal attention so each temporal step can attend to all previous steps. The predictor's input `embed_dim` equals `num_views × video_encoder_hidden_size` (e.g. 2 views × 1024 = 2048 for the pretrained checkpoints).
---
## Pretrained Checkpoints
Three checkpoints are available directly inside the LeRobot org here: [`lerobot/VLA-JEPA`](https://huggingface.co/collections/lerobot/vla-jepa), converted from [ginwind/VLA-JEPA](https://huggingface.co/ginwind/VLA-JEPA):
| Checkpoint | Dataset | Cameras | World model | Action dim |
| ----------------------------- | ----------------- | ----------------------- | ----------- | ---------- |
| `lerobot/VLA-JEPA-LIBERO` | LIBERO-10 | 2 (agentview + wrist) | Enabled | 7 |
| `lerobot/VLA-JEPA-Pretrain` | DROID 1.0.1 | 2 (exterior left views) | Enabled | 7 |
| `lerobot/VLA-JEPA-SimplerEnv` | OXE Bridge / RT-1 | 1 (view duplicated ×2) | Enabled | 7 |
All checkpoints use `Qwen/Qwen3-VL-2B-Instruct` as the language backbone.
---
## Configuration
Key parameters in `VLAJEPAConfig`:
| Parameter | Default | Description |
| ------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `chunk_size` | 7 | Number of actions predicted per inference call |
| `n_action_steps` | 7 | Steps executed from the predicted chunk before re-planning |
| `num_video_frames` | 8 | Video clip length fed to the world model |
| `enable_world_model` | `True` | Whether to load and train the V-JEPA2 predictor |
| `world_model_loss_weight` | 0.1 | Weight of the JEPA prediction loss relative to the action loss |
| `num_inference_timesteps` | 4 | Euler integration steps for action denoising |
| `freeze_qwen` | `False` | Freeze the Qwen3-VL backbone and only train the action head |
| `reinit_modules` | `None` | Key prefixes allowed to be randomly re-initialised on load (for cross-embodiment transfer, see [Fine-tuning on a different embodiment](#fine-tuning-on-a-different-embodiment)) |
| `gripper_dim` | 6 | Index of the gripper dimension in the action vector (e.g. 6 for a 7-DoF arm with gripper as the last joint) |
| `gripper_threshold` | 0.5 | Threshold used by `pre_snap_gripper_action` and `binarize_gripper_action` to binarize the gripper dimension |
| `pre_snap_gripper_action` | `True` | Snap the gripper dim to {0, 1} before unnormalization. Set to `False` for robots without a binary gripper |
| `binarize_gripper_action` | `True` | Binarize the gripper dim to {-1, 1} after unnormalization. Set to `False` for robots without a binary gripper |
---
## Training
Number of training steps may vary based on dataset size and compute budget. The original paper pretrained for 50k on ssv2 + droid jointly, then additional 30k steps for LIBERO, but fewer steps may still yield good performance when fine-tuning from the provided pretrained checkpoints.
### Full training from scratch
```bash
lerobot-train \
policy.type=vla_jepa \
policy.repo_id=your_org/your_repo \
dataset.repo_id=your_org/your_dataset
```
### Fine-tuning from a pretrained checkpoint
```bash
lerobot-train \
--policy.path=lerobot/VLA-JEPA-Pretrain \
--policy.repo_id=your_org/your_repo \
--dataset.repo_id=your_org/your_dataset
```
If you want to freeze the Qwen backbone and only train the action head, set `policy.freeze_qwen=True`:
```bash
lerobot-train \
--policy.path=lerobot/VLA-JEPA-Pretrain \
--policy.repo_id=your_org/your_repo \
--policy.freeze_qwen=true \
--dataset.repo_id=your_org/your_dataset
```
### Fine-tuning on a different embodiment
When the target robot has a different action or state dimensionality than the pretrained checkpoint, the input/output projection layers of the action head will have mismatched shapes and cannot be loaded directly. `reinit_modules` lets you list the key prefixes that are allowed to mismatch — those layers are randomly re-initialised while every other weight is reused from the checkpoint. Any shape mismatch outside the listed prefixes raises an error.
The layers that depend on `action_dim` and `state_dim` are:
| Layer | Key prefix |
| ----------------------------------------- | ----------------------------------- |
| Action encoder (action_dim → inner_dim) | `model.action_model.action_encoder` |
| Action decoder (hidden_size → action_dim) | `model.action_model.action_decoder` |
| State encoder (state_dim → inner_dim) | `model.action_model.state_encoder` |
```bash
lerobot-train \
--policy.path=lerobot/VLA-JEPA-Pretrain \
--policy.repo_id=your_org/your_repo \
--policy.freeze_qwen=true \
--policy.reinit_modules='["model.action_model.action_encoder", "model.action_model.action_decoder", "model.action_model.state_encoder"]' \
--dataset.repo_id=your_org/your_dataset
```
If your robot has no proprioceptive state, omit `model.action_model.state_encoder` from the list.
### Reproducing the LIBERO results
**Training on LIBERO:**
starts the training from the Pretrain checkpoint, trains for 30k steps on the LIBERO dataset.
Original paper mentions training across 8 GPUs with a batch size of 32, meaning global batch size of 256.
```bash
lerobot-train \
--policy.path=lerobot/VLA-JEPA-Pretrain \
--policy.repo_id=your_org/your_repo \
--dataset.repo_id=HuggingFaceVLA/libero \
--steps=30000
```
**Evaluating the pretrained LIBERO-10 checkpoint:**
```bash
lerobot-eval \
--policy.path=lerobot/VLA-JEPA-LIBERO \
--env.type=libero \
--env.task=libero_spatial,libero_object,libero_goal,libero_10 \
--eval.n_episodes=10 \
--eval.batch_size=5
```
To evaluate a subset of tasks only:
```bash
lerobot-eval \
--policy.path=lerobot/VLA-JEPA-LIBERO \
--env.type=libero \
--env.task=libero_10 \
--env.task_ids='[0,1,2]' \
--eval.n_episodes=10 \
--eval.batch_size=5
```
**Expected results:**
| Suite | Episodes | Successes | Success Rate |
| -------------- | -------- | --------- | ------------ |
| libero_spatial | 100 | 93 | **95.0%** |
| libero_object | 100 | 100 | **100.0%** |
| libero_goal | 100 | 98 | **98.0%** |
| libero_10 | 100 | 96 | **93.0%** |
| **Overall** | **400** | **387** | **96.5%** |
---
## Fine-tuning on datasets with a different number of cameras
The pretrained world model predictor was trained with `embed_dim = jepa_tubelet_size × 1024` (default `jepa_tubelet_size=2`).
**Default behaviour — view padding / trimming (no action required)**
When fine-tuning from `VLA-JEPA-Pretrain` the model automatically adjusts the number of views fed to the world model to match `jepa_tubelet_size`:
- **Single-view datasets (e.g. BridgeV2):** the single-view latent is duplicated to produce a two-view world-model input, preserving the JEPA self-supervised signal without any weight mismatch.
- **>2-view datasets (e.g. DROID with 3 views):** all views are passed to the Qwen backbone (for richer context), but only the first `jepa_tubelet_size` views (one wrist + one third-person, following the configured view order) are used for the world model.
**Option 1 — Disable the world model**
Set `enable_world_model=False` to skip the JEPA loss entirely. Only the Qwen backbone and action head are loaded and trained. This is sufficient for good action performance.
```bash
lerobot-train \
--policy.path=lerobot/VLA-JEPA-Pretrain \
--policy.enable_world_model=false \
--policy.repo_id=your_org/your_repo \
--dataset.repo_id=your_org/single_camera_dataset
```
**Option 2 — Reinitialize the predictor input projection**
If you want to change `jepa_tubelet_size` to a value other than 2, load the checkpoint with `strict=False` and reinitialize `model.video_predictor.predictor_embed` for the new `embed_dim`. All other predictor block weights (attention, MLP, norm, output projection) are camera-count-agnostic and can be reused from the pretrained checkpoint.
---
## Citation
```bibtex
@misc{sun2026vlajepaenhancingvisionlanguageactionmodel,
title = {VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model},
author = {Jingwen Sun and Wenyao Zhang and Zekun Qi and Shaojie Ren and Zezhi Liu and Hanxin Zhu and Guangzhong Sun and Xin Jin and Zhibo Chen},
year = {2026},
eprint = {2602.10098},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2602.10098},
}
```
---
## License
Weights are distributed under the license terms of the original [ginwind/VLA-JEPA](https://huggingface.co/ginwind/VLA-JEPA) repository (**Apache 2.0 License**). The LeRobot integration code follows the **Apache 2.0 License**.
+109
View File
@@ -0,0 +1,109 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6-27B VLM).
Spawns one single-GPU ``h200`` job that:
1. installs ``lerobot`` from ``main`` plus the annotation extras,
2. boots one vllm server with Qwen3.6-27B (dense VLM),
3. runs the plan / interjections / vqa modules across the dataset
in free-form mode (each episode generates its own subtasks +
memory),
4. uploads the annotated dataset to ``--new_repo_id`` (when set)
or back to ``--repo_id``.
Usage:
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
Adjust ``CMD`` (dataset, model, hub repo) and ``flavor`` below for your
run. For larger datasets, scale to ``h200x4`` and raise
``--vlm.parallel_servers`` / ``--vlm.num_gpus`` to match.
"""
import os
from huggingface_hub import get_token, run_job
token = os.environ.get("HF_TOKEN") or get_token()
if not token:
raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")
CMD = (
"apt-get update -qq && apt-get install -y -qq git ffmpeg && "
"pip install --no-deps "
"'lerobot @ git+https://github.com/huggingface/lerobot.git@main' && "
"pip install --upgrade-strategy only-if-needed "
"datasets pyarrow av jsonlines draccus gymnasium torchcodec mergedeep pyyaml-include toml typing-inspect "
"openai && "
"export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 && "
"export VLLM_VIDEO_BACKEND=pyav && "
"lerobot-annotate "
"--repo_id=pepijn223/robocasa_pretrain_human300_v4 "
"--new_repo_id=pepijn223/robocasa_pretrain_human300_v4_annotated5 "
"--push_to_hub=true "
"--vlm.backend=openai "
"--vlm.model_id=Qwen/Qwen3.6-27B "
"--vlm.parallel_servers=1 "
"--vlm.num_gpus=1 "
'--vlm.serve_command="vllm serve Qwen/Qwen3.6-27B '
"--tensor-parallel-size 1 --max-model-len 32768 "
'--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
"--vlm.serve_ready_timeout_s=1800 "
"--vlm.client_concurrency=128 "
"--vlm.max_new_tokens=512 "
"--vlm.temperature=0.7 "
"--executor.episode_parallelism=16 "
"--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
"--vlm.camera_key=observation.images.robot0_agentview_right "
# Phase 1 — plan module (subtasks + memory).
# Embed decoded frames (not a file:// clip): if clip extraction fails,
# the video_url path silently sends no video and the VLM hallucinates.
"--plan.use_video_url=false "
"--plan.frames_per_second=1.0 "
# 32 frames ≈ 8-10k vision tokens, fits the 32768 context. Don't push
# toward 128 — that overflows the context (BadRequestError 400).
"--plan.max_video_frames=32 "
# Window long episodes into 32s chunks (constant 1 fps density) so they
# get more subtasks; per-window spans are merged + stitched. 0 disables.
"--plan.subtask_window_seconds=32 "
# RoboCasa: the dataset task string is authoritative (eval uses it), so
# keep it driving subtasks. ``always`` would throw it away and hallucinate.
"--plan.derive_task_from_video=off "
# No task augmentation: eval conditions on the exact task strings, so
# rephrasings are unused at best and harmful when they drift.
"--plan.n_task_rephrasings=0 "
# Keep subtask decomposition tight for atomic tasks.
"--plan.plan_max_steps=10 "
# Only subtasks + memory — skip the numbered "plan" rows. true re-enables.
"--plan.emit_plan=false "
# The describe->segment grounding pass (+1 VLM call/episode) is ON by
# default; pass --plan.subtask_describe_first=false to skip it.
# Phase 2 — interjections + speech.
"--interjections.max_interjections_per_episode=6 "
# Phase 4 — general VQA: disabled for this run.
"--vqa.enabled=false"
)
job = run_job(
image="vllm/vllm-openai:latest",
command=["bash", "-c", CMD],
flavor="h200",
secrets={"HF_TOKEN": token},
timeout="2h",
)
print(f"Job URL: {job.url}")
print(f"Job ID: {job.id}")
+37 -15
View File
@@ -15,10 +15,12 @@
# limitations under the License.
"""
Create MP4 (or GIF) videos with sarm_progress overlay for specified episodes.
Create MP4 (or GIF) videos with per-frame progress overlay for specified episodes.
Downloads datasets from HuggingFace, seeks directly into the episode segment
of the source video, draws a progress line on each frame, and writes the result.
The progress data is read from a parquet file that lives alongside the dataset
(configurable via ``--progress-file``).
Usage:
python examples/dataset/create_progress_videos.py \
@@ -56,22 +58,26 @@ SCORE_FONT_SCALE = 0.8
TASK_FONT_SCALE = 0.55
def download_episode_metadata(repo_id: str, episode: int) -> Path:
"""Download only the metadata and sarm_progress files for a dataset.
def download_episode_metadata(
repo_id: str, episode: int, progress_file: str = "sarm_progress.parquet"
) -> Path:
"""Download only the metadata and per-frame progress file for a dataset.
Args:
repo_id: HuggingFace dataset repository ID.
episode: Episode index (used for logging only; all meta is fetched).
progress_file: Filename of the per-frame progress parquet inside the
dataset repo.
Returns:
Local cache path for the downloaded snapshot.
"""
logging.info("[1/4] Downloading metadata for %s (episode %d) ...", repo_id, episode)
logging.info("[1/4] Downloading metadata + %s for %s (episode %d) ...", progress_file, repo_id, episode)
local_path = Path(
snapshot_download(
repo_id=repo_id,
repo_type="dataset",
allow_patterns=["meta/**", "sarm_progress.parquet"],
allow_patterns=["meta/**", progress_file],
ignore_patterns=["*.mp4"],
)
)
@@ -215,25 +221,28 @@ def download_video_file(repo_id: str, local_path: Path, video_rel: str) -> Path:
return video_path
def load_progress_data(local_path: Path, episode: int) -> np.ndarray | None:
"""Load sarm_progress values for an episode.
def load_progress_data(
local_path: Path, episode: int, progress_file: str = "sarm_progress.parquet"
) -> np.ndarray | None:
"""Load per-frame progress values for an episode.
Args:
local_path: Dataset cache root.
episode: Episode index.
progress_file: Filename of the per-frame progress parquet.
Returns:
Sorted (N, 2) array of (frame_index, progress), or None if unavailable.
"""
parquet_path = local_path / "sarm_progress.parquet"
parquet_path = local_path / progress_file
if not parquet_path.exists():
logging.warning("sarm_progress.parquet not found")
logging.warning("%s not found", progress_file)
return None
df = pd.read_parquet(parquet_path)
logging.info(" sarm_progress.parquet columns: %s", list(df.columns))
logging.info(" %s columns: %s", progress_file, list(df.columns))
episode_df = df[df["episode_index"] == episode].copy()
if episode_df.empty:
logging.warning("No sarm_progress rows for episode %d", episode)
logging.warning("No progress rows for episode %d in %s", episode, progress_file)
return None
episode_df = episode_df.sort_values("frame_index")
@@ -576,6 +585,7 @@ def process_dataset(
camera_key: str | None,
output_dir: Path,
create_gif: bool = False,
progress_file: str = "sarm_progress.parquet",
) -> Path | None:
"""Full pipeline: download, extract metadata, composite progress, write output.
@@ -585,6 +595,8 @@ def process_dataset(
camera_key: Camera key to use, or None for auto-selection.
output_dir: Directory to write output files.
create_gif: If True, also generate a GIF from the MP4.
progress_file: Filename of the per-frame progress parquet inside the
dataset repo.
Returns:
Path to the final output file, or None on failure.
@@ -592,7 +604,7 @@ def process_dataset(
safe_name = repo_id.replace("/", "_")
logging.info("Processing: %s | episode %d", repo_id, episode)
local_path = download_episode_metadata(repo_id, episode)
local_path = download_episode_metadata(repo_id, episode, progress_file)
logging.info(" Local cache: %s", local_path)
episode_meta = load_episode_meta(local_path, episode, camera_key)
@@ -600,9 +612,9 @@ def process_dataset(
video_path = download_video_file(repo_id, local_path, episode_meta["video_rel"])
progress_data = load_progress_data(local_path, episode)
progress_data = load_progress_data(local_path, episode, progress_file)
if progress_data is None:
logging.error("Could not load sarm_progress data. Skipping overlay.")
logging.error("Could not load progress data from %s. Skipping overlay.", progress_file)
return None
logging.info(" Progress frames: %d", len(progress_data))
@@ -627,7 +639,7 @@ def process_dataset(
def main() -> None:
parser = argparse.ArgumentParser(
description="Create MP4/GIF videos with sarm_progress overlay for dataset episodes."
description="Create MP4/GIF videos with per-frame progress overlay for dataset episodes."
)
parser.add_argument(
"--repo-id",
@@ -658,6 +670,15 @@ def main() -> None:
action="store_true",
help="Also generate a GIF from the MP4 output.",
)
parser.add_argument(
"--progress-file",
type=str,
default="sarm_progress.parquet",
help=(
"Filename of the per-frame progress parquet inside the dataset repo "
"(default: 'sarm_progress.parquet')."
),
)
args = parser.parse_args()
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
@@ -670,6 +691,7 @@ def main() -> None:
camera_key=args.camera_key,
output_dir=args.output_dir,
create_gif=args.gif,
progress_file=args.progress_file,
)
if result:
+49 -4
View File
@@ -85,6 +85,11 @@ dependencies = [
"termcolor>=2.4.0,<4.0.0",
"tqdm>=4.66.0,<5.0.0",
# Training utilities
# EMA of policy parameters (Diffusion Policy / pi05 style). Tiny
# pure-python dependency — preferred over a hand-rolled implementation.
"ema-pytorch>=0.7.7,<1.0.0",
# Build tools (required by opencv-python-headless on some platforms)
"cmake>=3.29.0.1,<4.2.0",
"setuptools>=71.0.0,<81.0.0",
@@ -138,8 +143,11 @@ dataset_viz = ["lerobot[dataset]", "lerobot[viz]"]
# Common
av-dep = ["av>=15.0.0,<16.0.0"]
pygame-dep = ["pygame>=2.5.1,<2.7.0"]
placo-dep = ["placo>=0.9.6,<0.9.17"]
# NOTE: 0.9.16 links against liburdfdom_sensor.so.4, which is unavailable on Ubuntu 24.04
# (noble ships urdfdom 3.x). Cap below 0.9.16 until system urdfdom 4.x is broadly available.
placo-dep = ["placo>=0.9.6,<0.9.16"]
transformers-dep = ["transformers>=5.4.0,<5.6.0"]
sentencepiece-dep = ["sentencepiece>=0.2.0,<0.3.0"] # FAST action tokenizer backend (pi052, pi0_fast)
grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
can-dep = ["python-can>=4.2.0,<5.0.0"]
peft-dep = ["peft>=0.18.0,<1.0.0"]
@@ -195,7 +203,8 @@ wallx = [
"torchdiffeq>=0.2.4,<0.3.0",
"lerobot[qwen-vl-utils-dep]",
]
pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]"]
pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]", "lerobot[sentencepiece-dep]"]
molmoact2 = ["lerobot[transformers-dep]", "lerobot[peft-dep]", "lerobot[scipy-dep]"]
smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0"]
multi_task_dit = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]"]
groot = [
@@ -209,14 +218,40 @@ groot = [
"flash-attn>=2.5.9,<3.0.0 ; sys_platform != 'darwin'"
]
sarm = ["lerobot[transformers-dep]", "pydantic>=2.0.0,<3.0.0", "faker>=33.0.0,<35.0.0", "lerobot[matplotlib-dep]", "lerobot[qwen-vl-utils-dep]"]
robometer = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]", "lerobot[peft-dep]"]
topreward = ["lerobot[transformers-dep]"]
xvla = ["lerobot[transformers-dep]"]
eo1 = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]"]
hilserl = ["lerobot[transformers-dep]", "lerobot[dataset]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]
vla_jepa = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]", "lerobot[qwen-vl-utils-dep]"]
# Features
async = ["lerobot[grpcio-dep]", "lerobot[matplotlib-dep]"]
peft = ["lerobot[transformers-dep]", "lerobot[peft-dep]"]
# Annotation pipeline (lerobot-annotate). The only backend is ``openai``,
# which talks to any OpenAI-compatible server (``vllm serve`` /
# ``transformers serve`` / hosted). Distributed runs use Hugging Face Jobs
# (see examples/annotations/run_hf_job.py).
annotations = [
"lerobot[dataset]",
"lerobot[transformers-dep]",
"openai>=1.40,<2.0",
# ``vllm`` is intentionally NOT a hard dep: it pins an older torch, and
# uv's single unified lock would then cap ``torch`` for every extra
# (e.g. forcing 2.8 while ``torchcodec`` in [dataset] needs 2.11 -> ABI
# break in CI). The HF Jobs image (``vllm/vllm-openai``) provides vLLM;
# install it locally only if you run your own ``vllm serve``.
]
# Tool implementations under src/lerobot/tools/. Each tool's dependencies
# are isolated so adding a new tool doesn't bloat the base install.
# Currently only `say` (Kyutai pocket-tts; CPU-only, ~100M params).
tools = [
"pocket-tts>=1.0.0,<3.0.0",
"scipy>=1.11.0,<2.0.0", # SayTool.output_dir uses scipy.io.wavfile
]
# Development
dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1", "ruff>=0.14.1", "lerobot[notebook]"]
notebook = ["jupyter>=1.0.0,<2.0.0", "ipykernel>=6.0.0,<7.0.0"]
@@ -272,10 +307,12 @@ all = [
"lerobot[multi_task_dit]",
"lerobot[wallx]",
"lerobot[pi]",
"lerobot[molmoact2]",
"lerobot[smolvla]",
# "lerobot[groot]", TODO(Steven): Gr00t requires specific installation instructions for flash-attn
"lerobot[xvla]",
"lerobot[hilserl]",
"lerobot[vla_jepa]",
"lerobot[async]",
"lerobot[dev]",
"lerobot[test]",
@@ -286,6 +323,8 @@ all = [
"lerobot[libero]; sys_platform == 'linux'",
"lerobot[metaworld]",
"lerobot[sarm]",
"lerobot[robometer]",
"lerobot[topreward]",
"lerobot[peft]",
# "lerobot[unitree_g1]", TODO: Unitree requires specific installation instructions for unitree_sdk2
]
@@ -307,7 +346,10 @@ lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
lerobot-annotate="lerobot.scripts.lerobot_annotate:main"
lerobot-rollout="lerobot.scripts.lerobot_rollout:main"
# Interactive hierarchical-VLA runtime for PI052 (PaliGemma backbone).
lerobot-pi052-runtime="lerobot.scripts.lerobot_pi052_runtime:main"
# ---------------- Tool Configurations ----------------
@@ -325,7 +367,7 @@ torch = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]
torchvision = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]
[tool.setuptools.package-data]
lerobot = ["envs/*.json"]
lerobot = ["envs/*.json", "annotations/steerable_pipeline/prompts/*.txt"]
[tool.setuptools.packages.find]
where = ["src"]
@@ -401,8 +443,11 @@ default.extend-ignore-identifiers-re = [
"ein",
"thw",
"inpt",
"arange",
"is_compileable",
"ROBOTIS",
"OT_VALUE"
"OT_VALUE",
"VanderBilt"
]
# TODO: Uncomment when ready to use
+15
View File
@@ -0,0 +1,15 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
@@ -0,0 +1,36 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Steerable annotation pipeline producing ``language_persistent`` and
``language_events`` columns for LeRobot datasets.
The pipeline is decomposed into three independently runnable modules whose
outputs are staged per-episode before a final parquet rewrite:
- :mod:`.modules.plan_subtasks_memory` (the ``plan`` module) — persistent styles
- :mod:`.modules.interjections_and_speech` (the ``interjections`` module) — event styles + speech
- :mod:`.modules.general_vqa` (the ``vqa`` module) — event-style VQA pairs
"""
from .config import AnnotationPipelineConfig
from .validator import StagingValidator, ValidationReport
from .writer import LanguageColumnsWriter
__all__ = [
"AnnotationPipelineConfig",
"LanguageColumnsWriter",
"StagingValidator",
"ValidationReport",
]
@@ -0,0 +1,196 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
@dataclass
class PlanConfig:
"""``plan`` module: subtasks + plan + memory + task augmentation."""
enabled: bool = True
# ``task_aug`` rephrasings at t=0 (renderer rotates ${task} among them); 0 disables.
n_task_rephrasings: int = 10
# Derive the task from video instead of episode_task: off / if_short / always.
# Affects prompts only; ``meta/tasks.parquet`` is untouched.
derive_task_from_video: str = "if_short"
derive_task_min_words: int = 3
# Frames sampled uniformly, capped at max_video_frames — a hard context cap
# (~300 tokens/frame, so 32 fit a 32k VLM; 128 overflow).
frames_per_second: float = 1.0
max_video_frames: int = 32
# >0: split long episodes into windows of this length (constant fps density)
# instead of subsampling the whole episode; spans merged + stitched. 0 disables.
subtask_window_seconds: float = 0.0
min_subtask_seconds: float = 1.5
plan_max_steps: int = 8
# Narrate-only grounding pass before segmenting — best defense against subtasks
# invented from the task text (+1 VLM call/episode).
subtask_describe_first: bool = True
# Emit ``style="plan"`` rows at each boundary; False = subtasks + memory only.
emit_plan: bool = True
# (subtask spans are always stitched to a contiguous full-episode cover; not configurable.)
# Send a server-side ``video_url`` clip (at use_video_url_fps) instead of embedded frames.
use_video_url: bool = False
use_video_url_fps: float = 1.0
# Optional EgoMimic-style 5-axis task augmentation; replaces n_task_rephrasings.
task_aug_axes: TaskAugAxesConfig = field(default_factory=lambda: TaskAugAxesConfig())
@dataclass
class TaskAugAxesConfig:
"""5-axis t=0 task augmentation (EgoMimic-style): synonym / omit_arm /
omit_orientation / omit_grasp_method / combined. Replaces n_task_rephrasings
when enabled; each variant becomes a ``task_aug`` row. Axes with nothing to
omit emit fewer entries. Defaults (3+3+2+2+2) match EgoMimic."""
enabled: bool = False
synonym_paraphrase: int = 3
omit_arm: int = 3
omit_orientation: int = 2
omit_grasp_method: int = 2
combined_omissions: int = 2
@dataclass
class InterjectionsConfig:
"""``interjections`` module: interjections + paired speech."""
enabled: bool = True
# Each emits a paired (interjection, speech) row + a plan refresh at that ts.
max_interjections_per_episode: int = 3
interjection_min_t: float = 2.0
# Frame window centered on the timestamp so the VLM sees motion, not one frame.
interjection_window_seconds: float = 2.0
interjection_window_frames: int = 4
@dataclass
class VqaConfig:
"""``vqa`` module: general VQA."""
enabled: bool = True
vqa_emission_hz: float = 1.0
K: int = 1
"""Consecutive frames per emission tick. The VLM grounds on the FIRST frame,
so K>1 smears stale labels onto moved frames. Default 1 (no smear)."""
question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")
# True: ground VQA only on --vlm.camera_key (default: every camera).
restrict_to_default_camera: bool = False
@dataclass
class VlmConfig:
"""Shared Qwen-VL client configuration."""
# Only ``openai`` (OpenAI-compatible vLLM server, auto-spawned when
# auto_serve=True); ``stub`` is for tests.
backend: str = "openai"
model_id: str = "Qwen/Qwen3.6-27B"
# OpenAI-compatible endpoint; ``EMPTY`` key works for local servers.
api_base: str = "http://localhost:8000/v1"
api_key: str = "EMPTY"
# Spawn a server if none answers api_base; False = fail fast on a remote.
auto_serve: bool = True
serve_port: int = 8000
# Override the auto-serve command; ``{port}`` substituted per replica.
serve_command: str | None = None
# Independent servers for round-robin routing (one per GPU). num_gpus=0 = one each.
parallel_servers: int = 1
num_gpus: int = 0
client_concurrency: int = 16
serve_ready_timeout_s: float = 600.0
max_new_tokens: int = 512
temperature: float = 0.2
# Auto-serve context length (None → 32768); other vLLM flags go in serve_command.
max_model_len: int | None = None
# Camera for keyframes; None → first ``observation.images.*`` key.
camera_key: str | None = None
# Forwarded as extra_body.chat_template_kwargs (e.g. {"enable_thinking": false}).
chat_template_kwargs: dict[str, Any] | None = None
@dataclass
class ExecutorConfig:
"""Executor settings (intra-process episode concurrency; distribution via HF Jobs)."""
# Episodes processed concurrently per phase; main knob for saturating the servers.
episode_parallelism: int = 16
@dataclass
class AnnotationPipelineConfig:
"""Top-level config for ``lerobot-annotate`` (rewrites data shards in place)."""
# Hub dataset: download source when ``root`` unset; push target when push_to_hub
# is on and ``new_repo_id`` unset.
repo_id: str | None = None
# Separate push target (matches the LeRobot edit tools). Unset → push in place.
new_repo_id: str | None = None
root: Path | None = None
# Defaults to ``<root>/.annotate_staging/``.
staging_dir: Path | None = None
seed: int = 1729
plan: PlanConfig = field(default_factory=PlanConfig)
interjections: InterjectionsConfig = field(default_factory=InterjectionsConfig)
vqa: VqaConfig = field(default_factory=VqaConfig)
vlm: VlmConfig = field(default_factory=VlmConfig)
executor: ExecutorConfig = field(default_factory=ExecutorConfig)
skip_validation: bool = False
only_episodes: tuple[int, ...] | None = None
# Keyframe decode backend. None → ffmpeg CLI (crash-/thread-safe; torchcodec
# SIGSEGVs under concurrent decode). Or ``"torchcodec"`` / ``"pyav"``.
video_backend: str | None = None
# Upload to the Hub (new_repo_id if set, else repo_id; one must be set).
push_to_hub: bool = False
push_private: bool = False
push_commit_message: str | None = None
def resolved_staging_dir(self, root: Path) -> Path:
return self.staging_dir if self.staging_dir is not None else root / ".annotate_staging"
@@ -0,0 +1,253 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""In-process executor that runs the annotation phases.
The executor runs **six phases** in dependency order:
phase 1: ``plan`` module (plan + subtasks + memory)
phase 2: ``interjections`` module (interjections + speech)
phase 3: ``plan`` plan-update pass — re-runs plan emission at every
interjection timestamp produced by phase 2
phase 4: ``vqa`` module (VQA)
phase 5: validator
phase 6: writer
Phase 3 is why the ``plan`` module must be re-entered after the
``interjections`` module — to refresh ``plan`` rows at interjection
timestamps.
Distributed execution is provided by Hugging Face Jobs (see
``examples/annotations/run_hf_job.py``); the runner inside the job
invokes ``lerobot-annotate`` which uses this in-process executor.
Episode-level concurrency is controlled by
``ExecutorConfig.episode_parallelism``.
"""
from __future__ import annotations
import logging
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass
from pathlib import Path
from typing import Any
from .config import AnnotationPipelineConfig
from .reader import EpisodeRecord, iter_episodes
from .staging import EpisodeStaging
from .validator import StagingValidator
from .writer import LanguageColumnsWriter
logger = logging.getLogger(__name__)
@dataclass
class PhaseResult:
"""Summary of one pipeline phase across all episodes."""
name: str
episodes_processed: int
episodes_skipped: int
@dataclass
class PipelineRunSummary:
"""Aggregated result returned by :meth:`Executor.run`."""
phases: list[PhaseResult]
written_paths: list[Path]
validation_report: Any # ValidationReport, kept Any to avoid import cycle
@dataclass
class Executor:
"""Run all six phases over a dataset root in-process.
Episode-level concurrency comes from ``ExecutorConfig.episode_parallelism``
(a thread pool); cluster-level concurrency comes from running this
executor inside a Hugging Face Job. Tests construct the executor
directly with stub modules.
"""
config: AnnotationPipelineConfig
plan: Any # PlanSubtasksMemoryModule
interjections: Any # InterjectionsAndSpeechModule
vqa: Any # GeneralVqaModule
writer: LanguageColumnsWriter
validator: StagingValidator
def run(self, root: Path) -> PipelineRunSummary:
records = list(iter_episodes(root, only_episodes=self.config.only_episodes))
n = len(records)
if n == 0:
raise ValueError(f"No episodes found under {root}/data/")
print(f"[annotate] {n} episodes total", flush=True)
staging_dir = self.config.resolved_staging_dir(root)
staging_dir.mkdir(parents=True, exist_ok=True)
phases: list[PhaseResult] = []
# Phase 1: ``plan`` module (plan + subtasks + memory)
phases.append(self._run_module_phase("plan", records, staging_dir, self.plan))
# Phase 2: ``interjections`` module (interjections + speech). It
# reads the ``plan`` module's subtask rows from the same staging
# tree to ground the interjection prompt in the correct local subtask.
phases.append(self._run_module_phase("interjections", records, staging_dir, self.interjections))
# Phase 3: ``plan`` plan-update pass at interjection timestamps.
phases.append(self._run_plan_update_phase(records, staging_dir))
# Phase 4: ``vqa`` module (VQA)
phases.append(self._run_module_phase("vqa", records, staging_dir, self.vqa))
print("[annotate] running validator...", flush=True)
report = self.validator.validate(records, staging_dir)
if not report.ok and not self.config.skip_validation:
raise RuntimeError(f"Staging validation failed: {report.summary()}")
print(f"[annotate] validator: {report.summary()}", flush=True)
print(f"[annotate] writing parquet shards into {root}/data/...", flush=True)
written = self.writer.write_all(records, staging_dir, root)
print(f"[annotate] wrote {len(written)} shard(s); pipeline complete", flush=True)
# Keep meta/info.json aligned with the parquet schema we just wrote.
# Idempotent and additive: existing user metadata is preserved.
self._ensure_annotation_metadata_in_info(root)
return PipelineRunSummary(phases=phases, written_paths=written, validation_report=report)
@staticmethod
def _ensure_annotation_metadata_in_info(root: Path) -> None:
"""Write language features and canonical tools to ``meta/info.json``.
``LanguageColumnsWriter`` adds ``language_persistent`` and
``language_events`` to parquet shards. The metadata must advertise
those columns too, otherwise non-streaming ``LeRobotDataset`` loads
cast against the old schema and fail on the extra parquet columns.
"""
from lerobot.datasets.io_utils import load_info, write_info # noqa: PLC0415
from lerobot.datasets.language import SAY_TOOL_SCHEMA, language_feature_info # noqa: PLC0415
info_path = root / "meta" / "info.json"
if not info_path.exists():
return
try:
info = load_info(root)
except Exception as exc: # noqa: BLE001
print(f"[annotate] could not read {info_path}: {exc}", flush=True)
return
changed = False
merged_features = {**info.features, **language_feature_info()}
if merged_features != info.features:
info.features = merged_features
changed = True
existing = info.tools or []
names = {(t.get("function") or {}).get("name") for t in existing if isinstance(t, dict)}
if SAY_TOOL_SCHEMA["function"]["name"] not in names:
info.tools = [*existing, SAY_TOOL_SCHEMA]
changed = True
if changed:
write_info(info, root)
print(
"[annotate] meta/info.json: "
f"language_features={list(language_feature_info())}, "
f"tools={[t['function']['name'] for t in (info.tools or [])]}",
flush=True,
)
def _run_module_phase(
self,
name: str,
records: list[EpisodeRecord],
staging_dir: Path,
module: Any,
) -> PhaseResult:
if not module.enabled:
print(f"[annotate] phase={name} skipped (module disabled)", flush=True)
return PhaseResult(name=name, episodes_processed=0, episodes_skipped=len(records))
n = len(records)
parallelism = max(1, min(self.config.executor.episode_parallelism, n))
print(
f"[annotate] phase={name} starting on {n} episode(s) (parallelism={parallelism})",
flush=True,
)
t0 = time.time()
def _do(idx_record: tuple[int, EpisodeRecord]) -> tuple[int, int, float]:
i, record = idx_record
ep_start = time.time()
staging = EpisodeStaging(staging_dir, record.episode_index)
module.run_episode(record, staging)
return i, record.episode_index, time.time() - ep_start
processed = 0
if parallelism == 1:
for i, record in enumerate(records, 1):
_, ep_idx, elapsed = _do((i, record))
processed += 1
print(
f"[annotate] {name} episode {i}/{n} (idx={ep_idx}) done in {elapsed:.1f}s",
flush=True,
)
else:
with ThreadPoolExecutor(max_workers=parallelism) as pool:
futures = [pool.submit(_do, (i, r)) for i, r in enumerate(records, 1)]
for fut in as_completed(futures):
i, ep_idx, elapsed = fut.result()
processed += 1
print(
f"[annotate] {name} episode {processed}/{n} "
f"(idx={ep_idx}, submit_order={i}) done in {elapsed:.1f}s",
flush=True,
)
total = time.time() - t0
print(f"[annotate] phase={name} complete: {processed}/{n} in {total:.1f}s", flush=True)
return PhaseResult(name=name, episodes_processed=processed, episodes_skipped=0)
def _run_plan_update_phase( # noqa: PLR0915
self, records: list[EpisodeRecord], staging_dir: Path
) -> PhaseResult:
"""Re-emit ``plan`` rows at each timestamp the ``interjections`` module produced.
The ``plan`` module owns the prompt; the ``interjections`` module
produced the timestamps. This phase therefore calls back into the
``plan`` module with the interjection timestamps so its existing
prompt path is reused.
"""
if not self.plan.enabled or not self.interjections.enabled:
return PhaseResult(name="plan_update", episodes_processed=0, episodes_skipped=len(records))
processed = 0
for record in records:
staging = EpisodeStaging(staging_dir, record.episode_index)
interjection_rows = [
row for row in staging.read("interjections") if row.get("style") == "interjection"
]
interjection_times = [float(row["timestamp"]) for row in interjection_rows]
interjection_texts = [str(row.get("content") or "") for row in interjection_rows]
if interjection_times:
self.plan.run_plan_updates(record, staging, interjection_times, interjection_texts)
processed += 1
# Episodes without any interjections are skipped (no plan refresh
# needed); count them so the summary's processed+skipped == total.
return PhaseResult(
name="plan_update",
episodes_processed=processed,
episodes_skipped=len(records) - processed,
)
@@ -0,0 +1,498 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Keyframe extraction for the annotation pipeline.
Modules attach decoded camera frames to their VLM prompts so the model can
ground subtask decomposition, interjection scenarios, and VQA in actual
visual content. The pipeline shares one provider across modules and one
episode at a time, with a small per-episode cache so multiple modules
querying the same timestamp pay decode cost once.
"""
from __future__ import annotations
import logging
import threading
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Protocol
import PIL.Image
import torch
from lerobot.datasets.video_utils import decode_video_frames
from .reader import EpisodeRecord
logger = logging.getLogger(__name__)
class FrameProvider(Protocol):
"""Decodes camera frames at episode-relative timestamps."""
@property
def camera_keys(self) -> list[str]:
"""All ``observation.images.*`` feature keys this provider can decode."""
def frames_at(
self,
record: EpisodeRecord,
timestamps: list[float],
camera_key: str | None = None,
) -> list[Any]:
"""Return one decoded frame per timestamp from ``camera_key`` (or default).
Frames are ``torch.Tensor`` (``C, H, W`` uint8) — the shape
:func:`lerobot.datasets.video_utils.decode_video_frames` returns.
:func:`to_image_blocks` converts them to PIL only at the VLM-message
boundary.
Empty list if the camera is unavailable. ``camera_key=None`` falls back
to the provider's default camera so existing single-camera callers
(the ``plan`` and ``interjections`` modules) keep working unchanged.
"""
def video_for_episode(
self,
record: EpisodeRecord,
max_frames: int,
camera_key: str | None = None,
) -> list[Any]:
"""Return up to ``max_frames`` decoded frames covering the whole episode.
Sampling is uniform across the episode duration. Frames are
``torch.Tensor`` (``C, H, W`` uint8); :func:`to_video_block` wraps
them into one ``{"type":"video", "video":<list>}`` block for a
Qwen-VL-compatible model that pools temporally itself. Empty list if
no camera available.
"""
@dataclass
class _NullProvider:
"""No-op provider used when the dataset has no video keys or in tests."""
@property
def camera_keys(self) -> list[str]:
return []
def frames_at(
self,
record: EpisodeRecord,
timestamps: list[float],
camera_key: str | None = None,
) -> list[Any]:
return []
def video_for_episode(
self,
record: EpisodeRecord,
max_frames: int,
camera_key: str | None = None,
) -> list[Any]:
return []
def null_provider() -> FrameProvider:
return _NullProvider()
@dataclass
class VideoFrameProvider:
"""Decodes frames from the dataset's ``observation.images.*`` streams.
By default the *first* camera key is used for the ``plan`` module
(subtask decomposition) and the ``interjections`` module (interjection
scenarios) — those prompts care about *what is happening*, not which
angle. The ``vqa`` module instead iterates over every camera in
:attr:`camera_keys` so each frame's
grounded answer (bbox/keypoint/...) is tagged with the camera it was
grounded against.
``camera_key`` overrides the default-camera choice but does not restrict
:attr:`camera_keys`. Pass ``camera_key`` explicitly to ``frames_at`` /
``video_for_episode`` to read a non-default stream.
Caches up to ``cache_size`` decoded frames per process to keep
co-timestamped ``interjections`` + ``plan`` plan-update calls cheap.
"""
root: Path
camera_key: str | None = None
tolerance_s: float = 1e-2
cache_size: int = 256
# Keyframe decode backend. ``None`` uses the ffmpeg CLI — the
# concurrency- and crash-safe default for the pipeline's threaded
# decode. Set to ``"torchcodec"`` or ``"pyav"`` to pin an in-process
# decoder when the build is known thread-safe.
video_backend: str | None = None
_meta: Any = field(default=None, init=False, repr=False)
_cache: dict = field(default_factory=dict, init=False, repr=False)
_camera_keys: list[str] = field(default_factory=list, init=False, repr=False)
# Pipeline runs the three module phases under a ThreadPoolExecutor (see
# ``ExecutorConfig.episode_parallelism``); guard the dict cache and the
# one-shot warn flag against concurrent updates from worker threads.
_lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
_warned_decode_fail: bool = field(default=False, init=False, repr=False)
def __post_init__(self) -> None:
from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata # noqa: PLC0415
self._meta = LeRobotDatasetMetadata(repo_id="local", root=self.root)
# Only ``video_keys`` are decodable here: the clip/decode paths read
# ``videos/<key>/from_timestamp`` from episode metadata, which exists
# only for video-stored cameras. Image-stored cameras (also in
# ``camera_keys``) would KeyError, so restrict the list — and the
# default — to video keys.
keys = list(self._meta.video_keys)
# Last-resort fallback: if metadata didn't surface any video keys but
# the caller explicitly named a camera (``--vlm.camera_key=...``),
# trust them — the key is by definition known to exist on the dataset.
if not keys and self.camera_key:
keys = [self.camera_key]
self._camera_keys = keys
if self.camera_key is None:
self.camera_key = keys[0] if keys else None
@property
def camera_keys(self) -> list[str]:
"""All ``observation.images.*`` keys available on this dataset."""
return list(self._camera_keys)
def frames_at(
self,
record: EpisodeRecord,
timestamps: list[float],
camera_key: str | None = None,
) -> list[Any]:
target = camera_key if camera_key is not None else self.camera_key
if not timestamps or target is None:
return []
out: list[Any] = []
misses: list[float] = []
miss_indices: list[int] = []
with self._lock:
for i, ts in enumerate(timestamps):
key = (record.episode_index, target, round(float(ts), 6))
cached = self._cache.get(key)
if cached is not None:
out.append(cached)
else:
out.append(None)
misses.append(float(ts))
miss_indices.append(i)
if misses:
decoded = self._decode(record.episode_index, misses, target)
# ``_decode`` returns exactly one frame per requested timestamp,
# or an empty list if decoding failed wholesale. A partial list
# would mean a frame/timestamp misalignment, so only pair them up
# when the counts match (``strict=True`` then guards regressions).
if len(decoded) == len(miss_indices):
with self._lock:
for i, frame in zip(miss_indices, decoded, strict=True):
out[i] = frame
key = (record.episode_index, target, round(float(timestamps[i]), 6))
if len(self._cache) >= self.cache_size:
self._cache.pop(next(iter(self._cache)))
self._cache[key] = frame
# filter out any None left over from decode failures
return [frame for frame in out if frame is not None]
def video_for_episode(
self,
record: EpisodeRecord,
max_frames: int,
camera_key: str | None = None,
) -> list[Any]:
"""Return up to ``max_frames`` frames uniformly sampled across the episode.
The whole episode duration is covered; the model picks subtask
boundaries from the temporal pooling it does internally. Frames are
``torch.Tensor`` (see :meth:`frames_at`).
"""
target = camera_key if camera_key is not None else self.camera_key
if max_frames <= 0 or target is None or not record.frame_timestamps:
return []
n_frames = min(max_frames, len(record.frame_timestamps))
if n_frames == len(record.frame_timestamps):
timestamps = list(record.frame_timestamps)
else:
t0 = record.frame_timestamps[0]
t_last = record.frame_timestamps[-1]
if t_last <= t0:
timestamps = [float(t0)] * n_frames
else:
step = (t_last - t0) / (n_frames - 1) if n_frames > 1 else 0.0
timestamps = [float(t0 + i * step) for i in range(n_frames)]
return self.frames_at(record, timestamps, camera_key=target)
def episode_clip_path(self, record: EpisodeRecord, cache_dir: Path) -> Path | None:
"""Extract the episode's subclip to ``cache_dir/ep_{idx:06d}.mp4``.
Returns ``None`` if the dataset has no video tracks. Skips
re-extract when the cached clip already exists. Re-encodes to
H.264 (libx264) so the resulting mp4 is decodable by every
downstream video processor — stream-copy would inherit the
source codec (often AV1 in modern LeRobot datasets), which
vllm's libav build cannot decode.
"""
import subprocess # noqa: PLC0415
if self.camera_key is None:
return None
cache_dir.mkdir(parents=True, exist_ok=True)
out_path = cache_dir / f"ep_{record.episode_index:06d}.mp4"
if out_path.exists() and out_path.stat().st_size > 0:
return out_path
ep = self._meta.episodes[record.episode_index]
from_timestamp = float(ep[f"videos/{self.camera_key}/from_timestamp"])
to_timestamp = float(ep[f"videos/{self.camera_key}/to_timestamp"])
src = self.root / self._meta.get_video_file_path(record.episode_index, self.camera_key)
cmd = [
"ffmpeg",
"-y",
"-loglevel",
"error",
"-ss",
f"{from_timestamp:.3f}",
"-to",
f"{to_timestamp:.3f}",
"-i",
str(src),
"-c:v",
"libx264",
"-preset",
"ultrafast",
"-crf",
"23",
"-pix_fmt",
"yuv420p",
"-an",
str(out_path),
]
try:
# ffmpeg is invoked by name via PATH lookup (the standard way to
# call the CLI); the arg list is fully controlled here, not shell.
subprocess.run(cmd, check=True, timeout=300) # nosec B607
except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
return None
return out_path if out_path.exists() and out_path.stat().st_size > 0 else None
def _decode(self, episode_index: int, timestamps: list[float], camera_key: str) -> list[Any]:
"""Decode ``timestamps`` from the episode's video as ``(C, H, W)`` tensors.
Delegates to :func:`lerobot.datasets.video_utils.decode_video_frames`
(torchcodec by default, PyAV fallback) rather than a bespoke decoder.
Returns one frame per requested timestamp, or ``[]`` if decoding
failed wholesale — callers treat ``[]`` as "no frames available".
"""
ep = self._meta.episodes[episode_index]
from_timestamp = ep[f"videos/{camera_key}/from_timestamp"]
shifted = [from_timestamp + ts for ts in timestamps]
video_path = self.root / self._meta.get_video_file_path(episode_index, camera_key)
# Default to the ffmpeg CLI. The pipeline decodes under a 16-wide
# ThreadPoolExecutor and the in-process decoders are unsafe there:
# torchcodec is not thread-safe and SIGSEGVs under concurrent decode
# (a crash no try/except can catch), PyAV can likewise segfault on
# AV1, and lerobot's ``pyav`` backend routes through the removed
# ``torchvision.io.VideoReader``. ``_decode_frames_ffmpeg`` shells
# out per frame: each decode is an isolated child process, so it is
# both crash-safe and concurrency-safe. ``video_backend`` can pin
# ``torchcodec`` / ``pyav`` explicitly for callers that know their
# build is safe.
chain = [self.video_backend] if self.video_backend else ["ffmpeg"]
exc: Exception | None = None
for backend in chain:
try:
if backend == "ffmpeg":
return _decode_frames_ffmpeg(video_path, shifted)
if backend in ("pyav", "av"):
return _decode_frames_av(video_path, shifted)
# Stacked ``(N, C, H, W)`` uint8 tensor; one row per timestamp.
decoded = decode_video_frames(
video_path, shifted, self.tolerance_s, backend=backend, return_uint8=True
)
return list(decoded)
except Exception as e: # noqa: PERF203
exc = e
# Every backend raised. Log loudly the first time so a silent
# vqa-module no-op (every prompt skipped because frames_at returned
# []) is debuggable from the job log instead of post-hoc parquet
# inspection. Subsequent failures stay quiet.
with self._lock:
already_warned = self._warned_decode_fail
if not already_warned:
self._warned_decode_fail = True
if not already_warned:
logger.warning(
"VideoFrameProvider._decode failed for episode=%s camera=%s video_path=%s backends=%s: %s",
episode_index,
camera_key,
video_path,
chain,
exc,
exc_info=exc,
)
return []
def make_frame_provider(
root: Path, camera_key: str | None = None, video_backend: str | None = None
) -> FrameProvider:
"""Build a :class:`VideoFrameProvider` if videos are present, else null."""
try:
provider = VideoFrameProvider(root=root, camera_key=camera_key, video_backend=video_backend)
except Exception:
return null_provider()
if provider.camera_key is None:
return null_provider()
return provider
def _decode_frames_ffmpeg(video_path: Path, timestamps: list[float]) -> list[Any]:
"""Decode the frames nearest to ``timestamps`` via the ffmpeg CLI.
Runs one ``ffmpeg`` process per timestamp, seeking with ``-ss`` and
piping a single PNG to stdout. Unlike the in-process decoders this
survives a hostile container: a full ffmpeg build decodes AV1 (the codec
modern LeRobot datasets use) where torchcodec raises and PyAV can
SIGSEGV, and a crash stays isolated to the child process — a non-zero
exit is a catchable error, not a segfault of the whole job. Returns one
``(C, H, W)`` uint8 tensor per timestamp.
"""
import io # noqa: PLC0415
import subprocess # noqa: PLC0415
import numpy as np # noqa: PLC0415
frames: list[Any] = []
for ts in timestamps:
# ffmpeg invoked by name via PATH lookup; fully-controlled arg list, no shell.
proc = subprocess.run( # nosec B607
[
"ffmpeg",
"-nostdin",
"-loglevel",
"error",
"-ss",
f"{max(ts, 0.0):.3f}",
"-i",
str(video_path),
"-frames:v",
"1",
"-f",
"image2pipe",
"-vcodec",
"png",
"pipe:1",
],
capture_output=True,
check=True,
timeout=120,
)
if not proc.stdout:
raise RuntimeError(f"ffmpeg returned no frame for t={ts:.3f}s of {video_path}")
img = PIL.Image.open(io.BytesIO(proc.stdout)).convert("RGB")
frames.append(torch.from_numpy(np.asarray(img).copy()).permute(2, 0, 1).contiguous())
return frames
def _decode_frames_av(video_path: Path, timestamps: list[float]) -> list[Any]:
"""Decode the frames nearest to ``timestamps`` using PyAV directly.
lerobot's ``decode_video_frames(backend="pyav")`` routes through
``torchvision.io.VideoReader``, removed in torchvision 0.23+. This helper
talks to the ``av`` package directly. Note PyAV can SIGSEGV on AV1
streams in some builds — prefer ``_decode_frames_ffmpeg`` as the default
fallback; this stays available behind ``video_backend="pyav"``. Returns
one ``(C, H, W)`` uint8 tensor per timestamp.
"""
import av # noqa: PLC0415
first_ts = min(timestamps)
last_ts = max(timestamps)
loaded_frames: list[torch.Tensor] = []
loaded_ts: list[float] = []
with av.open(str(video_path)) as container:
stream = container.streams.video[0]
# Seek to the keyframe at or before the first requested timestamp.
offset = max(int(first_ts / stream.time_base), 0) if stream.time_base else 0
container.seek(offset, stream=stream, backward=True, any_frame=False)
for idx, frame in enumerate(container.decode(stream)):
ts = frame.time
if ts is None:
ts = float(frame.pts * stream.time_base) if frame.pts is not None else float(idx)
loaded_ts.append(ts)
loaded_frames.append(
torch.from_numpy(frame.to_ndarray(format="rgb24")).permute(2, 0, 1).contiguous()
)
if ts >= last_ts:
break
if not loaded_frames:
raise RuntimeError(f"PyAV decoded no frames from {video_path}")
ts_tensor = torch.tensor(loaded_ts)
return [loaded_frames[int(torch.argmin((ts_tensor - q).abs()))] for q in timestamps]
def _frame_to_pil(frame: Any) -> Any:
"""Materialise a decoded frame as a ``PIL.Image`` for the VLM message.
Frames flow through the provider as ``torch.Tensor`` (``C, H, W`` uint8,
straight from :func:`decode_video_frames`); PIL is only created here, at
the VLM-message boundary, because the chat backends expect PIL images /
data URLs. Non-tensor inputs (e.g. test stubs) pass through untouched.
"""
if not isinstance(frame, torch.Tensor):
return frame
array = frame.detach().cpu()
if array.ndim == 3 and array.shape[0] in (1, 3):
array = array.permute(1, 2, 0) # (C, H, W) -> (H, W, C)
if array.shape[-1] == 1:
array = array.squeeze(-1)
return PIL.Image.fromarray(array.to(torch.uint8).numpy())
def to_image_blocks(frames: list[Any]) -> list[dict[str, Any]]:
"""Convert decoded frames to Qwen-VL-compatible image content blocks."""
return [{"type": "image", "image": _frame_to_pil(frame)} for frame in frames]
def to_video_block(frames: list[Any]) -> list[dict[str, Any]]:
"""Wrap a list of decoded frames as one Qwen-VL video block.
Returns ``[]`` when the list is empty, so the caller can splat the result
into a content array without a separate emptiness check.
"""
if not frames:
return []
return [{"type": "video", "video": [_frame_to_pil(frame) for frame in frames]}]
def to_video_url_block(url: str | None, fps: float = 2.0) -> list[dict[str, Any]]:
"""Wrap a video file URL as one ``video_url`` block.
Used by the ``openai`` backend (transformers serve / vllm serve /
ktransformers serve), where the server handles frame sampling.
Returns ``[]`` when ``url`` is ``None`` so the caller can splat.
"""
if not url:
return []
return [{"type": "video_url", "video_url": {"url": url}, "fps": fps}]
@@ -0,0 +1,25 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .general_vqa import GeneralVqaModule
from .interjections_and_speech import InterjectionsAndSpeechModule
from .plan_subtasks_memory import PlanSubtasksMemoryModule
__all__ = [
"GeneralVqaModule",
"InterjectionsAndSpeechModule",
"PlanSubtasksMemoryModule",
]
@@ -0,0 +1,248 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""``vqa`` module: general VQA at a timed cadence.
Every ``1/hz`` seconds an emission tick fires; each tick anchors ``K``
consecutive frames, and every anchored frame gets its own VQA pair. Each
pair is grounded on that single anchor frame — there is no per-pair frame
window. For datasets with multiple cameras, every anchored frame produces
one ``(vqa, user)`` + ``(vqa, assistant)`` pair *per camera*: each pair is
generated against that camera's frame and stamped with the matching
``camera`` field on the emitted rows. The resolver disambiguates via
``camera=...``; recipes that consume VQA do so through one sub-recipe
per camera (see ``recipes/pi05_hirobot.yaml``).
Within a single (frame, camera) we still emit at most one ``(vqa, user)``
and one ``(vqa, assistant)`` row, so the resolver contract stays scalar.
Question types covered (per the plan's ``vqa`` table): bbox, keypoint,
count, attribute, spatial. The assistant's ``content`` is a JSON string
whose schema depends on the question type. Malformed JSON triggers one
retry inside :meth:`VlmClient.generate_json`.
"""
from __future__ import annotations
import json
import logging
import random
from collections.abc import Sequence
from dataclasses import dataclass, field
from typing import Any
from ..config import VqaConfig
from ..frames import FrameProvider, null_provider, to_image_blocks
from ..prompts import load as load_prompt
from ..reader import EpisodeRecord
from ..staging import EpisodeStaging
from ..validator import classify_vqa_answer
from ..vlm_client import VlmClient
def _emission_anchor_indices(frame_timestamps: Sequence[float], hz: float, k: int) -> list[int]:
"""Return the relative frame indices to anchor VQA emissions to.
For each emission tick (every ``1/hz`` seconds), we anchor ``k``
consecutive frames starting at the tick. Ticks fall on the nearest
available source frame timestamp.
"""
if hz <= 0 or k <= 0 or not frame_timestamps:
return []
t0 = frame_timestamps[0]
t_last = frame_timestamps[-1]
period = 1.0 / hz
indices: list[int] = []
t = t0
while t <= t_last + 1e-9:
# find the index of the nearest frame to t
nearest_i = min(range(len(frame_timestamps)), key=lambda i: abs(frame_timestamps[i] - t))
for offset in range(k):
j = nearest_i + offset
if j >= len(frame_timestamps):
break
if not indices or indices[-1] != j:
indices.append(j)
t += period
# dedupe while preserving order
seen: set[int] = set()
deduped: list[int] = []
for i in indices:
if i in seen:
continue
seen.add(i)
deduped.append(i)
return deduped
@dataclass
class GeneralVqaModule:
"""Emit grounded VQA pairs at a timed cadence."""
vlm: VlmClient
config: VqaConfig
seed: int = 1729
frame_provider: FrameProvider = field(default_factory=null_provider)
_warned_no_camera: bool = field(default=False, init=False, repr=False)
@property
def enabled(self) -> bool:
return self.config.enabled
def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
if not record.frame_timestamps:
staging.write("vqa", [])
return
rng = random.Random(f"{self.seed}:{record.episode_index}:vqa")
anchor_idx = _emission_anchor_indices(
record.frame_timestamps, self.config.vqa_emission_hz, self.config.K
)
cameras = self._target_cameras()
if not cameras:
# No camera available — emit nothing rather than producing
# untagged rows that would fail validation. Surface a loud one-
# time warning so this is never silently a no-op.
if not self._warned_no_camera:
logging.getLogger(__name__).warning(
"vqa module found no cameras on the frame provider — "
"every episode will emit zero VQA rows. Check that the "
"dataset declares observation.images.* features in "
"meta/info.json; passing --vlm.camera_key=<key> at the "
"CLI now also seeds the cameras list as a fallback."
)
self._warned_no_camera = True
staging.write("vqa", [])
return
# Build all messages first (one per (frame, camera)), then issue them
# as a single batched generate_json call so the client can fan them
# out concurrently.
per_call: list[tuple[float, str, str, list[dict[str, Any]]]] = []
for idx in anchor_idx:
ts = float(record.frame_timestamps[idx])
qtype = rng.choice(self.config.question_types)
for camera in cameras:
messages = self._build_messages(record, qtype, ts, camera)
# Skip cameras that decoded to zero frames at this ts: no point
# asking the VLM to ground a bbox without an image.
if not _has_image_block(messages):
continue
per_call.append((ts, camera, qtype, messages))
if not per_call:
staging.write("vqa", [])
return
results = self.vlm.generate_json([m for _, _, _, m in per_call])
rows: list[dict[str, Any]] = []
for (ts, camera, _qtype, _messages), result in zip(per_call, results, strict=True):
qa = self._postprocess(result)
if qa is None:
continue
question, answer = qa
rows.append(
{
"role": "user",
"content": question,
"style": "vqa",
"timestamp": ts,
"camera": camera,
"tool_calls": None,
}
)
rows.append(
{
"role": "assistant",
"content": json.dumps(answer, sort_keys=True),
"style": "vqa",
"timestamp": ts,
"camera": camera,
"tool_calls": None,
}
)
staging.write("vqa", rows)
def _target_cameras(self) -> list[str]:
"""Return the cameras the ``vqa`` module should iterate per anchored frame.
Defaults to every camera the provider exposes. Datasets with no
cameras (or test/null providers) yield an empty list, which makes
``run_episode`` a no-op.
When ``config.restrict_to_default_camera`` is set, VQA grounds on
only the provider's default camera (the single ``--vlm.camera_key``
stream), matching the plan / interjection modules so the whole
pipeline focuses on one view.
"""
all_cameras = list(getattr(self.frame_provider, "camera_keys", []) or [])
if getattr(self.config, "restrict_to_default_camera", False):
default = getattr(self.frame_provider, "camera_key", None)
if default and default in all_cameras:
return [default]
# ``restrict_to_default_camera`` is set but the configured default
# isn't one the provider exposes. Returning it anyway would make
# ``_decode`` raise a KeyError deep in frame extraction, so warn and
# fall through to every available camera instead.
if default:
logging.getLogger(__name__).warning(
"restrict_to_default_camera is set but camera_key=%r is not in the "
"provider's cameras %s; grounding VQA on all available cameras instead.",
default,
all_cameras,
)
return all_cameras
def _build_messages(
self,
record: EpisodeRecord,
question_type: str,
frame_timestamp: float,
camera_key: str,
) -> list[dict[str, Any]]:
prompt = load_prompt("vqa").format(
episode_task=record.episode_task,
question_type=question_type,
)
images = self.frame_provider.frames_at(record, [frame_timestamp], camera_key=camera_key)
content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
return [{"role": "user", "content": content}]
def _postprocess(self, result: Any) -> tuple[str, dict[str, Any]] | None:
if not isinstance(result, dict):
return None
question = result.get("question")
answer = result.get("answer")
if not isinstance(question, str) or not question.strip():
return None
if not isinstance(answer, dict):
return None
# The validator will enforce shape; here we just sanity-check that the
# answer matches *some* known shape so we can drop garbage early.
if classify_vqa_answer(answer) is None:
return None
return question.strip(), answer
def _has_image_block(messages: list[dict[str, Any]]) -> bool:
"""Return True if any user content block is a populated image block."""
for msg in messages:
content = msg.get("content")
if not isinstance(content, list):
continue
for block in content:
if isinstance(block, dict) and block.get("type") == "image":
return True
return False
@@ -0,0 +1,211 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""``interjections`` module: interjections + paired speech (EVENT styles + speech atoms).
Two sub-passes:
1. At ``t=0``, emit ONLY a speech tool-call atom (acknowledgement of the
canonical task). No interjection row — the canonical task is already the
user utterance from ``meta/tasks.parquet``.
2. For mid-episode interruptions, emit a co-timestamped pair:
{role:user, style:interjection, content:<text>}
speech atom (role:assistant, style:None, tool_calls=[say(...)])
Both rows go in ``language_events`` at the same timestamp.
The ``plan`` module's :meth:`run_plan_updates` reuses this module's
interjection timestamps to refresh the ``plan`` row at the same instant.
"""
from __future__ import annotations
import random
from collections.abc import Sequence
from dataclasses import dataclass, field
from typing import Any
from ..config import InterjectionsConfig
from ..frames import FrameProvider, null_provider, to_image_blocks
from ..prompts import load as load_prompt
from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
from ..staging import EpisodeStaging
from ..vlm_client import VlmClient
from ..writer import speech_atom
@dataclass
class InterjectionsAndSpeechModule:
"""Generate task-start speech and mid-episode interjection/speech pairs."""
vlm: VlmClient
config: InterjectionsConfig
seed: int = 1729
frame_provider: FrameProvider = field(default_factory=null_provider)
@property
def enabled(self) -> bool:
return self.config.enabled
def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
rows: list[dict[str, Any]] = []
if record.frame_timestamps:
t0 = float(record.frame_timestamps[0])
initial = self._initial_speech(record)
if initial:
rows.append(speech_atom(t0, initial))
# Pull the ``plan`` module's subtask spans for this episode so the
# interjection prompt can ground itself in the actual current
# subtask at each chosen timestamp. The ``plan`` module ran first.
episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
subtask_spans = reconstruct_subtask_spans(staging.read("plan"), episode_end_t=episode_end_t)
rows.extend(self._mid_episode_interjections(record, subtask_spans))
staging.write("interjections", rows)
@staticmethod
def _subtask_at(spans: Sequence[dict[str, Any]], t: float) -> str | None:
current: str | None = None
for span in spans:
if float(span["start"]) <= t:
current = span.get("text")
else:
break
return current
def _initial_speech(self, record: EpisodeRecord) -> str | None:
prompt = load_prompt("interjections_initial_speech").format(
episode_task=record.episode_task,
)
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
result = self.vlm.generate_json([messages])[0]
if isinstance(result, dict) and isinstance(result.get("text"), str):
text = result["text"].strip()
if text:
return text
return None
def _mid_episode_interjections(
self,
record: EpisodeRecord,
subtask_spans: Sequence[dict[str, Any]],
) -> list[dict[str, Any]]:
"""Generate interjections aligned with the actual demo trajectory.
Teleop data is frozen — the robot already executed every step in
the video. A *counterfactual* interjection like "actually skip
the wipe" contradicts what then happens in the video, which is
what qwen36moe-10/11 surfaced as low-quality interjections.
Instead, anchor every interjection at a subtask boundary and
write it as a natural user request for the *upcoming* subtask.
The robot's visible next behavior IS the interjection's effect,
so the training signal stays consistent: interjection text →
plan refresh → action stream all line up.
"""
if self.config.max_interjections_per_episode <= 0:
return []
if len(subtask_spans) < 2:
# Need at least one transition (subtask 0 → subtask 1).
return []
# Deterministic per-episode RNG so reruns are stable across SLURM jobs.
rng = random.Random(f"{self.seed}:{record.episode_index}:interjection")
# Boundaries: the start time of every subtask except the first
# (which is just t0 and is covered by the initial-task speech atom).
boundaries: list[tuple[float, str, str]] = []
for i in range(1, len(subtask_spans)):
ts = float(subtask_spans[i]["start"])
if ts < self.config.interjection_min_t:
continue
prev_text = (subtask_spans[i - 1].get("text") or "").strip()
next_text = (subtask_spans[i].get("text") or "").strip()
if not next_text:
continue
boundaries.append((ts, prev_text, next_text))
if not boundaries:
return []
n = min(self.config.max_interjections_per_episode, len(boundaries))
chosen = sorted(rng.sample(boundaries, n), key=lambda b: b[0])
out: list[dict[str, Any]] = []
for t, prev_subtask, next_subtask in chosen:
t_snap = snap_to_frame(t, record.frame_timestamps)
# Window straddles the boundary so the VLM sees the end of the
# previous subtask and the start of the next one — same
# conditioning the policy will see at training time.
window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
prompt = load_prompt("interjections_interjection").format(
episode_task=record.episode_task,
prev_subtask=prev_subtask or "(starting from initial state)",
next_subtask=next_subtask,
timestamp=t_snap,
window_seconds=self.config.interjection_window_seconds,
)
images = self.frame_provider.frames_at(record, window_ts)
content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
messages = [{"role": "user", "content": content}]
result = self.vlm.generate_json([messages])[0]
if not isinstance(result, dict):
continue
interjection_text = result.get("interjection")
speech_text = result.get("speech")
if not isinstance(interjection_text, str) or not interjection_text.strip():
continue
if not isinstance(speech_text, str) or not speech_text.strip():
continue
out.append(
{
"role": "user",
"content": interjection_text.strip(),
"style": "interjection",
"timestamp": t_snap,
"tool_calls": None,
}
)
out.append(speech_atom(t_snap, speech_text.strip()))
return out
def _window_timestamps(self, t_anchor: float, frame_timestamps: Sequence[float]) -> list[float]:
"""Return a small set of frame timestamps centered on ``t_anchor``.
The window straddles the subtask boundary the interjection sits
on: roughly half the frames cover the end of the previous
subtask, half cover the start of the next one. The VLM therefore
sees BOTH what just finished AND what's about to start, which is
the conditioning we need to write a natural "now please do X"
request that matches the visible upcoming behavior.
"""
if not frame_timestamps:
return [t_anchor]
n = max(1, int(self.config.interjection_window_frames))
if n == 1:
return [t_anchor]
window = float(self.config.interjection_window_seconds)
step = window / max(1, n - 1)
# Center the window on the anchor so half lands before, half after.
start_offset = -window / 2.0
targets = [t_anchor + start_offset + step * i for i in range(n)]
first_ts = float(frame_timestamps[0])
last_ts = float(frame_timestamps[-1])
snapped: list[float] = []
seen: set[float] = set()
for tgt in targets:
clamped = min(last_ts, max(first_ts, tgt))
t = snap_to_frame(clamped, frame_timestamps)
if t not in seen:
seen.add(t)
snapped.append(t)
return snapped or [t_anchor]
@@ -0,0 +1,712 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""``plan`` module: subtask decomposition + plan + memory (PERSISTENT styles)."""
from __future__ import annotations
import logging
from collections.abc import Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from ..config import PlanConfig
from ..frames import (
FrameProvider,
VideoFrameProvider,
null_provider,
to_video_block,
to_video_url_block,
)
from ..prompts import load as load_prompt
from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
from ..staging import EpisodeStaging
from ..vlm_client import VlmClient
logger = logging.getLogger(__name__)
@dataclass
class PlanSubtasksMemoryModule:
"""Generate subtask spans, plan, and memory rows.
All output is persistent (lives in ``language_persistent``):
- ``subtask`` rows: one per span, stamped at the span's *start* timestamp
(snapped to an exact frame).
- ``plan`` rows: emitted at ``t=0``; refreshed at every interjection
timestamp via :meth:`run_plan_updates` (called by the executor after
the ``interjections`` module completes).
- ``memory`` rows: emitted at each subtask boundary (= subtask start
timestamp from the second subtask onward).
"""
vlm: VlmClient
config: PlanConfig
frame_provider: FrameProvider = field(default_factory=null_provider)
@property
def enabled(self) -> bool:
return self.config.enabled
def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
rows: list[dict[str, Any]] = []
# Task driving every plan-module prompt: canonical episode_task, or a
# video-derived one when it's empty/placeholder (see derive_task_*).
effective_task = self._resolve_effective_task(record)
# task_aug rows at t=0: phrasings the renderer rotates ${task} through.
# Either the structured 5-axis taxonomy (task_aug_axes.enabled) or
# free-form n_task_rephrasings; the effective task is always emitted
# first so the rotation covers the source-of-truth phrasing.
t0 = float(record.frame_timestamps[0]) if record.frame_timestamps else 0.0
variants: list[str] | None = None
if self.config.task_aug_axes.enabled and effective_task:
variants = self._generate_task_aug_by_axes(effective_task, self.config.task_aug_axes)
elif self.config.n_task_rephrasings > 0 and effective_task:
variants = self._generate_task_rephrasings(effective_task, n=self.config.n_task_rephrasings)
if variants is not None:
rows.extend(self._task_aug_rows([effective_task, *variants], t0))
subtask_spans = self._generate_subtasks(record, task=effective_task)
# subtask rows
for span in subtask_spans:
rows.append(
{
"role": "assistant",
"content": span["text"],
"style": "subtask",
"timestamp": snap_to_frame(span["start"], record.frame_timestamps),
"tool_calls": None,
}
)
# Plan rows at every subtask boundary (incl. t=0). The plan is a
# numbered list of still-todo subtasks, so re-emitting at each
# boundary makes it shrink as work progresses — ${plan} at frame t is
# exactly what's left to do.
if self.config.emit_plan:
for span in subtask_spans:
boundary_t = snap_to_frame(span["start"], record.frame_timestamps)
plan_text = self._generate_plan(
record, subtask_spans, refresh_t=boundary_t, task=effective_task
)
if plan_text is not None:
rows.append(
{
"role": "assistant",
"content": plan_text,
"style": "plan",
"timestamp": float(boundary_t),
"tool_calls": None,
}
)
# memory rows at every subtask boundary except the very first start
prior_memory = ""
for i, span in enumerate(subtask_spans[1:], start=1):
completed = subtask_spans[i - 1]["text"]
remaining = [s["text"] for s in subtask_spans[i:]]
mem_text = self._generate_memory(record, prior_memory, completed, remaining, task=effective_task)
if mem_text:
ts = snap_to_frame(span["start"], record.frame_timestamps)
rows.append(
{
"role": "assistant",
"content": mem_text,
"style": "memory",
"timestamp": ts,
"tool_calls": None,
}
)
prior_memory = mem_text
staging.write("plan", rows)
# ------------------------------------------------------------------
# Task derivation + rephrasings
# ------------------------------------------------------------------
_PLACEHOLDER_TASKS: frozenset[str] = frozenset(
{
"debug",
"test",
"tbd",
"todo",
"n/a",
"na",
"untitled",
"unnamed",
"default",
"placeholder",
}
)
def _resolve_effective_task(self, record: EpisodeRecord) -> str:
"""Decide which task string drives the ``plan`` module for this episode.
Returns the user-supplied ``record.episode_task`` unless
``derive_task_from_video`` says otherwise (see config docstring).
Falls back gracefully to the canonical task if video derivation
fails.
"""
canonical = (record.episode_task or "").strip()
mode = (self.config.derive_task_from_video or "off").strip().lower()
if mode == "always":
derived = self._derive_task_from_video(record)
return derived or canonical
if mode == "if_short" and self._task_seems_bad(canonical):
derived = self._derive_task_from_video(record)
if derived:
return derived
return canonical
def _task_seems_bad(self, task: str) -> bool:
if not task:
return True
if len(task.split()) < int(self.config.derive_task_min_words):
return True
return task.lower() in self._PLACEHOLDER_TASKS
@staticmethod
def _task_aug_rows(phrasings: Sequence[str], t0: float) -> list[dict[str, Any]]:
"""Build deduplicated ``task_aug`` rows (role=user) at ``t0``."""
seen: set[str] = set()
rows: list[dict[str, Any]] = []
for phrasing in phrasings:
key = phrasing.strip()
if not key or key in seen:
continue
seen.add(key)
rows.append(
{"role": "user", "content": key, "style": "task_aug", "timestamp": t0, "tool_calls": None}
)
return rows
# ------------------------------------------------------------------
# VLM call helpers — every plan-module prompt follows the same shape:
# build messages → single VLM call → pull a named field.
# ------------------------------------------------------------------
def _vlm_field(self, messages: list[dict[str, Any]], field: str) -> Any:
"""Run a single VLM call and return ``result[field]`` or ``None``.
Centralizes the ``vlm.generate_json([m])[0]`` + ``isinstance(dict)``
dance every prompt-call site needs.
"""
result = self.vlm.generate_json([messages])[0]
if isinstance(result, dict):
return result.get(field)
return None
@staticmethod
def _text_message(text: str) -> list[dict[str, Any]]:
"""One-shot text-only user message wrapped for ``generate_json``."""
return [{"role": "user", "content": [{"type": "text", "text": text}]}]
def _video_message(
self,
record: EpisodeRecord,
prompt: str,
window: tuple[float, float] | None = None,
) -> list[dict[str, Any]]:
"""User message combining the (optionally windowed) video block with ``prompt``."""
content = [*self._episode_video_block(record, window=window), {"type": "text", "text": prompt}]
return [{"role": "user", "content": content}]
def _derive_task_from_video(self, record: EpisodeRecord) -> str | None:
"""Ask the VLM "what is this video about" with no task hint at all."""
text = self._vlm_field(self._video_message(record, load_prompt("plan_video_task")), "task")
return text.strip() if isinstance(text, str) and text.strip() else None
def _generate_task_rephrasings(self, base_task: str, *, n: int) -> list[str]:
"""Generate ``n`` text-only paraphrases of ``base_task``."""
if n <= 0 or not base_task:
return []
prompt = load_prompt("plan_task_rephrasings").format(base_task=base_task, n=n)
raw = self._vlm_field(self._text_message(prompt), "rephrasings")
if not isinstance(raw, list):
return []
out = [item.strip().strip('"').strip("'") for item in raw if isinstance(item, str)]
return [s for s in out if s][:n]
# ------------------------------------------------------------------
# Structured 5-axis task augmentation (EgoMimic-style taxonomy)
# ------------------------------------------------------------------
def _generate_task_aug_by_axes(self, base_task: str, axes_cfg: Any) -> list[str]:
"""One VLM call → variants along the 5-axis taxonomy.
Variants from all axes are flattened into a single list (the
downstream pipeline doesn't need to know about the per-axis
bucketing — every variant becomes a ``task_aug`` row). Order
is preserved for reproducibility: synonym_paraphrase first,
then omit_arm, then omit_orientation, then omit_grasp_method,
then combined_omissions.
"""
if not base_task:
return []
prompt = load_prompt("plan_task_aug_axes").format(
base_task=base_task,
n_synonym=axes_cfg.synonym_paraphrase,
n_omit_arm=axes_cfg.omit_arm,
n_omit_orientation=axes_cfg.omit_orientation,
n_omit_grasp_method=axes_cfg.omit_grasp_method,
n_combined=axes_cfg.combined_omissions,
)
result = self.vlm.generate_json([self._text_message(prompt)])[0]
if not isinstance(result, dict):
return []
ordered_axes = (
"synonym_paraphrase",
"omit_arm",
"omit_orientation",
"omit_grasp_method",
"combined_omissions",
)
flat: list[str] = []
seen: set[str] = set()
for axis in ordered_axes:
entries = result.get(axis)
if not isinstance(entries, list):
continue
for item in entries:
if not isinstance(item, str):
continue
key = item.strip().strip('"').strip("'")
if not key or key in seen:
continue
seen.add(key)
flat.append(key)
return flat
def _episode_video_block(
self, record: EpisodeRecord, window: tuple[float, float] | None = None
) -> list[dict[str, Any]]:
"""Video block for the segmentation / describe prompts.
Always returns a block that actually carries the video. When
``use_video_url`` is set we try the server-side ``video_url``
path first, but if clip extraction fails we FALL BACK to
decoding + embedding frames rather than returning an empty
block — an empty block would leave the VLM with no visual
grounding at all and it would hallucinate subtasks purely from
the task text.
When ``window=(w0, w1)`` is given (windowed subtask generation,
``subtask_window_seconds > 0``), embed frames sampled at the FIXED
``frames_per_second`` rate within ``[w0, w1]`` — constant temporal
density regardless of episode length, so long episodes are split
into windows rather than subsampled to a sparse 32-frame whole-
episode view. The ``video_url`` path is skipped for windows (it is
a whole-episode clip). ``max_video_frames`` still caps each window
as a context-budget safety net.
"""
if not record.frame_timestamps:
return []
if window is not None:
w0, w1 = float(window[0]), float(window[1])
dur = max(0.0, w1 - w0)
n = max(1, int(round(dur * self.config.frames_per_second)) + 1)
n = min(n, self.config.max_video_frames)
if n <= 1 or dur <= 0.0:
timestamps = [0.5 * (w0 + w1)]
else:
step = dur / (n - 1)
timestamps = [w0 + i * step for i in range(n)]
return to_video_block(self.frame_provider.frames_at(record, timestamps))
if self.config.use_video_url and isinstance(self.frame_provider, VideoFrameProvider):
cache_dir = Path(self.frame_provider.root) / ".annotate_staging" / ".video_clips"
clip = self.frame_provider.episode_clip_path(record, cache_dir)
if clip is not None:
return to_video_url_block(f"file://{clip}", fps=self.config.use_video_url_fps)
logger.warning(
"episode %d: video_url clip extraction failed — falling back to "
"embedded frames so the VLM still sees the demonstration",
record.episode_index,
)
episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
target_count = max(1, int(round(episode_duration * self.config.frames_per_second)))
target_count = min(target_count, self.config.max_video_frames)
video_frames = self.frame_provider.video_for_episode(record, target_count)
return to_video_block(video_frames)
def run_plan_updates(
self,
record: EpisodeRecord,
staging: EpisodeStaging,
interjection_times: Sequence[float],
interjection_texts: Sequence[str] | None = None,
) -> None:
"""Append additional ``plan`` rows at every interjection timestamp.
Plans refresh ONLY on user interjections (event-driven). The
interjection text is forwarded into the prompt so the refreshed plan
reflects the user's correction.
"""
if not self.config.emit_plan:
return
existing = staging.read("plan")
# Pass the last frame timestamp so the final span is closed (else its
# end == start, zero duration, and a refresh inside it is missed).
episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
spans = reconstruct_subtask_spans(existing, episode_end_t=episode_end_t)
already_planned: set[float] = {float(r["timestamp"]) for r in existing if r.get("style") == "plan"}
new_rows = list(existing)
texts: list[str | None] = (
[None] * len(interjection_times)
if interjection_texts is None
else [str(t) if t else None for t in interjection_texts]
)
for raw_t, inter_text in zip(interjection_times, texts, strict=True):
t = snap_to_frame(raw_t, record.frame_timestamps)
if t in already_planned:
continue
already_planned.add(t)
plan_text = self._generate_plan(record, spans, refresh_t=t, interjection=inter_text)
if plan_text is not None:
new_rows.append(
{
"role": "assistant",
"content": plan_text,
"style": "plan",
"timestamp": t,
"tool_calls": None,
}
)
staging.write("plan", new_rows)
def _generate_subtasks(self, record: EpisodeRecord, *, task: str | None = None) -> list[dict[str, Any]]:
"""Generate subtask spans, optionally via a multi-call quality chain.
Single call (default): watch video → emit subtask JSON.
Multi-call (opt-in, higher quality, more VLM calls):
1. ``subtask_describe_first`` — a grounding pass that narrates
ONLY what is visible (no JSON commitment to subtasks yet);
its description is injected into the segmentation prompt so
the model segments its own grounded observations instead of
pattern-matching the task text.
2. segmentation — emit subtask JSON (as before).
"""
if record.row_count == 0 or not record.frame_timestamps:
return []
episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
effective_task = task if task is not None else record.episode_task
# ---- Windowed path (constant temporal density) ---------------
# If subtask_window_seconds > 0 and the episode exceeds one window,
# process fixed-length windows so the VLM always sees
# frames_per_second density; results are merged + stitched.
window_s = float(getattr(self.config, "subtask_window_seconds", 0.0) or 0.0)
if window_s > 0.0 and episode_duration > window_s:
return self._generate_subtasks_windowed(record, effective_task, window_s)
# ---- Pass 1 (optional): grounding description ----------------
observation_block = ""
if getattr(self.config, "subtask_describe_first", False):
description = self._describe_episode(record, effective_task)
if description:
observation_block = (
"You watched this video and described, chronologically, "
"ONLY what the robot actually does:\n"
f'"""{description}"""\n\n'
"Segment THAT grounded description (cross-checked against "
"the video) into atomic subtasks. Do not introduce any "
"action that is not in your description above.\n\n"
)
# ---- Pass 2: segmentation ------------------------------------
prompt = load_prompt("plan_subtasks").format(
episode_task=effective_task,
min_subtask_seconds=self.config.min_subtask_seconds,
max_steps=self.config.plan_max_steps,
episode_duration=f"{episode_duration:.3f}",
observation_block=observation_block,
)
spans = self._vlm_field(self._video_message(record, prompt), "subtasks")
cleaned = self._clean_spans(spans, record)
if not cleaned:
return []
# ---- Full-episode coverage stitch ----------------------------
# The VLM can start after t0 or leave gaps, so frames fall through
# with no active subtask. Always stitch into a contiguous
# [t0, t_last] cover.
cleaned = self._stitch_full_coverage(cleaned, record)
return cleaned
def _generate_subtasks_windowed(
self, record: EpisodeRecord, task: str, window_s: float
) -> list[dict[str, Any]]:
"""Subtask generation in fixed-length windows at constant fps.
Splits ``[t0, t_last]`` into consecutive windows of ``window_s``
seconds, runs the describe -> segment chain on each window's own
frames (sampled at ``frames_per_second``), offsets
each window's spans back to absolute episode time, then merges +
stitches into a contiguous whole-episode cover.
"""
t0 = float(record.frame_timestamps[0])
t_last = float(record.frame_timestamps[-1])
all_spans: list[dict[str, Any]] = []
w0 = t0
n_windows = 0
while w0 < t_last - 1e-6:
w1 = min(w0 + window_s, t_last)
all_spans.extend(self._subtasks_for_window(record, task, w0, w1))
n_windows += 1
w0 = w1
logger.info(
"episode %d: windowed subtask gen over %d window(s) of %.1fs -> %d raw spans",
record.episode_index,
n_windows,
window_s,
len(all_spans),
)
# Merge across windows: clamp to the absolute episode, sort, and
# frame-snap to distinct starts (handles any boundary collisions).
cleaned = self._clean_spans(all_spans, record)
if not cleaned:
return []
return self._stitch_full_coverage(cleaned, record)
def _subtasks_for_window(
self, record: EpisodeRecord, task: str, w0: float, w1: float
) -> list[dict[str, Any]]:
"""Run describe -> segment on one ``[w0, w1]`` window.
The model works in window-RELATIVE time ``[0, L]`` (it perceives
the window as a clip starting at 0); spans are offset back to
absolute ``[w0, w1]`` before returning.
"""
window = (w0, w1)
win_len = max(0.0, w1 - w0)
observation_block = ""
if getattr(self.config, "subtask_describe_first", False):
description = self._describe_episode(record, task, window=window)
if description:
observation_block = (
"You watched this video clip and described, chronologically, "
"ONLY what the robot actually does:\n"
f'"""{description}"""\n\n'
"Segment THAT grounded description (cross-checked against "
"the clip) into atomic subtasks. Do not introduce any "
"action that is not in your description above.\n\n"
)
prompt = load_prompt("plan_subtasks").format(
episode_task=task,
min_subtask_seconds=self.config.min_subtask_seconds,
max_steps=self.config.plan_max_steps,
episode_duration=f"{win_len:.3f}",
observation_block=observation_block,
)
spans = self._vlm_field(self._video_message(record, prompt, window=window), "subtasks")
# Window-relative clamp; no frame-snap dedupe yet (done on the
# merged absolute set).
cleaned = self._clean_spans(spans, record, bounds=(0.0, win_len), dedupe=False)
if not cleaned:
return []
# Offset window-relative spans back to absolute episode time.
for s in cleaned:
s["start"] = w0 + float(s["start"])
s["end"] = w0 + float(s["end"])
return cleaned
def _stitch_full_coverage(
self, spans: list[dict[str, Any]], record: EpisodeRecord
) -> list[dict[str, Any]]:
"""Make subtask spans tile the full episode with no gaps.
* The first subtask starts at the episode's first frame ``t0``
(any idle / approach before the first labelled action is folded
into it), so every early frame has an active subtask.
* Each subtask's ``end`` is snapped to the next subtask's
``start`` (gaps between spans are closed), and the final
subtask's ``end`` extends to the last frame ``t_last``.
Starts are otherwise left as the (already frame-snapped, distinct)
values the VLM produced — only the FIRST start is pulled
back to ``t0``, which can't collide with a later span because it
was already the earliest. Purely deterministic; runs after the
VLM passes.
"""
if not spans or not record.frame_timestamps:
return spans
t0 = float(record.frame_timestamps[0])
t_last = float(record.frame_timestamps[-1])
spans = sorted(spans, key=lambda s: float(s["start"]))
spans[0]["start"] = t0
for i in range(len(spans) - 1):
spans[i]["end"] = float(spans[i + 1]["start"])
spans[-1]["end"] = t_last
for s in spans:
if float(s["end"]) < float(s["start"]):
s["end"] = float(s["start"])
return spans
def _clean_spans(
self,
spans: Any,
record: EpisodeRecord,
bounds: tuple[float, float] | None = None,
dedupe: bool = True,
) -> list[dict[str, Any]]:
"""Clamp / sort / (optionally) dedupe raw VLM subtask spans into valid rows.
``bounds`` overrides the clamp range — pass the window's
``(w_lo, w_hi)`` when cleaning window-relative spans, or leave
``None`` to clamp to the whole episode ``[t0, t_last]``.
``dedupe`` runs the frame-snap distinct-start step; skip it for
window-relative spans (frame snapping is done once on the merged,
absolute-time set).
"""
if not spans:
return []
if bounds is not None:
lo, hi = float(bounds[0]), float(bounds[1])
else:
lo = record.frame_timestamps[0]
hi = record.frame_timestamps[-1]
cleaned: list[dict[str, Any]] = []
for span in spans:
try:
start = float(span["start"])
end = float(span["end"])
text = str(span["text"]).strip()
except (KeyError, ValueError, TypeError):
continue
start = max(lo, min(start, hi))
end = max(lo, min(end, hi))
if end < start:
start, end = end, start
if not text:
continue
cleaned.append({"text": text, "start": start, "end": end})
cleaned.sort(key=lambda s: s["start"])
if dedupe:
return self._dedupe_starts_to_distinct_frames(cleaned, record)
return cleaned
def _describe_episode(
self, record: EpisodeRecord, task: str, window: tuple[float, float] | None = None
) -> str:
"""Grounding pass: free-form chronological description of the (windowed) video."""
prompt = load_prompt("plan_subtask_describe").format(episode_task=task)
text = self._vlm_field(self._video_message(record, prompt, window=window), "description")
return text.strip() if isinstance(text, str) and text.strip() else ""
@staticmethod
def _dedupe_starts_to_distinct_frames(
spans: list[dict[str, Any]], record: EpisodeRecord
) -> list[dict[str, Any]]:
"""Bump same-frame subtask starts onto distinct frames.
Two consecutive VLM spans whose ``start`` rounds to the same
source frame (after :func:`snap_to_frame`) would otherwise emit
two ``style=subtask`` rows at the identical persistent
timestamp. The training-time renderer's ``active_at(t,
style=subtask)`` resolver can't disambiguate that and raises
``Ambiguous resolver for style='subtask'``.
Walk the (sorted-by-start) spans, snap each to its frame, and
if the snapped frame is already taken push the span onto the
next unused frame so both subtasks survive on distinct
timestamps. If the episode ends before a free frame is found,
the trailing span is dropped with a warning — better than
poisoning the render.
"""
if not spans:
return spans
frames = record.frame_timestamps
if not frames:
return spans
used: set[float] = set()
out: list[dict[str, Any]] = []
for span in spans:
ts = snap_to_frame(span["start"], frames)
if ts in used:
next_ts = next((f for f in frames if f > ts and f not in used), None)
if next_ts is None:
logger.warning(
"episode %d: subtask %r snapped to occupied frame "
"%.3f and no free later frame exists — dropping",
record.episode_index,
span.get("text"),
ts,
)
continue
ts = next_ts
used.add(ts)
new_span = {**span, "start": ts}
if float(new_span.get("end", ts)) < ts:
new_span["end"] = ts
out.append(new_span)
return out
def _generate_plan(
self,
record: EpisodeRecord, # noqa: ARG002 (kept for signature stability)
subtask_spans: Sequence[dict[str, Any]],
*,
refresh_t: float | None = None,
interjection: str | None = None, # noqa: ARG002
task: str | None = None, # noqa: ARG002
) -> str | None:
"""Deterministic plan = numbered list of *still-todo* subtasks.
No VLM call: a plain numbered list keeps the plan aligned with the
upcoming subtasks (the old VLM "compact hierarchical plan" prompt
cost a round-trip per episode/refresh and could diverge).
1. <subtask 1>
2. <subtask 2>
On a refresh at ``refresh_t`` (from ``run_plan_updates`` on
interjections, and ``run_episode`` at each boundary), only subtasks
starting at or after ``refresh_t`` are included — so it always
describes what's left.
"""
if not subtask_spans:
return None
remaining = [
s for s in subtask_spans if refresh_t is None or float(s.get("start", 0.0)) >= float(refresh_t)
]
if not remaining:
# Past the last subtask boundary on a late refresh — nothing
# left to plan; emit None so the caller skips the row.
return None
return "\n".join(f"{i}. {span.get('text', '').strip()}" for i, span in enumerate(remaining, start=1))
def _generate_memory(
self,
record: EpisodeRecord,
prior_memory: str,
completed: str,
remaining: Sequence[str],
*,
task: str | None = None,
) -> str:
prompt = load_prompt("plan_memory").format(
episode_task=(task if task is not None else record.episode_task),
prior_memory=prior_memory or "(none)",
completed_subtask=completed,
remaining_subtasks=", ".join(remaining) if remaining else "(none)",
)
memory = self._vlm_field(self._text_message(prompt), "memory")
return memory.strip() if isinstance(memory, str) else ""
@@ -0,0 +1,33 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Prompt templates loaded as plain text.
One file per use site. Templates use ``str.format(**vars)`` substitution; we
intentionally avoid jinja2 here so the templates remain inspectable in
plain editors and roundtrip cleanly through ``ruff format``.
"""
from __future__ import annotations
from pathlib import Path
_DIR = Path(__file__).parent
def load(name: str) -> str:
"""Read prompt template ``name.txt`` from the ``prompts/`` directory."""
path = _DIR / f"{name}.txt"
return path.read_text(encoding="utf-8")
@@ -0,0 +1,12 @@
The user just asked the robot: "{episode_task}".
Generate a short verbal acknowledgement the robot would speak back before
beginning the task. Style: compact, confident, friendly.
Examples (Hi Robot, Shi 2025): "Sure, I won't put cheese on it.",
"OK, starting with the sponge.", "Got it.".
Prefer very short replies: "Got it.", "On it.", "OK."
Output strictly valid JSON:
{{ "text": "<the spoken acknowledgement>" }}
@@ -0,0 +1,46 @@
You are generating training data for a Hi Robot-style hierarchical
robot policy. The robot in this demonstration has ALREADY executed
every step shown in the video — we cannot retroactively change the
action stream. To keep training data consistent with the video, the
"interjection" must align with what the robot is *about to do next* in
the demonstration, framed as a natural mid-task user request.
The episode's overall task: "{episode_task}".
The images above show roughly {window_seconds:.1f} seconds straddling a
subtask boundary in the demonstration:
- Subtask the robot just finished: "{prev_subtask}"
- Subtask the robot is about to start: "{next_subtask}"
- Time into episode: {timestamp:.2f}s
Write ONE compact interjection the user would naturally say at this
moment to prompt / confirm / encourage the robot to do "{next_subtask}".
Keep it like a mid-task coaching cue, not a full instruction paragraph.
Also write the robot's compact verbal acknowledgement.
Hard rules:
- The interjection MUST be consistent with the next subtask. The user
cannot ask for something different from what the robot then does in
the video. If you're tempted to say "actually skip X" or "do Y
instead", DO NOT — those would contradict the demonstration.
- The interjection must reference an object, location, or action that
is plausible given the visible scene and the next subtask text.
- One short phrase or sentence each. Conversational, not robotic.
- Prefer direct cues: "{next_subtask}, please."; "Now {next_subtask}."
- Keep robot speech very short: "OK.", "On it.", "Doing that."
Style examples (vary the phrasing — don't reuse these verbatim):
- "Now go ahead and {next_subtask}."
- "Great, can you {next_subtask} next?"
- "{next_subtask}, please."
- "Before you continue, please {next_subtask}."
- "Looking good — {next_subtask} now."
- "Okay, {next_subtask}."
Output strictly valid JSON:
{{
"interjection": "<short cue from the user, asking for the next subtask>",
"speech": "<short robot acknowledgement>"
}}
@@ -0,0 +1,36 @@
You are updating the robot's compressed semantic memory at the boundary of
a completed subtask.
Reference (verbatim from MEM, Torne 2026):
"Remove or compress information in the language memory whenever
appropriate. Keep ONLY the minimal set of relevant information for future
task execution. Specific object attributes (colors, precise quantities of
each item) get discarded when their details won't affect subsequent
actions. Functional outcomes (where items went, how many) are preserved."
Episode task: "{episode_task}"
Previous memory: {prior_memory}
Just-completed subtask: "{completed_subtask}"
Remaining subtasks (for relevance judgement only): {remaining_subtasks}
Write the memory as a short FIRST-PERSON, PAST-TENSE narrative of what the
robot has accomplished so far — the running story it would tell itself.
Authoring rules:
- First person, past tense. Every sentence starts with "I": "I picked
up...", "I opened...", "I moved to...".
- One or two short sentences. Extend the previous memory with the
just-completed subtask; do not rewrite it from scratch.
- Keep WHAT happened (functional outcomes — where items went, how many),
drop HOW (grasp details, motions).
- Compress completed steps and drop object attributes (colors, exact
counts) once they no longer affect the remaining subtasks.
Example (MEM, Torne 2026):
Before: "I prepared the pot and got the potatoes, milk, and butter. I
moved to the drawer."
After: "I prepared the pot and got the ingredients. I opened the
drawer with the masher."
Output strictly valid JSON:
{{ "memory": "<one or two short first-person past-tense sentences>" }}
@@ -0,0 +1,27 @@
You are watching a teleoperated robot demonstration from a single
camera. The user asked the robot to: "{episode_task}"
This is an OBSERVATION pass. Watch the entire clip and describe, in
chronological order, ONLY what the robot physically does — the concrete
motions, approaches, contacts, grasps, releases, and relocations you can
actually SEE in the frames.
Hard rules:
- Describe only motion visible in the video. Do NOT use the task
instruction to guess steps that aren't shown. The instruction is the
goal; the video is ground truth.
- Do NOT segment into named subtasks yet and do NOT output JSON beyond
the single field below. Just narrate what happens.
- Give an approximate timestamp (in seconds) for each distinct event,
e.g. "0.0-1.4s: the base drives forward toward the stove".
- Do NOT invent objects, grasps, destinations, or steps. If the robot
only does one thing (e.g. it just navigates and the clip ends), say
exactly that and nothing more.
- Be concrete and literal. "the gripper closes on the mug" — not "the
robot prepares to make coffee".
Output strictly valid JSON:
{{
"description": "<chronological, timestamped description of ONLY what is visible>"
}}
@@ -0,0 +1,112 @@
You are labeling a teleoperated robot demonstration.
The user originally asked: "{episode_task}"
You are shown the entire demonstration as a single video. Watch the
whole clip, then segment it into a list of consecutive atomic subtasks
the robot performs.
{observation_block}GROUNDING — read this first, it overrides everything below:
- Label ONLY what the robot actually does in the video. Every subtask
you emit must correspond to motion you can SEE in specific frames.
- Do NOT invent, anticipate, or pad. If the robot only does one thing
(e.g. it just navigates to a location and the clip ends), emit
EXACTLY ONE subtask. Many demonstrations are a single atomic skill.
- ``max_steps`` below is a hard CEILING, not a target. Emitting fewer
subtasks than the ceiling is not just allowed, it is expected for
short / atomic demonstrations. One correct subtask is far better
than several invented ones.
- If the video does not clearly show the action implied by the task,
describe what you actually see — do NOT fabricate the task's steps
from the instruction text. The instruction tells you the goal; the
VIDEO is the ground truth for what happened.
Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
- Each subtask = one COMPOSITE atomic skill the low-level policy can
execute end-to-end. A "skill" bundles its own approach motion with
its terminal action — do NOT split the approach off as its own
subtask. The whole-arm policy already learns to reach as part of
every manipulation primitive.
- Write each subtask as an IMPERATIVE COMMAND, starting with one of
these verbs (extend only when none fits):
pick up <obj> — approach + grasp + lift in one subtask
put <obj> on/in <loc> — transport + release in one subtask
place <obj> on/in <loc> — synonym of "put"; pick one and stay consistent
push <obj> — contact + linear shove
pull <obj> — contact + linear retract
turn <knob/dial/handle> — rotary actuation
press <button> — single-press contact
open <drawer/door/lid> — full open motion
close <drawer/door/lid> — full close motion
pour <src> into <dst> — tilt + flow
insert <obj> into <slot>— alignment + push-fit
go to <loc> — ONLY when no grasp / actuation follows
(e.g. a pure relocation between phases).
If the next subtask grasps something at
that location, drop "go to ..." and just
write "pick up ..." instead.
- Forbidden ultra-fine splits — the VLM is NOT allowed to emit these
as standalone subtasks; fold them into the parent composite:
"move to X" → fold into "pick up X" (or whatever follows)
"reach for X" → fold into "pick up X"
"grasp X" → fold into "pick up X"
"lift X" → fold into "pick up X" (or "put X on Y" if it's
the transport phase of a place)
"release X" → fold into "put X on Y" (or "place X in Y")
- Keep it SHORT — a verb phrase, not a sentence. Drop articles
("the", "a") and adverbs ("carefully", "slowly"). Add a "how"
detail (which hand, which grasp point) ONLY when it is needed to
disambiguate. Every subtask must begin with one of the verbs
above (no leading nouns, no "then", no "first").
- NEVER use third person. Never write "the robot", "the arm", "the
gripper moves", "it picks up" — the robot is implied. Command it,
do not describe it.
- Use the exact object nouns from the task above. If the task says
"cube", every subtask says "cube" — never switch to "block". If it
says "box", never switch to "bin"/"container". Keep vocabulary
consistent across the whole episode.
- Good: "pick up blue cube", "put blue cube in box", "open drawer",
"turn red knob", "press start button", "go to sink".
- Bad: "move to blue cube" (approach as its own subtask — forbidden,
must be folded into "pick up blue cube"); "the robot arm moves
towards the blue cube" (third person, too long); "carefully pick
up the cube" (adverb, article); "release the yellow block"
("block" when the task said "cube", and "release" must be folded
into a "put"/"place" subtask).
- Subtasks are non-overlapping and cover the full episode in order.
Choose the cut points yourself based on what you see in the video
(gripper open/close events, contact, regrasps, transitions).
- Each subtask spans at least {min_subtask_seconds} seconds. If a
candidate span would be shorter, merge it into its neighbour
rather than emitting it.
- Do not exceed {max_steps} subtasks total. Fewer, larger composites
are preferred over many micro-steps.
- Every subtask's [start_time, end_time] must lie within
[0.0, {episode_duration}] seconds.
SPECIAL CASES — verb disambiguation (each rule is narrowly visual and
fires ONLY on the spatial situation it names; it must not change how you
label any other situation):
- STACK vs PUT: if an object is placed ON TOP OF another specific object
(not on a flat table / shelf / counter), use "stack ... on ...", not
"put". "stack blue book on green book", NOT "put blue book on table".
- INSERT vs PUT: if an object goes INTO a fitted slot / hole / socket /
receptacle (push-fit), use "insert ... into ...", not "put".
- RETRIEVE/PICK-UP vs PUT (direction): watch the gripper. If it CLOSES
on the object and the object moves WITH the hand, it is "pick up" /
"retrieve" (object leaves its location). If the gripper OPENS and the
object stays where the hand left it, it is "put" / "place" (object
arrives at a location). Decide by which way the object moves, not by
where the hand ends up.
- POUR vs PUT: only use "pour" when the source is tilted and contents
flow out; moving a full container without tilting is "put"/"place".
Output strictly valid JSON of shape:
{{
"subtasks": [
{{"text": "<short imperative verb phrase>", "start": <float>, "end": <float>}},
...
]
}}
@@ -0,0 +1,67 @@
You are generating structured augmentations of a robot task instruction
for training a language-conditioned policy. Unlike free-form rephrasing,
your variants follow a NAMED 5-axis taxonomy — each axis omits or varies
a specific element of the task while preserving its meaning.
Original task: "{base_task}"
Produce variants along five named axes. Each axis has a target count.
The whole batch should expose the policy to maximum linguistic diversity
WITHOUT changing what the robot is supposed to do.
Axes and target counts:
synonym_paraphrase ({n_synonym}):
Different wording / verbs / sentence structure. ALL information
from the original task is preserved — same object, same arm
specification if present, same orientation if present, same grasp
if present.
omit_arm ({n_omit_arm}):
Drop the left/right/both arm specification from the task. Skip
entirely (emit 0 entries) if the original task does NOT mention an
arm. Do not invent an arm specification just to omit it.
omit_orientation ({n_omit_orientation}):
Drop orientation cues (upright, sideways, facing the user,
long-edge-first, etc.). Skip entirely if no orientation cue is
present in the original task.
omit_grasp_method ({n_omit_grasp_method}):
Drop the grip / grasp method specification (pinch, wrap, hold by
the rim, etc.). Skip entirely if no grasp method is mentioned.
combined_omissions ({n_combined}):
Combine TWO of the above omissions simultaneously (e.g. drop both
arm and orientation). Skip entirely if fewer than two of (arm,
orientation, grasp_method) appear in the original task.
Hard rules:
- Each variant MUST preserve the core action, the target object, AND
the goal / destination. Do not change which object is involved, where
it goes, or the high-level action. "Navigate to the stove" may become
"go to the stove" or "head over to the stove" — it must NEVER become
"wander around the kitchen", "explore the room", or anything that
drops or generalises the stove destination. If you cannot vary the
wording without changing the goal, emit fewer variants.
- Only the FIVE listed elements (wording, arm, orientation, grasp
method, or a combination) may be varied or omitted. The verb's
meaning, the object, and the destination are fixed.
- Each variant is plain prose, no markdown, no quotes, no list numbers.
- Each variant must be DISTINCT from every other variant in the entire
output, both within and across axes. Near-duplicates are not allowed.
- If an axis cannot reach its target count because the original task
lacks the omittable element, emit fewer entries — do NOT pad the
axis with paraphrases that belong to a different axis.
- Variants should not all start with verbs — vary sentence structure
(some imperative, some polite request, some question).
Output strictly valid JSON of shape:
{{
"synonym_paraphrase": ["<v1>", "<v2>", ...],
"omit_arm": ["<v1>", "<v2>", ...],
"omit_orientation": ["<v1>", ...],
"omit_grasp_method": ["<v1>", ...],
"combined_omissions": ["<v1>", ...]
}}
@@ -0,0 +1,32 @@
You are generating training data for a Hi Robot-style policy. We need
{n} alternative phrasings of the same robot task so the policy sees
diverse user prompts during training instead of the same canonical
string repeated every frame.
Original task:
"{base_task}"
Generate exactly {n} alternative phrasings of the same task. Vary:
- formality (casual / polite / curt)
- verbosity (mostly short imperative; occasional polite request)
- word choice (synonyms, different verbs)
- sentence structure (imperative / question / suggestion)
Hard rules:
- Each phrasing MUST preserve the exact meaning of the original task.
Do not change which object is involved, the destination, or the
action. Do not add extra steps. Do not invent new objects.
- Each phrasing must be a short phrase or sentence, plain prose, no
markdown, no quotes, no list numbers.
- Phrasings must be distinct — no near-duplicates.
- Output exactly {n} entries.
Output strictly valid JSON:
{{
"rephrasings": [
"<phrasing 1>",
"<phrasing 2>",
...
]
}}
@@ -0,0 +1,17 @@
The video above shows a robot manipulation episode in full. Look at
the entire video and describe in ONE concise sentence what the robot
is doing.
Rules:
- One sentence, in natural English, like a user instruction.
- Capture the goal of the demonstration, not low-level motions.
Example: "place the yellow cube into the red bin" — not "move the
end-effector down 5cm and close the gripper".
- 4 to 15 words. Plain prose, no markdown, no bullets, no quotes.
- Do not invent objects or actions that aren't visible.
- Do not output anything other than the JSON object below.
Output strictly valid JSON:
{{
"task": "<single concise sentence describing what the robot does in this video>"
}}
@@ -0,0 +1,32 @@
You are generating a frame-grounded visual question/answer pair for
chain-of-thought training. Reference: ECoT (Zawalski 2024) and Steerable
Policies — both train policies on grounded features such as bounding box
pixel coordinates, keypoints, counts, attributes, and spatial relations.
The frame shows a robot working on: "{episode_task}".
Question types and the EXACT answer JSON shape required for each:
bbox => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
"bbox": [x1, y1, x2, y2]}}, ...]}}
bbox is in pixel coordinates (x_min, y_min, x_max, y_max).
ECoT example: "a white cup [124, 25, 176, 113]".
keypoint => {{"label": "<point>", "point_format": "xy",
"point": [x, y]}}
count => {{"label": "<obj>", "count": <int>,
"note": "<optional short note>"}}
attribute => {{"label": "<obj>", "attribute": "<color|shape|state|...>",
"value": "<observed value>"}}
spatial => {{"subject": "<obj>", "relation": "<left_of|right_of|on|in|"
"above|below|near>", "object": "<obj>"}}
Generate a question of type "{question_type}". Output strictly valid JSON:
{{
"question": "<short, frame-grounded question>",
"answer": <object whose shape matches the schema above>
}}
@@ -0,0 +1,216 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Datatrove-shaped reader.
The reader walks ``data/chunk-*/file-*.parquet`` and yields one record per
episode containing:
- ``episode_index``: int
- ``frame_timestamps``: tuple[float, ...]
- ``frame_indices``: tuple[int, ...]
- ``episode_task``: str (canonical task from ``meta/tasks.parquet``)
- ``data_path``: pathlib.Path of the source parquet shard
- ``frames_df``: pandas.DataFrame slice for the episode (only loaded on demand)
This shape lets each module operate per-episode without loading all parquet
rows into memory at once.
"""
from __future__ import annotations
from collections.abc import Iterator, Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
import pyarrow.parquet as pq
from lerobot.datasets.io_utils import load_tasks
from lerobot.datasets.utils import DEFAULT_TASKS_PATH
@dataclass
class EpisodeRecord:
"""Per-episode record yielded by the reader."""
episode_index: int
episode_task: str
frame_timestamps: tuple[float, ...]
frame_indices: tuple[int, ...]
data_path: Path
row_offset: int # row offset within the parquet file where this episode starts
row_count: int # number of rows for this episode
# Memoized parquet slice — populated on first ``frames_df()`` call so
# repeat queries from different modules don't re-read the whole shard.
_frames_df_cache: Any = field(default=None, init=False, repr=False, compare=False)
def frames_df(self): # type: ignore[no-untyped-def]
"""Lazy-load the pandas slice for this episode (memoized)."""
if self._frames_df_cache is None:
import pandas as pd # noqa: PLC0415 - deferred for optional dataset extra
table = pq.read_table(self.data_path)
df: pd.DataFrame = table.to_pandas()
self._frames_df_cache = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(
drop=True
)
return self._frames_df_cache
def reconstruct_subtask_spans(
rows: Sequence[dict[str, Any]],
*,
episode_end_t: float | None = None,
) -> list[dict[str, Any]]:
"""Turn ``style="subtask"`` rows into ``{text, start, end}`` spans.
Each span's ``end`` is the next span's ``start``. The final span's
``end`` defaults to its own ``start`` (zero-duration) — pass
``episode_end_t`` to extend it to the episode's last frame instead,
which is what downstream consumers (memory, interjection boundary
selection) expect.
Used by the ``plan`` module (plan-update pass) and the
``interjections`` module (interjection anchoring), which both need the
same span shape.
"""
sorted_rows = sorted(
(r for r in rows if r.get("style") == "subtask"),
key=lambda r: float(r["timestamp"]),
)
spans: list[dict[str, Any]] = []
for r in sorted_rows:
t = float(r["timestamp"])
if spans:
spans[-1]["end"] = t
spans.append({"text": r.get("content") or "", "start": t, "end": t})
if spans and episode_end_t is not None and float(episode_end_t) > spans[-1]["start"]:
spans[-1]["end"] = float(episode_end_t)
return spans
def snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
"""Snap an arbitrary float to the nearest exact source frame timestamp.
Modules use this when emitting event-style rows so the row's
timestamp matches a real parquet frame: event rows must land on an
exact frame, otherwise the per-frame event lookup the writer does
would never match them.
"""
if not frame_timestamps:
return float(t)
nearest = min(frame_timestamps, key=lambda f: abs(f - t))
return float(nearest)
def _load_tasks_lookup(root: Path) -> dict[int, str]:
"""Map ``task_index -> task`` from ``meta/tasks.parquet``.
Returns an empty dict when the file is absent — the task description is
derived later from the video if needed. Reuses the library-level
:func:`lerobot.datasets.io_utils.load_tasks`, which returns the tasks
frame indexed by task string with a ``task_index`` column.
"""
if not (root / DEFAULT_TASKS_PATH).exists():
return {}
tasks = load_tasks(root)
return {int(idx): str(task) for task, idx in zip(tasks.index, tasks["task_index"], strict=True)}
def iter_episodes(root: Path, *, only_episodes: tuple[int, ...] | None = None) -> Iterator[EpisodeRecord]:
"""Yield :class:`EpisodeRecord` for every episode under ``root/data/``.
Episodes are yielded in ascending ``episode_index`` order. The reader does
not assume a specific chunk/file layout: it scans every ``*.parquet``
under ``data/`` and groups by ``episode_index``.
"""
tasks = _load_tasks_lookup(root)
data_dir = root / "data"
parquet_files = sorted(data_dir.rglob("*.parquet"))
only_set = set(only_episodes) if only_episodes is not None else None
for path in parquet_files:
yield from _iter_one_path(path, tasks, only_set)
def _iter_one_path(path: Path, tasks: dict[int, str], only_set: set[int] | None) -> Iterator[EpisodeRecord]:
table = pq.read_table(path)
names = table.column_names
if "episode_index" not in names:
return
episode_col = table.column("episode_index").to_pylist()
timestamp_col = (
table.column("timestamp").to_pylist() if "timestamp" in names else [0.0] * len(episode_col)
)
frame_col = (
table.column("frame_index").to_pylist() if "frame_index" in names else list(range(len(episode_col)))
)
task_col = table.column("task_index").to_pylist() if "task_index" in names else None
def _build(
ep: int,
start: int,
end: int,
task_idx: int | None,
ts_buf: list[float],
fi_buf: list[int],
) -> EpisodeRecord | None:
if only_set is not None and ep not in only_set:
return None
task = tasks.get(task_idx, "") if task_idx is not None else ""
return EpisodeRecord(
episode_index=ep,
episode_task=task,
frame_timestamps=tuple(ts_buf),
frame_indices=tuple(fi_buf),
data_path=path,
row_offset=start,
row_count=end - start,
)
cur_ep: int | None = None
start_offset = 0
ts_buf: list[float] = []
fi_buf: list[int] = []
cur_task_idx: int | None = None
for i, ep in enumerate(episode_col):
if cur_ep is None:
cur_ep = ep
start_offset = i
ts_buf = [timestamp_col[i]]
fi_buf = [frame_col[i]]
cur_task_idx = task_col[i] if task_col is not None else None
continue
if ep != cur_ep:
rec = _build(cur_ep, start_offset, i, cur_task_idx, ts_buf, fi_buf)
if rec is not None:
yield rec
cur_ep = ep
start_offset = i
ts_buf = [timestamp_col[i]]
fi_buf = [frame_col[i]]
cur_task_idx = task_col[i] if task_col is not None else None
else:
ts_buf.append(timestamp_col[i])
fi_buf.append(frame_col[i])
if cur_ep is not None:
rec = _build(cur_ep, start_offset, len(episode_col), cur_task_idx, ts_buf, fi_buf)
if rec is not None:
yield rec
@@ -0,0 +1,92 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Per-episode staging.
Each module writes its raw output as a JSONL file under
``<staging_dir>/episode_{ep:06d}/<module>.jsonl``. The writer reads back this
staging tree and partitions rows into the two language columns.
JSONL is preferred over parquet here because the staging artifact is meant to
be human-inspectable, easy to diff between prompt iterations, and trivially
appended to. The final dataset format is parquet; staging is just an
intermediate.
"""
from __future__ import annotations
import json
from collections.abc import Iterable
from dataclasses import dataclass
from pathlib import Path
from typing import Any
ModuleName = str
_MODULES: tuple[ModuleName, ...] = (
"plan",
"interjections",
"vqa",
)
@dataclass
class EpisodeStaging:
"""Filesystem layout for a single episode's staged module outputs."""
root: Path
episode_index: int
@property
def episode_dir(self) -> Path:
return self.root / f"episode_{self.episode_index:06d}"
def path_for(self, module: ModuleName) -> Path:
if module not in _MODULES:
raise ValueError(f"Unknown module {module!r}; expected one of {_MODULES}")
return self.episode_dir / f"{module}.jsonl"
def write(self, module: ModuleName, rows: Iterable[dict[str, Any]]) -> Path:
path = self.path_for(module)
path.parent.mkdir(parents=True, exist_ok=True)
# Atomic replace: a crash mid-write would otherwise leave a
# half-written JSONL file that ``read()`` would then fail to
# parse. Write to a sibling .tmp and rename so the target path
# only ever points at a complete file.
tmp_path = path.with_suffix(path.suffix + ".tmp")
with tmp_path.open("w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False, sort_keys=True))
f.write("\n")
tmp_path.replace(path)
return path
def read(self, module: ModuleName) -> list[dict[str, Any]]:
path = self.path_for(module)
if not path.exists():
return []
out: list[dict[str, Any]] = []
with path.open(encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
out.append(json.loads(line))
return out
def read_all(self) -> dict[ModuleName, list[dict[str, Any]]]:
return {m: self.read(m) for m in _MODULES}
def has(self, module: ModuleName) -> bool:
return self.path_for(module).exists()
@@ -0,0 +1,332 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Pre-write validation against staged outputs.
Runs after all three modules have written their per-episode artifacts but
*before* the writer rewrites parquet shards. The validator never touches
parquet; it only inspects the staging tree and the source frame timestamps
exposed by :class:`EpisodeRecord`.
Checks (per the plan's "Intermediate staging and validation" section):
- exact timestamp alignment against source frame timestamps
- no orphan speech / interjection pairs
- plan / memory emission consistency (events have a paired persistent row)
- VQA assistant ``content`` is valid JSON (one of bbox / keypoint / count /
attribute / spatial)
- every row maps to its correct column under :func:`column_for_style`
"""
from __future__ import annotations
import json
import logging
from collections.abc import Iterable, Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from lerobot.datasets.language import (
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
column_for_style,
is_view_dependent_style,
validate_camera_field,
)
from .reader import EpisodeRecord
from .staging import EpisodeStaging
logger = logging.getLogger(__name__)
@dataclass
class ValidationReport:
"""Outcome of one validation pass across all episodes."""
errors: list[str] = field(default_factory=list)
warnings: list[str] = field(default_factory=list)
episodes_checked: int = 0
@property
def ok(self) -> bool:
return not self.errors
def add_error(self, message: str) -> None:
self.errors.append(message)
def add_warning(self, message: str) -> None:
self.warnings.append(message)
def summary(self) -> str:
return f"checked={self.episodes_checked} errors={len(self.errors)} warnings={len(self.warnings)}"
VQA_ANSWER_SHAPES: dict[str, set[str]] = {
"bbox": {"detections"},
"keypoint": {"label", "point_format", "point"},
"count": {"label", "count"},
"attribute": {"label", "attribute", "value"},
"spatial": {"subject", "relation", "object"},
}
def classify_vqa_answer(payload: Any) -> str | None:
"""Best-effort classification of a VQA answer payload to a question type."""
if not isinstance(payload, dict):
return None
keys = set(payload.keys())
for kind, required in VQA_ANSWER_SHAPES.items():
if required.issubset(keys):
return kind
return None
@dataclass
class StagingValidator:
"""Walks the staging tree and produces a :class:`ValidationReport`."""
timestamp_atol: float = 0.0 # exact-match by default
dataset_camera_keys: tuple[str, ...] | None = None
"""Known ``observation.images.*`` keys on the dataset. When set, the
validator additionally enforces that every view-dependent row's
``camera`` field references one of these keys. Pass ``None`` (default)
to skip that cross-check (e.g. in unit tests with no real dataset)."""
def validate(
self,
records: Sequence[EpisodeRecord],
staging_dir: Path,
) -> ValidationReport:
report = ValidationReport()
for record in records:
self._validate_episode(record, staging_dir, report)
report.episodes_checked += 1
return report
def _validate_episode(
self,
record: EpisodeRecord,
staging_dir: Path,
report: ValidationReport,
) -> None:
staging = EpisodeStaging(staging_dir, record.episode_index)
staged = staging.read_all()
all_rows: list[dict[str, Any]] = []
for module_name, rows in staged.items():
for row in rows:
row = {**row, "_module": module_name}
all_rows.append(row)
frame_ts = set(record.frame_timestamps)
events: list[dict[str, Any]] = []
persistent: list[dict[str, Any]] = []
for row in all_rows:
self._check_column_routing(row, report, record.episode_index)
self._check_camera_field(row, report, record.episode_index, self.dataset_camera_keys)
# ``_check_column_routing`` already recorded any unknown-style error;
# don't let the same ``column_for_style`` lookup raise here uncaught.
try:
column = column_for_style(row.get("style"))
except ValueError:
continue
if column == LANGUAGE_PERSISTENT:
persistent.append(row)
else:
events.append(row)
for row in events:
self._check_event_timestamp_alignment(row, frame_ts, report, record.episode_index)
self._check_speech_interjection_pairs(events, report, record.episode_index)
self._check_plan_memory_consistency(persistent, events, report, record.episode_index)
self._check_vqa_json(events, report, record.episode_index)
self._check_vqa_uniqueness_per_frame_camera(events, report, record.episode_index)
def _check_camera_field(
self,
row: dict[str, Any],
report: ValidationReport,
episode_index: int,
dataset_camera_keys: Sequence[str] | None,
) -> None:
"""Enforce the camera invariant + that the key matches the dataset's cameras."""
style = row.get("style")
camera = row.get("camera")
try:
validate_camera_field(style, camera)
except ValueError as exc:
report.add_error(f"ep={episode_index} module={row.get('_module')}: {exc}")
return
if is_view_dependent_style(style) and dataset_camera_keys and camera not in dataset_camera_keys:
report.add_error(
f"ep={episode_index} module={row.get('_module')}: camera {camera!r} on style "
f"{style!r} is not one of the dataset's video keys {sorted(dataset_camera_keys)!r}"
)
def _check_vqa_uniqueness_per_frame_camera(
self,
events: Iterable[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
"""Ensure at most one (vqa, user) and one (vqa, assistant) per (t, camera)."""
counts: dict[tuple[float, str, str], int] = {}
for row in events:
if row.get("style") != "vqa":
continue
ts = row.get("timestamp")
camera = row.get("camera")
role = row.get("role")
if ts is None or camera is None or role is None:
continue # other validators flag these
key = (float(ts), str(camera), str(role))
counts[key] = counts.get(key, 0) + 1
for (ts, camera, role), n in counts.items():
if n > 1:
report.add_error(
f"ep={episode_index}: {n} duplicate vqa rows at t={ts} "
f"camera={camera!r} role={role!r}; expected at most one per (t, camera, role)"
)
def _check_column_routing(
self,
row: dict[str, Any],
report: ValidationReport,
episode_index: int,
) -> None:
style = row.get("style")
module = row.get("_module")
try:
target_col = column_for_style(style)
except ValueError:
report.add_error(f"ep={episode_index} module={module}: unknown style {style!r}")
return
if module == "plan" and target_col != LANGUAGE_PERSISTENT:
report.add_error(
f"ep={episode_index} module=plan emitted style {style!r} that routes to {target_col} (must be persistent)"
)
if module in {"interjections", "vqa"} and target_col != LANGUAGE_EVENTS:
report.add_error(
f"ep={episode_index} module={module} emitted style {style!r} that routes to {target_col} (must be events)"
)
def _check_event_timestamp_alignment(
self,
row: dict[str, Any],
frame_ts: set[float],
report: ValidationReport,
episode_index: int,
) -> None:
ts = row.get("timestamp")
if ts is None:
report.add_error(f"ep={episode_index}: event row missing timestamp: {row!r}")
return
if self.timestamp_atol == 0.0:
if float(ts) not in frame_ts:
report.add_error(
f"ep={episode_index}: event row timestamp {ts!r} does not match any source frame timestamp"
)
else:
if not any(abs(float(ts) - f) <= self.timestamp_atol for f in frame_ts):
report.add_error(
f"ep={episode_index}: event row timestamp {ts!r} not within {self.timestamp_atol}s of any frame"
)
def _check_speech_interjection_pairs(
self,
events: Iterable[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
speech_ts: dict[float, int] = {}
interjection_ts: dict[float, int] = {}
for row in events:
ts = row.get("timestamp")
if ts is None:
continue
ts_f = float(ts)
if row.get("style") is None and row.get("role") == "assistant":
speech_ts[ts_f] = speech_ts.get(ts_f, 0) + 1
if row.get("style") == "interjection":
interjection_ts[ts_f] = interjection_ts.get(ts_f, 0) + 1
for ts in interjection_ts:
if ts not in speech_ts:
report.add_error(f"ep={episode_index}: interjection at t={ts} has no paired speech atom")
def _check_plan_memory_consistency(
self,
persistent: Sequence[dict[str, Any]],
events: Sequence[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
plan_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "plan"})
memory_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "memory"})
subtask_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "subtask"})
interjection_ts = sorted(
{
float(r["timestamp"])
for r in events
if r.get("style") == "interjection" and r.get("timestamp") is not None
}
)
if persistent and not plan_ts:
report.add_warning(f"ep={episode_index}: persistent rows present but no plan emitted")
# every interjection should have a same-timestamp plan refresh
for ts in interjection_ts:
if ts not in set(plan_ts):
report.add_error(
f"ep={episode_index}: interjection at t={ts} has no co-timestamped plan update"
)
# memory should be emitted at subtask boundaries (subset relation)
if memory_ts and subtask_ts:
mem_set = set(memory_ts)
sub_set = set(subtask_ts)
stray = sorted(mem_set - sub_set)
if stray:
report.add_warning(f"ep={episode_index}: memory rows at {stray} not at any subtask boundary")
def _check_vqa_json(
self,
events: Iterable[dict[str, Any]],
report: ValidationReport,
episode_index: int,
) -> None:
for row in events:
if row.get("style") != "vqa" or row.get("role") != "assistant":
continue
content = row.get("content")
if content is None:
report.add_error(
f"ep={episode_index}: VQA assistant row at t={row.get('timestamp')} has null content"
)
continue
try:
payload = json.loads(content)
except (TypeError, ValueError) as exc:
report.add_error(
f"ep={episode_index}: VQA assistant content not valid JSON at t={row.get('timestamp')}: {exc}"
)
continue
shape = classify_vqa_answer(payload)
if shape is None:
report.add_error(
f"ep={episode_index}: VQA assistant payload at t={row.get('timestamp')} does not match any known shape: keys={list(payload) if isinstance(payload, dict) else type(payload).__name__}"
)
@@ -0,0 +1,599 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Shared Qwen-VL client.
The pipeline uses a single shared VLM across modules. vLLM is preferred when
available (high throughput, JSON-guided decoding); transformers is the
fallback. A ``stub`` backend is used for unit tests so fixtures never call
into a real model.
The client speaks one method, :meth:`VlmClient.generate_json`, which:
- accepts a list of OpenAI/HF-style multimodal messages,
- requests JSON output from the server,
- batches requests transparently,
- and reprompts once on a JSON parse failure with an inline correction
message before raising.
"""
from __future__ import annotations
import atexit
import base64
import io
import json
import os
import shlex
import signal
import subprocess
import sys
import threading
import time
import urllib.request
from collections.abc import Callable, Sequence
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from typing import Any, Protocol
from .config import VlmConfig
class VlmClient(Protocol):
"""Protocol every backend must implement."""
def generate_json(
self,
messages_batch: Sequence[Sequence[dict[str, Any]]],
*,
max_new_tokens: int | None = None,
temperature: float | None = None,
) -> list[Any]:
"""Generate one JSON-decoded response per messages list."""
@dataclass
class StubVlmClient:
"""Deterministic stub used in unit tests.
A test passes a callable that maps the *last user message text* (or, if
that is empty, the full message list) to a JSON-serializable response.
"""
responder: Callable[[Sequence[dict[str, Any]]], Any]
def generate_json(
self,
messages_batch: Sequence[Sequence[dict[str, Any]]],
*,
max_new_tokens: int | None = None,
temperature: float | None = None,
) -> list[Any]:
return [self.responder(list(messages)) for messages in messages_batch]
def _strip_to_json(text: str) -> Any:
text = text.strip()
# Strip <think>...</think> blocks (Qwen3 Thinking style)
while "<think>" in text and "</think>" in text:
start = text.find("<think>")
end = text.find("</think>", start) + len("</think>")
text = (text[:start] + text[end:]).strip()
# Strip ```json ... ``` fences from chat-tuned backbones
if text.startswith("```"):
first = text.find("\n")
last = text.rfind("```")
if first != -1 and last != -1 and last > first:
text = text[first + 1 : last].strip()
try:
return json.loads(text)
except (ValueError, json.JSONDecodeError):
pass
# Fall back to extracting the first balanced {...} block.
obj_text = _extract_first_json_object(text)
if obj_text is None:
raise json.JSONDecodeError("No JSON object found", text, 0)
return json.loads(obj_text)
def _extract_first_json_object(text: str) -> str | None:
"""Return the first balanced ``{...}`` substring, ignoring braces in
string literals. Returns ``None`` if no balanced block is found."""
start = text.find("{")
if start < 0:
return None
depth = 0
in_string = False
escape = False
for i in range(start, len(text)):
ch = text[i]
if escape:
escape = False
continue
if ch == "\\":
escape = True
continue
# Note: ``escape`` is always False here — the ``if escape`` branch
# above already handled and reset it.
if ch == '"':
in_string = not in_string
continue
if in_string:
continue
if ch == "{":
depth += 1
elif ch == "}":
depth -= 1
if depth == 0:
return text[start : i + 1]
return None
@dataclass
class _GenericTextClient:
"""Wraps any text-generation callable in JSON-mode + one-retry semantics."""
generate_text: Callable[[Sequence[Sequence[dict[str, Any]]], int, float], list[str]]
config: VlmConfig
def generate_json(
self,
messages_batch: Sequence[Sequence[dict[str, Any]]],
*,
max_new_tokens: int | None = None,
temperature: float | None = None,
) -> list[Any]:
max_tok = max_new_tokens if max_new_tokens is not None else self.config.max_new_tokens
temp = temperature if temperature is not None else self.config.temperature
raw = self.generate_text(messages_batch, max_tok, temp)
out: list[Any] = []
for messages, text in zip(messages_batch, raw, strict=True):
try:
out.append(_strip_to_json(text))
continue
except (ValueError, json.JSONDecodeError):
pass
retry = list(messages) + [
{"role": "assistant", "content": text},
{
"role": "user",
"content": (
"Your previous reply was not valid JSON. "
"Reply with strictly valid JSON, no prose, no fences."
),
},
]
retry_text = self.generate_text([retry], max_tok, temp)[0]
try:
out.append(_strip_to_json(retry_text))
except (ValueError, json.JSONDecodeError):
# After retry: log preview and return None instead of crashing
# the whole pipeline. Modules treat None as "skip".
preview = retry_text.strip().replace("\n", " ")[:200]
print(
f"[vlm] WARNING: failed to parse JSON after retry; preview: {preview!r}",
flush=True,
)
out.append(None)
return out
def make_vlm_client(config: VlmConfig) -> VlmClient:
"""Build the shared VLM client.
Only the ``openai`` backend is supported for now. The shipped workflow
is Hugging Face Jobs (``examples/annotations/run_hf_job.py``): it boots
a vLLM server inside the ``vllm/vllm-openai`` image and the pipeline
talks to it over the OpenAI-compatible API (``--vlm.backend=openai``,
optionally auto-spawning the server via ``auto_serve`` /
``serve_command``). The former in-process ``vllm`` / ``transformers``
backends were removed to keep the support surface to the HF Jobs path.
For ``stub``, construct :class:`StubVlmClient` directly with a responder
callable; it is rejected here to make accidental misuse obvious.
"""
if config.backend == "openai":
return _make_openai_client(config)
if config.backend == "stub":
raise ValueError(
"Use StubVlmClient(...) directly for the stub backend; make_vlm_client builds real clients."
)
if config.backend in {"vllm", "transformers"}:
raise ValueError(
f"backend={config.backend!r} (in-process local model) is not supported for now — "
"only backend='openai' (the Hugging Face Jobs flow) is. Run the pipeline via "
"examples/annotations/run_hf_job.py, which serves the model with vLLM in the "
"vllm/vllm-openai image and talks to it over the OpenAI-compatible API."
)
raise ValueError(f"Unknown VLM backend: {config.backend!r}")
def _make_openai_client(config: VlmConfig) -> VlmClient:
"""Backend that talks to any OpenAI-compatible server.
Compatible with ``vllm serve``, ``transformers serve``,
``ktransformers serve``, and hosted endpoints. By default the server
is expected to be already running. Set ``auto_serve=True`` to have
this client spawn one (default: ``transformers serve``), wait until
it's ready, and tear it down on process exit.
Image blocks ``{"type":"image", "image":<PIL.Image>}`` are
auto-converted to ``image_url`` data-URLs. Video blocks
``{"type":"video", "video":[<PIL>...]}`` are forwarded as
multi-frame ``video_url`` items where supported.
"""
try:
from openai import OpenAI # type: ignore[import-not-found]
except ImportError as exc:
raise ImportError(
"openai package is required for backend='openai'. Install with `pip install openai`."
) from exc
api_base = config.api_base
api_key = config.api_key
auto_serve = config.auto_serve
api_bases: list[str] = [api_base]
print(
f"[lerobot-annotate] backend=openai model={config.model_id} "
f"api_base={api_base} auto_serve={auto_serve}",
flush=True,
)
if auto_serve:
if config.parallel_servers > 1:
print(
f"[lerobot-annotate] spawning {config.parallel_servers} parallel servers",
flush=True,
)
api_bases = _spawn_parallel_inference_servers(config)
elif _server_is_up(api_base):
print(f"[lerobot-annotate] reusing server already up at {api_base}", flush=True)
else:
print("[lerobot-annotate] no server reachable; spawning one", flush=True)
api_base = _spawn_inference_server(config)
api_bases = [api_base]
print(f"[lerobot-annotate] server ready at {api_base}", flush=True)
clients = [OpenAI(base_url=base, api_key=api_key) for base in api_bases]
# round-robin counter for parallel mode
rr_counter = {"i": 0}
# ``mm_processor_kwargs`` is a vllm-specific extra; transformers serve
# rejects it with HTTP 422. Send it only when explicitly opted in via
# an env var (e.g. ``LEROBOT_OPENAI_SEND_MM_KWARGS=1`` for vllm).
send_mm_kwargs = os.environ.get("LEROBOT_OPENAI_SEND_MM_KWARGS", "").lower() in {"1", "true", "yes"}
rr_lock = threading.Lock()
def _one_call(messages: Sequence[dict[str, Any]], max_tok: int, temp: float) -> str:
api_messages, mm_kwargs = _to_openai_messages(messages)
kwargs: dict[str, Any] = {
"model": config.model_id,
"messages": api_messages,
"max_tokens": max_tok,
"temperature": temp,
}
extra_body: dict[str, Any] = {}
if send_mm_kwargs and mm_kwargs:
extra_body["mm_processor_kwargs"] = {**mm_kwargs, "do_sample_frames": True}
if config.chat_template_kwargs:
extra_body["chat_template_kwargs"] = config.chat_template_kwargs
if extra_body:
kwargs["extra_body"] = extra_body
with rr_lock:
chosen = clients[rr_counter["i"] % len(clients)]
rr_counter["i"] += 1
response = chosen.chat.completions.create(**kwargs)
return response.choices[0].message.content or ""
def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
if len(batch) <= 1 or config.client_concurrency <= 1:
return [_one_call(messages, max_tok, temp) for messages in batch]
# Parallel fan-out — vllm batches these on the server side.
max_workers = min(config.client_concurrency, len(batch))
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = [pool.submit(_one_call, messages, max_tok, temp) for messages in batch]
return [f.result() for f in futures]
return _GenericTextClient(_gen, config)
def _spawn_parallel_inference_servers(config: VlmConfig) -> list[str]:
"""Spawn ``config.parallel_servers`` independent vllm replicas.
Each replica:
- is pinned to a single GPU via ``CUDA_VISIBLE_DEVICES``
- listens on ``serve_port + i``
- is shut down via the same atexit hook as the single-server path
Returns the list of ``api_base`` URLs the client should round-robin
across.
"""
n = config.parallel_servers
api_bases: list[str] = []
procs: list[subprocess.Popen] = []
ready_events: list[threading.Event] = []
# Multiple readiness signals — uvicorn's own banner is suppressed at
# ``--uvicorn-log-level warning``, so we also accept vllm's own
# "Starting vLLM API server" line and the route-listing line. The
# HTTP probe below is the ultimate fallback.
ready_markers = (
"Uvicorn running",
"Application startup complete",
"Starting vLLM API server",
"Available routes are",
)
# Single lock for all server-stream threads so multibyte chars from
# different servers don't interleave and tear UTF-8 sequences.
print_lock = threading.Lock()
base_cmd = config.serve_command or (
f"vllm serve {shlex.quote(config.model_id)} "
f"--tensor-parallel-size 1 "
f"--max-model-len {config.max_model_len or 32768} "
f"--uvicorn-log-level warning"
)
num_gpus = config.num_gpus if config.num_gpus > 0 else n
for i in range(n):
port = config.serve_port + i
gpu = i % num_gpus
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(gpu)
cmd = base_cmd.replace("{port}", str(port)) if "{port}" in base_cmd else f"{base_cmd} --port {port}"
api_base = f"http://localhost:{port}/v1"
api_bases.append(api_base)
print(f"[server-{i}] launching on GPU {gpu} port {port}: {cmd}", flush=True)
proc = subprocess.Popen(
shlex.split(cmd),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1,
env=env,
)
procs.append(proc)
ready = threading.Event()
ready_events.append(ready)
def _stream(idx: int, p: subprocess.Popen, ev: threading.Event) -> None:
# Read whole lines and emit each line atomically under the
# shared print_lock so output from N servers stays readable.
assert p.stdout is not None
for line in iter(p.stdout.readline, ""):
with print_lock:
sys.stdout.write(f"[server-{idx}] {line}")
if not line.endswith(("\n", "\r")):
sys.stdout.write("\n")
sys.stdout.flush()
if any(m in line for m in ready_markers):
ev.set()
threading.Thread(target=_stream, args=(i, proc, ready), daemon=True).start()
def _probe(idx: int, base: str, ev: threading.Event, p: subprocess.Popen) -> None:
while not ev.is_set() and p.poll() is None:
if _server_is_up(base):
print(f"[server-{idx}] ready (http probe)", flush=True)
ev.set()
return
time.sleep(2)
threading.Thread(target=_probe, args=(i, api_base, ready, proc), daemon=True).start()
def _shutdown() -> None:
for i, p in enumerate(procs):
if p.poll() is None:
print(f"[server-{i}] stopping pid={p.pid}", flush=True)
p.send_signal(signal.SIGINT)
for p in procs:
try:
p.wait(timeout=15)
except subprocess.TimeoutExpired:
p.kill()
p.wait(timeout=5)
atexit.register(_shutdown)
deadline = time.monotonic() + config.serve_ready_timeout_s
while any(not ev.is_set() for ev in ready_events) and time.monotonic() < deadline:
for i, p in enumerate(procs):
if p.poll() is not None:
raise RuntimeError(
f"[server-{i}] inference server exited unexpectedly with rc={p.returncode}"
)
time.sleep(2)
if any(not ev.is_set() for ev in ready_events):
raise RuntimeError(f"[server] not all replicas became ready within {config.serve_ready_timeout_s}s")
print(f"[lerobot-annotate] all {n} servers ready: {api_bases}", flush=True)
return api_bases
def _server_is_up(api_base: str) -> bool:
"""Return True if ``api_base/models`` answers 200 within 2 seconds."""
url = api_base.rstrip("/") + "/models"
# ``api_base`` is the user-configured local-server URL we just spawned
# or the user passed in via ``--vlm.api_base``; the bandit B310 warning
# is for arbitrary user-controlled URLs with file:/ schemes which
# cannot reach this code path.
try:
with urllib.request.urlopen(url, timeout=2) as resp: # noqa: S310 # nosec B310
return resp.status == 200
except Exception: # noqa: BLE001
return False
def _spawn_inference_server(config: VlmConfig) -> str:
"""Spawn ``transformers serve`` (or ``serve_command``), wait until it
accepts ``/v1/models``, and register a shutdown hook.
Streams the server's stdout/stderr to the parent terminal in
real-time on a background thread so users can see model-load
progress and errors as they happen.
Returns the full ``api_base`` URL the OpenAI client should use.
"""
cmd = config.serve_command
if not cmd:
cmd = (
f"transformers serve {shlex.quote(config.model_id)} "
f"--port {config.serve_port} --continuous-batching"
)
api_base = f"http://localhost:{config.serve_port}/v1"
print(f"[server] launching: {cmd}", flush=True)
proc = subprocess.Popen(
shlex.split(cmd),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1,
)
# Watch the server output for the uvicorn readiness banner. This is
# more reliable than polling /v1/models because transformers serve
# rescans its cache on every model-list request, which can exceed
# the urllib timeout and trigger an infinite probe loop.
ready_event = threading.Event()
# See _spawn_parallel_inference_servers for why we accept these.
ready_markers = (
"Uvicorn running",
"Application startup complete",
"Starting vLLM API server",
"Available routes are",
)
def _probe() -> None:
while not ready_event.is_set() and proc.poll() is None:
if _server_is_up(api_base):
print("[server] ready (http probe)", flush=True)
ready_event.set()
return
time.sleep(2)
threading.Thread(target=_probe, daemon=True).start()
def _stream_output() -> None:
# Read raw chunks instead of iterating lines so tqdm progress
# bars (which overwrite using \r) flush in real time.
assert proc.stdout is not None
buf = ""
prefix_started = False
while True:
ch = proc.stdout.read(1)
if ch == "":
# process exited; flush any tail
if buf:
sys.stdout.write(buf)
sys.stdout.flush()
return
if not prefix_started:
sys.stdout.write("[server] ")
prefix_started = True
sys.stdout.write(ch)
sys.stdout.flush()
buf += ch
if ch in ("\n", "\r"):
if any(marker in buf for marker in ready_markers):
ready_event.set()
buf = ""
prefix_started = False
threading.Thread(target=_stream_output, daemon=True).start()
def _shutdown() -> None:
if proc.poll() is None:
print(f"[server] stopping pid={proc.pid}", flush=True)
proc.send_signal(signal.SIGINT)
try:
proc.wait(timeout=15)
except subprocess.TimeoutExpired:
proc.kill()
proc.wait(timeout=5)
atexit.register(_shutdown)
deadline = time.monotonic() + config.serve_ready_timeout_s
while time.monotonic() < deadline:
if proc.poll() is not None:
raise RuntimeError(
f"[server] inference server exited unexpectedly with rc={proc.returncode}. "
f"See [server] log lines above for the cause."
)
if ready_event.wait(timeout=2):
return api_base
proc.terminate()
raise RuntimeError(f"[server] did not become ready within {config.serve_ready_timeout_s}s")
def _to_openai_messages(
messages: Sequence[dict[str, Any]],
) -> tuple[list[dict[str, Any]], dict[str, Any]]:
"""Convert internal messages to OpenAI chat format.
Returns ``(api_messages, mm_kwargs)``. Multimodal-processor kwargs
(``fps`` from ``video_url`` blocks) are extracted out so the caller
can pass them via ``extra_body.mm_processor_kwargs`` rather than
inside the content blocks (which transformers serve rejects).
File-URL video blocks are inlined as base64 data URLs.
"""
out_messages: list[dict[str, Any]] = []
mm_kwargs: dict[str, Any] = {}
for message in messages:
content = message.get("content")
if not isinstance(content, list):
out_messages.append({"role": message["role"], "content": content})
continue
out_blocks: list[dict[str, Any]] = []
for block in content:
block_type = block.get("type") if isinstance(block, dict) else None
if block_type == "text":
out_blocks.append({"type": "text", "text": block.get("text", "")})
elif block_type == "image":
out_blocks.append(
{"type": "image_url", "image_url": {"url": _pil_to_data_url(block["image"])}}
)
elif block_type == "video":
frames = block.get("video", [])
for img in frames:
out_blocks.append({"type": "image_url", "image_url": {"url": _pil_to_data_url(img)}})
elif block_type == "video_url":
video_url = dict(block["video_url"])
url = video_url.get("url", "")
if url.startswith("file://"):
video_url["url"] = _file_to_data_url(url[len("file://") :])
out_blocks.append({"type": "video_url", "video_url": video_url})
fps = block.get("fps")
if fps is not None:
mm_kwargs["fps"] = fps
else:
out_blocks.append(block)
out_messages.append({"role": message["role"], "content": out_blocks})
return out_messages, mm_kwargs
def _file_to_data_url(path: str) -> str:
"""Read a local video file and return a base64 ``data:video/mp4`` URL."""
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("ascii")
return f"data:video/mp4;base64,{b64}"
def _pil_to_data_url(image: Any) -> str:
"""Encode a PIL.Image as a base64 data URL."""
buf = io.BytesIO()
image.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode("ascii")
return f"data:image/png;base64,{b64}"
@@ -0,0 +1,341 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Final parquet rewrite.
For every episode the writer:
1. reads the staged module outputs,
2. partitions them into a persistent slice (PERSISTENT_STYLES) and an event
slice (EVENT_ONLY_STYLES + style=None tool-call atoms),
3. sorts each slice deterministically,
4. broadcasts the persistent slice across every frame in the episode,
5. for each frame, materializes the sublist of event rows whose timestamp
exactly equals that frame's timestamp,
6. drops the legacy ``subtask_index`` column,
7. writes the parquet shard back in place.
The writer does NOT add a dataset-level ``tools`` column. Tool *calls* are
emitted per-row via the existing ``tool_calls`` field on the v3.1 row
struct for every speech atom. The tool *schema* (the description
of the ``say`` function and its parameters) is a fixed code constant —
``SAY_TOOL_SCHEMA`` below — and downstream chat-template consumers import
it directly rather than reading a redundant per-row column.
Invariants enforced here (and re-checked by the validator):
- per-episode persistent slice is byte-identical across every frame;
- ``language_events`` rows on a frame all have ``timestamp == frame_ts``
(timestamps come straight from the source parquet — never recomputed);
- every row passes ``column_for_style(style)``.
"""
from __future__ import annotations
import logging
from collections import defaultdict
from collections.abc import Sequence
from dataclasses import dataclass
from pathlib import Path
from typing import Any
import pyarrow as pa
import pyarrow.parquet as pq
from lerobot.datasets.language import (
EVENT_ONLY_STYLES,
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
PERSISTENT_STYLES,
column_for_style,
validate_camera_field,
)
from .reader import EpisodeRecord
from .staging import EpisodeStaging
logger = logging.getLogger(__name__)
# Tool schema constants live in lerobot.datasets.language — single
# source of truth. Re-exported here so existing imports
# (``from lerobot.annotations.steerable_pipeline.writer import SAY_TOOL_SCHEMA``)
# keep working.
from lerobot.datasets.language import DEFAULT_TOOLS, SAY_TOOL_SCHEMA # noqa: F401, E402
def _row_persistent_sort_key(row: dict[str, Any]) -> tuple:
return (float(row["timestamp"]), row.get("style") or "", row.get("role") or "")
def _row_event_sort_key(row: dict[str, Any]) -> tuple:
# events are bucketed per-frame, but within a frame we still want determinism
return (
row.get("style") or "",
row.get("role") or "",
row.get("camera") or "",
)
def _normalize_row(row: dict[str, Any], style: str | None, *, with_timestamp: bool) -> dict[str, Any]:
"""Coerce a staged row into the language-column struct shape.
Key order matches ``PERSISTENT_ROW_FIELDS`` / ``EVENT_ROW_FIELDS`` — the
writer infers the parquet struct schema from insertion order, so
``timestamp`` (persistent rows only) sits between ``style`` and ``camera``.
"""
camera = row.get("camera")
validate_camera_field(style, camera)
out: dict[str, Any] = {
"role": str(row["role"]),
"content": None if row.get("content") is None else str(row["content"]),
"style": style,
}
if with_timestamp:
out["timestamp"] = float(row["timestamp"])
out["camera"] = None if camera is None else str(camera)
out["tool_calls"] = _normalize_tool_calls(row.get("tool_calls"))
return out
def _normalize_persistent_row(row: dict[str, Any]) -> dict[str, Any]:
"""Coerce a staged row into the persistent column's struct shape."""
style = row.get("style")
if style not in PERSISTENT_STYLES:
raise ValueError(
f"persistent slice contains row with non-persistent style {style!r}; "
"row would be misrouted under column_for_style()"
)
if "timestamp" not in row:
raise ValueError(f"persistent row missing timestamp: {row!r}")
if "role" not in row:
# Friendly error from the writer instead of a raw KeyError below;
# the validator doesn't check ``role`` yet.
raise ValueError(f"persistent row missing role: {row!r}")
return _normalize_row(row, style, with_timestamp=True)
def _normalize_event_row(row: dict[str, Any]) -> dict[str, Any]:
"""Coerce a staged row into the event column's struct shape (no timestamp)."""
style = row.get("style")
if style is not None and style not in EVENT_ONLY_STYLES:
raise ValueError(
f"event slice contains row with style {style!r}; expected None or one of {EVENT_ONLY_STYLES}"
)
if column_for_style(style) != LANGUAGE_EVENTS:
raise ValueError(f"event row with style {style!r} would not route to language_events")
if "role" not in row:
raise ValueError(f"event row missing role: {row!r}")
return _normalize_row(row, style, with_timestamp=False)
def _normalize_tool_calls(value: Any) -> list[Any] | None:
if value is None:
return None
if not isinstance(value, list):
raise ValueError(f"tool_calls must be a list or None, got {type(value).__name__}")
return list(value)
def _validate_atom_invariants(row: dict[str, Any]) -> None:
"""At-least-one of content/tool_calls; style=None implies tool_calls."""
has_content = row.get("content") is not None
has_tools = row.get("tool_calls") is not None
if not (has_content or has_tools):
raise ValueError(f"row has neither content nor tool_calls: {row!r}")
if row.get("style") is None and not has_tools:
raise ValueError(f"style=None requires tool_calls: {row!r}")
def _validate_speech_atom(row: dict[str, Any]) -> None:
"""Speech atoms: role=assistant, style=None, content=None, say tool call."""
if row.get("style") is not None:
return # not a speech atom
if row.get("role") != "assistant":
raise ValueError(f"speech atom must have role=assistant: {row!r}")
if row.get("content") is not None:
raise ValueError(f"speech atom must have content=null: {row!r}")
tool_calls = row.get("tool_calls")
if not tool_calls or not isinstance(tool_calls, list):
raise ValueError(f"speech atom must have non-empty tool_calls list: {row!r}")
first = tool_calls[0]
if not isinstance(first, dict):
raise ValueError(f"speech atom tool_calls[0] must be a dict: {row!r}")
if first.get("type") != "function":
raise ValueError(f"speech atom tool_calls[0].type must be 'function': {row!r}")
fn = first.get("function") or {}
if fn.get("name") != "say":
raise ValueError(f"speech atom tool_calls[0].function.name must be 'say': {row!r}")
args = fn.get("arguments") or {}
if not isinstance(args, dict) or "text" not in args or not isinstance(args["text"], str):
raise ValueError(f"speech atom must carry 'text' string in arguments: {row!r}")
@dataclass
class LanguageColumnsWriter:
"""Rewrite ``data/chunk-*/file-*.parquet`` with the two language columns."""
drop_existing_subtask_index: bool = True
def write_all(
self,
records: Sequence[EpisodeRecord],
staging_dir: Path,
root: Path,
) -> list[Path]:
episodes_by_path: dict[Path, list[EpisodeRecord]] = defaultdict(list)
for record in records:
episodes_by_path[record.data_path].append(record)
written: list[Path] = []
for path, eps in episodes_by_path.items():
self._rewrite_one(path, eps, staging_dir, root)
written.append(path)
return written
def _rewrite_one(
self,
path: Path,
episodes: Sequence[EpisodeRecord],
staging_dir: Path,
root: Path,
) -> None:
table = pq.read_table(path)
n_rows = table.num_rows
# Ensure we cover every episode in the file. Episodes that don't have
# staging artifacts are passed through with empty annotation lists —
# this keeps the writer idempotent and safe for partial reruns.
staged_per_ep: dict[int, dict[str, list[dict[str, Any]]]] = {}
for record in episodes:
staging = EpisodeStaging(staging_dir, record.episode_index)
staged_per_ep[record.episode_index] = staging.read_all()
persistent_by_ep: dict[int, list[dict[str, Any]]] = {}
events_by_ep_ts: dict[int, dict[float, list[dict[str, Any]]]] = {}
for ep_index, ep_staged in staged_per_ep.items():
persistent_rows: list[dict[str, Any]] = []
event_rows: list[dict[str, Any]] = [] # carry timestamp until bucketed
for _module_name, rows in ep_staged.items():
for row in rows:
style = row.get("style")
if column_for_style(style) == LANGUAGE_PERSISTENT:
persistent_rows.append(row)
else:
event_rows.append(row)
persistent_rows.sort(key=_row_persistent_sort_key)
normalized_persistent = []
for r in persistent_rows:
_validate_atom_invariants(r)
_validate_speech_atom(r)
normalized_persistent.append(_normalize_persistent_row(r))
persistent_by_ep[ep_index] = normalized_persistent
buckets: dict[float, list[dict[str, Any]]] = defaultdict(list)
for r in event_rows:
_validate_atom_invariants(r)
_validate_speech_atom(r)
ts = float(r["timestamp"])
buckets[ts].append(_normalize_event_row(r))
for ts in list(buckets.keys()):
buckets[ts].sort(key=_row_event_sort_key)
events_by_ep_ts[ep_index] = buckets
episode_col = (
table.column("episode_index").to_pylist() if "episode_index" in table.column_names else None
)
ts_col = table.column("timestamp").to_pylist() if "timestamp" in table.column_names else None
if episode_col is None or ts_col is None:
raise ValueError(f"{path} is missing 'episode_index' or 'timestamp' — required by the writer.")
per_row_persistent: list[list[dict[str, Any]]] = []
per_row_events: list[list[dict[str, Any]]] = []
for i in range(n_rows):
ep = episode_col[i]
ts = float(ts_col[i])
per_row_persistent.append(persistent_by_ep.get(ep, []))
buckets = events_by_ep_ts.get(ep, {})
per_row_events.append(buckets.get(ts, []))
new_table = self._materialize_table(
table, per_row_persistent, per_row_events, drop_old=self.drop_existing_subtask_index
)
# Atomic replace: write to a sibling tmp path and rename so a crash
# mid-write can't leave a half-written shard that ``pq.read_table``
# would then fail to open. ``Path.replace`` is atomic on POSIX +
# Windows when source and target sit on the same filesystem.
tmp_path = path.with_suffix(path.suffix + ".tmp")
pq.write_table(new_table, tmp_path)
tmp_path.replace(path)
def _materialize_table(
self,
table: pa.Table,
persistent: list[list[dict[str, Any]]],
events: list[list[dict[str, Any]]],
*,
drop_old: bool,
) -> pa.Table:
cols = []
names = []
for name in table.column_names:
if drop_old and name == "subtask_index":
continue
if name in (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS):
continue # we'll re-add canonical versions
# Strip any legacy ``tools`` column previously emitted by older
# writers — the schema no longer uses it (constant lives in
# SAY_TOOL_SCHEMA / DEFAULT_TOOLS).
if name == "tools":
continue
cols.append(table.column(name))
names.append(name)
# We let pyarrow infer struct/list schema rather than passing the
# canonical type from `lerobot.datasets.language` directly: that type
# uses `pa.json_()` for the `tool_calls` element type, which
# `pa.array(..., type=...)` cannot materialize from Python lists on
# current pyarrow versions. The inferred schema round-trips through
# parquet and `LeRobotDataset` correctly — `tests/datasets/test_language.py`
# exercises the same flow.
persistent_arr = pa.array(persistent)
events_arr = pa.array(events)
cols.extend([persistent_arr, events_arr])
names.extend([LANGUAGE_PERSISTENT, LANGUAGE_EVENTS])
return pa.Table.from_arrays(cols, names=names)
def speech_atom(timestamp: float, text: str) -> dict[str, Any]:
"""Build a canonical speech tool-call atom for the events column."""
return {
"role": "assistant",
"content": None,
"style": None,
"timestamp": float(timestamp),
"camera": None,
"tool_calls": [
{
"type": "function",
"function": {
"name": "say",
"arguments": {"text": text},
},
}
],
}
+70
View File
@@ -18,6 +18,7 @@ from __future__ import annotations
# Utilities
########################################################################################
import logging
import time
import traceback
from contextlib import nullcontext
from copy import copy
@@ -243,3 +244,72 @@ def sanity_check_dataset_robot_compatibility(
raise ValueError(
"Dataset metadata compatibility check failed with mismatches:\n" + "\n".join(mismatches)
)
########################################################################################
# Teleoperator smooth handover helpers
# NOTE(Maxime): These functions use minimal type hints to maintain compatibility with utils
# being a root module.
########################################################################################
def teleop_supports_feedback(teleop) -> bool:
"""Return True when the teleop can receive position feedback (is actuated).
Actuated teleops (e.g. SO-101, OpenArmMini) have non-empty ``feedback_features``
and expose ``enable_torque`` / ``disable_torque`` motor-control methods.
TODO(Maxime): See if it is possible to unify this interface across teleops instead of duck-typing.
"""
return (
bool(teleop.feedback_features)
and hasattr(teleop, "disable_torque")
and hasattr(teleop, "enable_torque")
)
def teleop_smooth_move_to(teleop, target_pos: dict, duration_s: float = 2.0, fps: int = 30) -> None:
"""Smoothly move an actuated teleop to ``target_pos`` via linear interpolation.
Requires the teleoperator to support feedback (i.e. have non-empty
``feedback_features`` and implement ``disable_torque`` / ``enable_torque``).
``target_pos`` is expected to be in the teleop's action/feedback key space.
For homogeneous setups (e.g. SO-101 leader + SO-101 follower) this matches
the robot action key space directly.
TODO(Maxime): This blocks up to ``duration_s`` seconds; during this time the
follower robot does not receive new actions, which could be an issue on LeKiwi.
"""
teleop.enable_torque()
current = teleop.get_action()
steps = max(int(duration_s * fps), 1)
for step in range(steps + 1):
t = step / steps
interp = {
k: current[k] * (1 - t) + target_pos[k] * t if k in target_pos else current[k] for k in current
}
teleop.send_feedback(interp)
time.sleep(1 / fps)
def follower_smooth_move_to(
robot, current: dict, target: dict, duration_s: float = 1.0, fps: int = 30
) -> None:
"""Smoothly move the follower robot from ``current`` to ``target`` action.
Used when the teleop is non-actuated: instead of driving the leader arm to
the follower, the follower is brought to the teleop's current pose so the
robot meets the operator's hand rather than jumping to it on the first frame.
Both ``current`` and ``target`` must be in the robot action key space
(i.e. the output of ``robot_action_processor``).
"""
steps = max(int(duration_s * fps), 1)
for step in range(steps + 1):
t = step / steps
interp = {k: current[k] * (1 - t) + target[k] * t if k in target else current[k] for k in current}
robot.send_action(interp)
time.sleep(1 / fps)
+146
View File
@@ -205,3 +205,149 @@ class WandBLogger:
wandb_video = self._wandb.Video(video_path, fps=self.env_fps, format="mp4")
self._wandb.log({f"{mode}/video": wandb_video}, step=step)
def log_training_examples(
self,
batch: dict,
step: int,
*,
camera_keys: list[str],
n_samples: int = 4,
policy=None,
predict_actions: bool = False,
mode: str = "train",
) -> None:
"""Push a ``wandb.Table`` of training-example rows for the current batch.
Each row is one batch element with:
* one ``wandb.Image`` column per camera in ``camera_keys`` (CHW or
HWC, uint8 or float in [0,1] — auto-detected),
* any text fields present in the batch (``task`` / ``subtask`` /
``memory`` / ``instruction``),
* ground-truth action first/last frame (the action chunk's
endpoints — gives a quick sense of trajectory direction),
* if ``predict_actions=True`` and ``policy`` is supplied, the model's
``predict_action_chunk`` first/last frame alongside.
This is opt-in via ``--wandb.log_examples_freq=N`` on the CLI; the
training loop calls it once every N steps. Cheap to keep on: with
N=4 samples and 3 cameras you upload 12 small PNGs per dump and (if
enabled) run one extra inference forward pass.
"""
import logging # noqa: PLC0415
import numpy as np # noqa: PLC0415
import torch # noqa: PLC0415
if mode not in {"train", "eval"}:
raise ValueError(mode)
# Batch size — first tensor-like value wins.
bsz = next(
(int(v.shape[0]) for v in batch.values() if hasattr(v, "shape") and v.ndim > 0),
None,
)
if not bsz:
return
n = min(int(n_samples), bsz)
# Optional predicted-action forward pass on the first n samples.
pred_actions: np.ndarray | None = None
if predict_actions and policy is not None:
was_training = policy.training
try:
policy.eval()
sub_batch = {}
for k, v in batch.items():
if isinstance(v, torch.Tensor):
sub_batch[k] = v[:n]
elif isinstance(v, (list, tuple)):
sub_batch[k] = list(v[:n])
else:
sub_batch[k] = v
with torch.no_grad():
pred = policy.predict_action_chunk(sub_batch)
pred_actions = pred.detach().cpu().float().numpy()
except Exception as exc: # noqa: BLE001
logging.warning(
"log_training_examples: predict_action_chunk failed (%s) — "
"skipping predicted-action columns",
exc,
)
pred_actions = None
finally:
if was_training:
policy.train()
present_cameras = [c for c in camera_keys if c in batch]
text_keys = [k for k in ("task", "subtask", "memory", "instruction") if k in batch]
columns = ["sample"]
columns.extend(c.removeprefix("observation.images.") or c for c in present_cameras)
columns.extend(text_keys)
columns.append("gt_action_first")
columns.append("gt_action_last")
if pred_actions is not None:
columns.append("pred_action_first")
columns.append("pred_action_last")
table = self._wandb.Table(columns=columns)
def _to_uint8_hwc(t: torch.Tensor) -> np.ndarray:
# Strip an outer time dim if present: (T, C, H, W) -> first frame.
if t.ndim == 4:
t = t[0]
# CHW -> HWC.
if t.ndim == 3 and t.shape[0] in (1, 3, 4) and t.shape[-1] not in (1, 3, 4):
t = t.permute(1, 2, 0)
arr = t.detach().cpu().float().numpy()
if arr.size and float(arr.max()) <= 1.5:
arr = arr * 255.0
return np.clip(arr, 0, 255).astype(np.uint8)
def _action_endpoints(a: torch.Tensor) -> tuple[str, str]:
arr = a.detach().cpu().float().numpy()
if arr.ndim == 2: # (T, D)
return (
str(np.round(arr[0], 3).tolist()),
str(np.round(arr[-1], 3).tolist()),
)
if arr.ndim == 1:
rounded = np.round(arr, 3).tolist()
return (str(rounded), str(rounded))
return (str(arr.tolist()), str(arr.tolist()))
for i in range(n):
row: list = [i]
for cam in present_cameras:
try:
row.append(self._wandb.Image(_to_uint8_hwc(batch[cam][i])))
except Exception as exc: # noqa: BLE001
logging.warning(
"log_training_examples: camera %s sample %d failed (%s)",
cam,
i,
exc,
)
row.append(None)
for tk in text_keys:
v = batch[tk]
if isinstance(v, (list, tuple)):
row.append(str(v[i]) if i < len(v) else "")
else:
row.append(str(v))
action = batch.get("action")
if isinstance(action, torch.Tensor) and action.ndim >= 1:
first, last = _action_endpoints(action[i])
row.append(first)
row.append(last)
else:
row.append("")
row.append("")
if pred_actions is not None:
p = torch.from_numpy(pred_actions[i])
pfirst, plast = _action_endpoints(p)
row.append(pfirst)
row.append(plast)
table.add_data(*row)
self._wandb.log({f"{mode}/examples": table}, step=step)
+2 -2
View File
@@ -41,8 +41,8 @@ class DatasetRecordConfig:
video: bool = True
# Upload dataset to Hugging Face hub.
push_to_hub: bool = True
# Upload on private repository on the Hugging Face hub.
private: bool = False
# If True, upload as private; if None, defer to the org default on the Hub (only affects orgs).
private: bool | None = None
# Add tags to your dataset on the hub.
tags: list[str] | None = None
# Number of subprocesses handling the saving of frames as PNG. Set to 0 to use threads only;
+66
View File
@@ -62,6 +62,72 @@ class WandBConfig:
run_id: str | None = None
mode: str | None = None # Allowed values: 'online', 'offline' 'disabled'. Defaults to 'online'
add_tags: bool = True # If True, save configuration as tags in the WandB run.
# Periodic training-example dump (independent of ``log_freq``). When > 0,
# every ``log_examples_freq`` steps the trainer pushes a ``wandb.Table``
# with one row per sampled batch element containing each camera view
# (rendered as ``wandb.Image``), any text fields present in the batch
# (``task`` / ``subtask`` / ``memory`` / ``instruction``), and the
# ground-truth action chunk's first + last frames. Defaults to 5000 — set
# to 0 to disable. Only fires when ``enable=True``, so runs without wandb
# are unaffected.
log_examples_freq: int = 5000
# Number of batch elements to include in each example dump.
log_examples_n: int = 4
# If True (default), also run ``policy.predict_action_chunk`` on the logged
# samples (in eval mode, no_grad) and add predicted vs ground-truth action
# columns to the table. Costs one extra forward pass per dump — negligible
# at the 5k-step default cadence. Set to ``False`` if your policy doesn't
# implement ``predict_action_chunk`` or you want to skip the extra forward.
log_examples_predict_actions: bool = True
@dataclass
class EMAConfig:
"""Exponential Moving Average of trainable policy parameters.
Diffusion / flow-matching policies (Diffusion Policy, π0/π0.5,
pi052) benefit substantially from averaging late-training
parameter oscillations — see Chi et al. 2023 §V.D. The official
JAX openpi trainer ships EMA with ``ema_decay=0.99`` (default) and
``0.999`` for its pi05_libero config; the openpi PyTorch port
explicitly lists EMA as unsupported, and LeRobot main inherited
that gap. Enabling this flag plugs ema-pytorch
(https://github.com/lucidrains/ema-pytorch) into the LeRobot
training loop with a shadow ``nn.Module`` clone of the policy.
Cost: 1× model params in fp32 shadow (~13 GB for pi052's 3.3B
params) + one elementwise update per training step (~1% step time).
Off by default (opt-in): EMA is only beneficial for flow-matching /
diffusion policies (pi0/pi05/pi052), and the fp32 shadow copy is pure
overhead for other policies (e.g. VLA-JEPA). Set ``--ema.enable=true``
to turn it on (the pi05/pi052 training recipes do this). openpi (JAX)
ships EMA on for every config; enable it explicitly to match that.
"""
enable: bool = False
# Target EMA decay β in θ_ema ← β·θ_ema + (1-β)·θ_live (passed to
# ema-pytorch as ``beta``).
# 0.999 — last ~1000 steps; pi05_libero default in openpi
# 0.99 — last ~100 steps; openpi top-level default
# 0.75 — very fast EMA (Diffusion Policy original setting)
# 0.9999 — very slow EMA (long classification runs)
decay: float = 0.99
# Skip the first N calls to ``ema.update()``; during this window
# the shadow is just a hard copy of the live weights (no averaging).
# Lets early-training rapid changes settle before averaging begins.
# Maps to ema-pytorch's ``update_after_step`` (NOT a smooth decay
# ramp like older lerobot EMA implementations).
warmup_steps: int = 0
# When True, the periodic eval block uses the EMA shadow model
# directly (``ema.ema_model``) instead of the live policy. Standard
# practice for diffusion-style policies — eval scores are usually
# 13% higher than the live policy at the same step.
use_for_eval: bool = True
# When True, the periodic wandb training-example dump uses the EMA
# shadow for the optional predicted-action columns (so what you see
# in W&B matches eval behavior).
use_for_wandb_examples: bool = True
@dataclass
+8 -3
View File
@@ -255,8 +255,7 @@ def extract_path_fields_from_config(config_path: str, path_fields: list[str]) ->
remaining = config_data[field]
if remaining:
_config_yaml_overrides[field] = _flatten_to_cli_args(remaining)
else:
del config_data[field]
del config_data[field]
modified = True
if not modified:
@@ -311,7 +310,13 @@ def wrap(config_path: Path | None = None) -> Callable[[F], F]:
cli_args = filter_arg("config_path", cli_args)
cfg = argtype.from_pretrained(config_path_cli, cli_args=cli_args)
else:
cfg = draccus.parse(config_class=argtype, config_path=config_path, args=cli_args)
if config_path_cli:
cli_args = filter_arg("config_path", cli_args)
cfg = draccus.parse(
config_class=argtype,
config_path=config_path_cli or config_path,
args=cli_args,
)
response = fn(cfg, *args, **kwargs)
return response
+18 -3
View File
@@ -147,7 +147,16 @@ class TrainingRecipe:
return cls.from_dict(data)
def _validate_message_recipe(self) -> None:
"""Ensure every templated binding is known and at least one turn is a target."""
"""Ensure every templated binding is known and the recipe supervises something.
A recipe is valid if it has at least one of:
* a ``target: true`` assistant turn (drives text-CE supervision), or
* a ``stream: low_level`` turn (drives flow / action supervision via
``predict_actions=True``, even when no assistant turn is targeted —
e.g. π0.5-style ``low_level_execution`` where the action expert
conditions on a user-only ``${subtask}`` prompt).
"""
assert self.messages is not None
known_bindings = set(DEFAULT_BINDINGS) | set(self.bindings or {}) | {"task"}
@@ -156,8 +165,14 @@ class TrainingRecipe:
if missing:
raise ValueError(f"MessageTurn references unknown binding(s): {sorted(missing)}")
if not any(turn.target for turn in self.messages):
raise ValueError("Message recipes must contain at least one target turn.")
has_target = any(turn.target for turn in self.messages)
has_low_level = any(turn.stream == "low_level" for turn in self.messages)
if not (has_target or has_low_level):
raise ValueError(
"Message recipes must contain at least one supervised turn — "
"either ``target: true`` (text CE) or ``stream: low_level`` "
"(flow/action loss)."
)
def _validate_blend_recipe(self) -> None:
"""Ensure each blend component is a non-empty, weighted message recipe."""
@@ -0,0 +1,68 @@
# subtask_mem_vqa_speech — Hi-Robot blend + memory + spoken responses.
#
# Superset of subtasks_vqa.yaml. Keeps the core subtask + action + VQA
# training, and adds two text-supervised tasks:
#
# high_level_subtask — predict the subtask from the task.
# low_level_execution — flow loss with [images, subtask, state].
# memory_update — compress progress into a memory note.
# user_interjection_response — reply to a user interjection with a
# spoken `say` tool call (no plan, no
# subtask text — just the spoken reply).
# ask_vqa_{top,wrist} — camera-grounded VQA.
#
# Plan is intentionally left out — memory is the only persistent
# high-level state here, keeping the prompt short.
#
# Requires the dataset to carry `memory`, `interjection` and `say`-tool
# annotations (the annotation pipeline's memory + interjection modules)
# in addition to `subtask` and `vqa`. Sub-recipes whose `if_present`
# bindings are missing simply don't render for that sample, so a
# dataset without interjections still trains the rest of the blend.
#
# Tool-call note: the `say` tool call on the interjection-response turn
# is flattened to a `<say>...</say>` text marker by the tokenizer step
# (`_flatten_say_tool_calls`) so the LM head learns to emit exactly the
# marker the runtime parses back (`_split_plan_and_say`).
blend:
high_level_subtask:
weight: 0.30
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
low_level_execution:
weight: 0.55
messages:
# The action expert is conditioned on the SUBTASK — at inference
# `HighLevelSubtaskFwd` generates it via the LM head and feeds it
# here. `stream: low_level` flips `predict_actions=True` so the
# flow loss fires; no text-CE target (subtask prediction is owned
# by `high_level_subtask`).
- {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
memory_update:
# At inference, `MemoryUpdateFwd` is triggered only on
# `subtask_change` events (sparse). Training densely with
# `active_at` — i.e. on every frame inside a subtask interval,
# not just the boundary frame — supervises the same
# (prior_memory, completed_subtask) → current_memory mapping
# against varied observations within the interval. The model
# learns a stateless transformation; the *when* to emit lives in
# the inference trigger, not the model. Annotations only exist
# for ~1% of frames as boundary events, so `emitted_at` would
# waste 99% of the blend draws (and silently leak them into a
# task-conditioned fallback); `active_at` lifts the renderable
# rate to ~87% on this dataset.
weight: 0.15
bindings:
prior_memory: "nth_prev(style=memory, offset=1)"
current_memory: "active_at(t, style=memory)"
completed_subtask: "nth_prev(style=subtask, offset=1)"
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
- {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
- {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
@@ -0,0 +1,99 @@
# subtask_mem_vqa_robocasa — Hi-Robot blend tuned for RoboCasa cameras.
#
# Same supervision as ``subtask_mem.yaml`` (subtask + memory) plus
# camera-grounded VQA across the three RoboCasa camera keys produced
# by ``slurm_build_robocasa_composite_seen.py``:
#
# observation.images.robot0_agentview_left (left scene view)
# observation.images.robot0_agentview_right (right scene view)
# observation.images.robot0_eye_in_hand (wrist)
#
# The annotation pipeline (``examples/annotations/run_hf_job.py``) emits
# VQA per camera, so each anchor frame produces three (user, assistant)
# rows tagged with their source camera. Each VQA sub-recipe consumes
# the rows for one camera via ``camera=...`` resolver bindings.
#
# Spatial VQA targets (bbox / point) are rewritten from JSON to
# PaliGemma ``<locDDDD>`` tokens by ``_messages_vqa_to_loc`` —
# ``register_paligemma_loc_tokens`` already collapses them to single
# detection-vocab ids so the LM head learns the pretrained pointing /
# detection prior, not a 7-piece BPE salad.
#
# Interjections / spoken responses are intentionally absent — the
# annotation job runs with ``--interjections.enabled=false``.
blend:
high_level_subtask:
weight: 0.25
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
low_level_execution:
weight: 0.45
messages:
# Action expert is conditioned on the SUBTASK; at inference the
# high-level loop generates it via the LM head and feeds it here.
# ``stream: low_level`` flips ``predict_actions=True`` so the flow
# loss fires; subtask CE is owned by ``high_level_subtask``.
- {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
memory_update:
# Trained densely with ``active_at`` — every frame inside a subtask
# interval — so the (prior_memory, completed_subtask) → current_memory
# mapping is supervised against varied observations. The *when* to
# emit lives in the inference trigger (subtask_change), not the
# model. See ``subtask_mem.yaml`` for the long version of this note.
weight: 0.15
bindings:
prior_memory: "nth_prev(style=memory, offset=1)"
current_memory: "active_at(t, style=memory)"
completed_subtask: "nth_prev(style=subtask, offset=1)"
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
- {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
- {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
ask_vqa_agentview_left:
weight: 0.05
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.robot0_agentview_left)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.robot0_agentview_left)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.robot0_agentview_left}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
ask_vqa_agentview_right:
weight: 0.05
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.robot0_agentview_right)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.robot0_agentview_right)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.robot0_agentview_right}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
ask_vqa_wrist:
weight: 0.05
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.robot0_eye_in_hand)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.robot0_eye_in_hand)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.robot0_eye_in_hand}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
@@ -0,0 +1,114 @@
# subtask_mem_vqa_speech — Hi-Robot blend + memory + spoken responses.
#
# Superset of subtasks_vqa.yaml. Keeps the core subtask + action + VQA
# training, and adds two text-supervised tasks:
#
# high_level_subtask — predict the subtask from the task.
# low_level_execution — flow loss with [images, subtask, state].
# memory_update — compress progress into a memory note.
# user_interjection_response — reply to a user interjection with a
# spoken `say` tool call (no plan, no
# subtask text — just the spoken reply).
# ask_vqa_{top,wrist} — camera-grounded VQA.
#
# Plan is intentionally left out — memory is the only persistent
# high-level state here, keeping the prompt short.
#
# Requires the dataset to carry `memory`, `interjection` and `say`-tool
# annotations (the annotation pipeline's memory + interjection modules)
# in addition to `subtask` and `vqa`. Sub-recipes whose `if_present`
# bindings are missing simply don't render for that sample, so a
# dataset without interjections still trains the rest of the blend.
#
# Tool-call note: the `say` tool call on the interjection-response turn
# is flattened to a `<say>...</say>` text marker by the tokenizer step
# (`_flatten_say_tool_calls`) so the LM head learns to emit exactly the
# marker the runtime parses back (`_split_plan_and_say`).
blend:
high_level_subtask:
weight: 0.25
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
low_level_execution:
weight: 0.40
messages:
# The action expert is conditioned on the SUBTASK — at inference
# `HighLevelSubtaskFwd` generates it via the LM head and feeds it
# here. `stream: low_level` flips `predict_actions=True` so the
# flow loss fires; no text-CE target (subtask prediction is owned
# by `high_level_subtask`).
- {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
memory_update:
# At inference, `MemoryUpdateFwd` is triggered only on
# `subtask_change` events (sparse). Training densely with
# `active_at` — i.e. on every frame inside a subtask interval,
# not just the boundary frame — supervises the same
# (prior_memory, completed_subtask) → current_memory mapping
# against varied observations within the interval. The model
# learns a stateless transformation; the *when* to emit lives in
# the inference trigger, not the model. Annotations only exist
# for ~1% of frames as boundary events, so `emitted_at` would
# waste 99% of the blend draws (and silently leak them into the
# task-conditioned fallback); `active_at` lifts the renderable
# rate to ~87% on Hi-Robot-style datasets.
weight: 0.10
bindings:
prior_memory: "nth_prev(style=memory, offset=1)"
current_memory: "active_at(t, style=memory)"
completed_subtask: "nth_prev(style=subtask, offset=1)"
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
- {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
- {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
user_interjection_response:
weight: 0.10
bindings:
interjection: "emitted_at(t, style=interjection)"
speech: "emitted_at(t, role=assistant, tool_name=say)"
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
# Spoken reply only: the assistant turn carries no text content,
# just a `say` tool call (`tool_calls_from: speech`). The chat
# tokenizer flattens it to a `<say>...</say>` marker, so the
# supervised target trains the model to respond to an
# interjection with a spoken acknowledgement.
- {role: assistant, stream: high_level, target: true, if_present: speech, tool_calls_from: speech}
# VQA is view-dependent — each camera gets its own sub-recipe so the
# resolver disambiguates via `camera=...`. Camera keys match
# subtasks_vqa.yaml (`front` + `wrist`); adjust to your dataset.
ask_vqa_top:
weight: 0.075
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.front}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
ask_vqa_wrist:
weight: 0.075
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.wrist}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
@@ -0,0 +1,61 @@
# subtasks_vqa — Hi-Robot blend for PI052 (PaliGemma backbone).
#
# Trains two things only: subtasks and VQA. Plan and memory are
# intentionally left out — keeps the prompt short and the training
# surface small. The fuller blend with memory + spoken replies is
# ``subtask_mem_vqa_speech.yaml``.
#
# high_level_subtask — predict the subtask from the task.
# low_level_execution — flow loss with [images, subtask, state].
# ask_vqa_{top,wrist} — camera-grounded VQA.
#
# PI052's text tokenizer renders these messages as plain
# ``Role: content`` text (PaliGemma is not chat-pretrained).
blend:
high_level_subtask:
weight: 0.40
messages:
- {role: user, content: "${task}", stream: high_level}
- {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
low_level_execution:
weight: 0.40
messages:
# The action expert is conditioned on the SUBTASK — at inference
# the high-level loop (``HighLevelSubtaskFwd``) generates the
# subtask via the LM head and feeds it here. The action expert's
# prefix is [images, subtask, state]. ``stream: low_level`` flips
# ``predict_actions=True`` so the flow loss fires; no text-CE
# target here (subtask prediction is owned by
# ``high_level_subtask``).
- {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
ask_vqa_top:
weight: 0.10
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.front}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
ask_vqa_wrist:
weight: 0.10
bindings:
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
messages:
- role: user
stream: high_level
if_present: vqa_query
content:
- {type: image, feature: observation.images.wrist}
- {type: text, text: "${vqa_query}"}
- {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
+19 -2
View File
@@ -30,7 +30,7 @@ from lerobot.utils.hub import HubMixin
from lerobot.utils.sample_weighting import SampleWeightingConfig
from . import parser
from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
from .default import DatasetConfig, EMAConfig, EvalConfig, PeftConfig, WandBConfig
from .policies import PreTrainedConfig
from .rewards import RewardModelConfig
@@ -111,9 +111,20 @@ class TrainPipelineConfig(HubMixin):
scheduler: LRSchedulerConfig | None = None
eval: EvalConfig = field(default_factory=EvalConfig)
wandb: WandBConfig = field(default_factory=WandBConfig)
ema: EMAConfig = field(default_factory=EMAConfig)
peft: PeftConfig | None = None
# Sample weighting configuration (e.g., for RA-BC training)
# VQA oversampling. When set (a fraction in (0, 1)), the training
# dataloader uses a WeightedEpisodeAwareSampler that draws frames
# carrying a `vqa` language annotation often enough that they make
# up roughly this fraction of the training stream. VQA annotations
# are typically sparse, so without this they are underrepresented.
# `None` (default) keeps uniform episode-aware sampling.
vqa_target_fraction: float | None = None
# Sample weighting configuration (e.g., for RA-BC training). Old
# inline ``use_rabc`` / ``rabc_*`` params are migrated to this
# field by ``_migrate_legacy_rabc_keys`` above.
sample_weighting: SampleWeightingConfig | None = None
# Rename map for the observation to override the image and state keys
@@ -177,6 +188,12 @@ class TrainPipelineConfig(HubMixin):
)
active_cfg = self.trainable_config
if self.rename_map and active_cfg.pretrained_path is None:
raise ValueError(
"`rename_map` requires a pretrained policy checkpoint. "
"Fresh initialization derives feature names from the current dataset, so no rename is applied."
)
if not self.job_name:
if self.env is None:
self.job_name = f"{active_cfg.type}"
+15 -2
View File
@@ -35,7 +35,6 @@ from .dataset_tools import (
remove_feature,
split_dataset,
)
from .factory import make_dataset, resolve_delta_timestamps
from .image_writer import safe_stop_image_writer
from .io_utils import load_episodes, write_stats
from .language import (
@@ -50,11 +49,24 @@ from .lerobot_dataset import LeRobotDataset
from .multi_dataset import MultiLeRobotDataset
from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
from .pyav_utils import check_video_encoder_parameters_pyav, detect_available_encoders_pyav
from .sampler import EpisodeAwareSampler
from .sampler import EpisodeAwareSampler, WeightedEpisodeAwareSampler
from .streaming_dataset import StreamingLeRobotDataset
from .utils import DEFAULT_EPISODES_PATH, create_lerobot_dataset_card
from .video_utils import VideoEncodingManager
def make_dataset(*args, **kwargs):
from .factory import make_dataset as _make_dataset
return _make_dataset(*args, **kwargs)
def resolve_delta_timestamps(*args, **kwargs):
from .factory import resolve_delta_timestamps as _resolve_delta_timestamps
return _resolve_delta_timestamps(*args, **kwargs)
# NOTE: Low-level I/O functions (cast_stats_to_numpy, get_parquet_file_size_in_mb, etc.)
# and legacy migration constants are intentionally NOT re-exported here.
# Import directly: ``from lerobot.datasets.io_utils import ...``
@@ -65,6 +77,7 @@ __all__ = [
"DEFAULT_QUANTILES",
"EVENT_ONLY_STYLES",
"EpisodeAwareSampler",
"WeightedEpisodeAwareSampler",
"LANGUAGE_EVENTS",
"LANGUAGE_PERSISTENT",
"LeRobotDataset",
+43
View File
@@ -126,10 +126,53 @@ class DatasetReader:
def _load_hf_dataset(self) -> datasets.Dataset:
"""hf_dataset contains all the observations, states, actions, rewards, etc."""
features = get_hf_features_from_features(self._meta.features)
# Datasets annotated with the PR1 language columns may have been
# written without registering those columns in ``meta/info.json``
# (e.g. they predate ``CODEBASE_VERSION="v3.1"`` and were
# back-filled by ``lerobot-annotate``). Probe a single parquet
# shard and graft the column features on so the strict
# ``Dataset.from_parquet`` cast doesn't fail with
# ``column names don't match``.
features = self._extend_features_with_language_columns(features)
hf_dataset = load_nested_dataset(self.root / "data", features=features, episodes=self.episodes)
hf_dataset.set_transform(hf_transform_to_torch)
return hf_dataset
def _extend_features_with_language_columns(
self, features: datasets.Features
) -> datasets.Features:
"""Add ``language_persistent`` / ``language_events`` to ``features``
when the underlying parquet shards declare them but the metadata
doesn't. No-op when neither column is present or both are
already registered.
"""
# Find any one parquet to peek at; bail if there are none yet
# (the dataset will fail later for an unrelated reason and we
# want that error to surface as-is).
try:
sample = next((self.root / "data").glob("*/*.parquet"))
except StopIteration:
return features
from pyarrow import parquet as _pq # noqa: PLC0415
schema_names = set(_pq.read_schema(sample).names)
from .language import ( # noqa: PLC0415
LANGUAGE_EVENTS,
LANGUAGE_PERSISTENT,
language_events_column_feature,
language_persistent_column_feature,
)
extra: dict[str, object] = {}
if LANGUAGE_PERSISTENT in schema_names and LANGUAGE_PERSISTENT not in features:
extra[LANGUAGE_PERSISTENT] = language_persistent_column_feature()
if LANGUAGE_EVENTS in schema_names and LANGUAGE_EVENTS not in features:
extra[LANGUAGE_EVENTS] = language_events_column_feature()
if not extra:
return features
return datasets.Features({**features, **extra})
def _check_cached_episodes_sufficient(self) -> bool:
"""Check if the cached dataset contains all requested episodes and their video files."""
if self.hf_dataset is None or len(self.hf_dataset) == 0:
+89 -3
View File
@@ -170,6 +170,29 @@ def render_sample(
"""
persistent_rows = _normalize_rows(persistent or [])
event_rows = _normalize_rows(events or [])
# VQA-priority routing. A ``vqa`` annotation is sparse and
# view-dependent; the plain weighted blend would (a) waste a draw
# whenever it picks an ``ask_vqa*`` sub-recipe for a frame that has
# no VQA, and (b) silently drop a VQA-annotated frame whenever it
# picks a non-VQA sub-recipe. So: if the blend has ``ask_vqa*``
# sub-recipes and *this* frame carries one of their VQA bindings,
# render VQA here regardless of the weighted draw. That makes VQA's
# recipe-side training share equal the VQA-annotation density (the
# maximum reachable without a dataset-level oversampling sampler).
if recipe.blend is not None:
vqa_rendered = _render_vqa_if_present(
recipe,
persistent=persistent_rows,
events=event_rows,
t=t,
sample_idx=sample_idx,
task=task,
dataset_ctx=dataset_ctx,
)
if vqa_rendered is not None:
return vqa_rendered
selected_recipe = _select_recipe(recipe, sample_idx)
bindings = _resolve_bindings(
selected_recipe,
@@ -183,6 +206,59 @@ def render_sample(
return _render_message_recipe(selected_recipe, bindings)
def _render_vqa_if_present(
recipe: TrainingRecipe,
*,
persistent: Sequence[LanguageRow],
events: Sequence[LanguageRow],
t: float,
sample_idx: int,
task: str | None,
dataset_ctx: Any | None,
) -> RenderedMessages | None:
"""Render an ``ask_vqa*`` sub-recipe iff this frame carries a VQA
annotation; otherwise return ``None`` so the caller falls back to the
normal weighted blend.
When several VQA sub-recipes resolve (e.g. a frame annotated for more
than one camera), one is chosen deterministically by relative weight.
"""
assert recipe.blend is not None
renderable: list[tuple[float, RenderedMessages]] = []
for name, component in recipe.blend.items():
if not name.startswith("ask_vqa"):
continue
bindings = _resolve_bindings(
component,
persistent=persistent,
events=events,
t=t,
sample_idx=sample_idx,
task=task,
dataset_ctx=dataset_ctx,
)
rendered = _render_message_recipe(component, bindings)
if rendered is not None:
renderable.append((float(component.weight or 0.0), rendered))
if not renderable:
return None
if len(renderable) == 1:
return renderable[0][1]
# Multiple cameras have a VQA for this frame — deterministic pick by
# relative weight (fall back to a uniform draw if all weights are 0).
total = sum(w for w, _ in renderable) or float(len(renderable))
digest = hashlib.blake2b(f"vqa:{sample_idx}".encode(), digest_size=8).digest()
draw = int.from_bytes(digest, "big") / 2**64 * total
cumulative = 0.0
for w, rendered in renderable:
cumulative += w or (total / len(renderable))
if draw < cumulative:
return rendered
return renderable[-1][1]
def _select_recipe(recipe: TrainingRecipe, sample_idx: int) -> TrainingRecipe:
"""Pick a deterministic blend component for ``sample_idx`` (or return ``recipe``)."""
if recipe.blend is None:
@@ -346,7 +422,15 @@ def _render_message_recipe(
if turn.target:
target_indices.append(message_idx)
if not target_indices:
# A render is meaningful if it supervises *something*: either a
# text-CE target turn, or a ``low_level`` stream turn (flow / action
# supervision — e.g. the flow-only ``low_level_execution`` recipe,
# ``user(${subtask})`` with ``stream: low_level`` and no target).
# Without this, a flow-only recipe renders to ``None`` every time
# the blend draws it → ``predict_actions`` is never True → the
# action expert never receives a flow loss.
has_low_level = any(stream == "low_level" for stream in streams)
if not target_indices and not has_low_level:
return None
rendered = {
@@ -403,8 +487,10 @@ def _validate_rendered(rendered: RenderedMessages) -> None:
if len(streams) != len(messages):
raise ValueError("message_streams must be aligned with messages.")
if not target_indices:
raise ValueError("Rendered samples must contain at least one target message.")
# Valid iff it supervises something: a text-CE target turn OR a
# ``low_level`` stream turn (flow / action supervision).
if not target_indices and not any(s == "low_level" for s in streams):
raise ValueError("Rendered samples must contain a target message or a low_level-stream message.")
for idx in target_indices:
if idx < 0 or idx >= len(messages):
raise ValueError(f"Target message index {idx} is out of bounds.")
+3 -2
View File
@@ -524,7 +524,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
license: str | None = "apache-2.0",
tag_version: bool = True,
push_videos: bool = True,
private: bool = False,
private: bool | None = None,
allow_patterns: list[str] | str | None = None,
upload_large_folder: bool = False,
**card_kwargs,
@@ -543,7 +543,8 @@ class LeRobotDataset(torch.utils.data.Dataset):
tag_version: If ``True``, create a Git tag for the current codebase
version.
push_videos: If ``False``, skip uploading the ``videos/`` directory.
private: If ``True``, create a private repository.
private: If ``True``, create a private repository. If ``None``
(default), defer to the org default on the Hub (only affects orgs).
allow_patterns: Glob pattern(s) restricting which files to upload.
upload_large_folder: If ``True``, use ``upload_large_folder`` instead
of ``upload_folder`` for very large datasets.
+63
View File
@@ -84,3 +84,66 @@ class EpisodeAwareSampler:
def __len__(self) -> int:
return len(self.indices)
class WeightedEpisodeAwareSampler(EpisodeAwareSampler):
"""``EpisodeAwareSampler`` that draws frames *with replacement* in
proportion to per-frame weights.
Used to oversample frames carrying a sparse annotation (e.g. a VQA
question) so the policy sees them more often than their natural
dataset density. One epoch still yields ``len(self.indices)``
samples — the weights only change the *composition* of the stream,
not its length. Each epoch re-draws, so the oversampled subset
varies run to run.
"""
def __init__(
self,
dataset_from_indices: list[int],
dataset_to_indices: list[int],
frame_weights,
*,
episode_indices_to_use: list | None = None,
drop_n_first_frames: int = 0,
drop_n_last_frames: int = 0,
):
"""
Args:
dataset_from_indices: Episode start indices (see ``EpisodeAwareSampler``).
dataset_to_indices: Episode end indices.
frame_weights: 1-D sequence/tensor of non-negative weights, one per
dataset frame (length == total dataset frames). Higher weight ⇒
that frame is sampled more often.
episode_indices_to_use / drop_n_first_frames / drop_n_last_frames:
Same meaning as ``EpisodeAwareSampler`` — the episode-boundary
frame filtering is applied first, then weighting is restricted
to the surviving frames.
"""
super().__init__(
dataset_from_indices,
dataset_to_indices,
episode_indices_to_use=episode_indices_to_use,
drop_n_first_frames=drop_n_first_frames,
drop_n_last_frames=drop_n_last_frames,
shuffle=False,
)
weights = torch.as_tensor(frame_weights, dtype=torch.double).flatten()
idx = torch.tensor(self.indices, dtype=torch.long)
if weights.numel() <= int(idx.max()):
raise ValueError(
f"frame_weights has {weights.numel()} entries but the sampler "
f"references frame index {int(idx.max())}."
)
selected = weights[idx]
if not torch.isfinite(selected).all() or bool((selected < 0).any()):
raise ValueError("frame_weights must be finite and non-negative.")
if float(selected.sum()) <= 0.0:
# All surviving frames have zero weight — fall back to uniform.
selected = torch.ones_like(selected)
self._weights = selected
def __iter__(self) -> Iterator[int]:
picks = torch.multinomial(self._weights, num_samples=len(self.indices), replacement=True)
for i in picks.tolist():
yield self.indices[i]
+17 -10
View File
@@ -366,17 +366,24 @@ def get_safe_version(repo_id: str, version: str | packaging.version.Version) ->
hub_versions = get_repo_versions(repo_id)
if not hub_versions:
raise RevisionNotFoundError(
f"""Your dataset must be tagged with a codebase version.
Assuming _version_ is the codebase_version value in the info.json, you can run this:
```python
from huggingface_hub import HfApi
hub_api = HfApi()
hub_api.create_tag("{repo_id}", tag="_version_", repo_type="dataset")
```
"""
msg = (
f"Repo {repo_id!r} has no codebase-version tags. The dataset "
f"either doesn't exist on the Hub yet, or it was uploaded "
f"without a ``v3.x``-style tag. To tag an existing dataset run:\n"
f" from huggingface_hub import HfApi\n"
f" HfApi().create_tag({repo_id!r}, tag='v3.0', repo_type='dataset', exist_ok=True)"
)
# ``RevisionNotFoundError`` extends ``HfHubHTTPError`` whose
# ``__init__`` indexes ``response.headers`` unconditionally on
# current ``huggingface_hub`` versions. Constructing it without
# a real ``Response`` object crashes with either
# ``TypeError: missing 1 required keyword-only argument`` (old
# builds) or ``AttributeError: 'NoneType' object has no attribute
# 'headers'`` (new builds). Skip that path entirely — this isn't
# really an HTTP error, it's a configuration issue — and raise a
# plain ``RuntimeError`` so the message actually reaches the
# caller.
raise RuntimeError(msg)
if target_version in hub_versions:
return f"v{target_version}"
+13 -10
View File
@@ -33,8 +33,8 @@ logger = logging.getLogger(__name__)
# Dimensions for the flat action/state vectors used by the LeRobot wrapper.
# These correspond to the PandaOmron robot in RoboCasa365.
OBS_STATE_DIM = 16 # base_pos(3) + base_quat(4) + ee_pos_rel(3) + ee_quat_rel(4) + gripper_qpos(2)
ACTION_DIM = 12 # base_motion(4) + control_mode(1) + ee_pos(3) + ee_rot(3) + gripper(1)
OBS_STATE_DIM = 16 # ee_pos_rel(3) + ee_quat_rel(4) + base_pos(3) + base_quat(4) + gripper_qpos(2)
ACTION_DIM = 12 # ee_pos(3) + ee_rot(3) + gripper(1) + base_motion(4) + control_mode(1)
ACTION_LOW = -1.0
ACTION_HIGH = 1.0
@@ -101,14 +101,15 @@ def _resolve_tasks(task: str) -> tuple[list[str], str | None]:
def convert_action(flat_action: np.ndarray) -> dict[str, Any]:
"""Split a flat (12,) action vector into a RoboCasa action dict.
Layout: base_motion(4) + control_mode(1) + ee_pos(3) + ee_rot(3) + gripper(1)
Layout (openpi / robocasa.utils.env_utils.convert_action order):
ee_pos(3) + ee_rot(3) + gripper(1) + base_motion(4) + control_mode(1)
"""
return {
"action.base_motion": flat_action[0:4],
"action.control_mode": flat_action[4:5],
"action.end_effector_position": flat_action[5:8],
"action.end_effector_rotation": flat_action[8:11],
"action.gripper_close": flat_action[11:12],
"action.end_effector_position": flat_action[0:3],
"action.end_effector_rotation": flat_action[3:6],
"action.gripper_close": flat_action[6:7],
"action.base_motion": flat_action[7:11],
"action.control_mode": flat_action[11:12],
}
@@ -230,12 +231,14 @@ class RoboCasaEnv(gym.Env):
return {"pixels": images}
# `state.*` keys come from PandaOmronKeyConverter inside the wrapper.
# openpi state order: ee first, then base, then gripper (matches the
# openpi robocasa pipeline / examples/robocasa/main.py state layout).
agent_pos = np.concatenate(
[
raw_obs.get("state.base_position", np.zeros(3)),
raw_obs.get("state.base_rotation", np.zeros(4)),
raw_obs.get("state.end_effector_position_relative", np.zeros(3)),
raw_obs.get("state.end_effector_rotation_relative", np.zeros(4)),
raw_obs.get("state.base_position", np.zeros(3)),
raw_obs.get("state.base_rotation", np.zeros(4)),
raw_obs.get("state.gripper_qpos", np.zeros(2)),
],
axis=-1,
+17 -3
View File
@@ -18,12 +18,25 @@ from typing import TYPE_CHECKING
import numpy as np
from lerobot.utils.import_utils import _placo_available, require_package
from lerobot.utils.import_utils import require_package
if TYPE_CHECKING or _placo_available:
_placo_runtime_error: ImportError | None = None
if TYPE_CHECKING:
import placo # type: ignore[import-not-found]
else:
placo = None
try:
import placo # type: ignore[import-not-found]
except ImportError as _placo_import_err:
placo = None
_placo_runtime_error = _placo_import_err
def _raise_if_placo_unusable() -> None:
if placo is None and _placo_runtime_error is not None:
raise ImportError(
f"placo is installed but failed to import: {_placo_runtime_error!s}"
) from _placo_runtime_error
class RobotKinematics:
@@ -44,6 +57,7 @@ class RobotKinematics:
joint_names (list[str] | None): List of joint names to use for the kinematics solver
"""
require_package("placo", extra="placo-dep")
_raise_if_placo_unusable()
self.robot = placo.RobotWrapper(urdf_path)
self.solver = placo.KinematicsSolver(self.robot)
+100 -18
View File
@@ -43,6 +43,7 @@ from .tables import (
CAN_CMD_SET_ZERO,
DEFAULT_BAUDRATE,
DEFAULT_TIMEOUT_MS,
HANDSHAKE_TIMEOUT_S,
MODEL_RESOLUTION,
MOTOR_LIMIT_PARAMS,
NORMALIZED_DATA,
@@ -215,14 +216,16 @@ class RobstrideMotorsBus(MotorsBusBase):
self._is_connected = False
raise ConnectionError(f"Failed to connect to CAN bus: {e}") from e
def _query_status_via_clear_fault(self, motor: NameOrID) -> tuple[bool, can.Message | None]:
def _query_status_via_clear_fault(
self, motor: NameOrID, timeout: float = RUNNING_TIMEOUT
) -> tuple[bool, can.Message | None]:
motor_name = self._get_motor_name(motor)
motor_id = self._get_motor_id(motor_name)
recv_id = self._get_motor_recv_id(motor_name)
data = [0xFF] * 7 + [CAN_CMD_CLEAR_FAULT]
msg = can.Message(arbitration_id=motor_id, data=data, is_extended_id=False)
self._bus().send(msg)
return self._recv_status_via_clear_fault(expected_recv_id=recv_id)
return self._recv_status_via_clear_fault(expected_recv_id=recv_id, timeout=timeout)
def _recv_status_via_clear_fault(
self, expected_recv_id: int | None = None, timeout: float = RUNNING_TIMEOUT
@@ -280,7 +283,7 @@ class RobstrideMotorsBus(MotorsBusBase):
faulted_motors = []
for motor_name in self.motors:
has_fault, msg = self._query_status_via_clear_fault(motor_name)
has_fault, msg = self._query_status_via_clear_fault(motor_name, timeout=HANDSHAKE_TIMEOUT_S)
if msg is None:
missing_motors.append(motor_name)
elif has_fault:
@@ -505,6 +508,87 @@ class RobstrideMotorsBus(MotorsBusBase):
return responses
def _recv_all_messages_until_quiet(
self,
*,
timeout: float = RUNNING_TIMEOUT,
max_messages: int = 4096,
) -> list[can.Message]:
"""
Receive frames until the bus goes quiet.
Args:
timeout: Poll timeout used for each recv() call. Collection stops
when one recv() times out (quiet gap).
max_messages: Safety cap to prevent unbounded loops.
"""
out: list[can.Message] = []
max_messages = max(1, max_messages)
timeout = max(0.0, timeout)
try:
while len(out) < max_messages:
msg = self._bus().recv(timeout=timeout)
if msg is None:
break
out.append(msg)
except (can.CanError, OSError) as e:
logger.debug(f"Error draining CAN RX queue on {self.port}: {e}")
return out
def _process_feedback_messages(self, messages: list[can.Message]) -> set[int]:
"""
Decode all received feedback frames and update cached motor states.
Returns:
Set of payload recv_ids that were successfully mapped to motors.
"""
processed_recv_ids: set[int] = set()
for msg in messages:
if len(msg.data) < 1:
logger.debug(
f"Dropping short CAN frame on {self.port} "
f"(arb=0x{int(msg.arbitration_id):02X}, data={bytes(msg.data).hex()})"
)
continue
recv_id = int(msg.data[0])
motor_name = self._recv_id_to_motor.get(recv_id)
if motor_name is None:
logger.debug(
f"Unmapped CAN frame on {self.port} "
f"(arb=0x{int(msg.arbitration_id):02X}, recv_id=0x{recv_id:02X}, data={bytes(msg.data).hex()})"
)
continue
self._process_response(motor_name, msg)
processed_recv_ids.add(recv_id)
return processed_recv_ids
def flush_rx_queue(self, poll_timeout_s: float = 0.0005, max_messages: int = 4096) -> int:
"""
Drain pending RX frames from the CAN interface.
This is used by higher-level controllers to drop stale feedback before issuing
a fresh read cycle, so subsequent state reads are based on most recent replies.
It should also be called once when a controller instance is created/connected,
to clear residual frames left on the interface from previous sessions.
"""
drained = 0
poll_timeout_s = max(0.0, poll_timeout_s)
max_messages = max(1, max_messages)
try:
while drained < max_messages:
msg = self._bus().recv(timeout=poll_timeout_s)
if msg is None:
break
drained += 1
except (can.CanError, OSError) as e:
logger.debug(f"Failed to flush CAN RX queue on {self.port}: {e}")
return drained
def _speed_control(
self,
motor: NameOrID,
@@ -644,11 +728,14 @@ class RobstrideMotorsBus(MotorsBusBase):
msg = can.Message(arbitration_id=motor_id, data=data, is_extended_id=False)
self._bus().send(msg)
recv_id_to_motor[self._get_motor_recv_id(motor)] = motor_name
# Read every feedback frame until RX goes quiet, then decode all of them.
# This avoids dropping useful frames when responses from different motors interleave.
messages = self._recv_all_messages_until_quiet()
processed_recv_ids = self._process_feedback_messages(messages)
responses = self._recv_all_responses(list(recv_id_to_motor.keys()), timeout=RUNNING_TIMEOUT)
for recv_id, motor_name in recv_id_to_motor.items():
if msg := responses.get(recv_id):
self._process_response(motor_name, msg)
if recv_id not in processed_recv_ids:
logger.warning(f"Packet drop: {motor_name} (ID: 0x{recv_id:02X}). Using last known state.")
def _float_to_uint(self, x: float, x_min: float, x_max: float, bits: int) -> int:
"""Convert float to unsigned integer for CAN transmission."""
@@ -711,7 +798,10 @@ class RobstrideMotorsBus(MotorsBusBase):
try:
self._decode_motor_state(msg.data)
except Exception as e:
logger.warning(f"Failed to decode response from {motor}: {e}")
logger.warning(
f"Failed to decode response from {motor} "
f"(arb=0x{int(msg.arbitration_id):02X}, data={bytes(msg.data).hex()}): {e}"
)
def _get_cached_value(self, motor: str, data_name: str) -> Value:
"""Retrieve a specific value from the state cache."""
@@ -848,20 +938,12 @@ class RobstrideMotorsBus(MotorsBusBase):
self._bus().send(msg)
updated_motors.append(motor)
expected_recv_ids = [self._get_motor_recv_id(motor) for motor in updated_motors]
responses = self._recv_all_responses(expected_recv_ids, timeout=RUNNING_TIMEOUT)
for response in responses.values():
payload_motor_name = self._recv_id_to_motor.get(response.data[0])
if payload_motor_name is not None:
self._process_response(payload_motor_name, response)
else:
# Fallback: still attempt to decode based on payload byte0 mapping.
self._decode_motor_state(response.data)
messages = self._recv_all_messages_until_quiet()
processed_recv_ids = self._process_feedback_messages(messages)
for motor in updated_motors:
recv_id = self._get_motor_recv_id(motor)
if recv_id not in responses:
if recv_id not in processed_recv_ids:
logger.warning(f"Packet drop: {motor} (ID: 0x{recv_id:02X}). Using last known state.")
def read_calibration(self) -> dict[str, MotorCalibration]:
+2 -1
View File
@@ -114,7 +114,8 @@ CAN_CMD_SAVE_PARAM = 0xAA
CAN_PARAM_ID = 0x7FF
RUNNING_TIMEOUT = 0.001
RUNNING_TIMEOUT = 0.003
HANDSHAKE_TIMEOUT_S = 0.05
PARAM_TIMEOUT = 0.01
STATE_CACHE_TTL_S = 0.02
+2
View File
@@ -104,6 +104,8 @@ class AdamWConfig(OptimizerConfig):
eps: float = 1e-8
weight_decay: float = 1e-2
grad_clip_norm: float = 10.0
foreach: bool | None = None
fused: bool | None = None
def build(self, params: OptimizerParams) -> torch.optim.Optimizer:
kwargs = asdict(self)
+4
View File
@@ -20,10 +20,12 @@ from .eo1.configuration_eo1 import EO1Config as EO1Config
from .factory import get_policy_class, make_policy, make_policy_config, make_pre_post_processors
from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig as GaussianActorConfig
from .groot.configuration_groot import GrootConfig as GrootConfig
from .molmoact2.configuration_molmoact2 import MolmoAct2Config as MolmoAct2Config
from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
from .pi0.configuration_pi0 import PI0Config as PI0Config
from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
from .pi05.configuration_pi05 import PI05Config as PI05Config
from .pi052.configuration_pi052 import PI052Config as PI052Config
from .pretrained import PreTrainedPolicy as PreTrainedPolicy
from .smolvla.configuration_smolvla import SmolVLAConfig as SmolVLAConfig
from .tdmpc.configuration_tdmpc import TDMPCConfig as TDMPCConfig
@@ -43,10 +45,12 @@ __all__ = [
"EO1Config",
"GaussianActorConfig",
"GrootConfig",
"MolmoAct2Config",
"MultiTaskDiTConfig",
"PI0Config",
"PI0FastConfig",
"PI05Config",
"PI052Config",
"SmolVLAConfig",
"TDMPCConfig",
"VQBeTConfig",
+167 -3
View File
@@ -49,6 +49,7 @@ from .diffusion.configuration_diffusion import DiffusionConfig
from .eo1.configuration_eo1 import EO1Config
from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig
from .groot.configuration_groot import GrootConfig
from .molmoact2.configuration_molmoact2 import MolmoAct2Config
from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
from .pi0.configuration_pi0 import PI0Config
from .pi05.configuration_pi05 import PI05Config
@@ -56,11 +57,85 @@ from .pretrained import PreTrainedPolicy
from .smolvla.configuration_smolvla import SmolVLAConfig
from .tdmpc.configuration_tdmpc import TDMPCConfig
from .utils import validate_visual_features_consistency
from .vla_jepa.configuration_vla_jepa import VLAJEPAConfig
from .vqbet.configuration_vqbet import VQBeTConfig
from .wall_x.configuration_wall_x import WallXConfig
from .xvla.configuration_xvla import XVLAConfig
def _restore_pi052_pretrained_state(
preprocessor: PolicyProcessorPipeline,
postprocessor: PolicyProcessorPipeline,
pretrained_path: str,
) -> None:
"""Transplant saved stateful blobs from a pi052 checkpoint into fresh pipelines.
pi052's preprocessor includes steps whose constructor args don't
JSON-roundtrip (``RenderMessagesStep.recipe`` is a Python object,
``ActionTokenizerProcessorStep.action_tokenizer_name`` is a
fitted-tokenizer path that may not exist at eval time). We rebuild
those pipelines fresh from ``config.recipe_path`` and then walk
over the saved ``policy_{pre,post}processor.json`` files to find
each step's ``state_file`` reference and load the bytes back into
the corresponding fresh step. Today that's only the
NormalizerProcessorStep / UnnormalizerProcessorStep (the action /
state quantile stats), but the loop is generic so any future
stateful step picks up its blob automatically.
Pairing is by ``registry_name`` AND position so a benign reorder
on the saved side surfaces a warning rather than silently feeding
the wrong tensors into the wrong step.
"""
import json # noqa: PLC0415
import logging # noqa: PLC0415
from pathlib import Path # noqa: PLC0415
from safetensors.torch import load_file # noqa: PLC0415
base = Path(pretrained_path)
if not base.exists():
return
log = logging.getLogger(__name__)
for pipeline, config_filename in [
(preprocessor, f"{POLICY_PREPROCESSOR_DEFAULT_NAME}.json"),
(postprocessor, f"{POLICY_POSTPROCESSOR_DEFAULT_NAME}.json"),
]:
config_path = base / config_filename
if not config_path.exists():
continue
saved = json.loads(config_path.read_text())
for idx, (saved_step, fresh_step) in enumerate(
zip(saved.get("steps", []), pipeline.steps, strict=False)
):
state_file = saved_step.get("state_file")
if not state_file:
continue
saved_name = saved_step.get("registry_name")
fresh_name = getattr(type(fresh_step), "_registry_name", None)
if saved_name and fresh_name and saved_name != fresh_name:
log.warning(
"PI052 state restore: %s step %d registry name mismatch "
"(saved=%s, fresh=%s); skipping %s",
config_filename, idx, saved_name, fresh_name, state_file,
)
continue
state_path = base / state_file
if not state_path.exists():
log.warning(
"PI052 state restore: %s missing at %s; %s left at fresh init",
state_file, base, fresh_name,
)
continue
fresh_step.load_state_dict(load_file(str(state_path)))
log.info(
"PI052 state restore: loaded %s into %s (step %d)",
state_file, fresh_name, idx,
)
def _reconnect_relative_absolute_steps(
preprocessor: PolicyProcessorPipeline, postprocessor: PolicyProcessorPipeline
) -> None:
@@ -88,7 +163,8 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
Args:
name: The name of the policy. Supported names are "tdmpc", "diffusion", "act",
"multi_task_dit", "vqbet", "pi0", "pi05", "gaussian_actor", "smolvla", "wall_x".
"multi_task_dit", "vqbet", "pi0", "pi05", "gaussian_actor", "smolvla", "wall_x",
"molmoact2".
Returns:
The policy class corresponding to the given name.
@@ -127,6 +203,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
from .pi05.modeling_pi05 import PI05Policy
return PI05Policy
elif name == "pi052":
from .pi052.modeling_pi052 import PI052Policy
return PI052Policy
elif name == "gaussian_actor":
from .gaussian_actor.modeling_gaussian_actor import GaussianActorPolicy
@@ -151,6 +231,14 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
from .eo1.modeling_eo1 import EO1Policy
return EO1Policy
elif name == "molmoact2":
from .molmoact2.modeling_molmoact2 import MolmoAct2Policy
return MolmoAct2Policy
elif name == "vla_jepa":
from .vla_jepa.modeling_vla_jepa import VLAJEPAPolicy
return VLAJEPAPolicy
else:
try:
return _get_policy_cls_from_policy_name(name=name)
@@ -167,8 +255,8 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
Args:
policy_type: The type of the policy. Supported types include "tdmpc",
"multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05", "gaussian_actor",
"smolvla", "wall_x".
"multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05",
"pi052", "gaussian_actor", "smolvla", "wall_x", "molmoact2".
**kwargs: Keyword arguments to be passed to the configuration class constructor.
Returns:
@@ -191,6 +279,10 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
return PI0Config(**kwargs)
elif policy_type == "pi05":
return PI05Config(**kwargs)
elif policy_type == "pi052":
from .pi052.configuration_pi052 import PI052Config
return PI052Config(**kwargs)
elif policy_type == "gaussian_actor":
return GaussianActorConfig(**kwargs)
elif policy_type == "smolvla":
@@ -203,6 +295,10 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
return WallXConfig(**kwargs)
elif policy_type == "eo1":
return EO1Config(**kwargs)
elif policy_type == "molmoact2":
return MolmoAct2Config(**kwargs)
elif policy_type == "vla_jepa":
return VLAJEPAConfig(**kwargs)
else:
try:
config_cls = PreTrainedConfig.get_choice_class(policy_type)
@@ -231,6 +327,13 @@ class ProcessorConfigKwargs(TypedDict, total=False):
preprocessor_overrides: dict[str, Any] | None
postprocessor_overrides: dict[str, Any] | None
dataset_stats: dict[str, dict[str, torch.Tensor]] | None
# Optional: HF Hub repo id of the dataset the policy is being
# trained on. Used by policies that auto-fit pieces of their
# preprocessing (e.g. pi052's FAST action tokenizer per
# Pertsch et al. 2025 [64], π0.5 §III.C). When omitted, those
# policies fall back to their universal pre-fitted tokenizers.
dataset_repo_id: str | None
dataset_meta: Any | None
def make_pre_post_processors(
@@ -263,6 +366,29 @@ def make_pre_post_processors(
NotImplementedError: If a processor factory is not implemented for the given
policy configuration type.
"""
if pretrained_path and getattr(policy_cfg, "type", None) == "pi052":
# pi052 pipelines don't roundtrip through the saved
# ``policy_preprocessor.json``: ``RenderMessagesStep`` holds a
# Python ``TrainingRecipe`` (not JSON-serializable; saved as
# ``{}``) and ``ActionTokenizerProcessorStep`` saves a host-only
# FAST tokenizer path. Generic ``from_pretrained`` then dies
# with ``RenderMessagesStep.__init__() missing 1 required
# positional argument: 'recipe'`` (job 22164494).
#
# Mirror ``lerobot_pi052_runtime``'s bootstrap: build pipelines
# fresh from ``config.recipe_path`` and transplant the saved
# stateful blobs (normalizer stats) from the checkpoint dir.
from .pi052.processor_pi052 import make_pi052_pre_post_processors
preprocessor, postprocessor = make_pi052_pre_post_processors(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
dataset_repo_id=kwargs.get("dataset_repo_id"),
)
_restore_pi052_pretrained_state(preprocessor, postprocessor, pretrained_path)
_reconnect_relative_absolute_steps(preprocessor, postprocessor)
return preprocessor, postprocessor
if pretrained_path:
# TODO(Steven): Temporary patch, implement correctly the processors for Gr00t
if isinstance(policy_cfg, GrootConfig):
@@ -357,6 +483,22 @@ def make_pre_post_processors(
dataset_stats=kwargs.get("dataset_stats"),
)
elif policy_cfg.type == "pi052":
# NOTE: PI052Config subclasses PI05Config, so this branch MUST
# come before the PI05Config isinstance check below (otherwise
# pi052 would silently pick up π0.5's processor).
from .pi052.processor_pi052 import make_pi052_pre_post_processors
processors = make_pi052_pre_post_processors(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
# ``dataset_repo_id`` flows in via kwargs when FAST CE is
# enabled — the train loop sets it from ``--dataset.repo_id``.
# When ``None``, ``make_pi052_pre_post_processors`` skips
# the auto-fit and uses the universal tokenizer.
dataset_repo_id=kwargs.get("dataset_repo_id"),
)
elif isinstance(policy_cfg, PI05Config):
from .pi05.processor_pi05 import make_pi05_pre_post_processors
@@ -406,6 +548,7 @@ def make_pre_post_processors(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
)
elif isinstance(policy_cfg, EO1Config):
from .eo1.processor_eo1 import make_eo1_pre_post_processors
@@ -414,6 +557,23 @@ def make_pre_post_processors(
dataset_stats=kwargs.get("dataset_stats"),
)
elif isinstance(policy_cfg, MolmoAct2Config):
from .molmoact2.processor_molmoact2 import make_molmoact2_pre_post_processors
processors = make_molmoact2_pre_post_processors(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
dataset_meta=kwargs.get("dataset_meta"),
)
elif isinstance(policy_cfg, VLAJEPAConfig):
from .vla_jepa.processor_vla_jepa import make_vla_jepa_pre_post_processors
processors = make_vla_jepa_pre_post_processors(
config=policy_cfg,
dataset_stats=kwargs.get("dataset_stats"),
)
else:
try:
processors = _make_processors_from_policy_config(
@@ -499,6 +659,10 @@ def make_policy(
action_names = ds_meta.features.get(ACTION, {}).get("names")
if action_names is not None:
cfg.action_feature_names = list(action_names)
if ds_meta is not None:
set_dataset_feature_metadata = getattr(cfg, "set_dataset_feature_metadata", None)
if callable(set_dataset_feature_metadata):
set_dataset_feature_metadata(ds_meta.features)
kwargs["config"] = cfg
@@ -60,6 +60,7 @@ class Eagle25VLPreTrainedModel(PreTrainedModel):
"SiglipEncoderLayer",
]
_skip_keys_device_placement = "past_key_values"
_supports_flash_attn = True
_supports_flash_attn_2 = True
_supports_cache_class = True
_supports_static_cache = True
@@ -124,7 +124,6 @@ class Eagle25VLProcessor(ProcessorMixin):
"videos_kwargs",
"text_kwargs",
]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
+14 -9
View File
@@ -14,7 +14,7 @@
# limitations under the License.
from pathlib import Path
from typing import TYPE_CHECKING
from typing import TYPE_CHECKING, Any
import numpy as np
import torch
@@ -26,9 +26,14 @@ from lerobot.utils.import_utils import _transformers_available
# Conditional import for type checking and lazy loading
if TYPE_CHECKING or _transformers_available:
from huggingface_hub.dataclasses import strict
from transformers import AutoConfig, AutoModel, PretrainedConfig, PreTrainedModel
from transformers.feature_extraction_utils import BatchFeature
else:
def strict(cls):
return cls
AutoConfig = None
AutoModel = None
PretrainedConfig = object
@@ -176,16 +181,16 @@ N_COLOR_CHANNELS = 3
class GR00TN15Config(PretrainedConfig):
model_type = "gr00t_n1_5"
backbone_cfg: dict
action_head_cfg: dict
action_horizon: int
action_dim: int
backbone_cfg: dict[str, Any] | None = None
action_head_cfg: dict[str, Any] | None = None
action_horizon: int = 0
action_dim: int = 0
compute_dtype: str = "float32"
def __init__(self, **kwargs):
super().__init__(**kwargs)
for key, value in kwargs.items():
setattr(self, key, value)
def __post_init__(self, **kwargs):
self.backbone_cfg = {} if self.backbone_cfg is None else self.backbone_cfg
self.action_head_cfg = {} if self.action_head_cfg is None else self.action_head_cfg
super().__post_init__(**kwargs)
# real model
@@ -206,7 +206,11 @@ def _build_eagle_processor(tokenizer_assets_repo: str = DEFAULT_TOKENIZER_ASSETS
"Vendor files are copied during model creation. Create the policy/model first, "
"or call ensure_eagle_cache_ready() before building processors."
)
proc = AutoProcessor.from_pretrained(str(cache_dir), trust_remote_code=True, use_fast=True)
proc = AutoProcessor.from_pretrained(
str(cache_dir),
trust_remote_code=True,
fix_mistral_regex=False,
)
proc.tokenizer.padding_side = "left"
return proc
+1
View File
@@ -0,0 +1 @@
../../../../docs/source/policy_molmoact2_README.md
@@ -0,0 +1,21 @@
#!/usr/bin/env python
# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .configuration_molmoact2 import MolmoAct2Config
from .modeling_molmoact2 import MolmoAct2Policy
from .processor_molmoact2 import make_molmoact2_pre_post_processors
__all__ = ["MolmoAct2Config", "MolmoAct2Policy", "make_molmoact2_pre_post_processors"]
@@ -0,0 +1,519 @@
#!/usr/bin/env python
# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import json
import math
import os
from contextlib import suppress
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from huggingface_hub import snapshot_download
from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature, PreTrainedConfig
from lerobot.optim import (
AdamWConfig,
CosineDecayWithWarmupSchedulerConfig,
LRSchedulerConfig,
OptimizerConfig,
)
from lerobot.utils.constants import ACTION, OBS_STATE
from ..rtc.configuration_rtc import RTCConfig
MOLMOACT2_DEFAULT_NUM_IMAGES = 2
MOLMOACT2_IMAGE_TOKENS_PER_IMAGE = 196
MOLMOACT2_FIXED_PROMPT_TOKEN_BUDGET = 80
MOLMOACT2_TASK_TOKEN_BUDGET = 32
MOLMOACT2_SEQUENCE_LENGTH_MARGIN = 32
MOLMOACT2_SEQUENCE_LENGTH_MULTIPLE = 64
MOLMOACT2_DISCRETE_ACTION_WRAPPER_TOKENS = 4
MOLMOACT2_MIN_DISCRETE_ACTION_TOKENS_PER_STEP = 6
MOLMOACT2_DISCRETE_ACTION_TOKENS_PER_DIM = 0.95
def _hf_token() -> str | None:
return os.environ.get("HF_TOKEN") or os.environ.get("HF_ACCESS_TOKEN")
def _resolve_checkpoint_location(
checkpoint_path: str,
*,
revision: str | None = None,
force_download: bool = False,
) -> str:
checkpoint_path = str(checkpoint_path or "").strip()
if not checkpoint_path:
raise ValueError("MolmoAct2 policy requires `checkpoint_path`.")
local_path = Path(checkpoint_path).expanduser()
if local_path.exists():
return str(local_path)
return snapshot_download(
repo_id=checkpoint_path,
repo_type="model",
revision=revision,
force_download=force_download,
ignore_patterns=["*.py", "*.pyc", "__pycache__/*"],
token=_hf_token(),
)
def _load_hf_norm_metadata_for_tag(
checkpoint_path: str,
*,
revision: str | None,
force_download: bool,
norm_tag: str | None,
) -> dict[str, Any]:
norm_tag = str(norm_tag or "").strip()
if not norm_tag:
return {}
checkpoint_location = Path(
_resolve_checkpoint_location(
checkpoint_path,
revision=revision,
force_download=force_download,
)
)
norm_stats_filename = "norm_stats.json"
config_path = checkpoint_location / "config.json"
if config_path.exists():
with suppress(OSError, json.JSONDecodeError):
norm_stats_filename = str(
json.loads(config_path.read_text()).get("norm_stats_filename") or norm_stats_filename
)
stats_path = checkpoint_location / norm_stats_filename
if not stats_path.exists():
raise FileNotFoundError(
f"MolmoAct2 HF checkpoint is missing {norm_stats_filename!r}; cannot resolve norm_tag={norm_tag!r}."
)
payload = json.loads(stats_path.read_text())
metadata_by_tag = payload.get("metadata_by_tag")
if not isinstance(metadata_by_tag, dict):
raise ValueError(f"MolmoAct2 norm stats file {stats_path} has no metadata_by_tag mapping.")
metadata = metadata_by_tag.get(norm_tag)
if not isinstance(metadata, dict):
available = sorted(str(tag) for tag in metadata_by_tag)
raise ValueError(f"Unknown MolmoAct2 norm_tag={norm_tag!r}. Available tags: {available}.")
return metadata
@LRSchedulerConfig.register_subclass("molmoact2_cosine_decay_with_warmup")
@dataclass
class MolmoAct2CosineDecayWithWarmupSchedulerConfig(CosineDecayWithWarmupSchedulerConfig):
"""MolmoAct2-local cosine scheduler with optional decay-step auto-match.
LeRobot's generic cosine scheduler keeps an explicit integer decay length.
For MolmoAct2, leaving num_decay_steps unset means "decay across this run's
training steps"; build() is the first point where num_training_steps is known.
"""
num_decay_steps: int | None
def build(self, optimizer, num_training_steps: int):
return CosineDecayWithWarmupSchedulerConfig(
peak_lr=self.peak_lr,
decay_lr=self.decay_lr,
num_warmup_steps=self.num_warmup_steps,
num_decay_steps=num_training_steps if self.num_decay_steps is None else self.num_decay_steps,
).build(optimizer, num_training_steps=num_training_steps)
def _round_up(value: int, multiple: int) -> int:
return int(math.ceil(value / multiple) * multiple)
def infer_molmoact2_max_sequence_length(
*,
num_images: int,
state_dim: int,
action_dim: int,
action_horizon: int,
include_discrete_action: bool,
) -> int:
"""Infer the padded text/image sequence cap from MolmoAct2's fixed token layout."""
if num_images < 1:
num_images = MOLMOACT2_DEFAULT_NUM_IMAGES
if state_dim < 0:
state_dim = 0
if action_dim < 1:
action_dim = 1
if action_horizon < 1:
action_horizon = 1
image_tokens = num_images * MOLMOACT2_IMAGE_TOKENS_PER_IMAGE
prompt_tokens = (
MOLMOACT2_FIXED_PROMPT_TOKEN_BUDGET
+ MOLMOACT2_TASK_TOKEN_BUDGET
+ state_dim
+ MOLMOACT2_SEQUENCE_LENGTH_MARGIN
)
action_tokens = 0
if include_discrete_action:
action_tokens_per_step = max(
MOLMOACT2_MIN_DISCRETE_ACTION_TOKENS_PER_STEP,
math.ceil(action_dim * MOLMOACT2_DISCRETE_ACTION_TOKENS_PER_DIM),
)
action_tokens = MOLMOACT2_DISCRETE_ACTION_WRAPPER_TOKENS + action_horizon * action_tokens_per_step
return _round_up(
image_tokens + prompt_tokens + action_tokens,
MOLMOACT2_SEQUENCE_LENGTH_MULTIPLE,
)
@PreTrainedConfig.register_subclass("molmoact2")
@dataclass
class MolmoAct2Config(PreTrainedConfig):
"""MolmoAct2 policy backed by the converted HF checkpoint implementation."""
checkpoint_path: str = "allenai/MolmoAct2"
checkpoint_revision: str | None = None
checkpoint_force_download: bool = False
n_obs_steps: int = 1
chunk_size: int = 30
n_action_steps: int = 30
action_mode: str = "both"
inference_action_mode: str | None = None
discrete_action_tokenizer: str = "allenai/MolmoAct2-FAST-Tokenizer"
discrete_generation_max_steps: int | None = None
norm_tag: str | None = None
setup_type: str = ""
control_mode: str = ""
image_keys: list[str] = field(default_factory=list)
normalize_language: bool = True
add_setup_tokens: bool = True
add_control_tokens: bool = True
normalize_gripper: bool = False
num_state_tokens: int = 256
# Leave unset for the default MolmoAct2 sequence budget inferred from the fixed
# image/prompt/state/action token layout. Override only for unusual long prompts.
max_sequence_length: int | None = None
# Fixed by released MolmoAct2 checkpoints. We validate this at model load.
expected_max_action_dim: int = 32
# Flow-matching training knobs copied from the original MolmoAct2 training path.
num_flow_timesteps: int = 8
flow_matching_cutoff: float = 1.0
flow_matching_time_offset: float = 0.001
flow_matching_time_scale: float = 0.999
flow_matching_beta_alpha: float = 1.0
flow_matching_beta_beta: float = 1.5
num_inference_steps: int | None = None
mask_action_dim_padding: bool = True
enable_inference_cuda_graph: bool = True
# MolmoAct2-local eval option. When enabled, stochastic continuous action
# generation uses a rollout-local generator derived from eval_seed.
per_episode_seed: bool = False
eval_seed: int | None = None
rtc_config: RTCConfig | None = None
# Default is full finetuning with gradients from the action expert flowing into the VLM.
enable_lora_vlm: bool = False
lora_rank: int = 64
lora_alpha: int = 16
lora_dropout: float = 0.05
lora_bias: str = "none"
enable_lora_action_expert: bool = False
enable_knowledge_insulation: bool = False
freeze_embedding: bool = True
train_action_expert_only: bool = False
gradient_checkpointing: bool = False
model_dtype: str = "bfloat16"
softmax_auxiliary_loss: bool = True
softmax_auxiliary_loss_scale: float = 1e-4
discrete_loss_token_weighting: str = "root_subsegments_root_tokens"
optimizer_lr: float = 1e-5
optimizer_vit_lr: float = 5e-6
optimizer_connector_lr: float = 5e-6
optimizer_action_expert_lr: float = 5e-5
optimizer_betas: tuple[float, float] = (0.9, 0.95)
optimizer_eps: float = 1e-6
optimizer_weight_decay: float = 0.0
optimizer_grad_clip_norm: float = 1.0
scheduler_warmup_steps: int = 200
scheduler_decay_steps: int | None = None
scheduler_decay_lr: float = 1e-6
normalization_mapping: dict[str, NormalizationMode] = field(
default_factory=lambda: {
"VISUAL": NormalizationMode.IDENTITY,
"STATE": NormalizationMode.QUANTILES,
"ACTION": NormalizationMode.QUANTILES,
}
)
input_features: dict[str, PolicyFeature] = field(default_factory=dict)
output_features: dict[str, PolicyFeature] = field(default_factory=dict)
dataset_feature_names: dict[str, Any] = field(default_factory=dict)
def __post_init__(self) -> None:
super().__post_init__()
if self.action_mode not in {"continuous", "discrete", "both"}:
raise ValueError(
f"Unsupported action_mode={self.action_mode!r}. "
"Expected one of {'continuous', 'discrete', 'both'}."
)
if self.inference_action_mode not in {None, "continuous", "discrete"}:
raise ValueError(
f"Unsupported inference_action_mode={self.inference_action_mode!r}. "
"Expected one of {None, 'continuous', 'discrete'}."
)
if self.inference_action_mode == "continuous" and self.action_mode == "discrete":
raise ValueError("MolmoAct2 action_mode='discrete' cannot run continuous inference.")
if self.inference_action_mode == "discrete" and self.action_mode == "continuous":
raise ValueError("MolmoAct2 action_mode='continuous' cannot run discrete inference.")
if self.train_action_expert_only and self.action_mode != "continuous":
raise ValueError("MolmoAct2 train_action_expert_only requires action_mode='continuous'.")
if self.train_action_expert_only and self.enable_lora_vlm:
raise ValueError("MolmoAct2 train_action_expert_only is incompatible with enable_lora_vlm.")
if self.enable_lora_action_expert and not self.enable_lora_vlm:
raise ValueError("MolmoAct2 enable_lora_action_expert requires enable_lora_vlm.")
if self.chunk_size < 1:
raise ValueError(f"chunk_size must be >= 1, got {self.chunk_size}.")
if self.n_action_steps < 1:
raise ValueError(f"n_action_steps must be >= 1, got {self.n_action_steps}.")
if self.n_action_steps > self.chunk_size:
raise ValueError(
f"n_action_steps ({self.n_action_steps}) cannot exceed chunk_size ({self.chunk_size})."
)
if self.expected_max_action_dim != 32:
raise ValueError("MolmoAct2 released checkpoints use expected_max_action_dim=32.")
if self.model_dtype not in {"float32", "bfloat16", "float16"}:
raise ValueError(
f"Unsupported model_dtype={self.model_dtype!r}. Expected 'float32', 'bfloat16', or 'float16'."
)
if self.lora_rank < 1:
raise ValueError(f"lora_rank must be >= 1, got {self.lora_rank}.")
if self.lora_alpha < 1:
raise ValueError(f"lora_alpha must be >= 1, got {self.lora_alpha}.")
if not 0 <= self.lora_dropout <= 1:
raise ValueError(f"lora_dropout must be in [0, 1], got {self.lora_dropout}.")
if self.lora_bias not in {"none", "all", "lora_only"}:
raise ValueError(
f"Unsupported lora_bias={self.lora_bias!r}. Expected one of 'none', 'all', or 'lora_only'."
)
if self.discrete_loss_token_weighting not in {
"none",
"token",
"root_tokens",
"root_subsegments",
"root_subsegments_root_tokens",
}:
raise ValueError(
f"Unsupported discrete_loss_token_weighting={self.discrete_loss_token_weighting!r}."
)
if self.discrete_generation_max_steps is not None and self.discrete_generation_max_steps < 1:
raise ValueError(
f"discrete_generation_max_steps must be >= 1 or None, got {self.discrete_generation_max_steps}."
)
if self.max_sequence_length is not None and self.max_sequence_length < 1:
raise ValueError(f"max_sequence_length must be >= 1 or None, got {self.max_sequence_length}.")
def inferred_max_sequence_length(
self,
*,
num_images: int | None = None,
state_dim: int | None = None,
action_dim: int | None = None,
action_horizon: int | None = None,
include_discrete_action: bool | None = None,
) -> int:
if self.max_sequence_length is not None:
return int(self.max_sequence_length)
if num_images is None:
num_images = len(self.image_keys) or len(self.image_features) or MOLMOACT2_DEFAULT_NUM_IMAGES
if state_dim is None:
state_feature = self.robot_state_feature
state_dim = int(state_feature.shape[0]) if state_feature is not None else 0
if action_dim is None:
action_feature = self.action_feature
action_dim = (
int(action_feature.shape[0]) if action_feature is not None else self.expected_max_action_dim
)
if action_horizon is None:
action_horizon = self.chunk_size
if include_discrete_action is None:
include_discrete_action = self.action_mode in {"discrete", "both"}
return infer_molmoact2_max_sequence_length(
num_images=int(num_images),
state_dim=int(state_dim),
action_dim=int(action_dim),
action_horizon=int(action_horizon),
include_discrete_action=bool(include_discrete_action),
)
@property
def observation_delta_indices(self) -> None:
return None
@property
def action_delta_indices(self) -> list[int]:
return list(range(self.chunk_size))
@property
def reward_delta_indices(self) -> None:
return None
def get_optimizer_preset(self) -> OptimizerConfig:
return AdamWConfig(
lr=self.optimizer_lr,
betas=self.optimizer_betas,
eps=self.optimizer_eps,
weight_decay=self.optimizer_weight_decay,
grad_clip_norm=self.optimizer_grad_clip_norm,
)
def get_scheduler_preset(self) -> LRSchedulerConfig | None:
return MolmoAct2CosineDecayWithWarmupSchedulerConfig(
peak_lr=self.optimizer_lr,
decay_lr=self.scheduler_decay_lr,
num_warmup_steps=self.scheduler_warmup_steps,
num_decay_steps=self.scheduler_decay_steps,
)
def set_dataset_feature_metadata(self, features: dict[str, Any]) -> None:
self.dataset_feature_names = {}
for key in (ACTION, OBS_STATE):
feature = features.get(key) if isinstance(features, dict) else None
if isinstance(feature, dict) and feature.get("names") is not None:
self.dataset_feature_names[key] = feature["names"]
def validate_features(self) -> None:
"""Validate and set up MolmoAct2 input and output features."""
image_features = [key for key, feat in self.input_features.items() if feat.type == FeatureType.VISUAL]
if not image_features:
raise ValueError(
"MolmoAct2 policy requires at least one visual input feature. "
"No features of type FeatureType.VISUAL found in input_features."
)
if OBS_STATE not in self.input_features:
state_feature = PolicyFeature(
type=FeatureType.STATE,
shape=(0,),
)
self.input_features[OBS_STATE] = state_feature
if ACTION not in self.output_features:
action_feature = PolicyFeature(
type=FeatureType.ACTION,
shape=(self.expected_max_action_dim,),
)
self.output_features[ACTION] = action_feature
def apply_norm_tag_metadata(self) -> None:
if not str(self.norm_tag or "").strip():
return
metadata = _load_hf_norm_metadata_for_tag(
self.checkpoint_path,
revision=self.checkpoint_revision,
force_download=bool(self.checkpoint_force_download),
norm_tag=self.norm_tag,
)
if metadata.get("action_horizon") is not None:
self.chunk_size = int(metadata["action_horizon"])
if metadata.get("n_action_steps") is not None:
self.n_action_steps = int(metadata["n_action_steps"])
if not self.setup_type and metadata.get("setup_type") is not None:
self.setup_type = str(metadata["setup_type"])
if not self.control_mode and metadata.get("control_mode") is not None:
self.control_mode = str(metadata["control_mode"])
def saved_policy_action_mode(self) -> str | None:
pretrained_path = getattr(self, "pretrained_path", None)
if pretrained_path is None:
return None
config_path = Path(pretrained_path) / "config.json"
if not config_path.exists():
return None
try:
mode = json.loads(config_path.read_text()).get("action_mode")
except (OSError, json.JSONDecodeError):
return None
if mode in {"continuous", "discrete", "both"}:
return str(mode)
return None
def training_action_mode(self, saved_policy_action_mode: str | None = None) -> str:
return saved_policy_action_mode or self.action_mode
def validate_inference_action_mode(self, saved_policy_action_mode: str | None = None) -> None:
requested_mode = self.inference_action_mode
if requested_mode is None:
return
training_mode = self.training_action_mode(saved_policy_action_mode)
if requested_mode == "continuous" and training_mode == "discrete":
raise ValueError(
"MolmoAct2 checkpoint was trained with action_mode='discrete' and cannot run "
"continuous inference."
)
if requested_mode == "discrete" and training_mode == "continuous":
raise ValueError(
"MolmoAct2 checkpoint was trained with action_mode='continuous' and cannot run "
"discrete inference. Train with action_mode='both' or action_mode='discrete' first."
)
def validate_checkpoint_action_mode(
self,
checkpoint_action_mode: str,
*,
has_action_expert: bool,
) -> None:
if self.action_mode == "both" and checkpoint_action_mode != "both":
raise ValueError(
f"action_mode='both' requires checkpoint action_mode='both', got {checkpoint_action_mode!r}."
)
if self.action_mode == "discrete" and checkpoint_action_mode not in {"discrete", "both"}:
raise ValueError(
f"action_mode='discrete' requires checkpoint action_mode in {{'discrete', 'both'}}, "
f"got {checkpoint_action_mode!r}."
)
if self.action_mode in {"continuous", "both"} and not has_action_expert:
raise ValueError("Continuous MolmoAct2 training requires an action expert checkpoint.")
def resolve_inference_action_mode(
self,
requested_mode: str | None,
saved_policy_action_mode: str | None = None,
) -> str:
training_mode = self.training_action_mode(saved_policy_action_mode)
if requested_mode is None:
requested_mode = self.inference_action_mode
if requested_mode is None:
raise ValueError(
"MolmoAct2 inference requires `inference_action_mode` to be set explicitly "
"to either 'continuous' or 'discrete'."
)
if requested_mode not in {"continuous", "discrete"}:
raise ValueError("MolmoAct2 inference_action_mode must be either 'continuous' or 'discrete'.")
if requested_mode == "continuous" and training_mode == "discrete":
raise ValueError("MolmoAct2 action_mode='discrete' checkpoint cannot run continuous inference.")
if requested_mode == "discrete" and training_mode == "continuous":
raise ValueError("MolmoAct2 action_mode='continuous' checkpoint cannot run discrete inference.")
return requested_mode
@@ -0,0 +1,17 @@
#!/usr/bin/env python
# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ruff: noqa
@@ -0,0 +1,237 @@
#!/usr/bin/env python
# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ruff: noqa
import logging
import os
from pathlib import Path
from typing import ClassVar
import numpy as np
from tokenizers import ByteLevelBPETokenizer
from tokenizers.trainers import BpeTrainer
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from transformers.processing_utils import ProcessorMixin
def _hf_token() -> str | None:
return os.environ.get("HF_TOKEN") or os.environ.get("HF_ACCESS_TOKEN")
def _resolve_tokenizer_location(
tokenizer_path: str,
*,
revision: str | None = None,
force_download: bool = False,
) -> str:
local_path = Path(str(tokenizer_path)).expanduser()
if local_path.exists():
return str(local_path)
return snapshot_download(
repo_id=str(tokenizer_path),
repo_type="model",
revision=revision,
force_download=force_download,
ignore_patterns=["*.py", "*.pyc", "__pycache__/*"],
token=_hf_token(),
)
class UniversalActionProcessor(ProcessorMixin):
attributes: ClassVar[list[str]] = ["tokenizer"]
tokenizer_class: str = "AutoTokenizer"
def __init__(
self,
tokenizer: PreTrainedTokenizerFast,
scale: float = 10,
vocab_size: int = 1024,
min_token: int = 0,
*,
action_dim: int | None = None,
time_horizon: int | None = None,
):
self.scale = scale
self.vocab_size = vocab_size
self.min_token = min_token
# Action horizon and dimension needed during decoding. These can be specified
# in three ways (in order of priority):
# 1. passed in as kwargs to decode()
# 2. in the constructor
# 3. cached from the last time decode() was called
self.time_horizon = time_horizon
self.action_dim = action_dim
self.called_time_horizon = time_horizon
self.called_action_dim = action_dim
super().__init__(tokenizer)
self.bpe_tokenizer = self.tokenizer
def __call__(self, action_chunk: np.array) -> np.array:
from scipy.fft import dct
assert action_chunk.ndim <= 3, "Only 3 dimensions supported: [batch, timesteps, action_dim]"
if action_chunk.ndim == 2:
action_chunk = action_chunk[None, ...]
# Cache the time horizon and action dimension for decoding
self.called_time_horizon = action_chunk.shape[-2]
self.called_action_dim = action_chunk.shape[-1]
dct_coeff = dct(action_chunk, axis=1, norm="ortho")
dct_coeff = np.around(dct_coeff * self.scale)
tokens = []
for elem in dct_coeff:
token_str = "".join(map(chr, np.maximum(elem.flatten() - self.min_token, 0).astype(int)))
tokens.append(self.bpe_tokenizer(token_str)["input_ids"])
return tokens
def decode(
self,
tokens: list[list[int]],
*,
time_horizon: int | None = None,
action_dim: int | None = None,
) -> np.array:
from scipy.fft import idct
self.time_horizon = time_horizon or self.time_horizon or self.called_time_horizon
self.action_dim = action_dim or self.action_dim or self.called_action_dim
# Cache the time horizon and action dimension for the next call
self.called_time_horizon = self.time_horizon
self.called_action_dim = self.action_dim
assert self.time_horizon is not None and self.action_dim is not None, (
"Tokenizer not initialized, call encode() once or pass in time_horizon and action_dim."
)
decoded_actions = []
for token in tokens:
try:
decoded_tokens = self.bpe_tokenizer.decode(token)
decoded_dct_coeff = np.array(list(map(ord, decoded_tokens))) + self.min_token
decoded_dct_coeff = decoded_dct_coeff.reshape(-1, self.action_dim)
assert decoded_dct_coeff.shape == (
self.time_horizon,
self.action_dim,
), (
f"Decoded DCT coefficients have shape {decoded_dct_coeff.shape}, expected ({self.time_horizon}, {self.action_dim})"
)
except Exception as e:
print(f"Error decoding tokens: {e}")
print(f"Tokens: {token}")
decoded_dct_coeff = np.zeros((self.time_horizon, self.action_dim))
decoded_actions.append(idct(decoded_dct_coeff / self.scale, axis=0, norm="ortho"))
return np.stack(decoded_actions)
@classmethod
def fit(
cls,
action_data: list[np.array],
scale: float = 10,
vocab_size: int = 1024,
*,
time_horizon: int | None = None,
action_dim: int | None = None,
) -> "UniversalActionProcessor":
from scipy.fft import dct
# Run DCT over all inputs
dct_tokens = [dct(a, axis=0, norm="ortho").flatten() for a in action_data]
# Quantize and find min token
max_token = int(np.around(np.concatenate(dct_tokens) * scale).max())
min_token = int(np.around(np.concatenate(dct_tokens) * scale).min())
min_vocab_size = max_token - min_token
assert min_vocab_size <= vocab_size, (
f"Vocab size {vocab_size} is too small for the range of tokens {min_vocab_size}"
)
if min_vocab_size + 100 > vocab_size:
logging.warning(
f"Initial alphabet size {min_vocab_size} is almost as large as the vocab"
f"size {vocab_size}, consider increasing vocab size"
)
# Make token iterator for BPE training
def _token_iter():
for tokens in dct_tokens:
rounded_tokens = np.around(tokens * scale) - min_token
rounded_tokens = rounded_tokens.astype(int)
string = "".join(map(chr, rounded_tokens))
yield string
# Train BPE tokenizer
bpe = ByteLevelBPETokenizer()
# Set up the entire range of possible tokens as the initial alphabet
alphabet = [chr(i) for i in range(max_token - min_token + 1)]
trainer = BpeTrainer(
vocab_size=vocab_size,
min_frequency=2,
show_progress=True,
special_tokens=[],
initial_alphabet=alphabet,
max_token_length=10000,
)
# Train the inner tokenizer (don't use ByteLevelBPETokenizer.train_from_iterator()
# because it doesn't support custom alphabets)
bpe._tokenizer.train_from_iterator(_token_iter(), trainer=trainer)
return cls(
PreTrainedTokenizerFast(tokenizer_object=bpe, clean_up_tokenization_spaces=False),
scale=scale,
vocab_size=vocab_size,
min_token=min_token,
time_horizon=time_horizon,
action_dim=action_dim,
)
@classmethod
def from_pretrained_local(
cls,
pretrained_model_name_or_path: str,
*,
revision: str | None = None,
force_download: bool = False,
) -> "UniversalActionProcessor":
location = Path(
_resolve_tokenizer_location(
pretrained_model_name_or_path,
revision=revision,
force_download=force_download,
)
)
processor_config = {}
processor_config_path = location / "processor_config.json"
if processor_config_path.exists():
import json
processor_config = json.loads(processor_config_path.read_text())
tokenizer = PreTrainedTokenizerFast.from_pretrained(str(location))
return cls(
tokenizer,
scale=processor_config.get("scale", 10),
vocab_size=processor_config.get("vocab_size", 1024),
min_token=processor_config.get("min_token", 0),
action_dim=processor_config.get("action_dim"),
time_horizon=processor_config.get("time_horizon"),
)
@@ -0,0 +1,553 @@
#!/usr/bin/env python
# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ruff: noqa
"""
MolmoAct2 configuration
"""
from typing import Optional, Any
from transformers import PretrainedConfig
from transformers.modeling_rope_utils import rope_config_validation
from transformers.utils import logging
logger = logging.get_logger(__name__)
class MolmoAct2VitConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`MolmoAct2VisionTransformer`].
It is used to instantiate a `MolmoAct2VisionTransformer` according to the specified arguments,
defining the model architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Example:
```python
>>> from transformers import MolmoAct2VitConfig, MolmoAct2VisionTransformer
>>> # Initializing a MolmoAct2VitConfig
>>> configuration = MolmoAct2VitConfig()
>>> # Initializing a MolmoAct2VisionTransformer (with random weights)
>>> model = MolmoAct2VisionTransformer(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "molmoact2"
base_config_key = "vit_config"
def __init__(
self,
hidden_size: int = 1152,
intermediate_size: int = 4304,
num_hidden_layers: int = 27,
num_attention_heads: int = 16,
num_key_value_heads: int = 16,
head_dim: int = 72,
hidden_act: str = "gelu_pytorch_tanh",
layer_norm_eps: float = 1e-6,
image_default_input_size: tuple[int, int] = (378, 378),
image_patch_size: int = 14,
image_num_pos: int = 577,
attention_dropout: float = 0.0,
residual_dropout: float = 0.0,
initializer_range: float = 0.02,
float32_attention: bool = True,
attn_implementation: str = "eager",
**kwargs,
):
self.attn_implementation = attn_implementation
super().__init__(attn_implementation=attn_implementation, **kwargs)
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.head_dim = head_dim
self.hidden_act = hidden_act
self.layer_norm_eps = layer_norm_eps
self.image_default_input_size = image_default_input_size
self.image_patch_size = image_patch_size
self.image_num_pos = image_num_pos
self.attention_dropout = attention_dropout
self.residual_dropout = residual_dropout
self.initializer_range = initializer_range
self.float32_attention = float32_attention
@property
def image_num_patch(self):
h, w = self.image_default_input_size
return h // self.image_patch_size, w // self.image_patch_size
class MolmoAct2AdapterConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of MolmoAct2Adapter. With MolmoAct2VitConfig,
It is used to instantiate an MolmoAct2VisionBackbone according to the specified arguments,
defining the model architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Example:
```python
>>> from transformers import MolmoAct2VitConfig, MolmoAct2AdapterConfig, MolmoAct2VisionBackbone
>>> # Initializing a MolmoAct2VitConfig and a MolmoAct2AdapterConfig
>>> vit_config = MolmoAct2VitConfig()
>>> adapter_config = MolmoPoolingConfig()
>>> # Initializing a MolmoAct2VisionBackbone (with random weights)
>>> model = MolmoAct2VisionBackbone(vit_config, adapter_config)
>>> # Accessing the model configuration
>>> vit_configuration = model.vit_config
>>> adapter_configuration = model.adapter_config
```"""
model_type = "molmoact2"
base_config_key = "adapter_config"
def __init__(
self,
vit_layers: tuple = (-3, -9),
pooling_attention_mask: bool = False,
hidden_size: int = 1152,
num_attention_heads: int = 16,
num_key_value_heads: int = 16,
head_dim: int = 72,
float32_attention: bool = True,
attention_dropout: float = 0.0,
residual_dropout: float = 0.0,
hidden_act: str = "silu",
intermediate_size: int = 18944,
text_hidden_size: int = 3584,
image_feature_dropout: float = 0.0,
initializer_range: float = 0.02,
attn_implementation: str = "eager",
**kwargs,
):
self.attn_implementation = attn_implementation
super().__init__(attn_implementation=attn_implementation, **kwargs)
self.vit_layers = vit_layers
self.pooling_attention_mask = pooling_attention_mask
self.hidden_size = hidden_size
self.num_attention_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.head_dim = head_dim
self.float32_attention = float32_attention
self.attention_dropout = attention_dropout
self.residual_dropout = residual_dropout
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.text_hidden_size = text_hidden_size
self.image_feature_dropout = image_feature_dropout
self.initializer_range = initializer_range
class MolmoAct2TextConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`MolmoAct2TextModel`]. It is used to instantiate a
`MolmoAct2TextModel` according to the specified arguments, defining the model architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Example:
```python
>>> from transformers import MolmoAct2TextConfig, MolmoAct2TextModel
>>> # Initializing a MolmoAct2TextConfig
>>> configuration = MolmoAct2TextConfig()
>>> # Initializing a MolmoAct2TextModel (with random weights)
>>> model = MolmoAct2TextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "molmoact2_text"
base_config_key = "text_config"
keys_to_ignore_at_inference = ["past_key_values"]
base_model_tp_plan = {
"blocks.*.self_attn.att_proj": "colwise",
"blocks.*.self_attn.attn_out": "rowwise",
"blocks.*.mlp.ff_proj": "colwise",
"blocks.*.mlp.ff_out": "rowwise",
}
base_model_pp_plan = {
"wte": (["input_ids"], ["inputs_embeds"]),
"blocks": (["hidden_states", "attention_mask"], ["hidden_states"]),
"ln_f": (["hidden_states"], ["hidden_states"]),
}
def __init__(
self,
hidden_size: int = 3584,
num_attention_heads: int = 28,
num_key_value_heads: int | None = 4,
head_dim: int = 128,
vocab_size: int = 152064,
additional_vocab_size: int = 128,
qkv_bias: bool = True,
num_hidden_layers: int = 48,
intermediate_size: int = 18944,
hidden_act: str = "silu",
embedding_dropout: float = 0.0,
attention_dropout: float = 0.0,
residual_dropout: float = 0.0,
max_position_embeddings: int = 4096,
rope_theta: float = 1000000.0,
rope_scaling: dict[str, Any] = None,
rope_scaling_layers: list[int] | None = None,
use_qk_norm: bool = False,
qk_norm_type: str = "olmo",
layer_norm_eps: int = 1e-6,
norm_after: bool = False,
initializer_range: float = 0.02,
use_cache=True,
tie_word_embeddings=False,
attn_implementation: str = "eager",
**kwargs,
):
self.attn_implementation = attn_implementation
super().__init__(
tie_word_embeddings=tie_word_embeddings, attn_implementation=attn_implementation, **kwargs
)
self.hidden_size = hidden_size
self.num_attention_heads = num_attention_heads
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.head_dim = head_dim
self.vocab_size = vocab_size
self.additional_vocab_size = additional_vocab_size
self.qkv_bias = qkv_bias
self.num_hidden_layers = num_hidden_layers
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.embedding_dropout = embedding_dropout
self.attention_dropout = attention_dropout
self.residual_dropout = residual_dropout
self.max_position_embeddings = max_position_embeddings
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self.rope_scaling_layers = rope_scaling_layers
self.use_qk_norm = use_qk_norm
self.qk_norm_type = qk_norm_type
self.layer_norm_eps = layer_norm_eps
self.norm_after = norm_after
self.initializer_range = initializer_range
self.use_cache = use_cache
# Validate the correctness of rotary position embeddings parameters
rope_config_validation(self)
class MolmoAct2ActionExpertConfig(PretrainedConfig):
r"""Configuration for the MolmoAct2 modern action expert."""
model_type = "molmoact2_action_expert"
base_config_key = "action_expert_config"
def __init__(
self,
max_action_horizon: int = 32,
max_action_dim: int = 32,
hidden_size: int = 1024,
num_layers: int = 32,
num_heads: int = 16,
mlp_ratio: float = 8.0 / 3.0,
ffn_multiple_of: int = 256,
timestep_embed_dim: int = 256,
dropout: float = 0.0,
attn_dropout: float = 0.0,
context_layer_norm: bool = True,
qk_norm: bool = True,
qk_norm_eps: float = 1e-6,
rope: bool = True,
causal_attn: bool = False,
**kwargs,
):
super().__init__(**kwargs)
self.max_action_horizon = max_action_horizon
self.max_action_dim = max_action_dim
self.hidden_size = hidden_size
self.num_layers = num_layers
self.num_heads = num_heads
self.mlp_ratio = mlp_ratio
self.ffn_multiple_of = ffn_multiple_of
self.timestep_embed_dim = timestep_embed_dim
self.dropout = dropout
self.attn_dropout = attn_dropout
self.context_layer_norm = context_layer_norm
self.qk_norm = qk_norm
self.qk_norm_eps = qk_norm_eps
self.rope = rope
self.causal_attn = causal_attn
def to_dict(self):
output = super().to_dict()
# These are derived from the parent MolmoAct2Config for HF exports. Keeping
# them out of the public nested config avoids duplicated sources of truth.
output.pop("max_action_horizon", None)
output.pop("max_action_dim", None)
return output
class MolmoAct2Config(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`MolmoAct2ForConditionalGeneration`].
It is used to instantiate an MolmoAct2 model according to the specified arguments, defining the model architecture.
Example:
```python
>>> from transformers import MolmoAct2Config, MolmoAct2VitConfig, MolmoAct2AdapterConfig, MolmoAct2TextConfig
>>> # Initializing a MolmoAct2VitConfig
>>> vit_config = MolmoAct2VitConfig()
>>> # Initializing a MolmoAct2AdapterConfig
>>> adapter_config = MolmoAct2AdapterConfig()
>>> # Initializing a MolmoAct2TextConfig
>>> text_config = MolmoAct2TextConfig()
>>> # Initializing a MolmoAct2Config
>>> configuration = MolmoAct2Config(
>>> vit_config=vit_config,
>>> adapter_config=adapter_config,
>>> text_config=text_config,
>>> image_start_token_id=151936,
>>> image_end_token_id=151937,
>>> image_patch_id=151938,
>>> image_col_id=151939,
>>> low_res_image_start_token_id=151940,
>>> image_low_res_id=151942,
>>> frame_start_token_id=151943,
>>> frame_end_token_id=151944,
>>> )
>>> # Initializing a model
>>> model = MolmoAct2ForConditionalGeneration(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "molmoact2"
sub_configs = {
"text_config": MolmoAct2TextConfig,
"vit_config": MolmoAct2VitConfig,
"adapter_config": MolmoAct2AdapterConfig,
"action_expert_config": MolmoAct2ActionExpertConfig,
}
def __init__(
self,
vit_config: MolmoAct2VitConfig = None,
adapter_config: MolmoAct2AdapterConfig = None,
text_config: MolmoAct2TextConfig = None,
action_expert_config: MolmoAct2ActionExpertConfig = None,
image_start_token_id: int = None,
low_res_image_start_token_id: int = None,
image_end_token_id: int = None,
image_low_res_id: int = None,
image_patch_id: int = None,
image_col_id: int = None,
frame_start_token_id: int = None,
frame_end_token_id: int = None,
use_frame_special_tokens: bool = True,
initializer_range: float = 0.02,
add_action_expert: bool = True,
max_action_dim: int = 32,
max_action_horizon: int = 30,
n_obs_steps: int = 30,
action_mode: str = "both",
state_format: str = "discrete",
flow_matching_num_steps: int = 10,
flow_matching_cutoff: float = 1.0,
flow_matching_time_offset: float = 0.001,
flow_matching_time_scale: float = 0.999,
flow_matching_beta_alpha: float = 1.0,
flow_matching_beta_beta: float = 1.5,
mask_action_dim_padding: bool = True,
enable_depth_reasoning: bool = False,
depth_mode: int = 2,
num_depth_codes: int = 100,
action_expert_depth_gate: bool = False,
action_expert_depth_gate_per_layer: bool = False,
action_expert_depth_gate_init_bias: float = -4.0,
action_output_token_id: int = None,
action_start_token_id: int = None,
action_end_token_id: int = None,
action_token_start_id: int = None,
num_action_tokens: int = 0,
depth_output_token_id: int = None,
depth_start_token_id: int = None,
depth_end_token_id: int = None,
depth_token_start_id: int = None,
num_depth_tokens: int = 0,
state_start_token_id: int = None,
state_end_token_id: int = None,
state_token_start_id: int = None,
num_state_tokens: int = 0,
add_setup_tokens: bool = True,
add_control_tokens: bool = True,
norm_stats_filename: str = "norm_stats.json",
**kwargs,
):
super().__init__(**kwargs)
if vit_config is None:
self.vit_config = MolmoAct2VitConfig()
elif isinstance(vit_config, dict):
self.vit_config = MolmoAct2VitConfig(**vit_config)
else:
self.vit_config = vit_config
if adapter_config is None:
self.adapter_config = MolmoAct2AdapterConfig()
elif isinstance(adapter_config, dict):
self.adapter_config = MolmoAct2AdapterConfig(**adapter_config)
else:
self.adapter_config = adapter_config
if text_config is None:
self.text_config = MolmoAct2TextConfig()
elif isinstance(text_config, dict):
self.text_config = MolmoAct2TextConfig(**text_config)
else:
self.text_config = text_config
self.add_action_expert = bool(add_action_expert)
if not self.add_action_expert:
self.action_expert_config = None
elif action_expert_config is None:
self.action_expert_config = MolmoAct2ActionExpertConfig(
max_action_horizon=max_action_horizon,
max_action_dim=max_action_dim,
num_layers=self.text_config.num_hidden_layers,
)
elif isinstance(action_expert_config, dict):
self.action_expert_config = MolmoAct2ActionExpertConfig(**action_expert_config)
else:
self.action_expert_config = action_expert_config
if self.add_action_expert:
self.action_expert_config.max_action_dim = int(max_action_dim)
self.action_expert_config.max_action_horizon = int(max_action_horizon)
self._validate_release_action_config(
state_format=state_format,
)
self.image_start_token_id = image_start_token_id
self.low_res_image_start_token_id = low_res_image_start_token_id
self.image_end_token_id = image_end_token_id
self.image_low_res_id = image_low_res_id
self.image_high_res_id = image_patch_id
self.image_patch_id = image_patch_id
self.image_col_id = image_col_id
self.frame_start_token_id = frame_start_token_id
self.frame_end_token_id = frame_end_token_id
self.use_frame_special_tokens = use_frame_special_tokens
self.initializer_range = initializer_range
self.max_action_dim = max_action_dim
self.max_action_horizon = max_action_horizon
self.n_obs_steps = n_obs_steps
self.action_mode = action_mode
self.state_format = state_format
self.flow_matching_num_steps = flow_matching_num_steps
self.flow_matching_cutoff = flow_matching_cutoff
self.flow_matching_time_offset = flow_matching_time_offset
self.flow_matching_time_scale = flow_matching_time_scale
self.flow_matching_beta_alpha = flow_matching_beta_alpha
self.flow_matching_beta_beta = flow_matching_beta_beta
self.mask_action_dim_padding = mask_action_dim_padding
self.enable_depth_reasoning = enable_depth_reasoning
self.depth_mode = depth_mode
self.num_depth_codes = num_depth_codes
self.action_expert_depth_gate = action_expert_depth_gate
self.action_expert_depth_gate_per_layer = action_expert_depth_gate_per_layer
self.action_expert_depth_gate_init_bias = action_expert_depth_gate_init_bias
self.action_output_token_id = action_output_token_id
self.action_start_token_id = action_start_token_id
self.action_end_token_id = action_end_token_id
self.action_token_start_id = action_token_start_id
self.num_action_tokens = num_action_tokens
self.depth_output_token_id = depth_output_token_id
self.depth_start_token_id = depth_start_token_id
self.depth_end_token_id = depth_end_token_id
self.depth_token_start_id = depth_token_start_id
self.num_depth_tokens = num_depth_tokens
self.state_start_token_id = state_start_token_id
self.state_end_token_id = state_end_token_id
self.state_token_start_id = state_token_start_id
self.num_state_tokens = num_state_tokens
self.add_setup_tokens = add_setup_tokens
self.add_control_tokens = add_control_tokens
self.norm_stats_filename = norm_stats_filename
@staticmethod
def _validate_release_action_config(
*,
state_format: str,
) -> None:
if state_format != "discrete":
raise ValueError("MolmoAct2 HF export supports only state_format='discrete'.")
@property
def image_num_patch(self):
assert self.vit_config is not None
return self.vit_config.image_num_patch
@property
def num_attention_heads(self):
return self.text_config.num_attention_heads
@property
def num_key_value_heads(self):
return self.text_config.num_key_value_heads
@property
def head_dim(self):
return self.text_config.head_dim
@property
def num_hidden_layers(self):
return self.text_config.num_hidden_layers
@property
def hidden_size(self):
return self.text_config.hidden_size
@property
def vocab_size(self):
return self.text_config.vocab_size
@property
def max_position_embeddings(self):
return self.text_config.max_position_embeddings
MolmoAct2VitConfig.register_for_auto_class()
MolmoAct2AdapterConfig.register_for_auto_class()
MolmoAct2TextConfig.register_for_auto_class()
MolmoAct2ActionExpertConfig.register_for_auto_class()
MolmoAct2Config.register_for_auto_class()
@@ -0,0 +1,564 @@
#!/usr/bin/env python
# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ruff: noqa
"""Image processor class for MolmoAct2"""
from typing import Optional, Union
import numpy as np
import einops
import torch
import torchvision.transforms
from transformers.image_utils import (
IMAGENET_STANDARD_MEAN,
IMAGENET_STANDARD_STD,
ImageInput,
PILImageResampling,
make_flat_list_of_images,
valid_images,
to_numpy_array,
)
from transformers.image_transforms import convert_to_rgb
from transformers.processing_utils import ImagesKwargs
from transformers.image_processing_utils import BaseImageProcessor, get_size_dict
from transformers.utils import logging
from transformers.feature_extraction_utils import BatchFeature
from transformers.utils import TensorType, logging
logger = logging.get_logger(__name__)
def normalize_image(
image: np.ndarray,
image_mean: list[float],
image_std: list[float],
) -> np.ndarray:
if np.allclose(image_mean, [0.5, 0.5, 0.5]) and np.allclose(image_std, [0.5, 0.5, 0.5]):
return image * np.asarray(2.0, dtype=np.float32) - np.asarray(1.0, dtype=np.float32)
image -= np.array(image_mean, dtype=np.float32)[None, None, :]
image /= np.array(image_std, dtype=np.float32)[None, None, :]
return image
def resize_image(
image: np.ndarray,
desired_output_size: list[int],
resample: PILImageResampling,
) -> np.ndarray:
image = torch.permute(torch.from_numpy(image), [2, 0, 1])
dtype = image.dtype
if torch.is_floating_point(image):
in_min = 0.0
in_max = 1.0
resized = torchvision.transforms.Resize(
desired_output_size,
resample,
antialias=False,
)(image)
resized = torch.clip(resized, 0.0, 1.0).to(dtype)
else:
assert image.dtype == torch.uint8, "SigLIP expects float images or uint8 images, but got {}".format(
image.dtype
)
in_min = 0.0
in_max = 255.0
resized = torchvision.transforms.Resize(
desired_output_size,
resample,
antialias=False,
)(image)
resized = torch.clip(resized, 0, 255).to(dtype)
resized = resized.to(torch.float32)
resized = (resized - in_min) / (in_max - in_min)
resized = torch.permute(resized, [1, 2, 0]).numpy()
return resized
def select_tiling(h, w, patch_size, max_num_crops):
"""Divide in image of size [w, h] in up to max_num_patches of size patch_size"""
original_size = np.stack([h, w]) # [1, 2]
original_res = h * w
tilings = []
for i in range(1, max_num_crops + 1):
for j in range(1, max_num_crops + 1):
if i * j <= max_num_crops:
tilings.append((i, j))
# sort so argmin and argmax favour smaller tilings in the event of a tie
tilings.sort(key=lambda x: (x[0] * x[1], x[0]))
candidate_tilings = np.array(tilings, dtype=np.int32) # [n_resolutions, 2]
candidate_resolutions = candidate_tilings * patch_size # [n_resolutions, 2]
# How much we would need to scale the image to fit exactly in each tiling
original_size = np.stack([h, w], dtype=np.float32) # [1, 2]
# The original size can be zero in rare cases if the image is smaller than the margin
# In those cases letting the scale become infinite means the tiling is based on the
# other side, or falls back to the smallest tiling
with np.errstate(divide="ignore"):
required_scale_d = (candidate_resolutions.astype(np.float32) / original_size,)
required_scale = np.min(required_scale_d, axis=-1, keepdims=True) # [n_resolutions, 1]
if np.all(required_scale < 1):
# We are forced to downscale, so try to minimize the amount of downscaling
ix = np.argmax(required_scale)
else:
# Pick the resolution that required the least upscaling so that it most closely fits the image
required_scale = np.where(required_scale < 1.0, 10e9, required_scale)
ix = np.argmin(required_scale)
return candidate_tilings[ix]
def build_resized_image(
image: np.ndarray,
base_image_input_size: list[int],
resample: PILImageResampling,
image_mean: list[float],
image_std: list[float],
image_patch_size: int,
) -> tuple[np.ndarray, np.ndarray]:
resized = resize_image(
image,
base_image_input_size,
resample,
)
resized = normalize_image(resized, image_mean, image_std)
if len(resized.shape) == 3:
resized = np.expand_dims(resized, 0)
crop_patch_w = base_image_input_size[1] // image_patch_size
crop_patch_h = base_image_input_size[0] // image_patch_size
resize_idx = np.arange(crop_patch_w * crop_patch_h).reshape([crop_patch_h, crop_patch_w])
return resized, resize_idx
def build_overlapping_crops(
image: np.ndarray,
max_crops: int,
overlap_margins: list[int],
base_image_input_size: list[int],
resample: PILImageResampling,
image_mean: list[float],
image_std: list[float],
image_patch_size: int,
) -> tuple[np.ndarray, np.ndarray]:
"""Decompose an image into a set of overlapping crops
:return crop_arr: [n_crops, h, w, 3] The crops
:return patch_idx: [overlap_patch_h, overlap_patch_w] For each patch in the resized image
the crops were extracted from, what patch in `crop_arr` it corresponds to
"""
original_image_h, original_image_w = image.shape[:2]
crop_size = base_image_input_size[0]
assert base_image_input_size[0] == base_image_input_size[1]
left_margin, right_margin = overlap_margins
total_margin_pixels = image_patch_size * (right_margin + left_margin) # pixels removed per dim
crop_patches = base_image_input_size[0] // image_patch_size # patches per crop dim
crop_window_patches = crop_patches - (right_margin + left_margin) # usable patches
crop_window_size = crop_window_patches * image_patch_size
crop_patch_w = base_image_input_size[1] // image_patch_size
crop_patch_h = base_image_input_size[0] // image_patch_size
original_image_h, original_image_w = image.shape[:2]
crop_size = base_image_input_size[0]
# Decide how to tile the image, to account for the overlap margins we compute the tiling
# as if we had an image without the margins and were using a crop size without the margins
tiling = select_tiling(
original_image_h - total_margin_pixels,
original_image_w - total_margin_pixels,
crop_window_size,
max_crops,
)
src = resize_image(
image,
[
tiling[0] * crop_window_size + total_margin_pixels,
tiling[1] * crop_window_size + total_margin_pixels,
],
resample,
)
src = normalize_image(src, image_mean, image_std)
# Now we have to split the image into crops, and track what patches came from
# where in `patch_idx_arr`
n_crops = tiling[0] * tiling[1]
crop_arr = np.zeros([n_crops, crop_size, crop_size, 3], dtype=src.dtype)
patch_idx_arr = np.zeros([n_crops, crop_patch_h, crop_patch_w], dtype=np.int32)
on_crop = 0
for i in range(tiling[0]):
# Slide over `src` by `crop_window_size` steps, but extract crops of size `crops_size`
# which results in overlapping crop windows
y0 = i * crop_window_size
for j in range(tiling[1]):
x0 = j * crop_window_size
crop_arr[on_crop] = src[y0 : y0 + crop_size, x0 : x0 + crop_size]
patch_idx = np.arange(crop_patch_w * crop_patch_h).reshape(crop_patch_h, crop_patch_w)
patch_idx += on_crop * crop_patch_h * crop_patch_w
# Mask out idx that are in the overlap region
if i != 0:
patch_idx[:left_margin, :] = -1
if j != 0:
patch_idx[:, :left_margin] = -1
if i != tiling[0] - 1:
patch_idx[-right_margin:, :] = -1
if j != tiling[1] - 1:
patch_idx[:, -right_margin:] = -1
patch_idx_arr[on_crop] = patch_idx
on_crop += 1
# `patch_idx_arr` is ordered crop-by-crop, here we transpose `patch_idx_arr`
# so it is ordered left-to-right order
patch_idx_arr = np.reshape(patch_idx_arr, [tiling[0], tiling[1], crop_patch_h, crop_patch_w])
patch_idx_arr = np.transpose(patch_idx_arr, [0, 2, 1, 3])
patch_idx_arr = np.reshape(patch_idx_arr, [-1])
# Now get the parts not in the overlap region, so it should map each patch in `src`
# to the correct patch it should come from in `crop_arr`
patch_idx_arr = patch_idx_arr[patch_idx_arr >= 0].reshape(
src.shape[0] // image_patch_size,
src.shape[1] // image_patch_size,
)
return crop_arr, patch_idx_arr
def batch_pixels_to_patches(array: np.ndarray, patch_size: int) -> np.ndarray:
"""Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
if len(array.shape) == 3:
n_crops, h, w = array.shape
h_patches = h // patch_size
w_patches = w // patch_size
array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size])
array = np.transpose(array, [0, 1, 3, 2, 4])
array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size])
return array
else:
n_crops, h, w, c = array.shape
h_patches = h // patch_size
w_patches = w // patch_size
array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size, c])
array = np.transpose(array, [0, 1, 3, 2, 4, 5])
array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size * c])
return array
def arange_for_pooling(
idx_arr: np.ndarray,
pool_h: int,
pool_w: int,
) -> np.ndarray:
h_pad = pool_h * ((idx_arr.shape[0] + pool_h - 1) // pool_h) - idx_arr.shape[0]
w_pad = pool_w * ((idx_arr.shape[1] + pool_w - 1) // pool_w) - idx_arr.shape[1]
idx_arr = np.pad(
idx_arr,
[[h_pad // 2, (h_pad + 1) // 2], [w_pad // 2, (w_pad + 1) // 2]],
mode="constant",
constant_values=-1,
)
return einops.rearrange(idx_arr, "(h dh) (w dw) -> h w (dh dw)", dh=pool_h, dw=pool_w)
def image_to_patches_and_grids(
image: np.ndarray,
max_crops: int,
overlap_margins: list[int],
base_image_input_size: list[int],
resample: PILImageResampling,
image_mean: list[float],
image_std: list[float],
image_patch_size: int,
image_pooling_w: int,
image_pooling_h: int,
crop_mode: str = "overlap-and-resize-c2",
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
:return image_grids, the shape of each (low-res, high-res) image after pooling
:return crops, the image crops to processes with the ViT
:return pooled_patch_idx, for each patch_id tokens in `image_tokens`, the indices of the
patches in `crops` to pool for that token, masked with -1
"""
if isinstance(base_image_input_size, int):
base_image_input_size = (base_image_input_size, base_image_input_size)
base_image_input_d = image_patch_size
pooling_w = image_pooling_w
pooling_h = image_pooling_h
crop_patch_w = base_image_input_size[1] // base_image_input_d
crop_patch_h = base_image_input_size[0] // base_image_input_d
if crop_mode == "resize":
resized, resize_idx = build_resized_image(
image,
base_image_input_size,
resample,
image_mean,
image_std,
image_patch_size,
)
resize_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
resized_h, resized_w = resize_idx.shape[:2]
resize_idx = resize_idx.reshape([-1, pooling_h * pooling_w])
image_grid = [np.array([resized_h, resized_w, 0, 0])]
return (
np.stack(image_grid, 0),
batch_pixels_to_patches(resized, image_patch_size),
resize_idx,
)
if crop_mode not in {"overlap-and-resize-c2", "overlap-and-resize"}:
raise ValueError(f"Unsupported MolmoAct2 image crop_mode {crop_mode!r}.")
crop_arr, patch_idx_arr = build_overlapping_crops(
image,
max_crops,
overlap_margins,
base_image_input_size,
resample,
image_mean,
image_std,
image_patch_size,
)
pooling_idx = arange_for_pooling(patch_idx_arr, pooling_h, pooling_w)
h, w = pooling_idx.shape[:2]
pooling_idx = pooling_idx.reshape([-1, pooling_h * pooling_w])
# Finally do the same for the global image
resized, resize_idx = build_resized_image(
image,
base_image_input_size,
resample,
image_mean,
image_std,
image_patch_size,
)
crop_arr = np.concatenate([resized, crop_arr], 0)
resize_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
resized_h, resized_w = resize_idx.shape[:2]
resize_idx = resize_idx.reshape([-1, pooling_h * pooling_w])
# Global image goes first, so the order of patches in previous crops gets increased
pooling_idx = np.where(pooling_idx >= 0, pooling_idx + crop_patch_h * crop_patch_w, -1)
pooling_idx = np.concatenate([resize_idx, pooling_idx])
image_grid = [np.array([resized_h, resized_w, h, w])]
return (np.stack(image_grid, 0), batch_pixels_to_patches(crop_arr, image_patch_size), pooling_idx)
class MolmoAct2ImagesKwargs(ImagesKwargs, total=False):
max_crops: int | None
overlap_margins: list[int] | None
crop_mode: str | None
patch_size: int | None
pooling_size: list[int] | None
class MolmoAct2ImageProcessor(BaseImageProcessor):
r"""
Constructs a MolmoAct2 image processor that preprocesses images for the model.
Args:
size (`dict[str, int]` *optional*, defaults to `{"height": 378, "width": 378}`):
Size of the image after resizing.
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BILINEAR`):
Resampling filter to use when resizing the image.
image_mean (`float` or `list[float]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
image_std (`float` or `list[float]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB.
max_crops (`int`, *optional*, defaults to `8`):
Maximum number of crops to use per image.
overlap_margins (`list[int]`, *optional*, defaults to `[4, 4]`):
Overlap margins to use.
patch_size (`int`, *optional*, defaults to 14):
The spatial patch size of the vision encoder.
pooling_size (`list[int]`, *optional*, defaults to `[2, 2]`):
The pooling size of the vision adapter.
"""
model_input_names = ["pixel_values", "image_token_pooling", "image_grids", "image_num_crops"]
def __init__(
self,
size: dict[str, int] | None = None,
resample: PILImageResampling = PILImageResampling.BILINEAR,
image_mean: float | list[float] | None = None,
image_std: float | list[float] | None = None,
do_convert_rgb: bool = True,
max_crops: int = 8,
overlap_margins: list[int] = [4, 4],
crop_mode: str = "overlap-and-resize-c2",
patch_size: int = 14,
pooling_size: list[int] = [2, 2],
**kwargs,
) -> None:
super().__init__(**kwargs)
size = size if size is not None else {"height": 378, "width": 378}
size = get_size_dict(size, default_to_square=True)
self.size = size
self.resample = resample
self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
self.do_convert_rgb = do_convert_rgb
self.max_crops = max_crops
self.overlap_margins = overlap_margins
self.crop_mode = crop_mode
self.patch_size = patch_size
self.pooling_size = pooling_size
def preprocess(
self,
images: ImageInput,
size: dict[str, int] | None = None,
resample: PILImageResampling | None = None,
image_mean: float | list[float] | None = None,
image_std: float | list[float] | None = None,
do_convert_rgb: bool | None = None,
max_crops: int | None = None,
overlap_margins: list[int] | None = None,
crop_mode: str | None = None,
patch_size: int | None = None,
pooling_size: list[int] | None = None,
return_tensors: str | TensorType | None = None,
**kwargs,
) -> BatchFeature:
"""
Args:
images (`ImageInput`):
Image to preprocess.
size (`dict[str, int]`, *optional*, defaults to `self.size`):
Size of the image after resizing.
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
Resampling filter to use when resizing the image. This can be one of the enum `PILImageResampling`. Only
has an effect if `do_resize` is set to `True`.
image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
`True`.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
max_crops (`int`, *optional*, defaults to `self.max_crops`):
Maximum number of crops to use per image.
overlap_margins (`list[int]`, *optional*, defaults to `self.overlap_margins`):
Overlap margins to use.
patch_size (`int`, *optional*, defaults to `self.patch_size`):
The spatial patch size of the vision encoder.
pooling_size (`list[int]`, *optional*, defaults to `self.pooling_size`):
The pooling size of the vision adapter.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
Returns:
A `BatchFeature` containing the following keys:
- `pixel_values`: The preprocessed images.
- `image_token_pooling`: The indices of the patches in `crops` to pool for each token in `image_tokens`.
- `image_grids`: The image grids.
- `image_num_crops`: The number of crops for each image.
"""
if size is not None:
if "height" not in size or "width" not in size:
raise ValueError("size must contain 'height' and 'width' keys.")
else:
size = {**self.size}
base_image_input_size = [size["height"], size["width"]]
resample = resample or self.resample
image_mean = image_mean or self.image_mean
image_std = image_std or self.image_std
do_convert_rgb = do_convert_rgb or self.do_convert_rgb
max_crops = max_crops or self.max_crops
overlap_margins = overlap_margins or self.overlap_margins
crop_mode = crop_mode or self.crop_mode
patch_size = patch_size or self.patch_size
pooling_size = pooling_size or self.pooling_size
image_pooling_h, image_pooling_w = pooling_size
if images is not None:
images = self.fetch_images(images)
images = make_flat_list_of_images(images)
if images is not None and not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
data = {}
if images is not None:
batch_grids = []
batch_crops = []
batch_pooled_patches_idx = []
batch_num_crops = []
for image in images:
image_grid, crops, pooled_idx = image_to_patches_and_grids(
image,
max_crops,
overlap_margins,
base_image_input_size,
resample,
image_mean,
image_std,
patch_size,
image_pooling_w,
image_pooling_h,
crop_mode,
)
batch_grids.append(image_grid)
batch_crops.append(crops)
batch_pooled_patches_idx.append(pooled_idx)
batch_num_crops.append(crops.shape[0])
pixel_values = np.concatenate(batch_crops, 0)
image_token_pooling = np.concatenate(batch_pooled_patches_idx, 0)
image_grids = np.concatenate(batch_grids, 0)
image_num_crops = np.array(batch_num_crops)
data.update(
pixel_values=pixel_values,
image_token_pooling=image_token_pooling,
image_grids=image_grids,
image_num_crops=image_num_crops,
)
return BatchFeature(data, tensor_type=return_tensors)
MolmoAct2ImageProcessor.register_for_auto_class()
@@ -0,0 +1,748 @@
#!/usr/bin/env python
# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ruff: noqa
"""Inference utilities for MolmoAct2"""
from dataclasses import dataclass
from typing import Any, Optional, Tuple
from collections.abc import Iterable, Sequence
import torch
from torch.nn import functional as F
from transformers.cache_utils import Cache
from transformers.configuration_utils import PretrainedConfig
@dataclass
class _ActionFlowInputs:
trajectory: torch.Tensor
context: Any
modulations: Sequence[Any]
action_dim_is_pad: torch.Tensor | None
@dataclass
class _ActionFlowCudaGraph:
key: tuple[Any, ...]
graph: torch.cuda.CUDAGraph
static_inputs: _ActionFlowInputs
output: torch.Tensor
@dataclass
class _DepthDecodeCudaGraphLayerStage:
residual: torch.Tensor
query: torch.Tensor
key: torch.Tensor
value: torch.Tensor
@dataclass
class _DepthDecodeCudaGraphPostStage:
graph: torch.cuda.CUDAGraph
attn_context: torch.Tensor
@dataclass
class _DepthDecodeCudaGraph:
cache_key: tuple[Any, ...]
pre_graph: torch.cuda.CUDAGraph
token_ids: torch.Tensor
cos: torch.Tensor
sin: torch.Tensor
positions: torch.Tensor
stages: Sequence[_DepthDecodeCudaGraphLayerStage]
post_graphs: Sequence[_DepthDecodeCudaGraphPostStage]
output: torch.Tensor
@dataclass
class _DepthDecodeCudaGraphSpec:
eligible: bool
cache_key_prefix: tuple[Any, ...]
num_hidden_layers: int
head_dim: int
num_attention_heads: int
def _cache_seq_len_int(past_key_values: Cache | None) -> int:
if past_key_values is None:
return 0
seq_len = past_key_values.get_seq_length()
if torch.is_tensor(seq_len):
return int(seq_len.item())
return int(seq_len)
def _cache_max_len_int(past_key_values: Cache | None) -> int:
if past_key_values is None:
return -1
max_len = past_key_values.get_max_cache_shape()
if torch.is_tensor(max_len):
return int(max_len.item())
return int(max_len)
def _iter_cache_key_values(
past_key_values: Cache,
) -> Iterable[tuple[torch.Tensor | None, torch.Tensor | None]]:
layers = getattr(past_key_values, "layers", None)
if layers is not None:
for layer in layers:
yield getattr(layer, "keys", None), getattr(layer, "values", None)
return
for layer in past_key_values:
yield layer[0], layer[1]
class _DepthDecodeStaticLayerCache:
is_compileable = False
is_sliding = False
def __init__(self, max_cache_len: int) -> None:
self.max_cache_len = int(max_cache_len)
self.cumulative_length = 0
self.keys: torch.Tensor | None = None
self.values: torch.Tensor | None = None
def _allocate(self, key_states: torch.Tensor, value_states: torch.Tensor) -> None:
bsz, n_heads = key_states.shape[:2]
self.keys = torch.empty(
(bsz, n_heads, self.max_cache_len, key_states.shape[-1]),
dtype=key_states.dtype,
device=key_states.device,
)
self.values = torch.empty(
(bsz, n_heads, self.max_cache_len, value_states.shape[-1]),
dtype=value_states.dtype,
device=value_states.device,
)
def update(
self,
key_states: torch.Tensor,
value_states: torch.Tensor,
*args,
**kwargs,
) -> tuple[torch.Tensor, torch.Tensor]:
if self.keys is None:
self._allocate(key_states, value_states)
start = self.cumulative_length
end = start + key_states.shape[-2]
if end > self.max_cache_len:
raise RuntimeError(f"KV cache length {end} exceeds max_cache_len={self.max_cache_len}.")
self.keys[:, :, start:end, :].copy_(key_states)
self.values[:, :, start:end, :].copy_(value_states)
self.cumulative_length = end
return self.keys[:, :, :end, :], self.values[:, :, :end, :]
def get_seq_length(self) -> int:
return self.cumulative_length
def get_max_cache_shape(self) -> int:
return -1
def reset(self) -> None:
self.cumulative_length = 0
class _DepthDecodeStaticCache(Cache):
def __init__(self, config: PretrainedConfig, max_cache_len: int) -> None:
text_config = config.get_text_config(decoder=True)
super().__init__(
layers=[
_DepthDecodeStaticLayerCache(max_cache_len=max_cache_len)
for _ in range(text_config.num_hidden_layers)
]
)
def get_seq_length(self, layer_idx: int = 0) -> int:
return self.layers[layer_idx].get_seq_length()
def get_max_cache_shape(self, layer_idx: int = 0) -> int:
return self.layers[layer_idx].get_max_cache_shape()
def reset(self) -> None:
for layer in self.layers:
layer.reset()
class ActionCudaGraphManager:
def __init__(self, model: Any) -> None:
self.model = model
self.enabled = True
self.action_flow_graph: _ActionFlowCudaGraph | None = None
def set_enabled(self, enabled: bool) -> None:
self.enabled = bool(enabled)
def can_use_action_flow(self, inputs: _ActionFlowInputs) -> bool:
action_model = self.model
if not self.enabled:
return False
if action_model.training or action_model._require_action_expert().training:
return False
if inputs.trajectory.device.type != "cuda":
return False
def all_on_cuda():
yield inputs.trajectory
for k, v in inputs.context.kv_contexts:
yield k
yield v
for t in (
inputs.context.cross_mask,
inputs.context.self_mask,
inputs.context.valid_action,
inputs.action_dim_is_pad,
):
if t is not None:
yield t
if inputs.context.rope_cache is not None:
yield from inputs.context.rope_cache
for step in inputs.modulations:
yield step.conditioning
for block_modulation in step.block_modulations:
yield from block_modulation
yield from step.final_modulation
return all(t.device.type == "cuda" for t in all_on_cuda())
def run_action_flow(
self,
inputs: _ActionFlowInputs,
steps: int,
run_loop,
) -> torch.Tensor:
key = _cuda_graph_key(inputs, steps)
cache = self.action_flow_graph
if cache is None or cache.key != key:
static_inputs = _clone_static_inputs(inputs)
graph, output = _capture_cuda_graph(
lambda: run_loop(static_inputs, steps),
inputs.trajectory.device,
after_warmup=lambda: static_inputs.trajectory.copy_(inputs.trajectory),
)
cache = _ActionFlowCudaGraph(
key=key,
graph=graph,
static_inputs=static_inputs,
output=output,
)
self.action_flow_graph = cache
else:
_copy_inputs_(cache.static_inputs, inputs)
cache.graph.replay()
return cache.output.clone()
class DepthDecodeCudaGraphManager:
def __init__(self, model: Any) -> None:
self.model = model
self.backbone = model.model
self.enabled = True
self.graph: _DepthDecodeCudaGraph | None = None
self.graph_spec: _DepthDecodeCudaGraphSpec | None = None
def set_enabled(self, enabled: bool) -> None:
self.enabled = bool(enabled)
def make_static_cache(self, max_cache_len: int) -> _DepthDecodeStaticCache:
return _DepthDecodeStaticCache(
config=self.model.config.text_config,
max_cache_len=max_cache_len,
)
def _depth_decode_spec(self) -> _DepthDecodeCudaGraphSpec:
static = self.graph_spec
if static is None:
cfg = self.backbone.transformer.config
rotary_emb = getattr(self.backbone.transformer, "rotary_emb", None)
static = _DepthDecodeCudaGraphSpec(
eligible=(
not cfg.norm_after
and cfg.rope_scaling_layers is None
and getattr(rotary_emb, "rope_type", None) == "default"
and cfg._attn_implementation == "sdpa"
),
cache_key_prefix=(
cfg.hidden_size,
cfg.num_attention_heads,
cfg.num_key_value_heads,
cfg.head_dim,
cfg.num_hidden_layers,
cfg.use_qk_norm,
cfg.qk_norm_type,
cfg._attn_implementation,
),
num_hidden_layers=cfg.num_hidden_layers,
head_dim=cfg.head_dim,
num_attention_heads=cfg.num_attention_heads,
)
self.graph_spec = static
return static
def can_use(
self,
next_input_ids: torch.Tensor,
*,
past_key_values: Cache,
attention_bias: torch.Tensor,
) -> bool:
if not self.enabled or self.model.training or self.backbone.transformer.training:
return False
if next_input_ids.device.type != "cuda":
return False
if next_input_ids.ndim != 2 or next_input_ids.shape[0] != 1 or next_input_ids.shape[1] != 1:
return False
if not isinstance(past_key_values, _DepthDecodeStaticCache):
return False
if not torch.is_tensor(attention_bias) or attention_bias.device != next_input_ids.device:
return False
return self._depth_decode_spec().eligible
def _depth_decode_key(
self,
next_input_ids: torch.Tensor,
attention_bias: torch.Tensor,
) -> tuple[Any, ...]:
device = next_input_ids.device
return (
self._depth_decode_spec().cache_key_prefix,
device.type,
device.index,
self.model.lm_head.weight.dtype,
attention_bias.shape[-1],
)
def _select_depth_decode_rope(self, cos: torch.Tensor, sin: torch.Tensor, *, past_length: int) -> None:
emb = self.backbone.transformer.rotary_emb
cos.copy_(emb._pos_cos_cache[0, :, past_length : past_length + 1, :])
sin.copy_(emb._pos_sin_cache[0, :, past_length : past_length + 1, :])
def _depth_decode_pre_layer(
self,
layer_idx: int,
hidden_states: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
block = self.backbone.transformer.blocks[layer_idx]
attention = block.self_attn
residual = hidden_states
hidden_states = block.attn_norm(hidden_states)
input_shape = hidden_states.shape[:-1]
hidden_shape = (*input_shape, -1, attention.head_dim)
qkv = attention.att_proj(hidden_states)
query_states, key_states, value_states = qkv.split(attention.fused_dims, dim=-1)
value_states = value_states.view(hidden_shape)
apply_qk_norm = attention.q_norm is not None and attention.k_norm is not None
norm_after_view = apply_qk_norm and attention.qk_norm_type == "qwen3"
if apply_qk_norm and not norm_after_view:
query_states = attention.q_norm(query_states)
key_states = attention.k_norm(key_states)
query_states = query_states.view(hidden_shape)
key_states = key_states.view(hidden_shape)
if norm_after_view:
query_states = attention.q_norm(query_states)
key_states = attention.k_norm(key_states)
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
query_states, key_states = _apply_rotary_pos_emb(query_states, key_states, cos, sin)
return residual, query_states, key_states, value_states
def _depth_decode_pre0(
self,
token_ids: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
inputs_embeds = self.model._embed_base_tokens(token_ids)
return self._depth_decode_pre_layer(0, inputs_embeds, cos, sin)
def _depth_decode_post_layer(
self,
layer_idx: int,
residual: torch.Tensor,
attn_context: torch.Tensor,
) -> torch.Tensor:
block = self.backbone.transformer.blocks[layer_idx]
attention = block.self_attn
input_shape = residual.shape[:-1]
attn_output = attn_context.reshape(*input_shape, -1).contiguous()
attn_output = attention.attn_out(attn_output)
hidden_states = residual + block.dropout(attn_output)
residual = hidden_states
hidden_states = block.ff_norm(hidden_states)
hidden_states = block.mlp(hidden_states)
hidden_states = residual + block.dropout(hidden_states)
return hidden_states
def _depth_decode_post_and_pre_next(
self,
layer_idx: int,
residual: torch.Tensor,
attn_context: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
hidden_states = self._depth_decode_post_layer(layer_idx, residual, attn_context)
return self._depth_decode_pre_layer(layer_idx + 1, hidden_states, cos, sin)
def _depth_decode_last_post(
self,
layer_idx: int,
residual: torch.Tensor,
attn_context: torch.Tensor,
) -> torch.Tensor:
hidden_states = self._depth_decode_post_layer(layer_idx, residual, attn_context)
return self.backbone.transformer.ln_f(hidden_states)
def _build_depth_decode_graph(
self,
next_input_ids: torch.Tensor,
*,
past_length: int,
attention_bias: torch.Tensor,
) -> _DepthDecodeCudaGraph:
text_config = self.backbone.transformer.config
device = next_input_ids.device
dtype = self.model.lm_head.weight.dtype
static = self._depth_decode_spec()
num_layers = static.num_hidden_layers
head_dim = static.head_dim
max_cache_len = int(attention_bias.shape[-1])
max_rope_len = max(int(text_config.max_position_embeddings or 0), max_cache_len)
self.backbone.transformer.prepare_rope_cache(device=device, max_seq_len=max_rope_len)
token_ids = torch.empty((1, 1), device=device, dtype=torch.long)
cos = torch.empty((1, 1, head_dim), device=device, dtype=dtype)
sin = torch.empty_like(cos)
positions = torch.arange(max_cache_len, device=device, dtype=torch.long)
context_shape = (1, 1, static.num_attention_heads, head_dim)
token_ids.copy_(next_input_ids)
self._select_depth_decode_rope(cos, sin, past_length=past_length)
pre_graph, pre_output = _capture_cuda_graph(
lambda: self._depth_decode_pre0(token_ids, cos, sin),
device,
)
stages = [_DepthDecodeCudaGraphLayerStage(*pre_output)]
post_graphs = []
for layer_idx in range(num_layers - 1):
stage = stages[-1]
attn_context = torch.empty(context_shape, device=device, dtype=dtype)
graph, output = _capture_cuda_graph(
lambda layer_idx=layer_idx, stage=stage, attn_context=attn_context: (
self._depth_decode_post_and_pre_next(
layer_idx,
stage.residual,
attn_context,
cos,
sin,
)
),
device,
)
post_graphs.append(_DepthDecodeCudaGraphPostStage(graph=graph, attn_context=attn_context))
stages.append(_DepthDecodeCudaGraphLayerStage(*output))
last_stage = stages[-1]
last_attn_context = torch.empty(context_shape, device=device, dtype=dtype)
last_graph, last_output = _capture_cuda_graph(
lambda: self._depth_decode_last_post(
num_layers - 1,
last_stage.residual,
last_attn_context,
),
device,
)
post_graphs.append(_DepthDecodeCudaGraphPostStage(graph=last_graph, attn_context=last_attn_context))
return _DepthDecodeCudaGraph(
cache_key=self._depth_decode_key(next_input_ids, attention_bias),
pre_graph=pre_graph,
token_ids=token_ids,
cos=cos,
sin=sin,
positions=positions,
stages=tuple(stages),
post_graphs=tuple(post_graphs),
output=last_output,
)
def _get_depth_decode_graph(
self,
next_input_ids: torch.Tensor,
*,
past_length: int,
attention_bias: torch.Tensor,
) -> _DepthDecodeCudaGraph:
key = self._depth_decode_key(next_input_ids, attention_bias)
decode_graph = self.graph
if decode_graph is None or decode_graph.cache_key != key:
decode_graph = self._build_depth_decode_graph(
next_input_ids,
past_length=past_length,
attention_bias=attention_bias,
)
self.graph = decode_graph
else:
decode_graph.token_ids.copy_(next_input_ids)
self._select_depth_decode_rope(decode_graph.cos, decode_graph.sin, past_length=past_length)
return decode_graph
def _run_depth_decode_attention_core(
self,
layer_idx: int,
stage: _DepthDecodeCudaGraphLayerStage,
*,
past_key_values: Cache,
attention_bias: torch.Tensor,
cache_position: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
) -> torch.Tensor:
attention = self.backbone.transformer.blocks[layer_idx].self_attn
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
key_states, value_states = past_key_values.update(
stage.key,
stage.value,
layer_idx,
cache_kwargs,
)
key_states = _repeat_kv(key_states, attention.num_key_value_groups)
value_states = _repeat_kv(value_states, attention.num_key_value_groups)
attn_output = F.scaled_dot_product_attention(
stage.query,
key_states,
value_states,
attn_mask=attention_bias,
dropout_p=0.0,
is_causal=False,
)
return attn_output.transpose(1, 2)
def run(
self,
next_input_ids: torch.Tensor,
*,
past_key_values: Cache,
attention_bias: torch.Tensor,
past_length: int,
) -> tuple[torch.Tensor, Cache]:
end = past_length + 1
decode_graph = self._get_depth_decode_graph(
next_input_ids,
past_length=past_length,
attention_bias=attention_bias,
)
cache_position = decode_graph.positions[past_length:end]
attention_bias_q = attention_bias[:, :, past_length:end, :end]
decode_graph.pre_graph.replay()
for layer_idx, post_graph in enumerate(decode_graph.post_graphs):
attn_context = self._run_depth_decode_attention_core(
layer_idx,
decode_graph.stages[layer_idx],
past_key_values=past_key_values,
attention_bias=attention_bias_q,
cache_position=cache_position,
cos=decode_graph.cos,
sin=decode_graph.sin,
)
post_graph.attn_context.copy_(attn_context)
post_graph.graph.replay()
return decode_graph.output, past_key_values
def _cuda_graph_tensor_signature(
tensor: torch.Tensor | None,
) -> tuple[Any, ...] | None:
if tensor is None:
return None
return (
tuple(tensor.shape),
tuple(tensor.stride()),
str(tensor.dtype),
str(tensor.device),
)
def _cuda_graph_context_signature(context: Any) -> tuple[Any, ...]:
sig = _cuda_graph_tensor_signature
return (
tuple((sig(k), sig(v)) for k, v in context.kv_contexts),
sig(context.cross_mask),
sig(context.self_mask),
sig(context.valid_action),
None if context.rope_cache is None else tuple(sig(t) for t in context.rope_cache),
)
def _cuda_graph_modulation_signature(modulations: Sequence[Any]) -> tuple[Any, ...]:
sig = _cuda_graph_tensor_signature
return tuple(
(
sig(step.conditioning),
tuple(tuple(sig(t) for t in block_modulation) for block_modulation in step.block_modulations),
tuple(sig(t) for t in step.final_modulation),
)
for step in modulations
)
def _cuda_graph_key(inputs: _ActionFlowInputs, steps: int) -> tuple[Any, ...]:
sig = _cuda_graph_tensor_signature
return (
sig(inputs.trajectory),
_cuda_graph_context_signature(inputs.context),
_cuda_graph_modulation_signature(inputs.modulations),
sig(inputs.action_dim_is_pad),
int(steps),
)
def _clone_static_tensor(tensor: torch.Tensor | None) -> torch.Tensor | None:
if tensor is None:
return None
static = torch.empty_strided(
tuple(tensor.shape),
tuple(tensor.stride()),
device=tensor.device,
dtype=tensor.dtype,
)
static.copy_(tensor)
return static
def _clone_static_context(context: Any) -> Any:
rope_cache = None
if context.rope_cache is not None:
rope_cache = tuple(_clone_static_tensor(t) for t in context.rope_cache)
return context.__class__(
kv_contexts=tuple((_clone_static_tensor(k), _clone_static_tensor(v)) for k, v in context.kv_contexts),
cross_mask=_clone_static_tensor(context.cross_mask),
self_mask=_clone_static_tensor(context.self_mask),
valid_action=_clone_static_tensor(context.valid_action),
rope_cache=rope_cache,
)
def _clone_static_modulations(modulations: Sequence[Any]) -> Sequence[Any]:
return tuple(
step.__class__(
conditioning=_clone_static_tensor(step.conditioning),
block_modulations=tuple(
tuple(_clone_static_tensor(t) for t in block_modulation)
for block_modulation in step.block_modulations
),
final_modulation=tuple(_clone_static_tensor(t) for t in step.final_modulation),
)
for step in modulations
)
def _clone_static_inputs(inputs: _ActionFlowInputs) -> _ActionFlowInputs:
return _ActionFlowInputs(
trajectory=_clone_static_tensor(inputs.trajectory),
context=_clone_static_context(inputs.context),
modulations=_clone_static_modulations(inputs.modulations),
action_dim_is_pad=_clone_static_tensor(inputs.action_dim_is_pad),
)
def _copy_context_(dst: Any, src: Any) -> None:
for (dst_k, dst_v), (src_k, src_v) in zip(dst.kv_contexts, src.kv_contexts):
dst_k.copy_(src_k)
dst_v.copy_(src_v)
if src.cross_mask is not None:
dst.cross_mask.copy_(src.cross_mask)
if src.self_mask is not None:
dst.self_mask.copy_(src.self_mask)
if src.valid_action is not None:
dst.valid_action.copy_(src.valid_action)
if src.rope_cache is not None:
for dst_tensor, src_tensor in zip(dst.rope_cache, src.rope_cache):
dst_tensor.copy_(src_tensor)
def _copy_inputs_(dst: _ActionFlowInputs, src: _ActionFlowInputs) -> None:
dst.trajectory.copy_(src.trajectory)
_copy_context_(dst.context, src.context)
if src.action_dim_is_pad is not None:
dst.action_dim_is_pad.copy_(src.action_dim_is_pad)
def _rotate_half(x: torch.Tensor) -> torch.Tensor:
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
def _apply_rotary_pos_emb(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
unsqueeze_dim: int = 1,
) -> tuple[torch.Tensor, torch.Tensor]:
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (_rotate_half(q) * sin)
k_embed = (k * cos) + (_rotate_half(k) * sin)
return q_embed, k_embed
def _repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
def _capture_cuda_graph(
fn,
device: torch.device,
*,
after_warmup=None,
) -> tuple[torch.cuda.CUDAGraph, Any]:
warmup_stream = torch.cuda.Stream(device=device)
warmup_stream.wait_stream(torch.cuda.current_stream(device))
with torch.cuda.stream(warmup_stream):
fn()
torch.cuda.current_stream(device).wait_stream(warmup_stream)
if after_warmup is not None:
after_warmup()
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
output = fn()
return graph, output
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,431 @@
#!/usr/bin/env python
# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ruff: noqa
"""
Processor class for MolmoAct2.
"""
from typing import Optional, Union
import dataclasses
import numpy as np
from transformers.image_utils import ImageInput
from transformers.video_utils import VideoInput
from transformers.processing_utils import (
Unpack,
ProcessingKwargs,
ProcessorMixin,
)
from transformers.feature_extraction_utils import BatchFeature
from transformers.tokenization_utils_base import TextInput, PreTokenizedInput
from transformers.utils import logging
from transformers import AutoTokenizer
from .image_processing_molmoact2 import MolmoAct2ImagesKwargs, MolmoAct2ImageProcessor
from .video_processing_molmoact2 import MolmoAct2VideoProcessorKwargs, MolmoAct2VideoProcessor
logger = logging.get_logger(__name__)
# Special tokens, these should be present in any tokenizer we use since the preprocessor uses them
IMAGE_PATCH_TOKEN = f"<im_patch>" # Where to insert high-res tokens
IMAGE_LOW_RES_TOKEN = f"<im_low>" # Where to insert low-res tokens
IM_START_TOKEN = f"<im_start>"
LOW_RES_IMAGE_START_TOKEN = f"<low_res_im_start>"
FRAME_START_TOKEN = f"<frame_start>"
IM_END_TOKEN = f"<im_end>"
FRAME_END_TOKEN = f"<frame_end>"
IM_COL_TOKEN = f"<im_col>"
IMAGE_PROMPT = "<|image|>"
VIDEO_PROMPT = "<|video|>"
IMAGE_TOKENS = [
IMAGE_PATCH_TOKEN,
IM_COL_TOKEN,
IM_START_TOKEN,
LOW_RES_IMAGE_START_TOKEN,
FRAME_START_TOKEN,
IM_END_TOKEN,
FRAME_END_TOKEN,
IMAGE_LOW_RES_TOKEN,
]
class MolmoAct2ProcessorKwargs(ProcessingKwargs, total=False):
"""MolmoAct2 processor kwargs"""
images_kwargs: MolmoAct2ImagesKwargs
videos_kwargs: MolmoAct2VideoProcessorKwargs
_defaults = {
"text_kwargs": {
"padding": False,
"return_mm_token_type_ids": True,
},
"videos_kwargs": {"return_metadata": True},
}
class MolmoAct2Processor(ProcessorMixin):
attributes = ["image_processor", "video_processor", "tokenizer"]
optional_attributes = [
"chat_template",
"time_mode",
"image_use_col_tokens",
"use_single_crop_col_tokens",
"use_single_crop_start_token",
"video_use_col_tokens",
"use_frame_special_tokens",
]
image_processor_class = "AutoImageProcessor"
video_processor_class = "AutoVideoProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self,
image_processor: MolmoAct2ImageProcessor = None,
video_processor: MolmoAct2VideoProcessor = None,
tokenizer: AutoTokenizer = None,
chat_template: str | None = None,
image_use_col_tokens: bool | None = True,
use_single_crop_col_tokens: bool | None = None,
use_single_crop_start_token: bool | None = True,
video_use_col_tokens: bool | None = False,
use_frame_special_tokens: bool | None = True,
**kwargs,
) -> None:
super().__init__(
image_processor,
video_processor,
tokenizer,
chat_template=chat_template,
)
self.image_use_col_tokens = image_use_col_tokens
self.use_single_crop_col_tokens = use_single_crop_col_tokens
self.use_single_crop_start_token = use_single_crop_start_token
self.video_use_col_tokens = video_use_col_tokens
self.use_frame_special_tokens = use_frame_special_tokens
self.image_placeholder_token = IMAGE_PROMPT
self.video_placeholder_token = VIDEO_PROMPT
self.image_token_ids = [tokenizer.convert_tokens_to_ids(token) for token in IMAGE_TOKENS]
def get_image_tokens(self, image_grid: np.ndarray):
resized_h, resized_w, height, width = image_grid
if int(height) == 0 or int(width) == 0:
per_row = np.full(resized_w, IMAGE_PATCH_TOKEN)
use_single_crop_col_tokens = (
self.image_use_col_tokens
if self.use_single_crop_col_tokens is None
else self.use_single_crop_col_tokens
)
if use_single_crop_col_tokens:
per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
joint = [
[IM_START_TOKEN],
np.tile(per_row, [resized_h]),
[IM_END_TOKEN],
]
return np.concatenate(joint)
per_row = np.full(width, IMAGE_PATCH_TOKEN)
if self.image_use_col_tokens:
per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
joint = [
[IM_START_TOKEN],
np.tile(per_row, [height]),
[IM_END_TOKEN],
]
per_row = np.full(resized_w, IMAGE_PATCH_TOKEN)
use_single_crop_col_tokens = (
self.image_use_col_tokens
if self.use_single_crop_col_tokens is None
else self.use_single_crop_col_tokens
)
image_start_token = LOW_RES_IMAGE_START_TOKEN if self.use_single_crop_start_token else IM_START_TOKEN
if use_single_crop_col_tokens:
per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
joint = [
[image_start_token],
np.tile(per_row, [resized_h]),
[IM_END_TOKEN],
] + joint
return np.concatenate(joint)
def get_video_string(
self,
video_grid: np.ndarray,
timestamps: np.ndarray,
):
if self.use_frame_special_tokens:
start_token_id = FRAME_START_TOKEN
end_token_id = FRAME_END_TOKEN
else:
start_token_id = IM_START_TOKEN
end_token_id = IM_END_TOKEN
num_frames, h, w = video_grid
video_string: str = ""
for frame_idx, frame_time in enumerate(timestamps):
# `per-frame-compact` time mode
prev_space = " " if frame_idx > 0 else ""
frame_prefix = prev_space + f"{frame_time:.1f} " # explicit whitespace before/after image tokens
video_string += frame_prefix
per_row = np.full(w, IMAGE_PATCH_TOKEN)
if self.video_use_col_tokens:
per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
extra_tokens = np.tile(per_row, [h])
video_tokens = [
[start_token_id],
extra_tokens,
[end_token_id],
]
video_string += "".join(np.concatenate(video_tokens, 0))
return video_string
def insert_bos(
self,
input_ids: np.ndarray,
attention_mask: np.ndarray,
bos_token_id: int,
pad_token_id: int,
):
"""
Args:
input_ids: [B, S] array with left padding
attention_mask: [B, S] array (0 for pad, 1 for valid)
bos_token_id: int
pad_token_id: int
Returns:
input_ids_out: [B, S] or [B, S+1] array with bos inserted if needed
attention_mask_out: same shape as input_ids_out
"""
need_to_expand = len(input_ids.shape) == 1
if need_to_expand:
input_ids = input_ids[None, :]
attention_mask = attention_mask[None, :]
B, S = input_ids.shape
# Handle zero-length sequence
if S == 0:
new_input_ids = np.full((B, 1), bos_token_id, dtype=input_ids.dtype)
new_attention_mask = np.ones((B, 1), dtype=attention_mask.dtype)
if need_to_expand:
new_input_ids = new_input_ids[0]
new_attention_mask = new_attention_mask[0]
return new_input_ids, new_attention_mask
first_valid_index = (attention_mask == 1).argmax(axis=-1) # [B]
bos_already_present = np.all(input_ids[np.arange(B), first_valid_index] == bos_token_id)
if bos_already_present:
if need_to_expand:
input_ids = input_ids[0]
attention_mask = attention_mask[0]
return input_ids, attention_mask
else:
new_input_ids = np.full((B, S + 1), pad_token_id, dtype=input_ids.dtype)
new_attention_mask = np.zeros((B, S + 1), dtype=attention_mask.dtype)
src_idx = np.tile(np.arange(S), (B, 1)) # [B, S]
valid_mask = src_idx >= first_valid_index[:, None] # [B, S]
tgt_idx = src_idx + 1 # shit right
batch_idx = np.tile(np.arange(B)[:, None], (1, S)) # [B, S]
# flatten valid_positions
flat_vals = input_ids[valid_mask]
flat_batch = batch_idx[valid_mask]
flat_tgt = tgt_idx[valid_mask]
new_input_ids[flat_batch, flat_tgt] = flat_vals
new_attention_mask[flat_batch, flat_tgt] = 1
insert_pos = first_valid_index
new_input_ids[np.arange(B), insert_pos] = bos_token_id
new_attention_mask[np.arange(B), insert_pos] = 1
if need_to_expand:
new_input_ids = new_input_ids[0]
new_attention_mask = new_attention_mask[0]
return new_input_ids, new_attention_mask
def __call__(
self,
text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] = None,
images: ImageInput = None,
videos: VideoInput = None,
**kwargs: Unpack[MolmoAct2ProcessorKwargs],
) -> BatchFeature:
"""
Args:
text (`str`, `list[str]`, `list[list[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`, `list[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
videos (`dict[str, Any]` or `list[dict[str, Any]]`):
The video or batch of videos to be prepared. Each video can be a dictionary with the following keys:
- `"frames"`: `np.ndarray` of shape (T, H, W, 3)
- `"timestamps"`: `np.ndarray` of shape (T,)
- `"sampled_fps"`: `float` (optional)
- `"sampling_augmentation"`: `str` (optional)
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
`BatchFeature`: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not `None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
- **image_token_pooling** -- Indices of the patches in `image_grids` to pool for each token in `image_tokens`.
Returned when `images` is not `None`.
- **image_grids** -- Grids of images. Returned when `images` is not `None`.
- **image_num_crops** -- Number of crops for each image. Returned when `images` is not `None`.
- **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
- **video_token_pooling** -- Indices of the patches in `video_grids` to pool for each token in `video_tokens`.
Returned when `videos` is not `None`.
- **video_grids** -- Grids of videos. Returned when `videos` is not `None`.
"""
output_kwargs = self._merge_kwargs(
MolmoAct2ProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
)
if images is not None:
image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
image_grids = image_inputs["image_grids"]
else:
image_inputs = {}
image_grids = None
if videos is not None:
videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
video_grids = videos_inputs["video_grids"]
# If user has not requested video metadata, pop it
if "return_metadata" not in kwargs:
video_metadata = videos_inputs.pop("video_metadata")
else:
video_metadata = videos_inputs["video_metadata"]
else:
videos_inputs = {}
video_grids = None
if not isinstance(text, list):
text = [text]
text = text.copy() # below lines change text in-place
if image_grids is not None:
index = 0
for i in range(len(text)):
num_images = text[i].count(self.image_placeholder_token)
image_grids_i = image_grids[index : index + num_images]
for image_grid in image_grids_i:
image_tokens = self.get_image_tokens(image_grid)
image_string = "".join(image_tokens)
text[i] = text[i].replace(self.image_placeholder_token, image_string, 1)
index += num_images
if video_grids is not None:
index = 0
for i in range(len(text)):
num_videos = text[i].count(self.video_placeholder_token)
assert num_videos in {0, 1}, "At most one video is supported for now"
video_grids_i = video_grids[index : index + num_videos]
metadata_i = video_metadata[index : index + num_videos]
for video_grid, metadata in zip(video_grids_i, metadata_i):
video_string = self.get_video_string(
video_grid,
metadata.timestamps,
)
text[i] = text[i].replace(self.video_placeholder_token, video_string, 1)
index += num_videos
return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", False)
text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
input_ids = text_inputs["input_ids"]
attention_mask = text_inputs["attention_mask"]
input_ids = np.array(input_ids)
attention_mask = np.array(attention_mask)
bos = self.tokenizer.bos_token_id or self.tokenizer.eos_token_id
input_ids, attention_mask = self.insert_bos(
input_ids, attention_mask, bos, self.tokenizer.pad_token_id
)
if return_mm_token_type_ids:
image_tokens = np.array(self.image_token_ids).astype(input_ids.dtype)
token_type_ids = np.any(input_ids[:, :, None] == image_tokens[None, None, :], axis=-1)
text_inputs["token_type_ids"] = token_type_ids.tolist()
text_inputs["input_ids"] = input_ids.tolist()
text_inputs["attention_mask"] = attention_mask.tolist()
return BatchFeature(
data={**text_inputs, **image_inputs, **videos_inputs},
tensor_type=return_tensors,
)
def post_process_image_text_to_text(
self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
):
"""
Post-process the output of the model to decode the text.
Args:
generated_outputs (`torch.Tensor` or `np.ndarray`):
The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
or `(sequence_length,)`.
skip_special_tokens (`bool`, *optional*, defaults to `True`):
Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
**kwargs:
Additional arguments to be passed to the tokenizer's `batch_decode method`.
Returns:
`list[str]`: The decoded text.
"""
return self.tokenizer.batch_decode(
generated_outputs,
skip_special_tokens=skip_special_tokens,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
**kwargs,
)
MolmoAct2Processor.register_for_auto_class()
@@ -0,0 +1,997 @@
#!/usr/bin/env python
# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ruff: noqa
"""Video processor class for MolmoAct2"""
from functools import partial
import os
import warnings
from contextlib import redirect_stdout
from io import BytesIO
from urllib.parse import urlparse
from typing import Optional, Union
from collections.abc import Callable
import numpy as np
import requests
import einops
import torch
import torchvision.transforms
from transformers.image_utils import (
IMAGENET_STANDARD_MEAN,
IMAGENET_STANDARD_STD,
ImageInput,
PILImageResampling,
SizeDict,
validate_kwargs,
)
from transformers.video_utils import (
VideoInput,
is_valid_video,
make_batched_videos,
make_batched_metadata,
VideoMetadata,
)
from transformers.processing_utils import Unpack, VideosKwargs
from transformers.video_processing_utils import BaseVideoProcessor
from transformers.utils import logging
from transformers.feature_extraction_utils import BatchFeature
from transformers.utils import (
is_av_available,
is_decord_available,
is_torchcodec_available,
is_yt_dlp_available,
TensorType,
logging,
to_numpy,
)
logger = logging.get_logger(__name__)
MAX_VIDEO_FPS = 8
def normalize_image(
image: np.ndarray,
image_mean: list[float],
image_std: list[float],
) -> np.ndarray:
if np.allclose(image_mean, [0.5, 0.5, 0.5]) and np.allclose(image_std, [0.5, 0.5, 0.5]):
return image * np.asarray(2.0, dtype=np.float32) - np.asarray(1.0, dtype=np.float32)
image -= np.array(image_mean, dtype=np.float32)[None, None, :]
image /= np.array(image_std, dtype=np.float32)[None, None, :]
return image
def resize_image(
image: np.ndarray,
desired_output_size: list[int],
resample: PILImageResampling,
) -> np.ndarray:
if len(image.shape) == 3:
is_video = False
image = torch.permute(torch.from_numpy(image), [2, 0, 1])
else:
is_video = True
image = torch.permute(torch.from_numpy(image), [0, 3, 1, 2])
dtype = image.dtype
if torch.is_floating_point(image):
in_min = 0.0
in_max = 1.0
resized = torchvision.transforms.Resize(
desired_output_size,
resample,
antialias=False,
)(image)
resized = torch.clip(resized, 0.0, 1.0).to(dtype)
else:
assert image.dtype == torch.uint8, "SigLIP expects float images or uint8 images, but got {}".format(
image.dtype
)
in_min = 0.0
in_max = 255.0
resized = torchvision.transforms.Resize(
desired_output_size,
resample,
antialias=False,
)(image)
resized = torch.clip(resized, 0, 255).to(dtype)
resized = resized.to(torch.float32)
resized = (resized - in_min) / (in_max - in_min)
if is_video:
resized = torch.permute(resized, [0, 2, 3, 1]).numpy()
else:
resized = torch.permute(resized, [1, 2, 0]).numpy()
return resized
def build_resized_image(
image: np.ndarray,
base_image_input_size: list[int],
resample: PILImageResampling,
image_mean: list[float],
image_std: list[float],
image_patch_size: int,
) -> tuple[np.ndarray, np.ndarray]:
resized = resize_image(
image,
base_image_input_size,
resample,
)
resized = normalize_image(resized, image_mean, image_std)
if len(resized.shape) == 3:
resized = np.expand_dims(resized, 0)
crop_patch_w = base_image_input_size[1] // image_patch_size
crop_patch_h = base_image_input_size[0] // image_patch_size
resize_idx = np.arange(crop_patch_w * crop_patch_h).reshape([crop_patch_h, crop_patch_w])
return resized, resize_idx
def batch_pixels_to_patches(array: np.ndarray, patch_size: int) -> np.ndarray:
"""Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
if len(array.shape) == 3:
n_crops, h, w = array.shape
h_patches = h // patch_size
w_patches = w // patch_size
array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size])
array = np.transpose(array, [0, 1, 3, 2, 4])
array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size])
return array
else:
n_crops, h, w, c = array.shape
h_patches = h // patch_size
w_patches = w // patch_size
array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size, c])
array = np.transpose(array, [0, 1, 3, 2, 4, 5])
array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size * c])
return array
def arange_for_pooling(
idx_arr: np.ndarray,
pool_h: int,
pool_w: int,
) -> np.ndarray:
h_pad = pool_h * ((idx_arr.shape[0] + pool_h - 1) // pool_h) - idx_arr.shape[0]
w_pad = pool_w * ((idx_arr.shape[1] + pool_w - 1) // pool_w) - idx_arr.shape[1]
idx_arr = np.pad(
idx_arr,
[[h_pad // 2, (h_pad + 1) // 2], [w_pad // 2, (w_pad + 1) // 2]],
mode="constant",
constant_values=-1,
)
return einops.rearrange(idx_arr, "(h dh) (w dw) -> h w (dh dw)", dh=pool_h, dw=pool_w)
def image_to_patches_and_grids(
image: ImageInput,
base_image_input_size: list[int],
resample: PILImageResampling,
image_mean: list[float],
image_std: list[float],
image_patch_size: int,
image_pooling_w: int,
image_pooling_h: int,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
:return image_grids, the shape of each image after pooling
:return crops, the image crops to processes with the ViT
:return pooled_patch_idx, for each patch_id tokens in `image_tokens`, the indices of the
patches in `crops` to pool for that token, masked with -1
"""
if isinstance(base_image_input_size, int):
base_image_input_size = (base_image_input_size, base_image_input_size)
pooling_w = image_pooling_w
pooling_h = image_pooling_h
resized, resize_idx = build_resized_image(
image,
base_image_input_size,
resample,
image_mean,
image_std,
image_patch_size,
)
pooling_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
h, w = pooling_idx.shape[:2]
pooling_idx = pooling_idx.reshape([-1, pooling_h * pooling_w])
image_grid = [h, w]
return (
image_grid,
batch_pixels_to_patches(resized, image_patch_size),
pooling_idx,
)
def get_candidate_target_fps(
video_fps: int | float,
sampling_fps: int | float,
max_fps: int | float = MAX_VIDEO_FPS,
) -> list[float]:
"""
Return the subset of `video_fps` factors that remain multiples of `sampling_fps`.
Examples:
>>> get_candidate_target_fps(video_fps=6, sampling_fps=2)
[2, 6]
>>> get_candidate_target_fps(video_fps=5, sampling_fps=1)
[1, 5]
>>> get_candidate_target_fps(video_fps=2, sampling_fps=2)
[2]
>>> get_candidate_target_fps(video_fps=5, sampling_fps=2)
Traceback (most recent call last):
...
ValueError: sampling_fps=2 must divide video_fps=5 to produce consistent frame steps.
"""
video_fps = int(video_fps)
sampling_fps = int(sampling_fps)
max_fps = int(max_fps)
if sampling_fps is None:
raise ValueError("sampling_fps must be provided")
if video_fps <= 0 or sampling_fps <= 0:
raise ValueError(f"video_fps and sampling_fps must be positive (got {video_fps}, {sampling_fps})")
if video_fps % sampling_fps != 0:
raise ValueError(f"sampling_fps={sampling_fps} must divide video_fps={video_fps}.")
candidates = []
for candidate in range(sampling_fps, video_fps + 1, sampling_fps):
if candidate > max_fps:
break
if video_fps % candidate == 0:
candidates.append(float(candidate))
return candidates
def read_video_decord(
video_path,
sample_timestamps_fn: Callable,
**kwargs,
) -> np.ndarray:
"""
Decode a video using the Decord backend.
Args:
video_path (`str`):
Path to the video file.
sample_timestamps_fn (`Callable`):
A callable function that will return timestamps at which the video should be sampled.
Returns:
tuple[`np.array`, `VideoMetadata`]: A tuple containing:
- Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
- `VideoMetadata` object.
"""
# Lazy import from decord
import importlib
decord = importlib.import_module("decord")
vr = decord.VideoReader(uri=video_path, ctx=decord.cpu(0)) # decord has problems with gpu
video_fps = vr.get_avg_fps()
total_num_frames = len(vr)
time_stamps = vr.get_frame_timestamp(list(range(len(vr))))
duration = time_stamps[-1][1] - time_stamps[0][0]
metadata = VideoMetadata(
total_num_frames=int(total_num_frames),
fps=float(video_fps),
duration=float(duration),
video_backend="decord",
)
target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
target_timestamps = np.array(target_timestamps)
offset = time_stamps[0, 0]
ix = np.searchsorted(time_stamps[:, 1], target_timestamps + offset, side="right")
ix = np.minimum(ix, len(time_stamps) - 1)
video = vr.get_batch(ix).asnumpy()
metadata.update(
{
"frames_indices": target_timestamps * video_fps,
"height": video.shape[1],
"width": video.shape[2],
}
)
return video, metadata
def read_video_torchcodec(
video_path,
sample_timestamps_fn: Callable,
**kwargs,
) -> np.ndarray:
"""
Decode a video using torchcodec decoder.
Args:
video_path (`str`):
Path to the video file.
sample_timestamps_fn (`Callable`):
A callable function that will return timestamps at which the video should be sampled.
Returns:
tuple[`np.array`, `VideoMetadata`]: A tuple containing:
- Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
- `VideoMetadata` object.
"""
# Lazy import torchcodec
import importlib
torchcodec = importlib.import_module("torchcodec")
decoder = torchcodec.decoders.VideoDecoder(
video_path,
# Interestingly `exact` mode takes less than approximate when we load the whole video
seek_mode="exact",
# Allow FFmpeg decide on the number of threads for efficiency
num_ffmpeg_threads=0,
)
# If the first frame starts at > 0, we effectively clip the video starting at that time
# since (most) video players would also skip to that time
time_offset = decoder.metadata.begin_stream_seconds_from_content
# Note this duration does assume we started playing at `time_offset`
duration = decoder.metadata.duration_seconds
metadata = VideoMetadata(
total_num_frames=decoder.metadata.num_frames,
fps=decoder.metadata.average_fps,
duration=duration,
video_backend="torchcodec",
height=decoder.metadata.height,
width=decoder.metadata.width,
)
target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
# Floating point/rounding issues might cause `target_timestamps` to be very slightly
# out-of-bounds, to handle this we sanity check then clip them
assert all(x >= 0 for x in target_timestamps)
assert all(x < duration + 1e-6 for x in target_timestamps)
# 1e-6 padding since torchcodec can throw out-of-bounds errors even if you ask for the
# exact boundary value, we should still get the first/last frame anyway
max_timestamp = decoder.metadata.end_stream_seconds_from_content - 1e-6
min_timestamp = decoder.metadata.begin_stream_seconds_from_content + 1e-6
# Note we avoid using numpy ops here to reduce floating precision issues
timestamps = [x + time_offset for x in target_timestamps]
timestamps = [max(min_timestamp, min(max_timestamp, x)) for x in timestamps]
video = (
decoder.get_frames_played_at(timestamps).data.numpy().transpose(0, 2, 3, 1)
) # Convert to THWC format
target_timestamps = np.array(target_timestamps)
metadata.frames_indices = target_timestamps * metadata.fps
return video, metadata
def read_video_pyav(
video_path,
sample_timestamps_fn: Callable,
**kwargs,
) -> np.ndarray:
"""
Decode a video using the PyAV backend.
Args:
video_path (`str`):
Path to the video file.
sample_timestamps_fn (`Callable`):
A callable function that will return timestamps at which the video should be sampled.
Returns:
tuple[`np.array`, `VideoMetadata`]: A tuple containing:
- Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
- `VideoMetadata` object.
"""
# Lazy import torchcodec
import importlib
av = importlib.import_module("av")
with av.open(video_path) as container:
video_stream = container.streams.video[0]
fps = video_stream.average_rate or video_stream.guessed_rate
it = container.decode(video=0)
frames = list(it)
stream = container.streams.video[0]
start = frames[0].pts * stream.time_base
container_end = stream.duration
if container_end is not None:
container_end *= stream.time_base
if container_end is None or container_end < frames[-1].pts:
# Some problem with stream duration, so use the frame PTS directly
# and guess the duration of the last frame
end = frames[-1].pts * stream.time_base + 1 / fps
else:
end = container_end
duration = float(end - start)
metadata = VideoMetadata(
total_num_frames=len(frames),
fps=float(fps),
duration=float(duration),
video_backend="pyav",
height=video_stream.height,
width=video_stream.width,
)
target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
offset = float(start)
target_timestamps = np.array(target_timestamps)
end_time_stamps = np.array([float(frame.pts * stream.time_base) for frame in frames[1:]] + [duration])
indices = np.searchsorted(end_time_stamps, target_timestamps + offset, side="right")
indices = np.minimum(indices, len(end_time_stamps) - 1)
video = np.stack(
[frames[i].to_ndarray(format="rgb24", channel_last=True) for i in indices],
axis=0,
)
metadata.frames_indices = target_timestamps * fps
return video, metadata
VIDEO_DECODERS = {
"decord": read_video_decord,
"torchcodec": read_video_torchcodec,
"pyav": read_video_pyav,
}
def load_video(
video: VideoInput,
backend: str = "decord",
sample_timestamps_fn: Callable | None = None,
**kwargs,
):
"""
Loads `video` to a numpy array.
Args:
video (`VideoInput`):
The video to convert to the numpy array format. Can be a link to video or local path.
backend (`str`, *optional*, defaults to `"decord"`):
The backend to use when loading the video. Can be any of ["decord", "pyav", ""torchcodec"]. Defaults to "decord".
sample_timestamps_fn (`Callable`):
A callable function that will return timestamps at which the video should be sampled.
"""
# Early exit if provided an array or `PIL` frames
if not isinstance(video, str):
metadata = [None] * len(video)
return video, metadata
if urlparse(video).netloc in ["www.youtube.com", "youtube.com"]:
if not is_yt_dlp_available():
raise ImportError("To load a video from YouTube url you have to install `yt_dlp` first.")
# Lazy import from yt_dlp
import importlib
yt_dlp = importlib.import_module("yt_dlp")
buffer = BytesIO()
with redirect_stdout(buffer), yt_dlp.YoutubeDL() as f:
f.download([video])
bytes_obj = buffer.getvalue()
file_obj = BytesIO(bytes_obj)
elif video.startswith("http://") or video.startswith("https://"):
file_obj = BytesIO(requests.get(video, timeout=10).content)
elif os.path.isfile(video):
file_obj = video
else:
raise TypeError(
"Incorrect format used for video. Should be an url linking to an video or a local path."
)
# can also load with decord, but not cv2/torchvision
# both will fail in case of url links
video_is_url = video.startswith("http://") or video.startswith("https://")
if video_is_url and backend == "opencv":
raise ValueError("If you are trying to load a video from URL, you cannot use 'opencv' as backend")
if (
(not is_decord_available() and backend == "decord")
or (not is_torchcodec_available() and backend == "torchcodec")
or (not is_av_available() and backend == "pyav")
):
raise ImportError(
f"You chose backend={backend} for loading the video but the required library is not found in your environment "
f"Make sure to install {backend} before loading the video."
)
video_decoder = VIDEO_DECODERS[backend]
video, metadata = video_decoder(file_obj, sample_timestamps_fn, **kwargs)
return video, metadata
def get_target_fps(
video_fps: float,
max_frames: int,
total_frames: int,
frame_sample_mode: str,
candidate_target_fps: tuple[float],
) -> float:
"""
Get the target fps that best spans the video and has the most frames sampled
"""
num_frames_sampled = 0
selected_target_fps = None
for target_fps in candidate_target_fps:
step_size = max(int(video_fps / target_fps), 1)
num_frames_sampled_at_fps = int(total_frames / step_size)
if num_frames_sampled == 0:
if "uniform" in frame_sample_mode:
if num_frames_sampled_at_fps > max_frames:
break
selected_target_fps = target_fps
num_frames_sampled = num_frames_sampled_at_fps
else:
# the candidate sampling fps increases so frame count can't decrease
assert num_frames_sampled <= num_frames_sampled_at_fps
if num_frames_sampled_at_fps > max_frames:
# choose the sampling fps that spans the video
continue
elif num_frames_sampled_at_fps > num_frames_sampled:
# both are less than max_frames, choose the one with higher density of frames sampled
selected_target_fps = target_fps
num_frames_sampled = num_frames_sampled_at_fps
return selected_target_fps
def get_frame_times_and_chosen_fps(selected_target_fps, total_frames, max_frames, video_fps):
if selected_target_fps is None:
frame_indices = np.linspace(0, total_frames, max_frames, endpoint=False, dtype=int)
else:
step_size = max(int(video_fps / selected_target_fps), 1)
frame_indices = np.arange(0, total_frames, step_size)
if len(frame_indices) > max_frames:
frame_indices = frame_indices[:max_frames]
return selected_target_fps, frame_indices
class MolmoAct2VideoProcessorKwargs(VideosKwargs, total=False):
patch_size: int | None
pooling_size: list[int] | None
frame_sample_mode: str | None
max_fps: int | None
sampling_fps: int | None
class MolmoAct2VideoProcessor(BaseVideoProcessor):
resample = PILImageResampling.BILINEAR
size = {"height": 378, "width": 378}
image_mean = IMAGENET_STANDARD_MEAN
image_std = IMAGENET_STANDARD_STD
do_resize = True
do_rescale = True
do_normalize = True
do_convert_rgb = True
patch_size = 14
pooling_size = [3, 3]
do_sample_frames = True
frame_sample_mode = "uniform_last_frame"
max_fps = 2
sampling_fps = 2
valid_kwargs = MolmoAct2VideoProcessorKwargs
model_input_names = ["pixel_values_videos", "video_token_pooling", "video_grids"]
def __init__(self, **kwargs: Unpack[MolmoAct2VideoProcessorKwargs]):
super().__init__(**kwargs)
if self.size is not None and (
self.size.get("height", None) is None or self.size.get("width", None) is None
):
raise ValueError("size must contain 'height' and 'width' keys.")
def _further_process_kwargs(
self,
size: SizeDict | None = None,
**kwargs,
) -> dict:
"""
Update kwargs that need further processing before being validated
Can be overridden by subclasses to customize the processing of kwargs.
"""
if size is not None and ("height" not in size or "width" not in size):
raise ValueError("size must contain 'height' and 'width' keys.")
return super()._further_process_kwargs(size=size, **kwargs)
def sample_times(
self,
metadata: VideoMetadata,
frame_sample_mode: str,
num_frames: int,
max_fps: int | None = None,
sampling_fps: int | None = None,
**kwargs,
) -> np.ndarray:
"""
Time-based sampling if an array video is passed
Args:
metadata (`VideoMetadata`):
Metadata of the video containing information about total duration, fps and total number of frames.
frame_sample_mode (`str`, *optional*):
Mode to sample frames. Defaults to `self.frame_sample_mode`.
num_frames (`int`, *optional*):
Maximum number of frames to sample. Defaults to `self.num_frames`.
man_fps (`int`, *optional*):
Maximum frames per second to sample.
sampling_fps (`int`, *optional*):
Sampling frames per second. Defaults to `self.sampling_fps`.
Used when `frame_sample_mode` is `"fps"`.
"""
frame_sample_mode = frame_sample_mode or self.frame_sample_mode
num_frames = num_frames or self.num_frames
sampling_fps = sampling_fps or self.sampling_fps
duration = metadata.duration or metadata.total_num_frames / metadata.fps
if frame_sample_mode == "fps":
candidate_target_fps = get_candidate_target_fps(metadata.fps, sampling_fps)
# Try larger and larger FPSs until we hit one that can't span the video
target_fps = candidate_target_fps[0]
for candidate_fps in candidate_target_fps[1:]:
if num_frames / candidate_fps < duration:
break
target_fps = candidate_fps
times = np.arange(0, num_frames) / target_fps
times = times[times < duration]
return times
elif frame_sample_mode == "uniform_last_frame":
if max_fps is not None:
max_duration = (num_frames - 1) / max_fps # -1 to include the last frame
if max_duration < duration:
times = np.linspace(0, duration, num=num_frames, endpoint=True, dtype=np.float64)
else:
times = np.arange(0.0, stop=duration, step=1 / max_fps)
times = np.concatenate([times, [duration]], axis=0)
assert len(times) <= num_frames
else:
times = np.linspace(0, duration, num=num_frames, endpoint=True, dtype=np.float64)
return times
else:
raise NotImplementedError(frame_sample_mode)
def sample_frames(
self,
metadata: VideoMetadata,
frame_sample_mode: str | None = None,
num_frames: int | None = None,
max_fps: int | None = None,
sampling_fps: int | None = None,
**kwargs,
) -> np.ndarray:
"""
Frame-based sampling if an array video is passed
Args:
metadata (`VideoMetadata`):
Metadata of the video containing information about total duration, fps and total number of frames.
frame_sample_mode (`str`, *optional*):
Mode to sample frames. Defaults to `self.frame_sample_mode`.
num_frames (`int`, *optional*):
Maximum number of frames to sample. Defaults to `self.num_frames`.
max_fps (`int`, *optional*):
Maximum frames per second to sample.
sampling_fps (`int`, *optional*):
Sampling frames per second. Defaults to `self.sampling_fps`.
Used when `frame_sample_mode` is `"fps"`.
"""
frame_sample_mode = frame_sample_mode or self.frame_sample_mode
num_frames = num_frames or self.num_frames
sampling_fps = sampling_fps or self.sampling_fps
total_num_frames = metadata.total_num_frames
if frame_sample_mode == "uniform_last_frame" and max_fps is not None:
duration = total_num_frames / metadata.fps
if total_num_frames <= 2:
return np.arange(total_num_frames).astype(int)
if duration > (num_frames - 1) / max_fps: # -1 to include the last frame
# uniform fallback
indices = np.linspace(
0,
total_num_frames - 1,
num=min(num_frames, total_num_frames),
endpoint=True,
).astype(int)
return indices
else:
float_indices = np.arange(
0.0,
stop=total_num_frames - 1,
step=float(metadata.fps / max_fps),
)
if np.round(float_indices[-1]) != total_num_frames - 1:
float_indices = np.concatenate([float_indices, [total_num_frames - 1]], axis=0)
indices = np.round(float_indices).astype(int)
assert indices[-1] < total_num_frames
assert len(float_indices) <= num_frames
return indices
elif frame_sample_mode == "uniform_last_frame":
indices = np.linspace(
0,
total_num_frames - 1,
num=min(num_frames, total_num_frames),
endpoint=True,
).astype(int)
return indices
elif frame_sample_mode == "fps":
candidate_target_fps = get_candidate_target_fps(metadata.fps, sampling_fps)
selected_target_fps = get_target_fps(
metadata.fps,
num_frames,
total_num_frames,
frame_sample_mode,
candidate_target_fps,
)
_, indices = get_frame_times_and_chosen_fps(
selected_target_fps,
total_num_frames,
num_frames,
metadata.fps,
)
return indices
else:
raise NotImplementedError(frame_sample_mode)
def fetch_videos(self, video_url_or_urls: str | list[str] | list[list[str]], sample_timestamps_fn=None):
"""
Convert a single or a list of urls into the corresponding `np.array` objects.
If a single url is passed, the return value will be a single object. If a list is passed a list of objects is
returned.
"""
if (not is_decord_available()) and (not is_torchcodec_available()) and (not is_av_available()):
raise ImportError(
"MolmoAct2VideoProcessor requires `decord`, `torchcodec`, or `av` to be installed."
)
if is_decord_available():
backend = "decord"
elif is_torchcodec_available():
warnings.warn(
"`decord` is not installed and cannot be used to decode the video by default. "
"Falling back to `torchcodec`."
)
backend = "torchcodec"
else:
warnings.warn(
"`decord` is not installed and cannot be used to decode the video by default. "
"Falling back to `PyAV`."
)
backend = "pyav"
if isinstance(video_url_or_urls, list):
return list(
zip(
*[
self.fetch_videos(x, sample_timestamps_fn=sample_timestamps_fn)
for x in video_url_or_urls
]
)
)
else:
return load_video(video_url_or_urls, backend=backend, sample_timestamps_fn=sample_timestamps_fn)
def _decode_and_sample_videos(
self,
videos: VideoInput,
video_metadata: VideoMetadata | dict,
do_sample_frames: bool | None = None,
sample_indices_fn: Callable | None = None,
sample_timestamps_fn: Callable | None = None,
):
"""
Decode input videos and sample frames if needed.
"""
videos = make_batched_videos(videos)
video_metadata = make_batched_metadata(videos, video_metadata=video_metadata)
# Framed-based sampling if an array video is passed
# Otherwise, time-based sampling with decoding
if is_valid_video(videos[0]) and do_sample_frames:
assert video_metadata[0].fps is not None, "FPS must be provided for video input"
sampled_videos = []
sampled_metadata = []
for video, metadata in zip(videos, video_metadata):
indices = sample_indices_fn(metadata=metadata)
metadata.frames_indices = indices
sampled_videos.append(video[indices])
sampled_metadata.append(metadata)
videos = sampled_videos
video_metadata = sampled_metadata
elif not is_valid_video(videos[0]):
if sample_indices_fn is None:
logger.warning(
"do_sample_frames is False, but video array is not provided: "
"Will decode the video and sample frames using MolmoAct2's default sampling mode"
)
if isinstance(videos[0], list):
raise ValueError("A list of images is not supported for video input!")
else:
videos, video_metadata = self.fetch_videos(videos, sample_timestamps_fn=sample_timestamps_fn)
return videos, video_metadata
def _prepare_input_videos(
self,
videos: VideoInput,
**kwargs,
) -> list[np.ndarray]:
processed_videos = [to_numpy(video) for video in videos]
return processed_videos
def preprocess(
self,
videos: VideoInput,
**kwargs: Unpack[MolmoAct2VideoProcessorKwargs],
) -> BatchFeature:
validate_kwargs(
captured_kwargs=kwargs.keys(),
valid_processor_keys=list(self.valid_kwargs.__annotations__.keys()) + ["return_tensors"],
)
# Set default kwargs from self. This ensures that if a kwarg is not provided
# by the user, it gets its default value from the instance, or is set to None.
for kwarg_name in self.valid_kwargs.__annotations__:
kwargs.setdefault(kwarg_name, getattr(self, kwarg_name, None))
do_sample_frames = kwargs.pop("do_sample_frames")
video_metadata = kwargs.pop("video_metadata")
sample_indices_fn = partial(self.sample_frames, **kwargs) if do_sample_frames else None
sample_timestamps_fn = partial(self.sample_times, **kwargs)
videos, video_metadata = self._decode_and_sample_videos(
videos,
video_metadata=video_metadata,
do_sample_frames=do_sample_frames,
sample_indices_fn=sample_indices_fn,
sample_timestamps_fn=sample_timestamps_fn,
)
videos = self._prepare_input_videos(videos=videos)
kwargs = self._further_process_kwargs(**kwargs)
return_metadata = kwargs.pop("return_metadata")
preprocessed_videos = self._preprocess(videos=videos, **kwargs)
if return_metadata:
preprocessed_videos["video_metadata"] = video_metadata
return preprocessed_videos
def _preprocess(
self,
videos: list[np.ndarray],
size: SizeDict | None = None,
resample: PILImageResampling | None = None,
image_mean: float | list[float] | None = None,
image_std: float | list[float] | None = None,
do_convert_rgb: bool | None = None,
patch_size: int | None = None,
pooling_size: list[int] | None = None,
return_tensors: str | TensorType | None = None,
**kwargs,
) -> BatchFeature:
"""
Preprocess a video for the model.
Args:
videos (`VideoInput`):
Video to preprocess.
size (`SizeDict`, *optional*, defaults to `self.size`):
Size of the image after resizing.
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
Resampling filter to use when resizing the image. This can be one of the enum `PILImageResampling`. Only
has an effect if `do_resize` is set to `True`.
image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
`True`.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
patch_size (`int`, *optional*, defaults to `self.patch_size`):
The spatial patch size of the vision encoder.
pooling_size (`list[int]`, *optional*, defaults to `self.pooling_size`):
The pooling size of the vision adapter.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
Returns:
A `BatchFeature` containing the following keys:
- `pixel_values_videos`: The preprocessed videos.
- `video_token_pooling`: The indices of the patches in `crops` to pool for each token in `video_tokens`.
- `video_grids`: The video grids.
"""
if size.height is None or size.width is None:
raise ValueError("size must contain 'height' and 'width' keys.")
base_image_input_size = [size.height, size.width]
resample = resample or self.resample
image_mean = image_mean or self.image_mean
image_std = image_std or self.image_std
do_convert_rgb = do_convert_rgb or self.do_convert_rgb
patch_size = patch_size or self.patch_size
pooling_size = pooling_size or self.pooling_size
image_pooling_h, image_pooling_w = pooling_size
batch_grids = []
batch_crops = []
batch_pooled_patches_idx = []
for video in videos:
all_crops = []
pooled_patches_idx = []
for frame in video:
image_grid, crops, pooled_idx = image_to_patches_and_grids(
frame,
base_image_input_size,
resample,
image_mean,
image_std,
patch_size,
image_pooling_w,
image_pooling_h,
)
offset = sum(np.prod(x.shape[:2]) for x in all_crops)
pooled_idx_with_offset = np.where(pooled_idx >= 0, pooled_idx + offset, pooled_idx)
pooled_patches_idx.append(pooled_idx_with_offset)
all_crops.append(crops)
video_grid = np.array([len(video), image_grid[0], image_grid[1]])
all_crops = np.concatenate(all_crops, 0)
pooled_patches_idx = np.concatenate(pooled_patches_idx, 0)
batch_grids.append(video_grid)
batch_crops.append(all_crops)
batch_pooled_patches_idx.append(pooled_patches_idx)
video_grids = np.stack(batch_grids, 0)
pixel_values_videos = np.concatenate(batch_crops, 0)
video_token_pooling = np.concatenate(batch_pooled_patches_idx, 0)
data = dict(
pixel_values_videos=pixel_values_videos,
video_token_pooling=video_token_pooling,
video_grids=video_grids,
)
return BatchFeature(data, tensor_type=return_tensors)
MolmoAct2VideoProcessor.register_for_auto_class()
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
+34 -22
View File
@@ -15,7 +15,6 @@
# limitations under the License.
import builtins
import copy
import logging
import math
from collections import deque
@@ -30,6 +29,7 @@ from lerobot.utils.import_utils import _transformers_available, require_package
# Conditional import for type checking and lazy loading
if TYPE_CHECKING or _transformers_available:
from transformers.cache_utils import DynamicCache
from transformers.models.auto import CONFIG_MAPPING
from transformers.models.gemma import modeling_gemma
@@ -41,6 +41,7 @@ if TYPE_CHECKING or _transformers_available:
)
else:
CONFIG_MAPPING = None
DynamicCache = None
modeling_gemma = None
PiGemmaForCausalLM = None
_gated_residual = None
@@ -141,6 +142,15 @@ def make_att_2d_masks(pad_masks, att_masks): # see openpi `make_att_2d_masks` (
return att_2d_masks & pad_2d_masks
def clone_past_key_values(past_key_values):
"""Clone the DynamicCache returned by prefix prefill for compiled denoising."""
return DynamicCache(
tuple(
(keys.clone(), values.clone(), sliding_window) for keys, values, sliding_window in past_key_values
)
)
def pad_vector(vector, new_dim):
"""Pad the last dimension of a vector to new_dim with zeros.
@@ -227,16 +237,13 @@ def resize_with_pad_torch( # see openpi `resize_with_pad_torch` (exact copy)
# Define the complete layer computation function for gradient checkpointing
def compute_layer_complete(
layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond, paligemma, gemma_expert
):
models = [paligemma.model.language_model, gemma_expert.model]
def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_cond, layers, rotary_emb):
query_states = []
key_states = []
value_states = []
gates = []
for i, hidden_states in enumerate(inputs_embeds):
layer = models[i].layers[layer_idx]
layer = layers[i]
hidden_states, gate = layernorm_forward(layer.input_layernorm, hidden_states, adarms_cond[i])
gates.append(gate)
input_shape = hidden_states.shape[:-1]
@@ -258,15 +265,16 @@ def compute_layer_complete(
device=query_states.device,
dtype=query_states.dtype,
)
cos, sin = paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
cos, sin = rotary_emb(dummy_tensor, position_ids)
query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
query_states, key_states, cos, sin, unsqueeze_dim=1
)
batch_size = query_states.shape[0]
scaling = paligemma.model.language_model.layers[layer_idx].self_attn.scaling
paligemma_layer = layers[0]
scaling = paligemma_layer.self_attn.scaling
# Attention computation
att_output, _ = modeling_gemma.eager_attention_forward(
paligemma.model.language_model.layers[layer_idx].self_attn,
paligemma_layer.self_attn,
query_states,
key_states,
value_states,
@@ -274,13 +282,13 @@ def compute_layer_complete(
scaling,
)
# Get head_dim from the current layer, not from the model
head_dim = paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
head_dim = paligemma_layer.self_attn.head_dim
att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
# Process layer outputs
outputs_embeds = []
start_pos = 0
for i, hidden_states in enumerate(inputs_embeds):
layer = models[i].layers[layer_idx]
layer = layers[i]
end_pos = start_pos + hidden_states.shape[1]
if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
@@ -488,8 +496,9 @@ class PaliGemmaWithExpertModel(
prefix_output = None
prefix_past_key_values = None
else:
models = [self.paligemma.model.language_model, self.gemma_expert.model]
num_layers = self.paligemma.config.text_config.num_hidden_layers
paligemma_layers = self.paligemma.model.language_model.layers
gemma_expert_layers = self.gemma_expert.model.layers
rotary_emb = self.paligemma.model.language_model.rotary_emb
# Check if gradient checkpointing is enabled for any of the models
use_gradient_checkpointing = (
@@ -499,36 +508,39 @@ class PaliGemmaWithExpertModel(
) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)
# Process all layers with gradient checkpointing if enabled
for layer_idx in range(num_layers):
for layers in zip(paligemma_layers, gemma_expert_layers, strict=True):
if use_gradient_checkpointing:
inputs_embeds = torch.utils.checkpoint.checkpoint(
compute_layer_complete,
layer_idx,
inputs_embeds,
attention_mask,
position_ids,
adarms_cond,
use_reentrant=False,
preserve_rng_state=False,
paligemma=self.paligemma,
gemma_expert=self.gemma_expert,
layers=layers,
rotary_emb=rotary_emb,
)
else:
inputs_embeds = compute_layer_complete(
layer_idx,
inputs_embeds,
attention_mask,
position_ids,
adarms_cond,
paligemma=self.paligemma,
gemma_expert=self.gemma_expert,
layers=layers,
rotary_emb=rotary_emb,
)
# final norm
final_norms = (
self.paligemma.model.language_model.norm,
self.gemma_expert.model.norm,
)
def compute_final_norms(inputs_embeds, adarms_cond):
outputs_embeds = []
for i, hidden_states in enumerate(inputs_embeds):
out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
out_emb, _ = layernorm_forward(final_norms[i], hidden_states, adarms_cond[i])
outputs_embeds.append(out_emb)
return outputs_embeds
@@ -907,7 +919,7 @@ class PI0Pytorch(nn.Module): # see openpi `PI0Pytorch`
full_att_2d_masks_4d = self._prepare_attention_masks_4d(full_att_2d_masks)
self.paligemma_with_expert.gemma_expert.model.config._attn_implementation = "eager" # noqa: SLF001
past_key_values = copy.deepcopy(past_key_values)
past_key_values = clone_past_key_values(past_key_values)
outputs_embeds, _ = self.paligemma_with_expert.forward(
attention_mask=full_att_2d_masks_4d,
position_ids=position_ids,
+34 -22
View File
@@ -15,7 +15,6 @@
# limitations under the License.
import builtins
import copy
import logging
import math
from collections import deque
@@ -30,6 +29,7 @@ from lerobot.utils.import_utils import _transformers_available, require_package
# Conditional import for type checking and lazy loading
if TYPE_CHECKING or _transformers_available:
from transformers.cache_utils import DynamicCache
from transformers.models.auto import CONFIG_MAPPING
from transformers.models.gemma import modeling_gemma
@@ -41,6 +41,7 @@ if TYPE_CHECKING or _transformers_available:
)
else:
CONFIG_MAPPING = None
DynamicCache = None
modeling_gemma = None
PiGemmaForCausalLM = None
_gated_residual = None
@@ -138,6 +139,15 @@ def make_att_2d_masks(pad_masks, att_masks): # see openpi `make_att_2d_masks` (
return att_2d_masks & pad_2d_masks
def clone_past_key_values(past_key_values):
"""Clone the DynamicCache returned by prefix prefill for compiled denoising."""
return DynamicCache(
tuple(
(keys.clone(), values.clone(), sliding_window) for keys, values, sliding_window in past_key_values
)
)
def pad_vector(vector, new_dim):
"""Pad the last dimension of a vector to new_dim with zeros.
@@ -224,16 +234,13 @@ def resize_with_pad_torch( # see openpi `resize_with_pad_torch` (exact copy)
# Define the complete layer computation function for gradient checkpointing
def compute_layer_complete(
layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond, paligemma, gemma_expert
):
models = [paligemma.model.language_model, gemma_expert.model]
def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_cond, layers, rotary_emb):
query_states = []
key_states = []
value_states = []
gates = []
for i, hidden_states in enumerate(inputs_embeds):
layer = models[i].layers[layer_idx]
layer = layers[i]
hidden_states, gate = layernorm_forward(layer.input_layernorm, hidden_states, adarms_cond[i])
gates.append(gate)
input_shape = hidden_states.shape[:-1]
@@ -255,15 +262,16 @@ def compute_layer_complete(
device=query_states.device,
dtype=query_states.dtype,
)
cos, sin = paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
cos, sin = rotary_emb(dummy_tensor, position_ids)
query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
query_states, key_states, cos, sin, unsqueeze_dim=1
)
batch_size = query_states.shape[0]
scaling = paligemma.model.language_model.layers[layer_idx].self_attn.scaling
paligemma_layer = layers[0]
scaling = paligemma_layer.self_attn.scaling
# Attention computation
att_output, _ = modeling_gemma.eager_attention_forward(
paligemma.model.language_model.layers[layer_idx].self_attn,
paligemma_layer.self_attn,
query_states,
key_states,
value_states,
@@ -271,13 +279,13 @@ def compute_layer_complete(
scaling,
)
# Get head_dim from the current layer, not from the model
head_dim = paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
head_dim = paligemma_layer.self_attn.head_dim
att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
# Process layer outputs
outputs_embeds = []
start_pos = 0
for i, hidden_states in enumerate(inputs_embeds):
layer = models[i].layers[layer_idx]
layer = layers[i]
end_pos = start_pos + hidden_states.shape[1]
if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
@@ -485,8 +493,9 @@ class PaliGemmaWithExpertModel(
prefix_output = None
prefix_past_key_values = None
else:
models = [self.paligemma.model.language_model, self.gemma_expert.model]
num_layers = self.paligemma.config.text_config.num_hidden_layers
paligemma_layers = self.paligemma.model.language_model.layers
gemma_expert_layers = self.gemma_expert.model.layers
rotary_emb = self.paligemma.model.language_model.rotary_emb
# Check if gradient checkpointing is enabled for any of the models
use_gradient_checkpointing = (
@@ -496,36 +505,39 @@ class PaliGemmaWithExpertModel(
) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)
# Process all layers with gradient checkpointing if enabled
for layer_idx in range(num_layers):
for layers in zip(paligemma_layers, gemma_expert_layers, strict=True):
if use_gradient_checkpointing:
inputs_embeds = torch.utils.checkpoint.checkpoint(
compute_layer_complete,
layer_idx,
inputs_embeds,
attention_mask,
position_ids,
adarms_cond,
use_reentrant=False,
preserve_rng_state=False,
paligemma=self.paligemma,
gemma_expert=self.gemma_expert,
layers=layers,
rotary_emb=rotary_emb,
)
else:
inputs_embeds = compute_layer_complete(
layer_idx,
inputs_embeds,
attention_mask,
position_ids,
adarms_cond,
paligemma=self.paligemma,
gemma_expert=self.gemma_expert,
layers=layers,
rotary_emb=rotary_emb,
)
# final norm
final_norms = (
self.paligemma.model.language_model.norm,
self.gemma_expert.model.norm,
)
def compute_final_norms(inputs_embeds, adarms_cond):
outputs_embeds = []
for i, hidden_states in enumerate(inputs_embeds):
out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
out_emb, _ = layernorm_forward(final_norms[i], hidden_states, adarms_cond[i])
outputs_embeds.append(out_emb)
return outputs_embeds
@@ -880,7 +892,7 @@ class PI05Pytorch(nn.Module): # see openpi `PI0Pytorch`
full_att_2d_masks_4d = self._prepare_attention_masks_4d(full_att_2d_masks)
self.paligemma_with_expert.gemma_expert.model.config._attn_implementation = "eager" # noqa: SLF001
past_key_values = copy.deepcopy(past_key_values)
past_key_values = clone_past_key_values(past_key_values)
outputs_embeds, _ = self.paligemma_with_expert.forward(
attention_mask=full_att_2d_masks_4d,
position_ids=position_ids,
+42
View File
@@ -0,0 +1,42 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""π0.5 v2 — full reproduction of the π0.5 paper's hierarchical
inference recipe on lerobot.
Extends :class:`lerobot.policies.pi05.PI05Policy` with:
* recipe-driven training (PR 1's :class:`RenderMessagesStep`),
* PaliGemma ``lm_head`` cross-entropy on supervised subtask spans
(the "high-level subtask prediction" of the paper, §IV.D),
* AR text generation at inference (:meth:`PI052Policy.select_message`),
* per-component prompt dropout (Pi 0.7 §V.E) for regularising the
text head against missing context at inference.
See ``src/lerobot/configs/recipes/subtasks_vqa.yaml`` for the
canonical training recipe and
``examples/training/pi052_hirobot.slurm`` for the launcher.
"""
from .configuration_pi052 import PI052Config
from .modeling_pi052 import PI052Policy
from .processor_pi052 import make_pi052_pre_post_processors
from .text_processor_pi052 import PI052TextTokenizerStep
__all__ = [
"PI052Config",
"PI052Policy",
"PI052TextTokenizerStep",
"make_pi052_pre_post_processors",
]
@@ -0,0 +1,235 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""π0.5 v2 (with text head) — reproduction of the π0.5 paper's
hierarchical inference recipe.
Same architecture as the existing ``PI05Policy`` (PaliGemma 2B VLM +
~300M Gemma action expert, joint training with FAST tokens during
pre-train and flow matching during post-train), but with the
PaliGemma ``lm_head`` re-enabled so the same model can be supervised
to predict both:
* **subtask strings** at the high level (cross-entropy on the LM
head), and
* **action chunks** at the low level (flow matching on the
action-expert tokens).
This is the dual-head co-training pattern from the paper:
L = H(x, f_θ_text) + α * ω - a - f_θ_action(a_τ, o, )²
with α = 10.0 per § IV.D of arxiv:2504.16054. The π0.5 model splits
inference into a text-prediction step followed by an action-prediction
step, which the multi-rate ``PI052Runtime`` (in
``lerobot.policies.pi052.inference``) drives at separate rates.
"""
from dataclasses import dataclass
from lerobot.configs import PreTrainedConfig
from lerobot.optim.optimizers import AdamWConfig
from ..pi05.configuration_pi05 import PI05Config
@PreTrainedConfig.register_subclass("pi052")
@dataclass
class PI052Config(PI05Config):
"""π0.5 with the PaliGemma LM head re-enabled for subtask prediction.
Recipe-driven dual-head training: the flow head supervises actions,
the LM head supervises subtask / plan / memory / VQA text. The
flow:text loss split is the milder 5:1 (see ``flow_loss_weight``).
"""
# Recipe / language stack ---------------------------------------------
recipe_path: str | None = "recipes/subtasks_vqa.yaml"
"""Path (absolute or relative to ``src/lerobot/configs/``) to a
``TrainingRecipe`` YAML. Defaults to the canonical Hi-Robot blend
shipped alongside this policy. Set to ``None`` to disable recipe
rendering and fall back to π0.5's single-task ``Task: ... Action:``
prompt path (unannotated datasets keep working that way)."""
apply_chat_template: bool = False
"""PaliGemma is *not* chat-pretrained — its tokenizer doesn't ship a
chat template, so we don't apply one. The recipe renderer's output
is concatenated as a plain prefix + assistant suffix instead,
mirroring how the π0.5 paper's high-level inference samples text
auto-regressively after the prefix."""
# Loss weights --------------------------------------------------------
# Paper §IV.D uses α=10 between the flow and text terms, assuming
# text is a rare auxiliary task. With the recipe stack the flow-only
# `low_level` branch fires on a large share of samples, so α=10
# swamps the LM head and collapses generation into degenerate
# repetition. We use the milder 5:1 split here.
text_loss_weight: float = 1.0
"""Weight on the LM-head cross-entropy term. Set to ``0`` to disable
text training entirely (reverts to flow-only / π0.5 behaviour)."""
flow_loss_weight: float = 5.0
"""Weight on the action-expert flow-matching term. ``5.0`` — a milder
flow:text split than the paper's α=10, since the flow-only
``low_level`` recipe already gives the action expert frequent
gradient. Lower it further if the LM head still underfits."""
# Backbone training ---------------------------------------------------
unfreeze_lm_head: bool = True
"""Whether to keep the PaliGemma ``lm_head`` unfrozen for fine-tuning.
The existing ``PI05Policy`` zeroes / freezes the head on load
because it never reads from it. Must be ``True`` for π0.5-style
hierarchical inference."""
# Per-component prompt dropout (Pi0.7 §V.E) ---------------------------
# Randomly drop non-target context messages so the LM head learns
# to handle missing /
# stale plan / memory at inference. Defaults to 0.0 so behaviour
# is identical until explicitly enabled.
plan_dropout_prob: float = 0.0
memory_dropout_prob: float = 0.0
subtask_dropout_prob: float = 0.0
# FAST discrete-action supervision — paper §III.B-C ------------------
# When enabled, actions are *also* tokenised via the FAST tokenizer
# ("physical-intelligence/fast") and supervised with cross-entropy
# on the PaliGemma LM head — exactly as in the paper's pre-training
# objective (Eq. 1 mixes FAST CE + flow MSE + subtask CE). The
# ActionTokenizerProcessorStep is wired into the preprocessor
# pipeline when this flag is set; the loss is computed in
# PI052Policy.forward.
enable_fast_action_loss: bool = True
"""If True, tokenise actions with the FAST tokenizer and add a
cross-entropy loss on the LM head. On by default to match the
π0.5 paper's three-loss objective (text CE + FAST CE + flow MSE,
§III.B-C Eq. 1). Set to False if you only want the
post-training-style flow + text recipe."""
action_tokenizer_name: str = "physical-intelligence/fast"
"""HF identifier for the FAST action tokenizer."""
max_action_tokens: int = 256
"""Maximum number of FAST tokens per action chunk."""
fast_skip_tokens: int = 128
"""Number of low-vocab tokens the FAST tokenizer skips to avoid
collisions with PaliGemma's text vocabulary."""
fast_action_loss_weight: float = 1.0
"""Weight on the FAST-action-token CE loss. Paper §III.C uses 1.0."""
auto_fit_fast_tokenizer: bool = False
"""If True, the processor factory checks ``fast_tokenizer_cache_dir``
for a previously-fitted tokenizer keyed on ``(dataset_repo_id,
base_tokenizer_name, fit_samples)``. On cache miss, it loads
``action_tokenizer_name`` as a base, samples
``fast_tokenizer_fit_samples`` action chunks from the dataset, runs
``.fit()``, saves the result, and uses *that* fitted path as the
actual tokenizer. Pertsch et al. 2025 (FAST paper [64], π0.5 §III.C)
explicitly recommend per-dataset fitting for best compression.
Off by default because the fit requires a separate pre-training
pass over the dataset (~1-2 min on a medium dataset) and depends
on the FAST tokenizer snapshot having a ``.fit()`` method. Opt in
when you want paper-faithful compression; leave off to fall back
on the universal ``physical-intelligence/fast`` codebook."""
fast_tokenizer_cache_dir: str = "~/.cache/lerobot/fast_tokenizers"
"""Where fitted FAST tokenizers are stored. ``~`` expands."""
fast_tokenizer_fit_samples: int = 1024
"""Number of action chunks to sample for the fit. The FAST paper uses
a few thousand; 1024 is a reasonable default for medium datasets."""
# Knowledge insulation — paper §III.B --------------------------------
# When enabled, gradients from the action expert's flow loss are
# blocked from flowing back into the VLM's K/V projections. This
# prevents the action loss from over-fitting the language backbone
# to robot-specific features. Implemented in ``modeling_pi052`` as
# a per-instance monkey-patch on ``paligemma_with_expert.forward``
# that splits queries into VLM and action halves and ``.detach()``-s
# the VLM K/V tensors used in the action-half's attention.
knowledge_insulation: bool = False
"""If True, route every transformer layer through the KI
attention path that blocks actionVLM gradient flow on K/V."""
# Learning-rate defaults --------------------------------------------
# pi052 inherits π0.5's openpi-validated optimizer config (peak LR
# 2.5e-5, cosine→2.5e-6, 1k warmup, AdamW (0.9, 0.95), wd=0.01,
# grad_clip=1.0). The only place pi052 needs to diverge from pi05
# is the LM-head LR multiplier: pi05 has no text supervision so the
# head doesn't get gradients; pi052 always has text supervision
# (subtask / memory / VQA) via the recipe, and under KI the LM head
# only sees gradients on ~3045% of the batch (the text-CE mask
# share of the recipe). Under aggressive cosine decay this is too
# weak to keep the head pinned, so it drifts back toward PaliGemma's
# pretrained ``<loc>`` first-token bias. 5x is the documented fix
# (see ``PI05Config.lm_head_lr_scale`` docstring); the wiring is
# already in ``PI05Policy.get_optim_params`` — it splits the LM head
# + tied ``embed_tokens`` into their own param group while sharing
# the same cosine lambda, so the 5x ratio is preserved across decay.
lm_head_lr_scale: float = 5.0
# PaLM-style z-loss on text CE. Penalises the log-partition function
# ``z = log Σ exp(logits)`` drifting away from zero — without it, large-
# vocab models (PaliGemma is 257k) can let ``logsumexp`` grow unbounded
# while CE stays low, because a uniform additive logit bias cancels in
# softmax. PaLM appendix B / Chinchilla report z-loss is essential for
# stable large-vocab CE; it especially helps under ``lm_head_lr_scale=
# 5.0`` which amplifies drift risk on the LM head. ``1e-4`` is the
# commonly cited weight; set 0 to disable entirely.
text_ce_z_loss_weight: float = 1e-4
# Liger Triton kernels (rope + geglu + layer_norm) are now patched
# unconditionally at model build time — see ``_enable_hf_kernels``
# in ``modeling_pi052``. The patch is process-global, idempotent
# and degrades gracefully if ``liger-kernel`` is missing. Measured
# at -4.5% step time on H100 (bench job 22161421); peak memory
# unchanged. ``fused_linear_cross_entropy`` ships separately via
# ``_shifted_lin_ce`` / ``_fast_lin_ce``.
use_hf_kernels: bool = True
"""Deprecated. Liger HF kernels are patched unconditionally by
``_enable_hf_kernels`` this field is retained as a no-op for
backward compatibility with checkpoints saved before commit
d70c8104 (which still serialize ``use_hf_kernels: true`` into
``config.json``). Loading those configs would otherwise raise
``DecodingError: The fields use_hf_kernels are not valid for
PI052Config`` (job 22164492). Remove in a future major bump."""
# Optimizer foreach/fused. pi052 carries these locally because the shared
# PI05Config (kept identical to upstream main) does not define them; the
# checkpoints we train serialize both keys into config.json, so they must
# be valid PI052Config fields and flow into the AdamW preset below.
optimizer_foreach: bool | None = False
optimizer_fused: bool | None = True
def get_optimizer_preset(self) -> AdamWConfig:
return AdamWConfig(
lr=self.optimizer_lr,
betas=self.optimizer_betas,
eps=self.optimizer_eps,
weight_decay=self.optimizer_weight_decay,
grad_clip_norm=self.optimizer_grad_clip_norm,
foreach=self.optimizer_foreach,
fused=self.optimizer_fused,
)
def __post_init__(self) -> None:
super().__post_init__()
# Backbone needs gradients flowing through the text head when
# we're training it. Override the π0.5 default
# (``train_expert_only=True``) unless the user explicitly opts
# out of text training via ``text_loss_weight=0``.
if self.text_loss_weight > 0 and self.unfreeze_lm_head:
self.train_expert_only = False
@@ -0,0 +1,304 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Dataset-specific FAST action tokenizer fitting.
The published ``physical-intelligence/fast`` tokenizer is a *universal*
codebook fitted on a heterogeneous mix of robot datasets. Per Pertsch
et al. 2025 (the FAST paper, [64] in the π0.5 paper) and §III.C of
π0.5 itself, the recommended practice is to **finetune the tokenizer on
your specific dataset's action distribution** before training the
policy same way one would adapt a language tokenizer to a domain
corpus. Without this finetune step, action sequences from your robot
may require more tokens per chunk than necessary, lowering effective
compression and slowing convergence of the action-CE loss.
This module provides a single utility, :func:`fit_fast_tokenizer`,
that does the finetune. The training entry point invokes it
automatically when the policy's ``enable_fast_action_loss`` and
``auto_fit_fast_tokenizer`` flags are both ``True`` and no cached
fitted tokenizer is found at ``fast_tokenizer_cache_dir``.
The fitted tokenizer is saved to
``{cache_dir}/{dataset_hash}_{base_hash}/`` so successive training
runs over the same dataset re-use it.
"""
from __future__ import annotations
import hashlib
import logging
import os
import time
from pathlib import Path
import numpy as np
logger = logging.getLogger(__name__)
# Marker file the cache-hit check looks for. ``ProcessorMixin.save_pretrained``
# writes ``processor_config.json`` (NOT ``preprocessor_config.json`` —
# that's the image / feature-extractor convention). Centralised here so
# the cache-hit check and the rank-N readiness wait agree on the same
# sentinel.
_CACHE_SENTINEL = "processor_config.json"
def _dataset_signature(
dataset_repo_id: str,
base_tokenizer_name: str,
n_samples: int,
chunk_size: int,
) -> str:
"""Deterministic short hash for naming the cache directory.
Keys on (dataset, base tokenizer, sample count, chunk size) so any
of those changing re-runs the fit. ``chunk_size`` matters because
the tokenizer is fit on chunks of that length.
"""
h = hashlib.sha256()
h.update(dataset_repo_id.encode("utf-8"))
h.update(b"\0")
h.update(base_tokenizer_name.encode("utf-8"))
h.update(b"\0")
h.update(str(n_samples).encode("utf-8"))
h.update(b"\0")
h.update(str(chunk_size).encode("utf-8"))
return h.hexdigest()[:16]
def fit_fast_tokenizer(
*,
dataset_repo_id: str,
cache_dir: str | Path,
base_tokenizer_name: str = "physical-intelligence/fast",
n_samples: int = 1024,
chunk_size: int = 50,
seed: int = 42,
) -> str:
"""Fit a FAST tokenizer on a LeRobot dataset's action distribution.
Args:
dataset_repo_id: HF Hub repo id of the LeRobotDataset to fit on.
cache_dir: Directory under which to save (and look up) fitted
tokenizers. The actual save path is
``{cache_dir}/{signature}``.
base_tokenizer_name: HF identifier for the base FAST tokenizer
to finetune from. ``physical-intelligence/fast`` is the
universal one.
n_samples: Number of action chunks to sample for the fit. The
FAST paper uses a few thousand; ``1024`` is a good default
for medium datasets.
chunk_size: Length of each action chunk (matches
``policy.chunk_size``). The FAST tokenizer is fit on
sequences of this length.
seed: RNG seed for sample selection.
Returns:
The local path to the fitted tokenizer. Passed directly to
``--policy.action_tokenizer_name`` for the training run.
Raises:
ImportError: If the ``transformers`` library doesn't expose
``AutoProcessor`` or the FAST tokenizer doesn't have a
``.fit()`` method (then you're on an older FAST snapshot —
update to the current published model).
FileNotFoundError: If the dataset can't be loaded.
"""
cache_dir = Path(cache_dir)
sig = _dataset_signature(dataset_repo_id, base_tokenizer_name, n_samples, chunk_size)
out_dir = cache_dir / sig
if out_dir.exists() and (out_dir / _CACHE_SENTINEL).exists():
logger.info(
"FAST tokenizer cache hit: %s — re-using fitted tokenizer for "
"dataset=%s base=%s n_samples=%d",
out_dir, dataset_repo_id, base_tokenizer_name, n_samples,
)
return str(out_dir)
# DDP-safe fit: only the (local) main process actually fits + saves;
# other ranks poll the cache sentinel until the leader is done.
# Without this guard, all N ranks fit concurrently and race on
# ``save_pretrained`` + ``AutoProcessor.from_pretrained`` (the latter
# copies ``processing_action_tokenizer.py`` into ``HF_MODULES_CACHE``
# and compiles a ``.pyc`` — concurrent writers occasionally produce
# a stale / partial ``.pyc`` and the subsequent ``from .. import
# UniversalActionProcessor`` raises ``AttributeError``.
is_leader = (
int(os.environ.get("RANK", "0")) == 0
and int(os.environ.get("LOCAL_RANK", "0")) == 0
)
if not is_leader:
timeout_s = 1800.0 # 30 min — covers ~1024-sample fits on cold caches
start = time.monotonic()
while not (out_dir / _CACHE_SENTINEL).exists():
if time.monotonic() - start > timeout_s:
raise RuntimeError(
f"FAST tokenizer fit: non-leader rank timed out after "
f"{timeout_s:.0f}s waiting for {out_dir / _CACHE_SENTINEL}. "
"Leader rank likely crashed during the fit."
)
time.sleep(2.0)
logger.info("FAST tokenizer ready (leader populated cache): %s", out_dir)
return str(out_dir)
logger.info(
"FAST tokenizer cache miss — fitting on dataset=%s "
"base=%s n_samples=%d chunk_size=%d%s",
dataset_repo_id, base_tokenizer_name, n_samples, chunk_size, out_dir,
)
from transformers import AutoProcessor # noqa: PLC0415
from lerobot.datasets.lerobot_dataset import LeRobotDataset # noqa: PLC0415
# Stream a single episode's worth of action chunks at a time so
# we don't blow memory on huge datasets. Random episode +
# random start offset gives a reasonable spread.
#
# Actions are read straight from the underlying HF dataset's
# ``action`` *column* — never via ``ds[i]``. ``ds[i]`` builds a full
# training item (delta-timestamp expansion + video decode + image
# transforms); a single bad video frame would then throw and, since
# the failure was swallowed at debug level, silently starve the fit
# of every chunk. The action column carries no video, so reading it
# directly is both faster and immune to decode errors.
rng = np.random.default_rng(seed)
actions_buf: list[np.ndarray] = []
# Resolve the dataset's data parquet shards directly, sidestepping
# ``LeRobotDataset(repo_id, episodes=[N])`` which on v3-format
# datasets routes through HF datasets'' split lookup and raises
# ``ValueError: Instruction "train" corresponds to no data!`` for
# every episode (job 22182985 looped through 13,293 skipped episodes
# for ~2.5 h before NCCL killed it). Reading the ``action`` column
# straight from the parquet shards is also faster: each per-episode
# ``LeRobotDataset`` instantiation re-parses every meta file.
from huggingface_hub import snapshot_download # noqa: PLC0415
import pyarrow as _pa # noqa: PLC0415
import pyarrow.parquet as _pq # noqa: PLC0415
snap = Path(snapshot_download(repo_id=dataset_repo_id, repo_type="dataset"))
data_files = sorted((snap / "data").glob("chunk-*/file-*.parquet"))
if not data_files:
raise RuntimeError(
f"FAST fit: no ``data/chunk-*/file-*.parquet`` shards found under {snap!s}."
)
# Read just the (episode_index, action) columns once across all
# shards. This is the same pattern used elsewhere in the codebase
# for whole-dataset audits and stays under ~2 GB even on 32 k-episode
# / 29 M-frame datasets because the action column is a fixed-length
# float vector.
tables = [_pq.read_table(f, columns=["episode_index", "action"]) for f in data_files]
table = _pa.concat_tables(tables)
eps = table["episode_index"].to_numpy()
acts_col = table["action"]
# ``action`` may be a fixed-shape ListArray or a 2-D NumericArray;
# ``to_numpy(zero_copy_only=False)`` produces an object array of
# 1-D NumPy actions either way, which we stack into (N, D).
try:
acts = np.stack(acts_col.to_numpy(zero_copy_only=False)).astype(np.float32)
except Exception: # noqa: BLE001
# Fallback path for nested-list types: flatten via to_pylist().
acts = np.asarray(acts_col.to_pylist(), dtype=np.float32)
if acts.ndim != 2:
raise RuntimeError(
f"FAST fit: expected ``action`` rows to be 1-D vectors; got shape {acts.shape}."
)
# Episode index → slice (start, stop) into ``acts`` along axis 0.
# ``eps`` is monotonically increasing within each parquet shard but
# we make no assumption across shards — sort once and group.
order = np.argsort(eps, kind="stable")
eps_sorted = eps[order]
boundaries = np.searchsorted(eps_sorted, np.arange(int(eps_sorted.max()) + 2))
ep_to_slice: dict[int, tuple[int, int]] = {
int(ep): (int(boundaries[ep]), int(boundaries[ep + 1]))
for ep in range(len(boundaries) - 1)
if boundaries[ep] < boundaries[ep + 1]
}
num_episodes = len(ep_to_slice)
# ``acts`` is in original (un-sorted-by-episode) row order; reorder
# so per-episode slices are contiguous.
acts = acts[order]
samples_per_episode = max(1, n_samples // max(num_episodes, 1))
collected = 0
eps_visited = 0
short_episodes = 0
ep_indices = list(ep_to_slice.keys())
for ep_idx in rng.permutation(ep_indices):
if collected >= n_samples:
break
start, stop = ep_to_slice[int(ep_idx)]
ep_actions = acts[start:stop]
if ep_actions.shape[0] < chunk_size:
short_episodes += 1
continue
starts = rng.integers(0, ep_actions.shape[0] - chunk_size + 1, size=samples_per_episode)
for s in starts:
actions_buf.append(ep_actions[int(s) : int(s) + chunk_size])
collected += 1
if collected >= n_samples:
break
eps_visited += 1
if not actions_buf:
raise RuntimeError(
f"FAST fit collected zero action chunks from {dataset_repo_id!r}: "
f"all {num_episodes} episodes were shorter than chunk_size="
f"{chunk_size} ({short_episodes} too short) or had an unreadable "
"``action`` column. Lower ``chunk_size`` to match your episode "
"lengths."
)
actions = np.stack(actions_buf, axis=0).astype(np.float32) # (N, H, D)
logger.info(
"FAST fit: collected %d chunks of shape %s from %d episodes",
actions.shape[0], actions.shape[1:], eps_visited,
)
# Quantile-normalise per dimension before fitting.
#
# The FAST tokenizer DCT-transforms actions, scales by ``scale`` and
# rounds to integer tokens; the integer *range* must fit the
# codebook (vocab_size, default 1024). Raw motor units (e.g. encoder
# ticks) blow that range up — hence "Vocab size 1024 is too small".
# More importantly, at training time ``ActionTokenizerProcessorStep``
# runs *after* the QUANTILES ``NormalizerProcessorStep``, so it
# encodes normalised actions. Fitting on raw actions would mismatch
# that space. We replicate QUANTILES normalisation here (per-dim
# [q01, q99] → [-1, 1], clipped) so the fit and the training-time
# encode see the same distribution.
flat = actions.reshape(-1, actions.shape[-1])
q01 = np.quantile(flat, 0.01, axis=0)
q99 = np.quantile(flat, 0.99, axis=0)
span = np.where((q99 - q01) > 1e-6, q99 - q01, 1.0)
actions = np.clip((actions - q01) / span * 2.0 - 1.0, -1.0, 1.0).astype(np.float32)
base = AutoProcessor.from_pretrained(base_tokenizer_name, trust_remote_code=True)
if not hasattr(base, "fit"):
raise ImportError(
f"Base FAST tokenizer {base_tokenizer_name!r} has no ``.fit()`` "
"method — your transformers / model snapshot is too old. Update "
"to the current ``physical-intelligence/fast`` revision."
)
fitted = base.fit(actions)
out_dir.mkdir(parents=True, exist_ok=True)
fitted.save_pretrained(str(out_dir))
logger.info("FAST fit: saved fitted tokenizer to %s", out_dir)
return str(out_dir)
@@ -0,0 +1,73 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PI052 inference / runtime orchestration.
Multi-rate runtime that mirrors the recipe-time training shape:
low_level_execution LowLevelForward + DispatchAction (high Hz)
high_level_subtask HighLevelSubtaskFwd (~1 Hz)
memory_update MemoryUpdateFwd (event: subtask_change)
user_interjection_response UserInterjectionFwd (event: stdin)
ask_vqa_* AskVQAFwd (event: stdin question)
speech tool calls DispatchToolCalls (event: tool_call_pending)
The CLI ``lerobot-pi052-runtime`` builds a ``PI052Runtime`` and calls
``run()``.
"""
from .repl import StdinReader
from .runtime import PI052Runtime
from .runtime_state import initial_runtime_state, push_log, set_if_changed, take_event
from .steps import (
AskVQAFwd,
DispatchAction,
DispatchToolCalls,
HighLevelSubtaskFwd,
InferenceStep,
LowLevelForward,
MemoryUpdateFwd,
UserInterjectionFwd,
)
from .triggers import EventTrigger, HzTrigger, Tick, TickClock, Trigger
from .ui import make_state_panel, print_robot_lines, print_user_line
__all__ = [
# runtime
"PI052Runtime",
"StdinReader",
# state helpers
"initial_runtime_state",
"push_log",
"set_if_changed",
"take_event",
# triggers
"Trigger",
"Tick",
"TickClock",
"HzTrigger",
"EventTrigger",
# steps
"InferenceStep",
"LowLevelForward",
"DispatchAction",
"HighLevelSubtaskFwd",
"MemoryUpdateFwd",
"UserInterjectionFwd",
"AskVQAFwd",
"DispatchToolCalls",
# UI
"make_state_panel",
"print_robot_lines",
"print_user_line",
]
@@ -0,0 +1,105 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Stdin REPL event collector for the PI052 runtime.
Reads non-blocking stdin lines, classifies each one heuristically:
"stop" / "quit" / "exit" state["stop"] = True
"/action" / "/pause" set state["mode"]
ends with "?" user_vqa_query event
starts with "task:" or first line set runtime task
anything else user_interjection event
Plugged into the runtime via ``event_collector=StdinReader().poll``.
Note: the shipped CLI (``lerobot-pi052-runtime``) drives stdin
directly in its REPL / autonomous loops and does *not* wire this
collector; it's kept as the documented embedding hook and for tests.
"""
from __future__ import annotations
import select
import sys
from dataclasses import dataclass, field
from typing import Any
@dataclass
class StdinReader:
"""Non-blocking stdin line collector for the runtime loop."""
prompt: str = "> "
_seen_first_line: bool = field(default=False, init=False)
_prompted: bool = field(default=False, init=False)
def poll(self, state: dict[str, Any]) -> None:
"""Drain pending stdin lines into runtime events."""
# Print the input prompt once on every fresh tick if we don't
# already have a pending line; matches the expected REPL feel.
if not self._prompted:
print(self.prompt, end="", flush=True)
self._prompted = True
# ``select`` with timeout=0 makes this non-blocking. Only works
# for actual TTY / pipe stdins; CI / scripted runs hit EOF.
try:
ready, _, _ = select.select([sys.stdin], [], [], 0)
except (ValueError, OSError):
return
if not ready:
return
line = sys.stdin.readline()
if not line: # EOF
state["stop"] = True
return
line = line.strip()
self._prompted = False # we'll re-prompt next tick
if not line:
return
lower = line.lower()
if lower in {"stop", "quit", "exit"}:
state["stop"] = True
return
# Slash commands flip the run mode. ``/pause`` stops the action
# loop (the action steps gate on ``state["mode"]``); ``/action``
# resumes it.
if lower.split(" ", 1)[0] in {"/action", "/act", "/run"}:
state["mode"] = "action"
return
if lower in {"/pause", "/p"}:
state["mode"] = "paused"
queue = state.get("action_queue")
if hasattr(queue, "clear"):
queue.clear()
return
# First non-control line sets the task if no task is active.
if not state.get("task"):
task = line[5:].strip() if lower.startswith("task:") else line
state["task"] = task
print(f"[pi052] Task: {task}", flush=True)
self._seen_first_line = True
return
# Question → VQA; statement → interjection.
if lower.endswith("?"):
state["recent_vqa_query"] = line
state.setdefault("events_this_tick", []).append("user_vqa_query")
else:
state["recent_interjection"] = line
state.setdefault("events_this_tick", []).append("user_interjection")
@@ -0,0 +1,205 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PI052 runtime loop.
Threads the multi-rate inference pipeline together with a stdin REPL
event collector, drives ticks through :class:`TickClock`, and prints
state-change updates to the user.
"""
from __future__ import annotations
import logging
from collections import deque
from dataclasses import dataclass, field
from typing import Any, Callable
from .runtime_state import initial_runtime_state, push_log
from .steps import (
AskVQAFwd,
DispatchAction,
DispatchToolCalls,
HighLevelSubtaskFwd,
InferenceStep,
LowLevelForward,
MemoryUpdateFwd,
)
from .triggers import EventTrigger, HzTrigger, TickClock
logger = logging.getLogger(__name__)
@dataclass
class PI052Runtime:
"""Compose the inference pipeline and drive it tick-by-tick."""
policy: Any
tools: dict[str, Any] = field(default_factory=dict)
"""Name → tool-instance dict, e.g. ``{"say": SayTool(...)}``. Read
from :func:`lerobot.tools.get_tools(meta)` when wiring the
runtime."""
observation_provider: Callable[[], dict | None] | None = None
"""Closure returning the current preprocessed observation batch.
``None`` for dry-run / language-only sessions."""
robot_executor: Callable[[Any], None] | None = None
"""Closure that takes one action chunk and forwards it to the
robot. ``None`` for dry-run."""
event_collector: Callable[[dict], None] | None = None
"""Per-tick hook that polls external sources (stdin, network) and
appends event names to ``state["events_this_tick"]``."""
chunk_hz: float = 4.0
ctrl_hz: float = 50.0
high_level_hz: float = 1.0
max_rate_hz: float = 50.0
pipeline: list[InferenceStep] = field(init=False)
state: dict[str, Any] = field(init=False)
_stop: bool = field(default=False, init=False)
def __post_init__(self) -> None:
# Subtask + memory + VQA configuration. Pipeline:
#
# HighLevelSubtaskFwd → generate the next subtask via the LM
# head at ~``high_level_hz``; writes
# ``current_subtask`` and emits
# ``subtask_change`` on a transition.
# MemoryUpdateFwd → on ``subtask_change``, refresh
# ``current_memory`` from the
# ``memory_update`` head.
# AskVQAFwd → answer camera-grounded stdin questions.
# LowLevelForward → action chunk conditioned on the
# generated ``current_subtask``.
# DispatchAction → drain the chunk to the robot.
# DispatchToolCalls → fire any pending tool calls.
#
# Order matters: ``HighLevelSubtaskFwd`` must run before
# ``MemoryUpdateFwd`` so the event is visible the same tick, and
# both must run before ``LowLevelForward`` (which is gated on
# "action queue empty") so the chunk consumes the freshest
# subtask. ``UserInterjectionFwd`` is still importable but
# disabled until plan generation is wired in.
self.pipeline = [
HighLevelSubtaskFwd(
trigger=HzTrigger(self.high_level_hz),
policy=self.policy,
observation_provider=self.observation_provider,
),
# Listens for the ``subtask_change`` event raised by
# ``HighLevelSubtaskFwd`` and refreshes ``current_memory``.
MemoryUpdateFwd(
trigger=EventTrigger("subtask_change"),
policy=self.policy,
observation_provider=self.observation_provider,
),
AskVQAFwd(
policy=self.policy,
observation_provider=self.observation_provider,
),
LowLevelForward(
trigger=HzTrigger(self.chunk_hz),
policy=self.policy,
observation_provider=self.observation_provider,
),
DispatchAction(
trigger=HzTrigger(self.ctrl_hz),
robot_executor=self.robot_executor,
),
DispatchToolCalls(tools=self.tools),
]
self.state = initial_runtime_state()
# ------------------------------------------------------------------
# Lifecycle
# ------------------------------------------------------------------
def set_task(self, task: str) -> None:
"""Set or replace the active task. Logged for the REPL."""
self.state["task"] = task
push_log(self.state, f"Task: {task}")
def stop(self) -> None:
self._stop = True
def run(self, *, max_ticks: int | None = None) -> None:
"""Main loop. Returns when ``stop()`` is called or after
``max_ticks`` ticks (useful for tests / dry-run)."""
clock = TickClock(max_rate_hz=self.max_rate_hz)
while not self._stop:
tick = clock.advance()
self.state["_tick"] = tick
self.state["events_this_tick"] = []
self.state["log_lines"] = []
if self.event_collector is not None:
self.event_collector(self.state)
if self.state.get("stop"):
self._stop = True
break
for step in self.pipeline:
self.state = step(self.state)
self._flush_logs()
if max_ticks is not None and tick.index >= max_ticks:
break
self._on_shutdown()
# ------------------------------------------------------------------
# REPL helper: drive one full pipeline pass and return its logs
# ------------------------------------------------------------------
def step_once(self) -> list[str]:
"""Run one tick of the pipeline and return the log lines.
Used by the interactive REPL: instead of a background thread,
the CLI drives ticks synchronously after each user input. Logs
are returned (not printed) so the caller can route them into
the rich-Live chat scrollback.
"""
from .triggers import Tick # noqa: PLC0415
# Synthesize a tick. We don't need the real wall-clock pacing
# here — the REPL drives the runtime, not vice versa — but
# ``HzTrigger`` uses ``tick.monotonic_seconds`` to gate, so we
# bump it generously so every Hz-triggered step considers
# itself due.
import time as _time # noqa: PLC0415
prev_index = self.state.get("_tick").index if isinstance(self.state.get("_tick"), Tick) else 0
self.state["_tick"] = Tick(index=prev_index + 1, monotonic_seconds=_time.monotonic())
self.state["log_lines"] = []
# ``events_this_tick`` is set up by the caller before
# ``step_once`` (the REPL pushes user-driven events first).
self.state.setdefault("events_this_tick", [])
for step in self.pipeline:
self.state = step(self.state)
return list(self.state.get("log_lines") or [])
# ------------------------------------------------------------------
# I/O
# ------------------------------------------------------------------
def _flush_logs(self) -> None:
for line in self.state.get("log_lines") or []:
print(f"[pi052] {line}", flush=True)
def _on_shutdown(self) -> None:
# Drain any queued action chunks safely.
queue = self.state.get("action_queue")
if isinstance(queue, deque):
queue.clear()
print("[pi052] runtime stopped", flush=True)
@@ -0,0 +1,95 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Runtime state passed between inference steps each tick.
The runtime threads a single dict through the pipeline; this module
documents the shape and provides factories. We use a plain ``dict``
rather than a frozen dataclass because steps freely add and remove
keys (``events_this_tick``, ``messages_pending``, ``tool_calls_pending``,
) and dataclass field churn would just get in the way.
Stable keys (read by multiple steps):
task str the current top-level task
current_plan str | None latest plan emitted by the planner
current_subtask str | None latest subtask the policy is executing
current_memory str | None latest compressed memory
recent_interjection str | None most recent user interjection text (consumed)
action_queue collections.deque[Tensor] pending action chunks
tool_calls_pending list[dict] parsed but not-yet-dispatched tool calls
events_this_tick list[str] triggers consumed this tick
_tick Tick current tick (set by the loop)
mode str "action" (run the robot) | "paused"
(action loop stopped robot holds)
log_lines list[str] human-readable status lines printed each tick
"""
from __future__ import annotations
from collections import deque
from typing import Any
def initial_runtime_state(task: str | None = None) -> dict[str, Any]:
"""Build a fresh runtime state dict with sensible defaults."""
return {
"task": task,
"current_plan": None,
"current_subtask": None,
"current_memory": None,
"recent_interjection": None,
"action_queue": deque(),
"tool_calls_pending": [],
"events_this_tick": [],
"log_lines": [],
"mode": "action",
"stop": False,
}
def take_event(state: dict[str, Any], event_name: str) -> bool:
"""Pop ``event_name`` from ``events_this_tick`` if present.
Steps that consume an event call this so the same event doesn't
re-fire on a sibling step within the same tick.
"""
events: list[str] = state.get("events_this_tick") or []
if event_name in events:
events.remove(event_name)
return True
return False
def push_log(state: dict[str, Any], line: str) -> None:
"""Append ``line`` to the per-tick log buffer; the runtime prints
it at the end of the tick."""
state.setdefault("log_lines", []).append(line)
def set_if_changed(state: dict[str, Any], key: str, value: Any, label: str | None = None) -> bool:
"""Update ``state[key]`` and log a diff line if the value changed.
Returns ``True`` if the value actually changed.
"""
prev = state.get(key)
if prev == value:
return False
state[key] = value
if label is not None:
push_log(state, f" {label}: {value}")
return True
@@ -0,0 +1,955 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Inference steps for the PI052 multi-rate runtime.
Each step is a tiny class with a ``trigger`` and an ``__call__(state)``;
the runtime applies them in order each tick. When a step's trigger
doesn't fire, the step is a no-op and the runtime moves on.
Stream-to-step mapping mirrors the ``subtasks_vqa.yaml`` recipe:
* ``LowLevelForward`` calls ``policy.select_action`` for the
action chunk; trained by
``low_level_execution``
* ``EnqueueChunk`` pushes the chunk to ``action_queue``
* ``DispatchAction`` pops one action per control tick and
forwards to the robot
* ``HighLevelSubtaskFwd`` calls ``policy.select_message`` for the
next subtask; trained by
``high_level_subtask``
* ``MemoryUpdateFwd`` fires on subtask boundary; trained by
``memory_update``
* ``UserInterjectionFwd`` fires on stdin interjection; trained by
``user_interjection_response``
* ``AskVQAFwd`` fires on stdin question; trained by
``ask_vqa_*``
* ``DispatchToolCalls`` pops ``tool_calls_pending`` and calls
the matching ``Tool`` instance
"""
from __future__ import annotations
import logging
import re
from dataclasses import dataclass, field
from typing import Any
from .runtime_state import push_log, set_if_changed, take_event
from .triggers import EventTrigger, HzTrigger, Trigger
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Step base + runner
# ---------------------------------------------------------------------------
@dataclass
class InferenceStep:
"""A trigger-gated callable. Subclasses override :meth:`run`."""
trigger: Trigger
def __call__(self, state: dict[str, Any]) -> dict[str, Any]:
if not self.trigger.should_fire(state["_tick"], state):
return state
return self.run(state) or state
def run(self, state: dict[str, Any]) -> dict[str, Any] | None: # pragma: no cover
raise NotImplementedError
# ---------------------------------------------------------------------------
# Low-level (action) path
# ---------------------------------------------------------------------------
@dataclass
class LowLevelForward(InferenceStep):
"""Run the policy's action head and produce one action chunk."""
policy: Any = None
observation_provider: Any = None
"""Callable ``() -> dict``: returns the current observation batch
(already preprocessed). Typically wraps the robot's camera /
proprio reads. ``None`` in dry-run mode step skips."""
trigger: Trigger = field(default_factory=lambda: HzTrigger(hz=4.0))
def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
if self.policy is None or self.observation_provider is None:
return None
# ``/vlm`` mode pauses the whole action loop so the robot holds
# position while the operator probes the VLM with VQA.
if state.get("mode", "action") != "action":
return None
if not state.get("task"):
return None
# PI052 produces *action chunks* (typically 50 steps via
# flow-matching). Every step gets dispatched to the robot;
# popping one per dispatch tick is essentially free. Only
# generate a new chunk once the previous one has fully
# drained — this is the canonical "sense → think → act"
# loop. Refreshing while a chunk is still queued causes the
# new chunk to "telescope" past the old one (planned from an
# observation that's already 25+ steps stale by the time it
# starts dispatching).
queue = state.setdefault("action_queue", [])
if len(queue) > 0:
return None
observation = self.observation_provider()
if observation is None:
return None
# The action expert is conditioned on the SUBTASK generated by
# the high-level loop (``HighLevelSubtaskFwd`` runs earlier in
# the pipeline and writes ``current_subtask``). Matches the
# training-time ``low_level_execution`` recipe — ``user(${subtask})``.
# Falls back to the task string only on the very first frame,
# before the high-level loop has produced a subtask.
subtask = state.get("current_subtask") or state.get("task") or ""
ctx = [{"role": "user", "content": subtask}]
# ``add_generation_prompt=False`` to match the training-time
# prefix shape: at training the action expert sees the rendered
# user turn ending at ``<|im_end|>`` (no trailing
# ``<|im_start|>assistant\n``). Passing True here would append
# extra role-marker tokens the action expert never saw during
# training.
text_batch = _build_text_batch(self.policy, ctx, add_generation_prompt=False)
from lerobot.utils.constants import ( # noqa: PLC0415
OBS_LANGUAGE_ATTENTION_MASK,
OBS_LANGUAGE_TOKENS,
)
observation = dict(observation)
observation[OBS_LANGUAGE_TOKENS] = text_batch["lang_tokens"]
observation[OBS_LANGUAGE_ATTENTION_MASK] = text_batch["lang_masks"]
try:
# ``predict_action_chunk`` returns the *full* chunk shape
# ``(batch, n_action_steps, action_dim)``. Enqueue every
# step so DispatchAction at ctrl_hz can drain them
# smoothly until the next refresh.
chunk = self.policy.predict_action_chunk(observation)
except Exception as exc: # noqa: BLE001
logger.warning(
"predict_action_chunk failed: %s",
exc,
exc_info=logger.isEnabledFor(logging.DEBUG),
)
push_log(
state,
f" [warn] predict_action_chunk failed: "
f"{type(exc).__name__}: {exc}",
)
return None
# ``chunk`` shape: ``(batch, n_action_steps, action_dim)``. Push
# each step as a ``(1, action_dim)`` tensor so the existing
# action executor's batch-squeeze logic works unchanged.
if chunk.ndim == 3:
chunk_iter = chunk[0] # ``(n_action_steps, action_dim)``
elif chunk.ndim == 2:
chunk_iter = chunk
else:
chunk_iter = chunk.unsqueeze(0)
for step in chunk_iter:
queue.append(step.unsqueeze(0))
state["last_chunk_size"] = int(chunk_iter.shape[0])
return None
@dataclass
class DispatchAction(InferenceStep):
"""Pop one action per tick and hand it to the robot.
In dry-run mode (``robot_executor=None``) the step still pops the
queue so it doesn't grow unbounded — the popped tensor is logged
instead of executed.
Wall-clock catch-up: the action queue represents an open-loop
trajectory at a fixed step rate (``trigger.hz`` ``ctrl_hz``).
When the main loop stalls e.g. an LLM call for the high-level
subtask blocks for ~2 s on MPS the dispatch trigger fires only
once over that whole interval. Naively popping a single entry per
fire makes the robot lag further and further behind the planned
timeline, and a 50-step chunk would take ~125 s to drain instead
of ~1.7 s. Track real elapsed time between dispatches and pop
``round(elapsed * hz)`` entries, sending the most recent one. The
skipped intermediate joint targets are stale anyway the dynamixel
will smooth toward the latest goal position.
"""
robot_executor: Any = None
trigger: Trigger = field(default_factory=lambda: HzTrigger(hz=50.0))
_last_dispatch_t: float | None = field(default=None, init=False)
def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
import time as _time # noqa: PLC0415
# ``/vlm`` mode pauses dispatch — the robot holds its last
# commanded position while the operator runs VQA.
if state.get("mode", "action") != "action":
self._last_dispatch_t = None
return None
queue = state.get("action_queue")
if not queue:
# Reset wall-clock anchor when the queue is empty so the
# next chunk doesn't see a huge fake "elapsed" window.
self._last_dispatch_t = None
return None
now = _time.monotonic()
hz = getattr(self.trigger, "hz", 30.0)
if self._last_dispatch_t is None or hz <= 0:
n_to_pop = 1
else:
elapsed = now - self._last_dispatch_t
# ``max(1, ...)`` so we always pop at least one when the
# trigger fires; ``min(len(queue), ...)`` so we don't run
# off the end of the chunk.
n_to_pop = max(1, min(len(queue), int(round(elapsed * hz))))
self._last_dispatch_t = now
# Drain ``n_to_pop`` stale entries, keep only the latest as the
# action actually sent. The intermediate joint targets would
# all be ~1030 ms apart in chunk time — the robot can't track
# them individually anyway when the host loop is slow.
latest = None
for _ in range(n_to_pop):
if not queue:
break
latest = queue.popleft() if hasattr(queue, "popleft") else queue.pop(0)
state["actions_dispatched"] = state.get("actions_dispatched", 0) + 1
if latest is not None and self.robot_executor is not None:
self.robot_executor(latest)
return None
# ---------------------------------------------------------------------------
# High-level (text) paths — all use policy.select_message
# ---------------------------------------------------------------------------
_LOC_TOKENIZER_CACHE: dict[str, Any] = {}
def _get_loc_tokenizer(tok_name: str, auto_tokenizer_cls: Any, register_loc_fn: Any) -> Any:
"""Return a loc-token-registered tokenizer, loading from disk only once.
``AutoTokenizer.from_pretrained`` + loc-token registration is expensive and
the result is immutable, so cache per ``tok_name``.
"""
tokenizer = _LOC_TOKENIZER_CACHE.get(tok_name)
if tokenizer is None:
tokenizer = register_loc_fn(auto_tokenizer_cls.from_pretrained(tok_name))
_LOC_TOKENIZER_CACHE[tok_name] = tokenizer
return tokenizer
def _build_text_batch(
policy: Any,
prompt_messages: list[dict[str, Any]],
*,
add_generation_prompt: bool = True,
) -> dict[str, Any]:
"""Tokenize chat messages into the batch ``select_message`` expects.
PI052's backbone (PaliGemma) ships no chat template, so we train on
a plain role-prefixed concatenation built by
``PI052TextTokenizerStep``. We reuse that exact formatter so the
inference prefix matches training; ``add_generation_prompt`` appends
the bare ``Assistant: `` header the LM head continues from.
"""
import torch # noqa: PLC0415
from transformers import AutoTokenizer # noqa: PLC0415
from lerobot.policies.pi052.text_processor_pi052 import ( # noqa: PLC0415
_flatten_say_tool_calls,
_format_messages,
_strip_blocks,
register_paligemma_loc_tokens,
)
tok_name = (
getattr(policy.config, "tokenizer_name", None) or "google/paligemma-3b-pt-224"
)
# Register PaliGemma's <locDDDD> tokens so inference encoding /
# decoding sees them as single vocab ids — must match training.
# The tokenizer is read-only after registration, so cache it: rebuilding it
# from disk on every call dominated eval runtime (this runs twice per env
# per replan — subtask gen + action prompt).
tokenizer = _get_loc_tokenizer(tok_name, AutoTokenizer, register_paligemma_loc_tokens)
messages = [_strip_blocks(_flatten_say_tool_calls(m)) for m in prompt_messages]
prompt, _spans = _format_messages(messages)
if add_generation_prompt:
prompt = prompt + "Assistant: "
encoded = tokenizer(prompt, return_tensors="pt")
ids = encoded["input_ids"]
attn = encoded.get("attention_mask")
if attn is None and tokenizer.pad_token_id is not None:
attn = ids != tokenizer.pad_token_id
if attn is not None and hasattr(attn, "dtype") and attn.dtype != torch.bool:
attn = attn.bool()
# Move tokens onto the policy's device — otherwise prefix embedding
# raises a device-mismatch on every forward (CPU tensor vs MPS / CUDA
# model), which the caller's broad except would swallow silently.
device = getattr(getattr(policy, "config", None), "device", None)
if device is not None:
try:
ids = ids.to(device)
if attn is not None and hasattr(attn, "to"):
attn = attn.to(device)
except Exception as exc: # noqa: BLE001
logger.debug("could not move pi052 lang tokens to %s: %s", device, exc)
return {"lang_tokens": ids, "lang_masks": attn, "tokenizer": tokenizer}
def _strip_recipe_keys(m: dict[str, Any]) -> dict[str, Any]:
new = dict(m)
new.pop("stream", None)
new.pop("target", None)
return new
@dataclass
class HighLevelSubtaskFwd(InferenceStep):
"""At ~1 Hz, ask the policy for the next subtask.
Mirrors the ``high_level_subtask`` recipe layout exactly:
user: "${task}\\nPlan: ${plan}\\nMemory: ${memory}"
user: "Current subtask: ${subtask}" (if subtask present)
generate
assistant: <next subtask>
"""
policy: Any = None
observation_provider: Any = None
"""Same shape as ``LowLevelForward.observation_provider``. When
set, the resulting observation is merged into ``select_message``'s
batch so text generation runs against real video + state."""
trigger: Trigger = field(default_factory=lambda: HzTrigger(hz=1.0))
def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
if self.policy is None or not state.get("task"):
return None
# ``/vlm`` mode pauses subtask generation along with the rest of
# the action loop.
if state.get("mode", "action") != "action":
return None
# Gate to chunk boundaries: only generate a fresh subtask when
# the action queue is empty (i.e. right before LowLevelForward
# refreshes the chunk). ``select_message`` takes ~2 s on MPS,
# and running it every loop iteration starves DispatchAction
# at ctrl_hz=30 — the queue drains at ~0.4 actions/sec instead
# of 30/sec and the robot barely moves. Tying it to the same
# "queue empty" condition as the chunk refresh produces a
# clean sense → think → act cycle.
#
# Rearm the trigger when skipping so a low-hz schedule
# (e.g. ``--high_level_hz=0.2`` = once per 5 s) doesn't lose
# the slot: the trigger fires once on the timer but the brief
# queue-empty window almost never coincides, so without rearm
# HL would effectively never run.
queue = state.get("action_queue") or []
if len(queue) > 0:
if hasattr(self.trigger, "rearm"):
self.trigger.rearm()
return None
# Per-chunk-boundary throttle: at each "queue empty" moment we
# increment a counter; subtask gen only fires once the counter
# reaches ``subtask_chunks_per_gen``. Lets the operator run e.g.
# 5 action chunks per subtask-gen so the LM head doesn't churn
# every 1.7 s (a fresh subtask while the previous one is still
# being executed is wasted compute *and* causes the action
# expert's flow trajectory to be re-planned mid-grasp).
chunks_per_gen = max(1, int(state.get("subtask_chunks_per_gen", 1) or 1))
# Initialise so the first chunk boundary fires immediately
# (counter starts at chunks_per_gen, decrements per skip,
# generates and resets when it hits 0).
if "_hl_chunks_until_gen" not in state:
state["_hl_chunks_until_gen"] = 0
if state["_hl_chunks_until_gen"] > 0:
state["_hl_chunks_until_gen"] -= 1
if hasattr(self.trigger, "rearm"):
self.trigger.rearm()
return None
state["_hl_chunks_until_gen"] = chunks_per_gen - 1
ctx = _msgs_for_subtask(state)
observation = _maybe_observation(self.observation_provider)
# Default: greedy argmax, no min_new_tokens, no special-token
# suppression — matches training. Operator can override via
# ``--text_min_new_tokens=N --text_temperature=T --text_top_p=P``
# on the CLI; useful for under-trained checkpoints whose LM
# head still favours EOS at position 0 (pre-trained chat
# backbone's short-turn prior hasn't been fully overridden
# by the fine-tuning supervision yet).
msg = _generate_with_policy(
self.policy,
ctx,
observation=observation,
state=state,
label="subtask gen",
min_new_tokens=int(state.get("text_gen_min_new_tokens") or 0),
temperature=float(state.get("text_gen_temperature") or 0.0),
top_p=float(state.get("text_gen_top_p") or 1.0),
# Subtasks never legitimately contain PaliGemma ``<loc>``
# tokens — suppress them so a checkpoint whose LM head
# has drifted toward the pretrained loc-prior falls back
# to its (still-correct) text mass.
suppress_loc_tokens=True,
)
# Diagnostics: surface what the model is *actually* producing
# at chunk boundaries, even when the output gets rejected or
# repeats. Memorisation collapse looks like "same accepted
# subtask N times in a row" or "gibberish_count rising while
# current_subtask is stuck". The state panel renders these.
state["last_subtask_raw"] = msg or ""
# Persistent empty completion is its own failure mode (model
# immediately EOS-es from the chat-template generation
# prompt) — surface it once every N occurrences so the
# operator can distinguish "generation failing silently"
# from "generating fine but filter rejecting".
if not msg:
empties = state.get("subtask_empty_count", 0) + 1
state["subtask_empty_count"] = empties
if empties == 1 or empties % 5 == 0:
debug = getattr(self.policy, "_last_select_message_debug", "") or ""
if debug:
push_log(
state,
f" [info] subtask gen empty (×{empties}); {debug}",
)
else:
push_log(
state,
f" [info] subtask gen returned empty (×{empties}) — "
"no tokens generated (head EOS-ing before any "
"non-special token).",
)
if msg and _looks_like_gibberish(msg):
# Bump a counter so the operator can see the model is
# struggling without spamming the log every tick. A first
# rejection still logs once so the failure is visible.
count = state.get("subtask_gibberish_count", 0) + 1
state["subtask_gibberish_count"] = count
if count == 1 or count % 30 == 0:
push_log(
state,
f" [info] subtask gen rejected (gibberish ×{count}): {msg[:60]!r}",
)
return None
if msg:
prev_subtask = state.get("current_subtask")
changed = set_if_changed(state, "current_subtask", msg, label="subtask")
if changed:
# Stash the just-completed subtask so ``MemoryUpdateFwd``
# can drop it into its prompt as ``Completed subtask:``
# — the recipe binds ``completed_subtask`` to
# ``nth_prev(style=subtask, offset=1)``, i.e. the subtask
# that was active *before* the change.
if prev_subtask:
state["prior_subtask"] = prev_subtask
# Subtask change is a downstream trigger.
state.setdefault("events_this_tick", []).append("subtask_change")
state["subtask_repeat_count"] = 0
else:
# Same accepted string regenerated — memorisation tell.
# Once this counter climbs past a few, you're seeing
# the model unable to move past the current subtask
# despite the chunk having drained (visual scene may
# have changed but the LM is replaying training
# tokens).
state["subtask_repeat_count"] = (
state.get("subtask_repeat_count", 0) + 1
)
# Silently skip empty completions — common when the model
# warms up or generates only EOS; logging it every tick at
# ctrl_hz is just noise.
return None
@dataclass
class MemoryUpdateFwd(InferenceStep):
"""On subtask boundary, refresh the compressed memory.
Mirrors the ``memory_update`` recipe layout exactly:
user: "${task}"
assistant: "Previous memory: ${prior_memory}" (if prior memory)
user: "Completed subtask: ${completed_subtask}" (if subtask)
generate
assistant: <new memory>
"""
policy: Any = None
observation_provider: Any = None
trigger: Trigger = field(default_factory=lambda: EventTrigger("subtask_change"))
def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
# Don't consume the event — multiple steps may want to react.
if self.policy is None:
return None
ctx = _msgs_for_memory(state)
observation = _maybe_observation(self.observation_provider)
new_memory = _generate_with_policy(
self.policy,
ctx,
observation=observation,
state=state,
label="memory gen",
suppress_loc_tokens=True,
)
state["last_memory_raw"] = new_memory or ""
if new_memory and _looks_like_gibberish(new_memory):
count = state.get("memory_gibberish_count", 0) + 1
state["memory_gibberish_count"] = count
push_log(
state,
f" [info] memory gen rejected (gibberish ×{count}): {new_memory[:60]!r}",
)
return None
if new_memory:
set_if_changed(state, "current_memory", new_memory, label="memory")
return None
@dataclass
class UserInterjectionFwd(InferenceStep):
"""On stdin interjection, refresh the plan + emit a paired ``say``.
Mirrors the ``user_interjection_response`` recipe layout exactly:
user: "${task}"
assistant: "Previous plan:\\n${prior_plan}" (if prior plan)
user: "${interjection}" (the new utterance)
generate
assistant: <plan + <say>...</say>>
"""
policy: Any = None
observation_provider: Any = None
trigger: Trigger = field(default_factory=lambda: EventTrigger("user_interjection"))
def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
if self.policy is None or not take_event(state, "user_interjection"):
return None
ctx = _msgs_for_interjection(state)
observation = _maybe_observation(self.observation_provider)
out = _generate_with_policy(
self.policy,
ctx,
observation=observation,
state=state,
label="plan/say gen",
suppress_loc_tokens=True,
)
if not out:
# Don't log every empty completion — happens repeatedly on
# MPS during warm-up and floods the panel. The user can
# re-trigger by typing again.
return None
if _looks_like_gibberish(out):
count = state.get("plan_gibberish_count", 0) + 1
state["plan_gibberish_count"] = count
push_log(
state,
f" [info] plan/say gen rejected (gibberish ×{count}): {out[:60]!r}",
)
return None
# Heuristic split: model is trained to emit one assistant turn
# carrying both plan text AND a `say` tool call. Look for a
# "<say>...</say>" or "say(...)" marker; fall back to whole
# text → plan, no speech.
plan_text, speech_text = _split_plan_and_say(out)
if plan_text and _looks_like_gibberish(plan_text):
plan_text = ""
if plan_text:
set_if_changed(state, "current_plan", plan_text, label="plan")
if speech_text:
push_log(state, f" speech: {speech_text}")
state.setdefault("tool_calls_pending", []).append(
{
"type": "function",
"function": {"name": "say", "arguments": {"text": speech_text}},
}
)
state.setdefault("events_this_tick", []).append("tool_call_pending")
# Mark interjection consumed.
state["recent_interjection"] = None
return None
@dataclass
class AskVQAFwd(InferenceStep):
"""On stdin question, answer a frame-grounded VQA.
Mirrors the ``ask_vqa_*`` recipe layout exactly: a single user
turn carrying just the VQA question, plus the camera image block
in training (we drop the image at inference because the dataset's
image preprocessing doesn't match SmolVLM's vision tower input).
user: <question>
generate
assistant: <vqa answer>
"""
policy: Any = None
observation_provider: Any = None
trigger: Trigger = field(default_factory=lambda: EventTrigger("user_vqa_query"))
def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
if self.policy is None or not take_event(state, "user_vqa_query"):
return None
question = state.get("recent_vqa_query")
if not question:
return None
ctx = _msgs_for_vqa(question)
observation = _maybe_observation(self.observation_provider)
answer = _generate_with_policy(
self.policy,
ctx,
observation=observation,
state=state,
label="vqa gen",
)
# VQA answers are intentionally JSON-like during training, so
# ``_looks_like_gibberish`` would false-positive on them. Keep
# the answer as-is — the VQA panel line lets the user judge.
if answer:
push_log(state, f" vqa: {answer}")
state["recent_vqa_query"] = None
return None
# ---------------------------------------------------------------------------
# Tool dispatch
# ---------------------------------------------------------------------------
@dataclass
class DispatchToolCalls(InferenceStep):
"""Pop ``tool_calls_pending`` and execute them via :data:`TOOL_REGISTRY`."""
tools: dict[str, Any] = field(default_factory=dict)
trigger: Trigger = field(default_factory=lambda: EventTrigger("tool_call_pending"))
def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
take_event(state, "tool_call_pending")
pending = state.get("tool_calls_pending") or []
for call in pending:
try:
fn = (call or {}).get("function") or {}
name = fn.get("name")
args = fn.get("arguments") or {}
tool = self.tools.get(name)
if tool is None:
push_log(state, f" [warn] tool {name!r} not registered — skipping call")
continue
tool.call(args)
except Exception as exc: # noqa: BLE001
push_log(state, f" [error] tool dispatch failed: {exc}")
state["tool_calls_pending"] = []
return None
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _looks_like_gibberish(text: str) -> bool:
"""Heuristically detect generation that's clearly off the rails.
Memorised models can collapse to dominant-mode outputs when the
prompt drifts even slightly from training distribution. Reject:
* empty / whitespace-only
* too few alphabetic characters (mostly punctuation)
* a single character repeated past the threshold
* starts with ``":"`` and contains no letters
* too few unique tokens e.g. ``"the"``, ``"the the the"``,
``"Ass\\n::\\nthe"`` (the collapse seen on real-robot frames
where the model emits one or two memorised tokens repeatedly)
* chat-template fragment leakage (``Assistant:``, ``User:``,
``Ass\\n``)
Real subtasks look like ``"close the gripper to grasp the blue
cube"`` — multiple unique alphabetic tokens, no role-marker
fragments. Anything materially shorter than that is rejected.
"""
if not text or not text.strip():
return True
stripped = text.strip()
alpha = sum(1 for c in stripped if c.isalpha())
if alpha < max(3, len(stripped) // 8):
return True
if stripped.startswith('":') and stripped.count('"') > stripped.count(" "):
return True
# Single repeating char: e.g. ``""""""``.
if len(set(stripped)) <= 2 and len(stripped) > 4:
return True
# Chat-template fragment leakage — the model emits ``Ass``,
# ``Assistant:``, ``User:``, often with extra newlines/colons.
# Reject if the cleaned text is mostly role-marker shards.
cleaned = stripped.replace("\n", " ").replace(":", " ")
for marker in ("Assistant", "User", "Ass "):
if marker in cleaned and len(cleaned.split()) < 4:
return True
tokens = [t for t in cleaned.split() if any(c.isalpha() for c in t)]
unique_alpha = {t.lower() for t in tokens}
# Short degenerate output — model stuck on ``the`` or a couple of
# memorised single-token continuations.
if len(unique_alpha) < 3 and len(stripped) < 80:
return True
# Long repetition collapse — the LM head loops an n-gram for the
# whole generation budget ("the arm the arm … the the the the").
# Length-independent: many tokens but a tiny unique ratio. The
# earlier ``< 80`` check missed these because the looped string
# blows well past 80 chars.
if len(tokens) >= 8 and len(unique_alpha) <= max(3, len(tokens) // 10):
return True
return False
def _control_context_messages(
state: dict[str, Any],
*,
include_completed: bool = False,
extra_user: str | None = None,
) -> list[dict[str, Any]]:
"""Build a chat-template-ready prompt from current runtime state.
Mirrors what ``subtasks_vqa.yaml`` renders into ``${task}\nPlan:
${plan}\nMemory: ${memory}`` for the high-level branches.
"""
# Always emit ``Plan: `` / ``Memory: `` labels — even with empty
# values — to mirror the training-time recipe substitution.
task = state.get("task") or ""
plan = state.get("current_plan") or ""
memory = state.get("current_memory") or ""
parts = [task, f"Plan: {plan}", f"Memory: {memory}"]
if include_completed and state.get("current_subtask"):
parts.append(f"Completed subtask: {state['current_subtask']}")
head = "\n".join(parts)
msgs: list[dict[str, Any]] = [{"role": "user", "content": head}]
if extra_user:
msgs.append({"role": "user", "content": extra_user})
return msgs
# ---------------------------------------------------------------------------
# Per-recipe prompt builders. Each one mirrors a single sub-recipe's
# message layout in ``subtasks_vqa.yaml`` so the chat-templated
# prompt at inference matches what the model saw during training.
# Generic ``_control_context_messages`` is kept around as a fallback
# for ad-hoc callers but the four high-level steps now use these.
# ---------------------------------------------------------------------------
def _hirobot_user_head(state: dict[str, Any]) -> str:
"""Build the ``task\\nPlan: …\\nMemory: …`` user content string.
Mirrors what the recipe renders at training time, where
``language_render._substitute`` substitutes empty strings for
missing ``${plan}`` / ``${memory}`` bindings i.e. the
``Plan: `` / ``Memory: `` prefix labels are *always* in the
user turn, even when their values aren't set yet. Skipping them
here (the previous behaviour) produced a different prompt shape
on early frames before plan / memory are populated and on
samples where the dataset has no plan / memory annotation.
"""
task = state.get("task") or ""
plan = state.get("current_plan") or ""
memory = state.get("current_memory") or ""
return f"{task}\nPlan: {plan}\nMemory: {memory}"
def _msgs_for_subtask(state: dict[str, Any]) -> list[dict[str, Any]]:
"""``high_level_subtask`` recipe layout — predict the subtask from the
task. The v-current recipe's user turn is just ``${task}`` (plan and
memory are not trained), so the inference prompt is the bare task
no ``Plan: `` / ``Memory: `` lines.
"""
return [{"role": "user", "content": state.get("task") or ""}]
def _msgs_for_memory(state: dict[str, Any]) -> list[dict[str, Any]]:
"""Memory-update prompt — mirrors ``memory_update`` recipe layout.
Recipe layout (``subtask_mem.yaml``):
user: "${task}"
assistant: "Previous memory: ${prior_memory}" (if_present prior)
user: "Completed subtask: ${completed}" (if_present completed)
assistant: predicts new memory
Fired by ``MemoryUpdateFwd`` on a ``subtask_change`` event:
``state['current_memory']`` is the memory the policy last emitted
(= the ``prior_memory`` binding at training), and
``state['prior_subtask']`` is the subtask that just got replaced
(= the ``completed_subtask`` binding at training).
"""
msgs: list[dict[str, Any]] = [
{"role": "user", "content": state.get("task") or ""},
]
prior_memory = state.get("current_memory")
if prior_memory:
msgs.append(
{"role": "assistant", "content": f"Previous memory: {prior_memory}"}
)
completed_subtask = state.get("prior_subtask")
if completed_subtask:
msgs.append(
{"role": "user", "content": f"Completed subtask: {completed_subtask}"}
)
return msgs
def _msgs_for_interjection(state: dict[str, Any]) -> list[dict[str, Any]]:
"""``user_interjection_response`` recipe layout."""
msgs: list[dict[str, Any]] = [
{"role": "user", "content": state.get("task") or ""}
]
if state.get("current_plan"):
msgs.append(
{"role": "assistant", "content": f"Previous plan:\n{state['current_plan']}"}
)
interjection = state.get("recent_interjection")
if interjection:
msgs.append({"role": "user", "content": interjection})
return msgs
def _msgs_for_plan(state: dict[str, Any]) -> list[dict[str, Any]]:
"""``plan_generation`` recipe layout — bare task → plan.
The assistant turn is the generation target, so we only render
the user turn at inference; the runtime appends the predicted
plan after sampling.
"""
return [{"role": "user", "content": state.get("task") or ""}]
def _msgs_for_vqa(question: str) -> list[dict[str, Any]]:
"""``ask_vqa_*`` recipe layout (text-only at inference)."""
return [{"role": "user", "content": question}]
def _maybe_observation(provider: Any) -> dict | None:
"""Pull one observation from ``provider`` if it's set, else ``None``.
Errors from the provider are logged at debug level and swallowed
text generation still runs (in text-only mode) so a flaky frame
source doesn't kill the REPL.
"""
if provider is None:
return None
try:
return provider()
except Exception as exc: # noqa: BLE001
logger.debug("observation_provider raised %s — falling back to text-only", exc)
return None
def _generate_with_policy(
policy: Any,
messages: list[dict[str, Any]],
*,
observation: dict | None = None,
state: dict[str, Any] | None = None,
label: str = "select_message",
min_new_tokens: int = 0,
temperature: float = 0.0,
top_p: float = 1.0,
suppress_loc_tokens: bool = False,
) -> str:
"""Drive ``policy.select_message`` with a chat batch (and optional obs).
When ``observation`` carries ``observation.images.*`` and
``observation.state``, those are merged into the batch so
``select_message`` runs the same VLM prefix the policy was trained
on. Without an observation the runtime falls back to a text-only
prompt the text head still runs, but generations may drift from
the training distribution.
Failures are surfaced both to the module logger (``warning``) and,
when ``state`` is given, to the runtime's user-visible log via
:func:`push_log`, so the REPL no longer "looks dead" when
something goes wrong inside generation.
"""
if not hasattr(policy, "select_message"):
if state is not None:
push_log(state, f" [warn] policy has no select_message — skipping {label}")
return ""
text_batch = _build_text_batch(policy, messages)
try:
from lerobot.utils.constants import ( # noqa: PLC0415
OBS_LANGUAGE_ATTENTION_MASK,
OBS_LANGUAGE_TOKENS,
)
batch: dict[str, Any] = {
OBS_LANGUAGE_TOKENS: text_batch["lang_tokens"],
OBS_LANGUAGE_ATTENTION_MASK: text_batch["lang_masks"],
}
if observation:
for k, v in observation.items():
if isinstance(k, str) and k.startswith("observation.") and k not in batch:
batch[k] = v
kwargs: dict[str, Any] = {
"tokenizer": text_batch["tokenizer"],
"min_new_tokens": min_new_tokens,
"temperature": temperature,
"top_p": top_p,
}
kwargs["suppress_loc_tokens"] = suppress_loc_tokens
return policy.select_message(batch, **kwargs)
except Exception as exc: # noqa: BLE001
logger.warning("%s failed: %s", label, exc, exc_info=logger.isEnabledFor(logging.DEBUG))
if state is not None:
push_log(state, f" [warn] {label} failed: {type(exc).__name__}: {exc}")
return ""
_SAY_RE = re.compile(r"<\s*say\s*>(.*?)<\s*/\s*say\s*>", re.IGNORECASE | re.DOTALL)
def _split_plan_and_say(text: str) -> tuple[str, str]:
"""Pull a ``<say>...</say>`` snippet out of ``text``; remainder is plan.
The training-time tool-call serializer wraps ``say(text="")`` in a
deterministic textual marker so prefix-LM-style training learns to
emit it. The runtime parses it back here. If no marker is present,
the entire text is treated as plan with no speech.
"""
if not text:
return "", ""
match = _SAY_RE.search(text)
if not match:
return text.strip(), ""
speech = match.group(1).strip().strip('"').strip("'")
plan = (text[: match.start()] + text[match.end() :]).strip()
return plan, speech
@@ -0,0 +1,134 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Trigger primitives for PI052's multi-rate inference runtime.
Mirrors the plan's Section "Runtime orchestration": each
``InferenceStep`` is gated by a :class:`Trigger` that decides per tick
whether the step fires. Two trigger flavours cover all the cadences
the canonical recipe needs:
* :class:`HzTrigger` for periodic beats (action chunks at ~3-5 Hz,
high-level subtask generation at ~1 Hz, action dispatch at ~50 Hz)
* :class:`EventTrigger` for one-shot reactions (subtask boundary
memory update; user interjection plan refresh; user VQA query
vqa answer; pending tool call dispatcher)
Triggers are stateless except for ``HzTrigger``'s last-fire timestamp.
The runtime stores the :class:`Tick` clock as ``state["_tick"]`` so
every step shares a single time source.
"""
from __future__ import annotations
import time
from dataclasses import dataclass, field
from typing import Any, Protocol
@dataclass
class Tick:
"""Single tick from :class:`TickClock`. Carries time references the
runtime steps consume to gate themselves."""
index: int
"""Monotonic counter — increments by one per tick."""
monotonic_seconds: float
"""``time.monotonic()`` at the start of this tick."""
@dataclass
class TickClock:
"""Drives the runtime loop at up to ``max_rate_hz``.
Sleeps just enough between :meth:`advance` calls to enforce the
rate. With ``max_rate_hz=50`` the loop wakes ~every 20ms; the
higher-level ``HzTrigger`` slices that timeline into sub-cadences.
"""
max_rate_hz: float = 50.0
_index: int = field(default=0, init=False)
_last_seconds: float | None = field(default=None, init=False)
def advance(self) -> Tick:
period = 1.0 / max(self.max_rate_hz, 0.1)
now = time.monotonic()
if self._last_seconds is not None:
sleep_for = (self._last_seconds + period) - now
if sleep_for > 0:
time.sleep(sleep_for)
now = time.monotonic()
self._last_seconds = now
self._index += 1
return Tick(index=self._index, monotonic_seconds=now)
class Trigger(Protocol):
"""Decide whether the next ``InferenceStep`` should fire."""
def should_fire(self, tick: Tick, state: dict[str, Any]) -> bool: ...
@dataclass
class HzTrigger:
"""Fire at most ``hz`` times per second.
A step that gates further (e.g. ``HighLevelSubtaskFwd`` skipping
when the action queue is non-empty) and wants the trigger to
retry next tick instead of waiting a full period can call
:meth:`rearm` from inside ``run``. Without this, a low-hz trigger
(e.g. ``hz=0.2`` = once per 5 s) almost never coincides with the
brief queue-empty window and the step never fires at all.
"""
hz: float
_last_seconds: float | None = field(default=None, init=False)
def should_fire(self, tick: Tick, state: dict[str, Any]) -> bool:
period = 1.0 / max(self.hz, 1e-6)
if self._last_seconds is None or (tick.monotonic_seconds - self._last_seconds) >= period:
self._last_seconds = tick.monotonic_seconds
return True
return False
def rearm(self) -> None:
"""Mark the trigger as not having fired, so the next tick re-evaluates.
Used by a step that decided to skip after ``should_fire`` already
committed the firing keeps the cadence honest without losing
the slot.
"""
self._last_seconds = None
@dataclass
class EventTrigger:
"""Fire when ``event_name`` is in ``state["events_this_tick"]``.
The runtime fills ``events_this_tick`` once per tick from:
* stdin / network input (``user_interjection``, ``user_vqa_query``,
``stop``)
* internal state transitions (``subtask_change``,
``tool_call_pending``)
The list is consumed (cleared at the end of the tick) so events
fire at most once.
"""
event_name: str
def should_fire(self, tick: Tick, state: dict[str, Any]) -> bool:
events: list[str] = state.get("events_this_tick") or []
return self.event_name in events
+127
View File
@@ -0,0 +1,127 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Rich-based REPL layout for the PI052 runtime.
Two-zone terminal layout:
[chat scrollback user messages / robot responses, scrolls naturally]
State
task please clean up the kitchen
subtask grasp the handle of the sponge
plan 1. grasp sponge 2. wipe 3. tidy
memory sponge picked up; counter still dirty
> _
The state panel re-renders on every state change. Chat lines are
``console.print``'d above the live region so they accumulate naturally
in scrollback. Implemented with :class:`rich.live.Live` plus
:func:`rich.console.Console.input` for the prompt when an input is
pending, ``rich.Live`` auto-suspends so the input doesn't fight the
panel for cursor position.
"""
from __future__ import annotations
from typing import Any
try: # rich is optional; only required for the interactive REPL.
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from rich.text import Text
_HAS_RICH = True
except ImportError: # pragma: no cover
_HAS_RICH = False
Console = Any # type: ignore[assignment]
Panel = Any # type: ignore[assignment]
Table = Any # type: ignore[assignment]
Text = Any # type: ignore[assignment]
_STATE_KEYS = (
("task", "task"),
("current_subtask", "subtask"),
("current_plan", "plan"),
("current_memory", "memory"),
)
def make_state_panel(state: dict[str, Any]) -> Any:
"""Render the persistent state panel for the live region.
Returns a :class:`rich.panel.Panel`. Caller passes it to
``Live.update(panel)`` whenever the state changes.
"""
if not _HAS_RICH:
raise RuntimeError(
"rich is required for the interactive REPL. "
"`pip install rich` (it's a transitive dep of lerobot)."
)
table = Table.grid(padding=(0, 2), expand=True)
table.add_column(justify="right", style="dim", no_wrap=True, width=10)
table.add_column(justify="left")
for key, label in _STATE_KEYS:
value = state.get(key)
if value is None:
rendered = Text("(not set)", style="dim italic")
else:
rendered = Text(str(value), style="bold")
table.add_row(label, rendered)
queue = state.get("action_queue")
queue_len = len(queue) if hasattr(queue, "__len__") else 0
pending = state.get("tool_calls_pending") or []
footer = Text.assemble(
("queued actions: ", "dim"),
(str(queue_len), "bold cyan"),
(" pending tool calls: ", "dim"),
(str(len(pending)), "bold magenta"),
)
table.add_row("", footer)
run_mode = state.get("mode", "action")
mode_tag = (
"[green]action[/]" if run_mode == "action" else "[yellow]paused[/]"
)
return Panel(
table,
title=f"[bold]PI052 state[/] · mode: {mode_tag}",
border_style="cyan",
)
def print_user_line(console: Any, line: str) -> None:
"""Append a user-typed line to the chat scrollback."""
if not _HAS_RICH:
print(f"you: {line}", flush=True)
return
console.print(f"[bold cyan]you:[/] {line}")
def print_robot_lines(console: Any, lines: list[str]) -> None:
"""Append robot/runtime log lines to the chat scrollback."""
if not _HAS_RICH:
for line in lines:
print(f"robot: {line.lstrip()}", flush=True)
return
for line in lines:
# The runtime uses leading whitespace + "label: text"; render
# the label in green and the value in default for readability.
stripped = line.lstrip()
if ":" in stripped:
label, _, value = stripped.partition(":")
console.print(f"[bold green]robot[/] [dim]({label.strip()})[/] {value.strip()}")
else:
console.print(f"[bold green]robot:[/] {stripped}")
+423
View File
@@ -0,0 +1,423 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Interactive VQA for the PI052 runtime.
In ``/vlm`` mode a typed line is treated as a VQA question. This module
runs the full interactive flow:
1. pull the current observation and list available cameras,
2. ask the operator which camera to ground the question on,
3. generate the answer with the VLM conditioned on that one camera,
4. parse the JSON answer; if it carries a bounding box (``bbox``) or a
point (``keypoint``), draw the overlay on the camera frame, save a
PNG to ``./vqa_overlays/`` and auto-open it.
VQA answer schemas mirror the annotation pipeline's ``VQA_ANSWER_SHAPES``
(see ``lerobot.annotations.steerable_pipeline.validator``):
* ``bbox`` ``{"detections": [{"label", "bbox_format": "xyxy",
"bbox": [x1, y1, x2, y2]}, ...]}``
* ``keypoint`` ``{"label", "point_format": "xy", "point": [x, y]}``
* ``count`` / ``attribute`` / ``spatial`` text-only, no overlay.
"""
from __future__ import annotations
import json
import logging
import os
import re
import subprocess
import sys
import time
import webbrowser
from pathlib import Path
from typing import Any
from .runtime_state import push_log
logger = logging.getLogger(__name__)
_IMAGE_PREFIX = "observation.images."
# PaliGemma detection / pointing vocabulary. PI052 trains spatial VQA
# answers in this native ``<locNNNN>`` format (index in [0, 1023],
# normalized to the image axis) instead of pixel-coordinate JSON, so the
# answer string the runtime parses can be e.g.
# ``<loc0512><loc0301> blue cube`` (point) or
# ``<loc0100><loc0080><loc0400><loc0360> blue cube`` (box).
_LOC_RE = re.compile(r"<loc(\d{1,4})>")
# Iteration order for shape matching — most specific keys first so an
# answer is classified deterministically.
_SHAPE_ORDER = ("bbox", "keypoint", "count", "attribute", "spatial")
_BBOX_COLOR = (255, 64, 64)
_POINT_COLOR = (64, 220, 64)
# ---------------------------------------------------------------------------
# Camera selection
# ---------------------------------------------------------------------------
def available_cameras(observation: dict | None) -> list[str]:
"""Return the sorted ``observation.images.*`` keys present in ``observation``."""
if not observation:
return []
return sorted(k for k in observation if isinstance(k, str) and k.startswith(_IMAGE_PREFIX))
def camera_short_name(camera_key: str) -> str:
"""Strip the ``observation.images.`` prefix for display."""
return camera_key[len(_IMAGE_PREFIX) :] if camera_key.startswith(_IMAGE_PREFIX) else camera_key
def prompt_camera_choice(
cameras: list[str],
*,
input_fn: Any = input,
print_fn: Any = print,
) -> str | None:
"""Ask the operator which camera frame to draw a VQA overlay on.
Accepts either the menu number or the (short or full) camera name.
A single-camera setup auto-selects without prompting. Returns the
chosen ``observation.images.*`` key, or ``None`` if the operator
cancels / gives an invalid answer.
"""
if not cameras:
return None
if len(cameras) == 1:
return cameras[0]
print_fn("Draw the result on which camera?")
for i, cam in enumerate(cameras, 1):
print_fn(f" [{i}] {camera_short_name(cam)}")
try:
raw = str(input_fn("camera> ")).strip()
except (EOFError, KeyboardInterrupt):
return None
if not raw:
return cameras[0]
if raw.isdigit():
idx = int(raw) - 1
return cameras[idx] if 0 <= idx < len(cameras) else None
for cam in cameras:
if raw == cam or raw == camera_short_name(cam):
return cam
return None
# ---------------------------------------------------------------------------
# Answer parsing
# ---------------------------------------------------------------------------
def _loc_to_norm(idx: int) -> float:
"""PaliGemma ``<locNNNN>`` index → normalized [0, 1] axis coordinate."""
return max(0.0, min(1023.0, float(idx))) / 1023.0
def parse_loc_answer(answer: str) -> dict | None:
"""Parse a PaliGemma ``<loc>``-format spatial VQA answer.
PI052 trains spatial answers in PaliGemma's native detection
vocabulary, label-first: a point is ``<label> <locY><locX>``, a box
is ``<label> <locY0><locX0><locY1><locX1>``, and multiple boxes are
joined by `` ; `` (e.g. ``cube <loc..><loc..><loc..><loc..> ; box
<loc..><loc..><loc..><loc..>``). Loc-first formats are also accepted
this parser strips loc tokens and treats the remainder as the
label, so order is irrelevant. Coordinates come back *normalized*
([0, 1]); the overlay denormalizes them against the chosen camera
frame's pixel size.
Returns ``{"kind", "payload", "normalized": True}`` on success
(``payload`` mirrors the JSON shapes so the overlay code is shared),
or ``None`` when the answer carries no ``<loc>`` tokens.
"""
if not answer or "<loc" not in answer:
return None
segments = [seg for seg in answer.split(";") if "<loc" in seg]
points: list[tuple[float, float, str]] = []
boxes: list[tuple[float, float, float, float, str]] = []
for seg in segments:
locs = [int(m) for m in _LOC_RE.findall(seg)]
label = _LOC_RE.sub("", seg).strip()
if len(locs) == 2:
y, x = (_loc_to_norm(v) for v in locs[:2])
points.append((x, y, label))
elif len(locs) >= 4:
y1, x1, y2, x2 = (_loc_to_norm(v) for v in locs[:4])
boxes.append((x1, y1, x2, y2, label))
if boxes:
detections = [
{"label": lbl, "bbox_format": "xyxy", "bbox": [x1, y1, x2, y2]}
for (x1, y1, x2, y2, lbl) in boxes
]
return {"kind": "bbox", "payload": {"detections": detections}, "normalized": True}
if len(points) == 1:
x, y, lbl = points[0]
return {
"kind": "keypoint",
"payload": {"label": lbl, "point_format": "xy", "point": [x, y]},
"normalized": True,
}
if points: # several bare points → treat as detections-as-points
detections = [
{"label": lbl, "bbox_format": "xyxy", "bbox": [x, y, x, y]} for (x, y, lbl) in points
]
return {"kind": "bbox", "payload": {"detections": detections}, "normalized": True}
return None
def parse_vqa_answer(answer: str) -> dict | None:
"""Parse a VQA answer string into ``{"kind", "payload"}``.
``kind`` is one of the ``VQA_ANSWER_SHAPES`` names (``bbox``,
``keypoint``, ``count``, ``attribute``, ``spatial``) or ``"unknown"``
when the JSON doesn't match any known shape. PaliGemma ``<loc>``
spatial answers are detected first (PI052 trains them in that native
format). Returns ``None`` when the answer is neither ``<loc>`` text
nor a parseable JSON object.
"""
if not answer or not answer.strip():
return None
loc_parsed = parse_loc_answer(answer)
if loc_parsed is not None:
return loc_parsed
try:
payload = json.loads(answer)
except (ValueError, TypeError):
return None
if not isinstance(payload, dict):
return None
try:
from lerobot.annotations.steerable_pipeline.validator import ( # noqa: PLC0415
VQA_ANSWER_SHAPES,
)
shapes = VQA_ANSWER_SHAPES
except ImportError: # pragma: no cover - annotation extra not installed
shapes = {
"bbox": {"detections"},
"keypoint": {"label", "point_format", "point"},
"count": {"label", "count"},
"attribute": {"label", "attribute", "value"},
"spatial": {"subject", "relation", "object"},
}
keys = set(payload)
for kind in _SHAPE_ORDER:
required = shapes.get(kind)
if required and required <= keys:
return {"kind": kind, "payload": payload}
return {"kind": "unknown", "payload": payload}
def answer_has_overlay(parsed: dict | None) -> bool:
"""True iff ``parsed`` carries drawable spatial coordinates."""
return bool(parsed) and parsed.get("kind") in ("bbox", "keypoint")
# ---------------------------------------------------------------------------
# Overlay drawing
# ---------------------------------------------------------------------------
def observation_image_to_pil(image_tensor: Any) -> Any:
"""Convert an ``observation.images.*`` tensor to a PIL RGB image.
The runtime observation stores images as ``(1, C, H, W)`` (or
``(C, H, W)``) float tensors in ``[0, 1]``. Reuses
``image_array_to_pil_image`` which handles the CHWHWC transpose and
the floatuint8 scaling.
"""
from lerobot.datasets.image_writer import image_array_to_pil_image # noqa: PLC0415
arr = image_tensor
if hasattr(arr, "detach"):
arr = arr.detach().cpu()
if hasattr(arr, "numpy"):
arr = arr.numpy()
while arr.ndim > 3: # drop leading batch dim(s)
arr = arr[0]
return image_array_to_pil_image(arr).convert("RGB")
def draw_vqa_overlay(image: Any, parsed: dict) -> Any:
"""Draw ``bbox`` / ``keypoint`` answers onto a copy of ``image``.
Non-spatial answers (``count`` / ``attribute`` / ``spatial`` /
``unknown``) are returned as an unmodified copy. When ``parsed`` has
``normalized=True`` (PaliGemma ``<loc>`` answers) the [0, 1]
coordinates are scaled to the image's pixel size.
"""
from PIL import ImageDraw # noqa: PLC0415
img = image.convert("RGB").copy()
kind = parsed.get("kind")
payload = parsed.get("payload") or {}
draw = ImageDraw.Draw(img)
w, h = img.size
sx, sy = (w, h) if parsed.get("normalized") else (1, 1)
if kind == "bbox":
for det in payload.get("detections") or []:
if not isinstance(det, dict):
continue
box = det.get("bbox")
if not (isinstance(box, list | tuple) and len(box) == 4):
continue
try:
x1, y1, x2, y2 = (float(v) for v in box)
except (TypeError, ValueError):
continue
x1, x2 = x1 * sx, x2 * sx
y1, y2 = y1 * sy, y2 * sy
draw.rectangle([x1, y1, x2, y2], outline=_BBOX_COLOR, width=3)
label = str(det.get("label", "")).strip()
if label:
draw.text((x1 + 3, max(0.0, y1 - 12)), label, fill=_BBOX_COLOR)
elif kind == "keypoint":
point = payload.get("point")
if isinstance(point, list | tuple) and len(point) == 2:
try:
x, y = float(point[0]) * sx, float(point[1]) * sy
except (TypeError, ValueError):
return img
r = 6
draw.ellipse([x - r, y - r, x + r, y + r], outline=_POINT_COLOR, width=3)
draw.line([x - 2 * r, y, x + 2 * r, y], fill=_POINT_COLOR, width=2)
draw.line([x, y - 2 * r, x, y + 2 * r], fill=_POINT_COLOR, width=2)
label = str(payload.get("label", "")).strip()
if label:
draw.text((x + r + 3, y - r), label, fill=_POINT_COLOR)
return img
def _open_file(path: Path) -> None:
"""Best-effort open ``path`` in the OS default viewer."""
try:
if sys.platform == "darwin":
subprocess.run(["open", str(path)], check=False)
elif sys.platform.startswith("linux"):
subprocess.run(["xdg-open", str(path)], check=False)
elif os.name == "nt":
os.startfile(str(path)) # type: ignore[attr-defined] # noqa: S606
else: # pragma: no cover - exotic platform
webbrowser.open(path.resolve().as_uri())
except Exception as exc: # noqa: BLE001
logger.debug("could not auto-open %s: %s", path, exc)
def save_and_open_overlay(image: Any, out_dir: str | Path = "./vqa_overlays") -> Path:
"""Save ``image`` as a timestamped PNG under ``out_dir`` and auto-open it."""
out = Path(out_dir)
out.mkdir(parents=True, exist_ok=True)
path = out / f"vqa_{int(time.time() * 1000)}.png"
image.save(path)
_open_file(path)
return path
# ---------------------------------------------------------------------------
# Orchestrator
# ---------------------------------------------------------------------------
def handle_vqa_query(
*,
policy: Any,
observation_provider: Any,
question: str,
state: dict[str, Any],
input_fn: Any = input,
print_fn: Any = print,
) -> None:
"""Run one interactive VQA question end to end.
Called synchronously from the input layer while the runtime is in
``/question`` mode (the action loop is gated off, so the policy is
not in concurrent use). Progress is reported via both
:func:`push_log` (REPL panel scrollback) and ``print_fn`` (direct
stdout) in autonomous question mode the panel redraw is suspended,
so the direct print is what the operator actually sees.
"""
from .steps import _generate_with_policy, _msgs_for_vqa # noqa: PLC0415
def report(line: str) -> None:
"""Surface a line both to the panel scrollback and to stdout."""
push_log(state, line)
try:
print_fn(line)
except Exception: # noqa: BLE001
pass
if policy is None or not hasattr(policy, "select_message"):
report(" [warn] vqa: policy has no select_message — skipping")
return
observation: dict | None = None
if observation_provider is not None:
try:
observation = observation_provider()
except Exception as exc: # noqa: BLE001
logger.debug("observation_provider raised %s", exc)
# Feed the FULL observation (every camera + state) to the VLM. The
# ``ask_vqa_*`` recipes look single-camera, but the image *block* is
# stripped before tokenization — the actual frames reach the model
# via PI052's ``OBS_IMAGES_*`` channels, and ``embed_prefix``
# consumes *all* ``config.image_features`` regardless of which
# camera the sub-recipe was tagged for. So the model always sees
# every camera; the operator never has to name one to ask.
answer = _generate_with_policy(
policy,
_msgs_for_vqa(question),
observation=observation,
state=state,
label="vqa gen",
)
if not answer:
report(" [info] vqa gen returned empty")
return
report(f" vqa: {answer}")
parsed = parse_vqa_answer(answer)
if not answer_has_overlay(parsed):
if parsed is None:
report(" [info] vqa answer is not JSON — no overlay")
return
# The answer carries a bounding box / point. Its pixel coordinates
# are camera-specific and the text answer doesn't say which camera,
# so ask the operator *now* — only when there is actually something
# to draw — which camera frame to render the overlay on.
cameras = available_cameras(observation)
if observation is None or not cameras:
report(" [info] no camera image — cannot draw overlay")
return
chosen = prompt_camera_choice(cameras, input_fn=input_fn, print_fn=print_fn)
if chosen is None:
report(" [info] overlay skipped — no camera selected")
return
try:
pil = observation_image_to_pil(observation[chosen])
overlay = draw_vqa_overlay(pil, parsed)
path = save_and_open_overlay(overlay)
report(f" vqa overlay ({camera_short_name(chosen)}) saved: {path}")
except Exception as exc: # noqa: BLE001
logger.warning("vqa overlay failed: %s", exc, exc_info=logger.isEnabledFor(logging.DEBUG))
report(f" [warn] vqa overlay failed: {type(exc).__name__}: {exc}")

Some files were not shown because too many files have changed in this diff Show More