Commit Graph

1744 Commits

Author SHA1 Message Date
pepijn 673cc6b0fe pi052: opt-in Liger fused kernels (rope + geglu + layer_norm)
Adds ``PI052Config.use_hf_kernels`` (default off). When enabled,
``PI052Policy.__init__`` calls ``apply_liger_kernel_to_paligemma``
before the backbone is built so PaliGemma / Gemma / Siglip layers
pick up Liger's fused Triton forwards.

Measured at BS=16 / L=512 / H100 80GB with KI+GC on (bench job
22161421, see ``examples/benchmark/bench_pi052_kernels.slurm``):

  rope only        →  -2.5% step time
  geglu only       →  -2.2% step time
  layer_norm only  →  -1.1% step time
  all three        →  -4.5% step time, peak_mem unchanged

``cross_entropy`` / ``fused_linear_cross_entropy`` are deliberately
skipped — pi052 calls ``F.cross_entropy`` directly and bypasses
``PaliGemmaForConditionalGeneration.forward``, so neither patch
fires without invasive model-code changes (left for a follow-up).
``rms_norm`` measured as noise on this workload (GC dominates),
so it stays off to keep the patch surface minimal.

Requires ``pip install liger-kernel``; falls back to a warning if
missing so the default path is unaffected.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-25 20:50:07 +00:00
Pepijn 2ed6519a93 ema: enable by default (matches openpi JAX behavior)
Flip EMAConfig.enable default from False -> True. Every training run
now maintains an EMA shadow of the policy and uses it for eval + W&B
example dumps. Disable per-run with --ema.enable=false for short or
memory-constrained training.

Rationale:
  * openpi (JAX, official) ships EMA on for every shipped config,
    decay=0.99 by default and 0.999 for pi05_libero. The openpi
    PyTorch port explicitly lists EMA as unsupported, a gap LeRobot
    main inherited. Flipping the default closes that gap for every
    LeRobot policy that ships through lerobot-train.
  * EMA is established best practice for diffusion / flow-matching
    policies (Diffusion Policy §V.D; standard in DDPM/EDM/Stable
    Diffusion training recipes). For autoregressive policies the
    extra cost is real but the safety net (smoother eval, better
    final checkpoint) doesn't hurt.

Trade-offs to be aware of:
  * Memory: 1x model params in fp32 shadow (~13 GB for pi052's
    3.3B params; <500 MB for ACT/Diffusion-Policy class). Memory-
    constrained users on consumer GPUs may need --ema.enable=false.
  * Checkpoint disk: extra .pt file in training_state/, size ~=
    pretrained_model/model.safetensors. Over a 100k-step run with
    save_freq=20000 that's 5x the model size in extra disk.
  * Eval scores will now reflect EMA model instead of live model -
    expected to be 1-3% higher on closed-loop tasks per the
    diffusion-policy literature; might surprise users who memorize
    their last run's numbers.

Opt out:
  --ema.enable=false           # disable entirely
  --ema.use_for_eval=false     # keep EMA but eval reflects live
  --ema.use_for_wandb_examples=false   # keep EMA but W&B reflects live

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:58:46 +02:00
Pepijn 72ea531017 train: switch EMA from custom ModelEMA to ema-pytorch
Replace the 250-line src/lerobot/utils/ema.py with a direct dependency
on ema-pytorch (lucidrains' canonical PyTorch EMA library). Same
semantics, decay=0.999 default unchanged, but offloads the maintenance
burden to a maintained library used by every diffusion repo.

Why ema-pytorch:
  * Standard PyTorch EMA library; battle-tested across diffusion +
    speech + image-gen codebases.
  * Tiny pure-python dep (no compiled code).
  * Cleaner consumer-side API: ema.ema_model is a full nn.Module
    clone of the policy, so eval / wandb just pass it through instead
    of context-managed swap/restore on the live model.

What changed mechanically:
  * pyproject.toml: add 'ema-pytorch>=0.7.7,<1.0.0' to core deps.
  * deleted src/lerobot/utils/ema.py (the custom ModelEMA).
  * scripts/lerobot_train.py:
      - import EMA from ema_pytorch
      - instantiate with beta=cfg.ema.decay,
        update_after_step=cfg.ema.warmup_steps, update_every=1,
        include_online_model=False (accelerator owns live model
        lifecycle; double-registration would double-count params).
      - ema.update() (no args) — library tracks the online model
        internally.
      - Eval block: pass eval_target_policy = ema.ema_model (when
        cfg.ema.use_for_eval) instead of swap context manager.
      - W&B examples: same pattern.
      - Save: torch.save(ema.state_dict(), .../ema_state.pt) instead
        of custom safetensors writer. .pt format is consistent with
        the rest of training_state which already mixes safetensors +
        json + (now) pt.
      - Resume: ema.load_state_dict(torch.load(.../ema_state.pt)).
      - WandB observability: ema/step (count of ema.update calls),
        ema/initted (bool from library), ema/beta (constant from
        cfg).
  * configs/default.py: EMAConfig.decay stays 0.999 (matches
    openpi's pi05_libero); docstring updated to reflect ema-pytrch
    semantics for warmup_steps (now maps to update_after_step — a hard
    skip, not a smooth decay ramp).

Behavior preserved:
  * Defaults: enable=False, decay=0.999, warmup_steps=0,
    use_for_eval=True, use_for_wandb_examples=True.
  * Same CLI: --ema.enable=true, --ema.decay=X, etc.
  * Same checkpoint layout (training_state/ema_state.pt next to
    optimizer_state.safetensors etc.); resumes silently if present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:51:23 +02:00
Pepijn 56a934ec55 train: EMA of policy parameters (opt-in via --ema.enable=true)
Adds Exponential Moving Average of trainable policy parameters with
warmup, eval-time swap, checkpoint save/resume, and wandb observability.

For diffusion / flow-matching policies (pi052's flow expert exactly
qualifies), averaging late-training parameter oscillations yields a
smoother model that generalises substantially better at inference —
~1–3% absolute success-rate improvement on closed-loop tasks per the
diffusion-policy lit (Chi et al. 2023 §V.D; standard in DDPM/EDM).

New module: src/lerobot/utils/ema.py
  ModelEMA class with:
    * fp32 shadow of every requires_grad parameter
    * decay warmup: min(decay, (1+n)/(10+n)) for first warmup_steps updates
    * update(model) -> effective_decay (for logging)
    * apply_to(model) context manager: temp-swap weights, restore on exit
    * copy_to(model): permanent overwrite
    * save() / load_from_file(): safetensors + JSON sidecar for metadata
    * state_dict() / load_state_dict() for in-process round-tripping

New config: src/lerobot/configs/default.py EMAConfig + wired into
TrainPipelineConfig as 'ema: EMAConfig'.
  Fields:
    enable: bool = False         (off by default, back-compat)
    decay: float = 0.999         (standard; 0.75 for fast Diffusion-Policy)
    warmup_steps: int = 0        (no warmup by default)
    use_for_eval: bool = True    (eval swaps in EMA weights)
    use_for_wandb_examples: bool = True
                                 (W&B training-examples table uses EMA
                                  for predicted-action columns -> matches
                                  what eval / deployment would see)

Training loop integration (src/lerobot/scripts/lerobot_train.py):
  1. After accelerator.prepare + policy.train(), instantiate ModelEMA
     on the main process if cfg.ema.enable. Resume from
     checkpoint_path/training_state/ema_state.safetensors if present.
  2. After each update_policy() call, ema.update(unwrap_model(policy))
     returns the effective decay (logged to wandb during warmup).
  3. The save_checkpoint() block also ema.save(...) the shadow next to
     the existing optimizer/scheduler/rng training state. Resume picks
     it up automatically in (1).
  4. The eval block (cfg.env && is_eval_step) wraps eval_policy_all in
     ema.apply_to() when use_for_eval=True. Live weights restored
     byte-for-byte on context exit.
  5. The W&B training-example dump wraps log_training_examples in
     ema.apply_to() when use_for_wandb_examples=True so the predicted-
     action columns match the eval/deployment behavior.
  6. Two new wandb scalars: ema/effective_decay, ema/num_updates.

Cost:
  Memory: 1x model params in fp32 (~13 GB for pi052's 3.3B params).
          Lives only on main-process GPU. CPU offload available via
          ModelEMA(device='cpu') if needed.
  Compute: one elementwise update per step (~1% of step time).
  Eval: 2x checkpoint files in training_state/ (live optimizer state
        + ema shadow). Negligible relative to model.safetensors.

Usage:
  lerobot-train ... --ema.enable=true
  lerobot-train ... --ema.enable=true --ema.decay=0.9999  # very slow EMA
  lerobot-train ... --ema.enable=true --ema.warmup_steps=1000

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:27:14 +02:00
Pepijn 738e317caa pi052: PaLM-style z-loss on text CE (default weight 1e-4)
Penalise the log-partition function z = log Σ exp(logits) drifting away
from zero on text-CE supervised positions. Without it, large-vocab
models (PaliGemma's 257k vocab) can let logsumexp grow unboundedly
while CE stays low — a uniform additive logit bias cancels in softmax
but pushes the partition function out of bounds, causing numerical
instability and generation drift.

PaLM appendix B / Chinchilla report z-loss is essential for stable
large-vocab CE. It is especially valuable for pi052 because the recent
default lm_head_lr_scale=5.0 amplifies head-drift risk: the 5x boost
keeps the head pinned to fine-tuning targets, and z-loss caps the
partition function so the head can't just bias all logits high uniformly.

Implementation:
  * _shifted_ce(logits, labels, z_loss_weight=0.0) gains the new arg
    with default 0.0 (back-compat for any other caller).
  * Both call sites in PI052Policy.forward read self.config.text_ce_
    z_loss_weight and pass it through.
  * PI052Config.text_ce_z_loss_weight defaults to 1e-4 (commonly cited
    PaLM value); set to 0 to disable.

Cheap to compute: one extra logsumexp shares the softmax kernel that
F.cross_entropy already runs. No memory overhead beyond a (B*T,) tensor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 21:08:56 +02:00
Pepijn 8ba3b187a1 pi052: bump lm_head_lr_scale default to 5.0 (keep base LR at 2.5e-5)
The base optimizer LR (2.5e-5, cosine to 2.5e-6, 1k warmup, AdamW
(0.9, 0.95), wd 0.01, grad_clip 1.0) is the openpi/π0.5 setting used
for the RoboCasa leaderboard baselines and is well-validated for 3B-
class VLAs with a paligemma backbone. Leave it alone.

The one place pi052 needs to diverge from pi05 is the LM-head LR
multiplier:

  * pi05 has no text supervision -> head doesn't get gradients ->
    lm_head_lr_scale is moot, stays at 1.0.
  * pi052 always has text supervision via the recipe (subtask /
    memory / VQA). Under KI, the LM head only sees gradients on
    ~30-45% of the batch (the text-CE mask share). Under aggressive
    cosine decay the head drifts back toward PaliGemma's pretrained
    <loc> first-token bias, despite teacher-forced CE staying near 0.

5x is the documented fix (see PI05Config.lm_head_lr_scale docstring
and PI05Policy.get_optim_params, which is already wired to split the
LM head + tied embed_tokens into their own param group while sharing
the same cosine lambda). Flipping the default here lifts the fix from
opt-in to on-by-default for every pi052 run, with zero downside on
text-free recipes (head still gets no gradients to scale).

Other LR knobs reviewed and intentionally NOT changed:
  - optimizer_lr=2.5e-5: openpi-validated, matches leaderboard.
  - scheduler_warmup_steps=1000: standard for VLA finetuning.
  - scheduler_decay_steps=30000: auto-scales for short runs.
  - optimizer_betas=(0.9, 0.95): GPT/LLM convention, works for
    flow-matching + LM-CE.
  - optimizer_weight_decay=0.01, grad_clip=1.0: standard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:57:43 +02:00
Pepijn 057c794ffe wandb: flip training-example logging defaults to on (every 5000 steps)
The training-example wandb.Table dump (camera images + text fields +
GT/predicted action chunk endpoints) was opt-in. Flip defaults so any
run with --wandb.enable=true gets visual training observability for free.

  log_examples_freq:           0     -> 5000   (push table every 5k steps)
  log_examples_n:              4     -> 4      (unchanged)
  log_examples_predict_actions: False -> True   (extra forward in eval mode)

Runs without --wandb.enable=true are unaffected (the training loop gate
checks wandb_logger is not None first). Set log_examples_freq=0 to opt
out of the dump even with wandb enabled; set log_examples_predict_actions
=false to skip the extra inference forward pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 18:00:04 +02:00
Pepijn b1e83f556c train: periodic wandb log of training examples (images + text + actions)
Adds an opt-in cadence for pushing rich training examples to W&B,
independent of the scalar log_freq. Off by default; turn on with
--wandb.log_examples_freq=5000 (one wandb.Table dump every 5k steps).

WandBConfig (configs/default.py):
  + log_examples_freq: int = 0       # 0 disables
  + log_examples_n: int = 4          # batch elements per dump
  + log_examples_predict_actions: bool = False
                                     # opt-in extra forward pass to
                                     # show predicted vs GT action chunk

WandBLogger.log_training_examples (common/wandb_utils.py):
  Builds one wandb.Table row per sampled batch element with:
    * one wandb.Image column per camera (auto handles CHW/HWC,
      uint8/float32 [0,1])
    * any text fields present in the batch (task / subtask /
      memory / instruction)
    * gt_action_first / gt_action_last (chunk endpoints)
    * pred_action_first / pred_action_last when --wandb.log_examples_
      predict_actions=true (policy.eval() + no_grad; restores train
      mode after)
  Defensive: per-camera failures don't poison the row; predict_action_
  chunk exceptions are logged and the predicted columns are dropped.

Training loop (scripts/lerobot_train.py):
  One new gated block right after the existing scalar log_step clause.
  Reads batch + dataset.meta.camera_keys, hands them to
  log_training_examples. Wrapped in try/except so a bad sample never
  kills the run.

Usage:
  lerobot-train ... \
    --wandb.enable=true --wandb.project=robocasa_composite_seen \
    --wandb.log_examples_freq=5000 \
    --wandb.log_examples_n=4 \
    --wandb.log_examples_predict_actions=true

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 16:57:15 +02:00
Pepijn da3e87ee86 Merge branch 'feat/smolvla-on-steerable' of https://github.com/huggingface/lerobot into feat/smolvla-on-steerable 2026-05-25 16:56:50 +02:00
Pepijn 1e9a6d044d Merge remote-tracking branch 'origin/feat/language-annotation-pipeline' into feat/smolvla-on-steerable
# Conflicts:
#	src/lerobot/datasets/__init__.py
#	src/lerobot/policies/__init__.py
#	src/lerobot/policies/factory.py
#	src/lerobot/processor/render_messages_processor.py
#	uv.lock
2026-05-25 16:56:22 +02:00
pepijn 3fdfcb912a examples(port_datasets): generalize RoboCasa builder + add smoke script
- Add ATOMIC_TASKS, COMPOSITE_UNSEEN_TASKS and four new --task-set keys
  (atomic, composite_unseen, composite_all, composite_atomic) so the same
  builder produces the 50-task target benchmark or the 300-task Human300
  pretraining slice (via --split=pretrain --task-set=all) without
  duplicating logic.
- Stop hardcoding the composite_seen tag on the HF push; tags are now
  derived from --split / --source / --task-set so atomic, composite_all,
  and pretrain runs land with accurate metadata.
- Refresh module docstring to match the broader scope.
- Add scripts/build_robocasa_smoke.sh: 2-atomic-task smoke dataset
  (~1k episodes, ~131k frames) for fast end-to-end training validation
  before kicking off Human300-scale runs.
2026-05-25 14:54:00 +00:00
Pepijn c37b1fc7d0 Merge origin/feat/language-annotation-pipeline (8 fix(annotate) commits + vocabulary phase) 2026-05-25 15:47:25 +02:00
Pepijn 9020635b14 Merge branch 'main' into feat/language-annotation-pipeline
Resolves conflicts from 32 commits on main:

* docs/source/_toctree.yml — keep both new toc entries
  (annotation_pipeline + video_encoding_parameters).
* docs/source/language_and_recipes.mdx — adopt main's section
  ordering (Layer 2 before "Temporal semantics") and float32
  timestamp dtype to match the codebase.
* src/lerobot/configs/__init__.py — keep both export sets
  (recipe + video encoder).
* src/lerobot/datasets/dataset_metadata.py — drop redundant lazy
  imports (top-level imports cover both LANGUAGE_COLUMNS and
  DEFAULT_TOOLS); adopt main's @tools.setter for info.json
  write-back.
* src/lerobot/datasets/feature_utils.py — call the real
  validate_feature_language() instead of returning "".
* src/lerobot/datasets/language.py — float32 timestamps to match
  pa.float32() used in video_utils.py and the rest of the codebase.
* src/lerobot/datasets/language_render.py — adopt main's
  unwrap_scalar() helper (drops two hand-rolled .item()/list
  unwrappers); float32 in docstring.
* src/lerobot/processor/render_messages_processor.py — drop
  PR-local _scalar() helper, use shared unwrap_scalar().
* tests/datasets/test_language.py — adopt main's new float32 dtype
  + validate_feature_language warning tests.
* tests/datasets/test_dataset_metadata.py — adopt main's new
  tools.setter persist/clear tests.
* uv.lock — regenerated cleanly from main's resolver.

90 of 92 touched tests pass. Two pre-existing test failures
(test_module1_plan_memory_subtask_smoke,
test_module2_mid_episode_emits_paired_interjection_and_speech in
tests/annotations/test_modules.py) are unrelated to this merge —
that test file doesn't exist on main, so the failures originate on
the branch and are addressed by the 8 newer fix(annotate) commits
already on origin that will land in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:46:32 +02:00
Pepijn 83d0c390da pi052: drop debug scaffolding left over from training/inference bug hunts
Three diagnostic surfaces shipped in PR3 that don't belong in a clean
release:

* ``LEROBOT_DUMP_RECIPE_SAMPLES`` env-var dump (~70 LOC in
  text_processor_pi052.py): pretty-prints the next N rendered samples
  with ``[TGT]...[/TGT]`` markers over supervised spans. One-off
  training-inspection tool — no production user, never wired into a
  CLI flag, only useful while iterating on the recipe. Drop the module
  constants, the ``_is_dump_rank`` / ``_dump_recipe_sample`` helpers,
  the call site, and the now-unused ``import os``.

* ``_log_obs_tensors_once()`` in lerobot_pi052_runtime.py: the
  docstring literally says "Used to bisect train/inference mismatches"
  — a debugging artifact from when the LM head was collapsing on the
  live robot. Logged unconditionally at WARNING level from both the
  dataset-driven and robot-driven providers, with no ``--verbose``
  gate. Drop the function, both call sites, and the ``_logged`` /
  ``_obs_logged`` flag dicts that fed them. (``_resize_logged`` is
  kept — it gates the operationally useful camera-size sanity log.)

* Defensive ``unsqueeze(0)`` block in the dataset observation
  provider: papered over an upstream bug where some preprocessor step
  could produce an unbatched tensor. ``AddBatchDimensionProcessorStep``
  is reliable in the current pipeline — pi052 tests still pass with
  the block removed. If the bug ever resurfaces it should be fixed
  at the source, not silently re-batched here.

Net: -169 LOC. All 30 ``tests/policies/pi052/`` tests pass.

The ``<loc>`` token plumbing (``register_paligemma_loc_tokens``,
``_loc_token``, ``suppress_loc_tokens`` runtime gate) is left as-is —
it's the actual mechanism for VQA spatial answers, not scaffolding,
and the ``suppress_loc_tokens=True`` callers on subtask/memory/
interjection paths and ``=False`` on the VQA path are intentional
asymmetric behaviour, not a bug-routing knob.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:07:43 +02:00
Pepijn 1ff10b935c Merge branch 'feat/language-annotation-pipeline' into feat/smolvla-on-steerable
Resolves conflicts from 66 commits on the base branch:

* pyproject.toml — keep base's transformers>=5.4.0,<5.6.0; add the
  sentencepiece-dep entry pi052 (FAST action tokenizer) needs.
* policies/__init__.py — keep pi052 export; drop the
  RewardClassifierConfig export that base removed.
* policies/factory.py — docstring list resolution (keep pi052; drop
  reward_classifier, removed by base).
* annotations/steerable_pipeline/executor.py — adopt base's renamed
  _ensure_annotation_metadata_in_info (it already advertises the say
  tool); drop pi052's older _ensure_tools_in_info call.
* configs/train.py — keep pi052's vqa_target_fraction; adopt base's
  SampleWeightingConfig (legacy RA-BC inline params already covered
  by the migration shim base added).
* scripts/lerobot_train.py — merge pi052's per-policy processor
  rebuild + dataset_repo_id pass-through with base's active_cfg /
  is_reward_model_training tightening, and re-route vqa-weighted
  sampler to active_cfg.drop_n_last_frames.
* datasets/language_render.py — adopt base's _select_one + timestamp
  tolerance (drops pi052's stale _select_latest / per-style sort_key).
* tests — adopt base's parametrized per-camera blend + tolerance
  test; drop pi052 tests that overlap with base's tighter rewrites;
  keep pi052's flow-only / VQA-blend coverage; add a
  test_canonical_recipe_loads check on subtask_mem_vqa_speech.yaml.
* policies/pi052/processor_pi052.py — import RenderMessagesStep
  directly from render_messages_processor (base intentionally
  dropped it from lerobot.processor's re-exports).
* uv.lock — regenerated cleanly from base + pi052's pocket-tts /
  beartype.

All 67 touched tests pass (30 pi052 + 37 recipe / language-render /
pipeline / render-messages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:47:09 +02:00
Pepijn 67bdf4690e examples(port_datasets): rewrite RoboCasa composite_seen builder
Replace the earlier wrapper (which depended on robocasa.scripts.download
+ dataset_registry) with a self-contained pipeline that:

* downloads each task tarball directly from Box via box_links_ds.json
* converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30
* standardizes camera keys under observation.images.robot0_* and
  flattens observation.state by concatenating base/EE/gripper subkeys
  when the source dataset stores them separately
* builds per-rank unified shards then aggregates into one dataset

Filter: composite_seen task-set restricts discovery to the 16 multi-step
target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use
--task-set=all to keep every discovered task in the split/source slice;
--tasks=... overrides for arbitrary subsets.

Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task.

Adapted from a battle-tested port_robocasa.py reference shared by the
user; the only semantic addition is the task-set filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:27:42 +02:00
Pepijn 8085feab6e pi052(runtime): factor out shared observation-prep boilerplate
Both observation providers in lerobot_pi052_runtime.py ended a sample
dict the same way — strip the runtime-owned language columns and hand
the policy a device-resident ``observation.*``-only subset. Extract
two tiny helpers (``_strip_runtime_owned_language_cols`` and
``_select_observation_to_device``) so the dataset and robot paths
read as a clear linear pipeline. Path-specific concerns (defensive
unsqueeze on the dataset path; camera resize + state-vector sanity
logging on the robot path) stay inline at the call sites.

Behaviour unchanged; all 30 ``tests/policies/pi052/`` tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:25:08 +02:00
Pepijn a088c10c80 examples(port_datasets): SLURM+datatrove RoboCasa composite_seen build
Parallel variant of build_robocasa_composite_seen.py modeled after the
existing slurm_port_shards.py / slurm_aggregate_shards.py pattern.

Two-phase datatrove pipeline:
  * Phase 1 DOWNLOAD: tasks=16 (one per RoboCasa composite_seen task),
    each worker downloads its assigned tar via RoboCasa's own
    download_datasets helper. Network-bound, idempotent.
  * Phase 2 AGGREGATE: tasks=1, single worker calls aggregate_datasets
    over the 16 extracted directories. Submitted with depends=phase1 so
    SLURM only releases it once all 16 downloads succeed.

Reuses the COMPOSITE_SEEN_TASKS list and per-task download/resolve
helpers from the single-machine script via aliased imports — single
source of truth for 'what does it mean to download a composite_seen
task'.

Local (--slurm 0) mode runs the two phases sequentially in-process for
debugging on a workstation.

Usage on SLURM:
    uv run python examples/port_datasets/slurm_build_robocasa_composite_seen.py \
        --output-dir=/scratch/${USER}/robocasa_composite_seen \
        --hub-repo-id=${HF_USER}/robocasa_composite_seen \
        --logs-dir=/scratch/${USER}/logs/robocasa \
        --partition=cpu --push-to-hub

Prereq: uv sync --extra annotations  (pulls datatrove)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:10:05 +02:00
Pepijn 9c3d5ab7ce scripts: build_robocasa_composite_seen — aggregate 16 target tasks
RoboCasa 1.0 ships its target/human demos in LeRobot format (parquet +
mp4) as lerobot.tar archives distributed via Box. This script wraps
RoboCasa's own download_datasets helper to pull each of the 16
composite_seen tasks, opens each extracted directory as a
LeRobotDataset, and merges them into a single combined dataset via
merge_datasets (a thin wrapper over aggregate_datasets that revalidates
fps/robot_type/features, unifies task indices, concatenates videos and
parquet, and recomputes stats).

The 16-task slice corresponds exactly to the 'Composite-Seen' column of
the published RoboCasa365 leaderboard, so the resulting dataset is the
right substrate for an apples-to-apples pi05 vs pi052 comparison on
multi-step kitchen manipulation.

Usage:
    uv run python -m lerobot.scripts.build_robocasa_composite_seen \
        --output-dir=/data/lerobot/robocasa_composite_seen \
        --hub-repo-id=${HF_USER}/robocasa_composite_seen \
        --push-to-hub

Idempotent: re-running skips already-downloaded tasks. Defensive
fallbacks handle RoboCasa API drift in get_ds_path / download_datasets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 14:01:28 +02:00
Pepijn e84f97a8c1 smolvla2(runtime): interactive task picker + drop action diagnostic
Task picker:
The dataset bootstrap used to silently overwrite args.task with the
canonical training task. Replace that with an interactive picker
(_select_task_interactively) that shows every unique task in
ds_meta.tasks as a numbered menu (canonical task first as default) plus
a 'type a custom task' option. --task on the CLI still skips the
picker, and non-TTY runs fall back to the bootstrap task so scripted
invocations are unchanged.

Action diagnostic removal:
Drop the [act] log block in LowLevelForward.run (|a|_mean / spread /
normalized + unnormalized first/last + state) that was added while
debugging the 'barely moving' issue. Robot motion is now healthy, the
output is noise in steady-state, and it depended on stashing the
postprocessor on runtime.state — also removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:59:08 +02:00
Pepijn 6d2b8c80ab smolvla2(runtime): wire MemoryUpdateFwd into the inference pipeline
MemoryUpdateFwd was importable but never installed, so subtask_change
events fired by HighLevelSubtaskFwd had no listener and current_memory
stayed at its initial None value — the runtime panel always showed
'memory (not set)' even when the policy was trained with the
memory_update recipe (e.g. subtask_mem_vqa_speech.yaml, weight 0.15).

Insert MemoryUpdateFwd between HighLevelSubtaskFwd and AskVQAFwd so
the event is visible the same tick it is emitted, and refresh the
stale comment that claimed memory was not in scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:52:44 +02:00
Pepijn 793c7c4ddd feat(runtime): --subtask_chunks_per_gen throttles HL gen vs action chunks
Adds a per-chunk-boundary counter to HighLevelSubtaskFwd: subtask gen
fires only once every N chunk boundaries (default 1 = current
behavior). Lets the operator run e.g. 5 flow-matching action chunks
per LM-head subtask gen so the subtask doesn't churn every 1.7s while
the previous one is still being executed — saves compute and avoids
re-planning the action trajectory mid-grasp.

  --subtask_chunks_per_gen=5    # 5 chunks per subtask refresh

The counter starts at 0 so the very first chunk boundary fires
immediately (no startup delay). Trigger is rearmed when skipping so
a low high_level_hz doesn't lose slots.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:34:59 +02:00
Pepijn db927ab40b feat(runtime): action chunk diagnostic — log normalized + unnormalized values
Adds a per-chunk log line in LowLevelForward that surfaces what the
action expert actually emits and what the robot receives after the
postprocessor unnormalizes it, so "barely moving" can be diagnosed
at a glance:

  [act] T=50 |a|_mean=0.234 spread=0.512
  [act] norm  first=[0.12, -0.31, ...]  last=[0.45, -0.22, ...]
  [act] joint first=[3.2, -47.8, ...]  last=[12.4, -41.0, ...]  state=[0.5, -55.3, ...]

|a|_mean ~ 0.3–0.6 with spread ~ 0.3+ and visible delta from first to
last → healthy trajectory. |a|_mean near 0 across the chunk → model
defaulting to median pose. joint values that don't differ much from
state → safety cap or model output near current state.

Postprocessor is stashed on runtime.state["_postprocessor"] at startup
so the diagnostic can replay the same unnormalize the dispatcher uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 12:10:52 +02:00
pepijn 471b2b1b1d fix(annotate): bump same-frame subtasks onto distinct frames
If two consecutive VLM-emitted subtask spans have ``start`` timestamps
that round to the same source frame after ``snap_to_frame`` (e.g. on
short episodes the VLM sometimes nominates two ~adjacent action
boundaries within one 30 Hz step), the writer emits two
``style=subtask`` rows at the identical persistent timestamp. The
training-time renderer's default binding
``subtask: active_at(t, style=subtask)`` then raises:

    ValueError: Ambiguous resolver for style='subtask';
                add role=..., tool_name=..., or camera=... to disambiguate.

… and the whole training run dies on the first batch.

Observed concretely on ``pepijn223/super_poulain_vocab2`` (job
22159979): episodes 3 and 30 each had two subtask rows at the same
timestamp (``release yellow cube`` + ``retract arm`` snapping to the
same frame).

Add ``_dedupe_starts_to_distinct_frames`` to walk the cleaned span list
and, whenever a snapped start collides with one already used, push the
later span onto the next free frame timestamp. Both subtasks survive
on distinct timestamps; the renderer can now disambiguate. If the
episode genuinely has no later free frame (extremely unlikely — would
require a same-timestamp collision on the very last frame of the
episode), the later span is dropped with a warning rather than left
to poison the render.

New test ``test_plan_module_bumps_collocated_subtasks_to_distinct_frames``
locks in the contract; full vocabulary suite is 14/14 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-23 19:31:44 +00:00
pepijn a15e16c072 fix(annotate): replace fuzzy subtask snapping with strict match + one-shot retry
The Jaccard-overlap snap was warping VLM output into wrong canonical
labels — e.g. an off-vocab "consult the wizard" span would silently
become "grasp blue cube" if that scored highest. Even with a higher
floor the operator can't tell which subtasks were paraphrases vs
genuine mislabels in the resulting dataset.

Replace with strict exact-match validation + a single targeted retry:

  1. Generate subtasks as before.
  2. If any returned subtask's normalised form (lowercased, articles
     stripped, whitespace collapsed) isn't in the canonical vocab,
     fire one retry call naming the offending strings and re-sending
     the full canonical list. The retry prompt requires byte-identical
     output from the vocab.
  3. After the retry, validate again. Spans still off-vocab are
     dropped — no fuzzy snapping ever produces a different canonical
     label than the VLM actually emitted.
  4. If every span ends up off-vocab even after the retry, warn loudly
     so the operator extends ``meta/canonical_vocabulary.json`` to
     cover the missing phase. The episode is left with empty subtasks
     rather than silently fabricated ones — visibility > sweep-under-
     the-rug.

Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the
normalisation helper out so the retry-validation and the final
canonicalisation share one source of truth.

Tests:
  - test_plan_module_accepts_article_only_difference: "grasp the blue
    cube" still maps to canonical "grasp blue cube" (article-tolerant).
  - test_plan_module_retries_when_subtask_off_vocab: paraphrase
    triggers the retry which the VLM corrects in pass 2.
  - test_plan_module_drops_off_vocab_subtask_after_retry: VLM that
    refuses to correct → bad span dropped, in-vocab span kept.
  - test_plan_module_empty_when_all_off_vocab_after_retry: every
    span off-vocab → episode left empty (no warping).
All 13 vocabulary tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-23 09:57:27 +00:00
pepijn 336af85c09 fix(annotate): never leave an episode with zero canonical subtasks
When the canonical vocabulary is enabled and the VLM produces spans
that don't overlap any canonical label, the previous Jaccard-floor
(0.5) dropped them and the episode came out with no subtasks at all
— invisible to the downstream policy. Observed on
``pepijn223/super_poulain_vocab``: some episodes had empty subtask
columns because every VLM-emitted phrase scored below 0.5 against
the discovered vocabulary.

Two-pass canonicalisation:

  - First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to
    let mild paraphrases through) and drops everything below.
  - If that first pass leaves the episode with **zero** subtasks,
    fall back to a second pass that always snaps each VLM span to
    its nearest canonical label by Jaccard (no floor). The episode
    ends up with subtasks even when the vocabulary missed a phase
    — a slightly-wrong canonical label is still closer to the right
    motion than nothing at all.
  - Log loudly when the fallback fires so the operator can spot
    coverage gaps in ``meta/canonical_vocabulary.json``.
  - Log a per-episode count at INFO when some (but not all) spans
    were dropped so it's visible without spamming the run output.

Promote the Jaccard floor + ignore-tokens to class constants so
they're a single edit point. Add ``force=True`` parameter to
``_canonicalize_subtask`` for the no-floor fallback path.

New test ``test_plan_module_snaps_when_all_off_vocab`` covers the
fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is
adjusted to keep at least one in-vocab span so the floor path can
still fire and is exercised. All 12 vocabulary tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 12:44:03 +00:00
pepijn 54221ceea2 feat(annotate): let the VLM decide vocabulary size
Hardcoding ``n_subtask_target=10`` and ``n_memory_target=6`` baked task
complexity into the config — a simple pick-and-place needs ~6, a
multi-step recipe needs ~20. The VLM already sees the clips, so let it
pick the count itself from what's recurring across episodes.

Drop both knobs from ``VocabularyConfig`` and the ``module_0_vocabulary``
prompt template. The prompt now says "decide the count yourself based
on what you see — the smallest set that still covers every recurring
phase" and adds an "each label must recur across the demos" rule so
the VLM filters out one-off motions.

Update the launcher script + docs to remove the old knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 11:46:31 +00:00
pepijn 369ab17110 fix(annotate): update run_hf_job CLI args for renamed namespaces + phase 0
Three stale things in the launcher script:

  - ``--module_1/2/3.*`` no longer exist; review commit fd18beb renamed
    the CLI namespaces to ``--plan/interjections/vqa``. Forwarded all
    eight existing args to their new names.
  - ``--push_to_hub`` is now a bool; the destination repo lives at
    ``--dest_repo_id``. Split the single positional into both args.
  - ``openai`` was missing from the pip install list, which the prior
    review review (claude bot, 2026-05-08) flagged — the default vlm
    backend is ``openai`` so the job would have ImportError'd. Added.

Also expose the new phase 0 (canonical vocabulary discovery) knobs
explicitly: ``--vocabulary.sample_episodes``, ``--n_subtask_target``,
``--n_memory_target``. Defaults are sane (3 / 10 / 6) but worth
flagging in the example so the operator knows what they're running.

Update the docstring + section comments to match the current phase
layout (vocabulary → plan → interjections → vqa → writer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 11:43:06 +00:00
pepijn 86a7edc590 feat(annotate): phase 0 — derive canonical vocabulary from sample episodes
The pipeline previously emitted near-unique subtask + memory phrasings
per episode (free-form LLM rephrasing). On the downstream low-level
policy that collapses the action expert's conditioning to noise: every
episode pairs a different paraphrase with similar motions, so the
expert learns a flat scene-prior that ignores the subtask string —
then at inference the high-level head invents *yet another* paraphrase
and the expert produces tiny "uncertain hover" chunks.

Add a vocabulary-discovery phase (phase 0) that runs once per dataset:

  - watches the first ``vocabulary.sample_episodes`` (default 3)
    episode videos as one Qwen-VL prompt,
  - asks the VLM to derive ~``n_subtask_target`` canonical imperative
    subtask labels and ~``n_memory_target`` first-person past-tense
    memory milestones that recur across the demos,
  - persists them to ``meta/canonical_vocabulary.json`` (human-
    inspectable, hand-editable), and
  - wires the resulting ``Vocabulary`` into the ``plan`` module so
    every per-episode subtask + memory call is constrained to those
    exact strings (both as prompt-side instructions *and* post-VLM
    validation: paraphrases snap to the closest canonical entry via
    token-set overlap; below a 0.5 Jaccard floor the subtask is
    dropped rather than warped into something semantically wrong).

Operator workflow:

  - first run discovers the vocabulary, writes the JSON, and runs
    the ``plan`` module against it,
  - subsequent runs reuse the on-disk file (``reuse_existing=True``
    default) so hand-edits stick,
  - set ``--vocabulary.enabled=False`` to fall back to free-form
    generation (the original behaviour).

The discovery prompt forbids gerunds / third-person / adverbs and
caps the lists to the requested counts, matching the Hi-Robot /
π0.6-MEM convention of small per-environment vocabularies. The
``plan`` module's subtask + memory prompts grow a conditional
``{vocabulary_block}`` slot rendered only when a vocabulary is
present; without one the templates collapse to their previous
free-form form.

Tests: 11 new unit tests under tests/annotations/test_vocabulary.py
cover the on-disk round-trip, discovery against the fixture dataset,
``reuse_existing`` short-circuit, paraphrase canonicalisation, off-
vocab subtask dropping, and the no-vocabulary pass-through path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 11:40:05 +00:00
pepijn 77a16db529 fix(smolvla2): make HighLevelSubtaskFwd actually fire at low hz + quiet startup log
Two runtime fixes that surfaced from on-robot testing.

(1) HighLevelSubtaskFwd was double-gated: HzTrigger fires every period
(e.g. every 5s at --high_level_hz=0.2) AND the step requires the
action queue to be empty. The queue-empty window is brief (~tens of
ms between drain and refill) and almost never coincides with the
low-hz timer, so HL effectively never fired and the subtask shown
in the runtime panel stayed on the dataset's frame-0 annotation.

Add HzTrigger.rearm() and have HighLevelSubtaskFwd call it when
skipping due to queue-non-empty — the trigger stays armed and tries
again on the next tick instead of waiting another full period.
LowLevelForward keeps the original "skip" semantics because chunk_hz
is meant as a true upper bound on chunk-generation rate.

(2) The "robot state at startup" warning in _build_robot_observation_provider
was meant to fire once but wasn't gated by _resize_logged like the
sibling "camera ... live=AxB" warning. Result: it spammed every
observation tick (~1-2s). Gate it on first_call (snapshot of
_resize_logged["done"]) so both logs fire once at session start.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 11:04:12 +00:00
Pepijn 8194897994 fix(deps): cap placo below 0.9.16 and harden kinematics import (#3647)
* fix(deps): cap placo below 0.9.16 and harden kinematics import

placo 0.9.16 links against liburdfdom_sensor.so.4, which is unavailable
on Ubuntu 24.04 (noble ships urdfdom 3.x). Importing placo on that base
crashes with:

  ImportError: liburdfdom_sensor.so.4.0: cannot open shared object file

This broke nightly Latest Deps tests (CPU and GPU) when the lockfile
upgrade picked placo 0.9.16, since lerobot.model.kinematics
unconditionally imports placo when _placo_available is true, and that
check (importlib.util.find_spec) cannot detect dlopen failures of
transitive shared libraries — so unrelated subsystems (RL actor,
gym_manipulator) became unimportable.

Two changes:

1. Pin placo to <0.9.16 in pyproject.toml + regenerate uv.lock
   (0.9.16 → 0.9.15). Short-term unblock for nightly CI until system
   urdfdom 4.x is broadly available.

2. Harden the import guard in src/lerobot/model/kinematics.py:
   wrap 'import placo' in try/except ImportError so a missing
   transitive .so no longer crashes module import. RobotKinematics
   instantiation now raises an informative ImportError citing the
   underlying dlopen failure via _raise_if_placo_unusable().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(kinematics): hoist _placo_runtime_error to module scope for mypy

Mypy walks the TYPE_CHECKING branch in which the runtime else-block is
not executed, so _placo_runtime_error was only defined at runtime and
mypy reported 'Name "_placo_runtime_error" is not defined' on the
three references inside _raise_if_placo_unusable. Declare the symbol
unconditionally at module scope with a default of None; the runtime
import-failure branch still assigns to it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style(kinematics): drop verbose comments around placo import guard

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 12:03:07 +02:00
pepijn ca1b951e7b feat(pi05): expose lm_head_lr_scale for stronger text-CE gradient
With knowledge_insulation=True the LM head only receives gradients on
text-CE samples (e.g. ~45% of the mix for subtask_mem.yaml). Under
aggressive cosine LR decay this is enough for the head's first-token
distribution to drift back toward PaliGemma's pretrained <loc>
detection prior — teacher-forced argmax stays high while autoregressive
generation collapses to <locDDDD> tokens.

Add `lm_head_lr_scale` (default 1.0, no behavior change) on PI05Config.
When != 1.0, PI05Policy.get_optim_params splits the policy into two
param groups: the PaliGemma lm_head projection plus its tied
embed_tokens at lr * lm_head_lr_scale, and the rest at lr. The cosine
scheduler multiplies both groups by the same lambda each step, so the
ratio is preserved across decay.

Recommended starting point for pi052 + subtask_mem.yaml runs: 5.0,
combined with a higher scheduler_decay_lr floor (e.g. 5e-6 instead of
1e-6) so the head doesn't get starved in the second half of training.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 09:56:46 +00:00
pepijn 9d30d91021 fix(pi052,smolvla2): unblock text generation when LM head drifted to <loc>
PaliGemma's pretraining puts heavy first-token mass on its <loc0000>..
<loc1023> ids at any "Assistant:" continuation. Our pi052 fine-tunes
with knowledge_insulation=True and a small text-CE budget (~45% of
samples) drift back toward that prior on long runs at low LR — teacher-
forced argmax stays at 100% (CE only measures next-token given correct
prefix) while autoregressive first-token selection collapses onto <loc>.
On the running poulain11 checkpoint at step 8000 this manifests as a
stream of <locDDDD> tokens for every subtask call — confirmed locally
against the saved checkpoint on a dataset frame.

Add a `suppress_loc_tokens` knob to `PI052Policy.select_message` that
masks ids [256000, 257024) to -inf before sampling, and pass it from
the three text-only inference steps (HighLevelSubtaskFwd,
MemoryUpdateFwd, UserInterjectionFwd). VQA steps keep the default
False so spatial answers can still emit locs. Verified end-to-end:
suppressed → "the robot arm moves the blue block to the green basket".

Also fix `_msgs_for_memory`: it was emitting the older
`User: ${task}\nPlan:..\nMemory:..` / `Assistant: ${subtask}` template,
which no longer matches the `memory_update` recipe layout
(`User: ${task}` / `Assistant: Previous memory: ..` /
`User: Completed subtask: ..`). The new prompt mirrors the training
recipe; `HighLevelSubtaskFwd` stashes the just-completed subtask in
`state['prior_subtask']` so the memory prompt can render
`Completed subtask: ..` for `MemoryUpdateFwd`.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 09:50:14 +00:00
Haoming Song 9f437d86b6 fix(groot): align GR00TN15Config with transformers config dataclasses (#3606)
* fix(gr00t): fix gr00t config dataclass init TypeError

* fix(groot): guard strict config decorator without transformers for passing CI

---------

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-05-22 10:31:04 +02:00
Haoming Song b74a551d38 fix(pi0, pi05): stabilize torch.compile and expand test coverage (#3610)
* chore(gr00t): sync with #3606 for fixing gr00t config crash

* fix(pi0&pi05): fix graph break caused by deepcopy of past_key_values in sample_actions

* fix(pi0&pi05): fix frequent recompile caused by compute_layer_complete

* feat(test): add compile test and benchamrk for pi0 and pi05

* feat(test): add comprehensive testing for pi0 and pi05. Including processor, forward, sample action, etc.
2026-05-22 10:29:34 +02:00
Nikodem Bartnik c0a2e9814d fix examples (#3623)
- Fixed broken API examples in Lerobot Imitation Learning Documentation
- Teleoperation with cameras improved by adding a fixed frequency in the loop (without it the cameras feed gets very slow)
- Wrapped record example script in main() to avoid problems on Mac
- Previously teleoperation example was using SO-ARM and teleoperation with cameras was using Koch. I changed it to use SO-ARM in all of the examples.
- Added section on how to train with HF Jobs - CLI and Python examples
- Replaced lerobot-record with lerobot-rollout in policies examples
2026-05-21 22:14:07 +02:00
pepijn e050d0fe0a fix(recipes): use active_at for memory_update, rebalance subtask_mem
memory_update was bound to `emitted_at(t, style=memory)`, which requires
the frame's exact timestamp to match a memory annotation. Memory rows are
placed at subtask-boundary timestamps and at 30 fps that's ~1% of frames,
so 99% of memory_update draws couldn't render and silently fell through
to _fallback_low_level_render — injecting task-conditioned low-level
training on ~30% of samples (subtask_mem.yaml).

Switch to `active_at`. At inference `MemoryUpdateFwd` is triggered on
`subtask_change` events, but the model only needs to learn the stateless
mapping (prior_memory, completed_subtask) -> current_memory. active_at
supervises this mapping on every frame inside a subtask interval, against
varied observations; the trigger lives outside the model. Net effect:
memory_update renders on ~87% of frames, the fallback leak drops from
~30% to ~4%, and memory CE gets a meaningful (not 0.3%) training share.

subtask_mem.yaml: rebalance to 0.30 / 0.55 / 0.15 so memory CE is
~13% effective and the freed weight goes to low_level_execution.
subtask_mem_vqa_speech.yaml: keep weights (memory_update=0.10 was
already balanced against the other text-CE branches).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 14:53:13 +00:00
pepijn 2ca030fa28 fix(pi052): build processors from current config
When fine-tuning from pi05_base, reuse only the pretrained weights so pi052 still generates recipe text labels and FAST action labels.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 13:54:29 +00:00
pepijn 36f828221c fix(pi05): preserve pretrained paligemma lm head
Keep the PaliGemma LM head in float32 and initialize it from pretrained weights or token embeddings when loading pi05 checkpoints.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 13:25:24 +00:00
Pepijn d41d874581 fix(pi052): debug parity harness truncates prompt instead of masking
The parity check in debug_text_predictions was producing false ✗
DIVERGED reports. Root cause: I built the "inference" batch by
zero-masking the attention past the supervised span, but kept the
full 512-token padded sequence. select_message reads the prompt-end
hidden state via ``vlm_out[:, -1:]`` — the LAST position of the
prefix — which in a padded batch is a padding-token hidden state,
not the last prompt token. PaliGemma's prior on those padded
positions reliably argmaxes to <loc0879>, falsely flagging a
training/inference mismatch.

Fix: truncate both tokens AND mask to length == first_sup before
calling select_message, mirroring what the real runtime does
(``tokenizer(prompt)`` returns un-padded ids). Now the parity check
compares like-with-like.

The actual training argmax in the dump was sensible English
("' move the blue cube into the green bin'" at acc=6/9) — the head
is learning correctly. The "<loc>" salad was purely the harness
reading from the wrong position.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 15:09:36 +02:00
Khalil Meftah bac4f61eae refactor: support custom progress parquet overlays (#3640) 2026-05-21 14:32:10 +02:00
Pepijn efa05f0ada fix(train): unwrap DDP policy in debug_text_predictions hook
At training time the policy is wrapped by Accelerator/DDP into a
.module attribute and custom methods are NOT proxied through the
wrapper, so ``hasattr(policy, "debug_text_predictions")`` was False
and the periodic dump was silently no-op'ing. Walk through .module
indirection to reach the raw PI052Policy that defines the method.

Also surface why the dump didn't fire (no method / empty supervised
positions / generation error) so users can see what's blocking it
instead of staring at silence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 13:41:20 +02:00
Pepijn e98b6f726b feat(train): debug dump runs inference too, with parity check
Extends the periodic LM-head dump (LEROBOT_DEBUG_PREDS_EVERY) to ALSO
run select_message autoregressively on the same prompt prefix and show:

  prompt                          : '<bos>User: ... Assistant: '
  target  (ground truth)          : ' close the gripper ...'
  training argmax (teacher-fed)   : ' close the gri lift ...'  acc=12/15=80%
  inference (autoregressive)      : ' close the gripper around ...'
  first-token parity              : train=3387 (' close') vs infer=3387 (' close')  ✓ MATCH

The first-token parity check is decisive: training-side argmax at the
prompt-end position and inference's first generated token both compute
``argmax(lm_head(h_last_prompt))`` on identical context, so they MUST
match. Any divergence signals a training↔inference bug (mask, dtype,
KI routing, embedding scale, etc.). Subsequent tokens can diverge
because training uses teacher forcing while inference free-runs.

debug_text_predictions now also returns an ``inference`` list keyed
by sample, each entry carrying ``first_sup_pos`` and ``decoded``.
Limited to 24 new tokens per sample to keep the dump fast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:27:32 +02:00
Pepijn f7747d02a9 feat(train): periodic LM-head prediction dump for live debugging
Adds an opt-in diagnostic that, every N training steps, dumps 5 batch
samples plus the LM head's argmax prediction at every supervised
position alongside the label and a ✓/✗ marker — the cheapest signal
for "is text training actually learning what we expect, or collapsing
to a fixed token". Refills the recipe-sample dump budget on the same
cadence so the raw input shapes are also re-dumped.

Opt in via env var:
  LEROBOT_DEBUG_PREDS_EVERY=1000 lerobot-train ...

PI052 implements ``debug_text_predictions`` (mirrors the text-loss
forward but returns argmax instead of CE); other policies are silently
skipped. The dump runs in eval() mode under no_grad, slicing the
current batch to N samples — no extra data fetch, no train-state
mutation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:23:05 +02:00
pepijn 86ecd4bc2e add subtask memory training recipe
Add a recipe that blends subtask prediction, low-level execution, and memory update supervision.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 09:56:10 +00:00
pepijn 28b86449a2 fix(pi05): cast attention masks to model dtype
Ensure attention masks follow the backbone dtype during bf16 inference to avoid mixed dtype failures.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-21 09:52:46 +00:00
Virgileboat f4b834844e Feat/clean can bus (#3526)
* change timeout  for handshake

* enforce last state read when querry

* change import order

* fix(motors): flush stale robstride RX and harden feedback drain

* robstride: remove redundant timeout and max_messages casts

* bugfix + %-style

* update exception catch
2026-05-21 11:44:04 +02:00
Pepijn 5bb2da4da6 fix(pi052): VQA target format = "label <loc><loc>" not "<loc><loc> label"
The trained model collapsed to spewing 40+ <loc> tokens for *every*
prompt — subtask, memory, anything — because VQA targets were supervised
to *start* with <loc>. With ~25% of all text samples beginning with a
<loc> token, the LM head learned "Assistant: → <loc>" as a strong
attractor; once one loc is emitted, autoregression chains the rest.

Flip the format so every text target — subtask, memory, speech, AND VQA
— starts with a regular word. The model still learns the <loc>
vocabulary for the spatial portion of the answer, but loc can no
longer be the first generation step out of a clean prompt.

Examples:
  point  : "green box <loc0162><loc0759>"
  bbox   : "cube <loc0082>…<loc0409>"
  multi  : "blue <locs> ; yellow <locs>"

The runtime parser (parse_loc_answer) strips loc tokens and uses the
remainder as label, so it's order-tolerant and works under either
format. Old loc-first checkpoints still parse cleanly at inference;
new training will use label-first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 18:56:48 +02:00
Pepijn f7b989ad97 fix(pi052): read backbone dtype from q_proj, not first parameter
select_message's bf16 cast used next(paligemma.parameters()).dtype,
which lands on a fp32-kept param (norm / embedding) under
to_bfloat16_for_selected_params. Mask stayed fp32 while q/k/v were
bf16 → SDPA still raised "invalid dtype for bias". Read the dtype
from layers[0].self_attn.q_proj.weight instead — q_proj is always
cast with the rest, so its dtype matches what SDPA sees.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 18:46:08 +02:00
Pepijn 3b4376aa33 fix(pi052): cast attention bias to model dtype for bf16 inference
`_prepare_attention_masks_4d` always returns fp32 (the 0.0 / -inf
literals); with bf16 weights, HF PaliGemma's SDPA path raises
"invalid dtype for bias - should match query's dtype" and
select_message returns empty every step. Cast in both attention
sites: `_compute_layer_ki` (training, when both experts run) and
`select_message` (inference, VLM-only branch). Bf16 training +
bf16 inference now run end to end with no dtype mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 18:42:26 +02:00