lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-12 20:41:58 +00:00

Author	SHA1	Message	Date
pepijn	673cc6b0fe	pi052: opt-in Liger fused kernels (rope + geglu + layer_norm) Adds ``PI052Config.use_hf_kernels`` (default off). When enabled, ``PI052Policy.__init__`` calls ``apply_liger_kernel_to_paligemma`` before the backbone is built so PaliGemma / Gemma / Siglip layers pick up Liger's fused Triton forwards. Measured at BS=16 / L=512 / H100 80GB with KI+GC on (bench job 22161421, see ``examples/benchmark/bench_pi052_kernels.slurm``): rope only → -2.5% step time geglu only → -2.2% step time layer_norm only → -1.1% step time all three → -4.5% step time, peak_mem unchanged ``cross_entropy`` / ``fused_linear_cross_entropy`` are deliberately skipped — pi052 calls ``F.cross_entropy`` directly and bypasses ``PaliGemmaForConditionalGeneration.forward``, so neither patch fires without invasive model-code changes (left for a follow-up). ``rms_norm`` measured as noise on this workload (GC dominates), so it stays off to keep the patch surface minimal. Requires ``pip install liger-kernel``; falls back to a warning if missing so the default path is unaffected. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-25 20:50:07 +00:00
Pepijn	2ed6519a93	ema: enable by default (matches openpi JAX behavior) Flip EMAConfig.enable default from False -> True. Every training run now maintains an EMA shadow of the policy and uses it for eval + W&B example dumps. Disable per-run with --ema.enable=false for short or memory-constrained training. Rationale: * openpi (JAX, official) ships EMA on for every shipped config, decay=0.99 by default and 0.999 for pi05_libero. The openpi PyTorch port explicitly lists EMA as unsupported, a gap LeRobot main inherited. Flipping the default closes that gap for every LeRobot policy that ships through lerobot-train. * EMA is established best practice for diffusion / flow-matching policies (Diffusion Policy §V.D; standard in DDPM/EDM/Stable Diffusion training recipes). For autoregressive policies the extra cost is real but the safety net (smoother eval, better final checkpoint) doesn't hurt. Trade-offs to be aware of: * Memory: 1x model params in fp32 shadow (~13 GB for pi052's 3.3B params; <500 MB for ACT/Diffusion-Policy class). Memory- constrained users on consumer GPUs may need --ema.enable=false. * Checkpoint disk: extra .pt file in training_state/, size ~= pretrained_model/model.safetensors. Over a 100k-step run with save_freq=20000 that's 5x the model size in extra disk. * Eval scores will now reflect EMA model instead of live model - expected to be 1-3% higher on closed-loop tasks per the diffusion-policy literature; might surprise users who memorize their last run's numbers. Opt out: --ema.enable=false # disable entirely --ema.use_for_eval=false # keep EMA but eval reflects live --ema.use_for_wandb_examples=false # keep EMA but W&B reflects live Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 21:58:46 +02:00
Pepijn	72ea531017	train: switch EMA from custom ModelEMA to ema-pytorch Replace the 250-line src/lerobot/utils/ema.py with a direct dependency on ema-pytorch (lucidrains' canonical PyTorch EMA library). Same semantics, decay=0.999 default unchanged, but offloads the maintenance burden to a maintained library used by every diffusion repo. Why ema-pytorch: * Standard PyTorch EMA library; battle-tested across diffusion + speech + image-gen codebases. * Tiny pure-python dep (no compiled code). * Cleaner consumer-side API: ema.ema_model is a full nn.Module clone of the policy, so eval / wandb just pass it through instead of context-managed swap/restore on the live model. What changed mechanically: * pyproject.toml: add 'ema-pytorch>=0.7.7,<1.0.0' to core deps. * deleted src/lerobot/utils/ema.py (the custom ModelEMA). * scripts/lerobot_train.py: - import EMA from ema_pytorch - instantiate with beta=cfg.ema.decay, update_after_step=cfg.ema.warmup_steps, update_every=1, include_online_model=False (accelerator owns live model lifecycle; double-registration would double-count params). - ema.update() (no args) — library tracks the online model internally. - Eval block: pass eval_target_policy = ema.ema_model (when cfg.ema.use_for_eval) instead of swap context manager. - W&B examples: same pattern. - Save: torch.save(ema.state_dict(), .../ema_state.pt) instead of custom safetensors writer. .pt format is consistent with the rest of training_state which already mixes safetensors + json + (now) pt. - Resume: ema.load_state_dict(torch.load(.../ema_state.pt)). - WandB observability: ema/step (count of ema.update calls), ema/initted (bool from library), ema/beta (constant from cfg). * configs/default.py: EMAConfig.decay stays 0.999 (matches openpi's pi05_libero); docstring updated to reflect ema-pytrch semantics for warmup_steps (now maps to update_after_step — a hard skip, not a smooth decay ramp). Behavior preserved: * Defaults: enable=False, decay=0.999, warmup_steps=0, use_for_eval=True, use_for_wandb_examples=True. * Same CLI: --ema.enable=true, --ema.decay=X, etc. * Same checkpoint layout (training_state/ema_state.pt next to optimizer_state.safetensors etc.); resumes silently if present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 21:51:23 +02:00
Pepijn	56a934ec55	train: EMA of policy parameters (opt-in via --ema.enable=true) Adds Exponential Moving Average of trainable policy parameters with warmup, eval-time swap, checkpoint save/resume, and wandb observability. For diffusion / flow-matching policies (pi052's flow expert exactly qualifies), averaging late-training parameter oscillations yields a smoother model that generalises substantially better at inference — ~1–3% absolute success-rate improvement on closed-loop tasks per the diffusion-policy lit (Chi et al. 2023 §V.D; standard in DDPM/EDM). New module: src/lerobot/utils/ema.py ModelEMA class with: * fp32 shadow of every requires_grad parameter * decay warmup: min(decay, (1+n)/(10+n)) for first warmup_steps updates * update(model) -> effective_decay (for logging) * apply_to(model) context manager: temp-swap weights, restore on exit * copy_to(model): permanent overwrite * save() / load_from_file(): safetensors + JSON sidecar for metadata * state_dict() / load_state_dict() for in-process round-tripping New config: src/lerobot/configs/default.py EMAConfig + wired into TrainPipelineConfig as 'ema: EMAConfig'. Fields: enable: bool = False (off by default, back-compat) decay: float = 0.999 (standard; 0.75 for fast Diffusion-Policy) warmup_steps: int = 0 (no warmup by default) use_for_eval: bool = True (eval swaps in EMA weights) use_for_wandb_examples: bool = True (W&B training-examples table uses EMA for predicted-action columns -> matches what eval / deployment would see) Training loop integration (src/lerobot/scripts/lerobot_train.py): 1. After accelerator.prepare + policy.train(), instantiate ModelEMA on the main process if cfg.ema.enable. Resume from checkpoint_path/training_state/ema_state.safetensors if present. 2. After each update_policy() call, ema.update(unwrap_model(policy)) returns the effective decay (logged to wandb during warmup). 3. The save_checkpoint() block also ema.save(...) the shadow next to the existing optimizer/scheduler/rng training state. Resume picks it up automatically in (1). 4. The eval block (cfg.env && is_eval_step) wraps eval_policy_all in ema.apply_to() when use_for_eval=True. Live weights restored byte-for-byte on context exit. 5. The W&B training-example dump wraps log_training_examples in ema.apply_to() when use_for_wandb_examples=True so the predicted- action columns match the eval/deployment behavior. 6. Two new wandb scalars: ema/effective_decay, ema/num_updates. Cost: Memory: 1x model params in fp32 (~13 GB for pi052's 3.3B params). Lives only on main-process GPU. CPU offload available via ModelEMA(device='cpu') if needed. Compute: one elementwise update per step (~1% of step time). Eval: 2x checkpoint files in training_state/ (live optimizer state + ema shadow). Negligible relative to model.safetensors. Usage: lerobot-train ... --ema.enable=true lerobot-train ... --ema.enable=true --ema.decay=0.9999 # very slow EMA lerobot-train ... --ema.enable=true --ema.warmup_steps=1000 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 21:27:14 +02:00
Pepijn	738e317caa	pi052: PaLM-style z-loss on text CE (default weight 1e-4) Penalise the log-partition function z = log Σ exp(logits) drifting away from zero on text-CE supervised positions. Without it, large-vocab models (PaliGemma's 257k vocab) can let logsumexp grow unboundedly while CE stays low — a uniform additive logit bias cancels in softmax but pushes the partition function out of bounds, causing numerical instability and generation drift. PaLM appendix B / Chinchilla report z-loss is essential for stable large-vocab CE. It is especially valuable for pi052 because the recent default lm_head_lr_scale=5.0 amplifies head-drift risk: the 5x boost keeps the head pinned to fine-tuning targets, and z-loss caps the partition function so the head can't just bias all logits high uniformly. Implementation: * _shifted_ce(logits, labels, z_loss_weight=0.0) gains the new arg with default 0.0 (back-compat for any other caller). * Both call sites in PI052Policy.forward read self.config.text_ce_ z_loss_weight and pass it through. * PI052Config.text_ce_z_loss_weight defaults to 1e-4 (commonly cited PaLM value); set to 0 to disable. Cheap to compute: one extra logsumexp shares the softmax kernel that F.cross_entropy already runs. No memory overhead beyond a (B*T,) tensor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 21:08:56 +02:00
Pepijn	8ba3b187a1	pi052: bump lm_head_lr_scale default to 5.0 (keep base LR at 2.5e-5) The base optimizer LR (2.5e-5, cosine to 2.5e-6, 1k warmup, AdamW (0.9, 0.95), wd 0.01, grad_clip 1.0) is the openpi/π0.5 setting used for the RoboCasa leaderboard baselines and is well-validated for 3B- class VLAs with a paligemma backbone. Leave it alone. The one place pi052 needs to diverge from pi05 is the LM-head LR multiplier: * pi05 has no text supervision -> head doesn't get gradients -> lm_head_lr_scale is moot, stays at 1.0. * pi052 always has text supervision via the recipe (subtask / memory / VQA). Under KI, the LM head only sees gradients on ~30-45% of the batch (the text-CE mask share). Under aggressive cosine decay the head drifts back toward PaliGemma's pretrained <loc> first-token bias, despite teacher-forced CE staying near 0. 5x is the documented fix (see PI05Config.lm_head_lr_scale docstring and PI05Policy.get_optim_params, which is already wired to split the LM head + tied embed_tokens into their own param group while sharing the same cosine lambda). Flipping the default here lifts the fix from opt-in to on-by-default for every pi052 run, with zero downside on text-free recipes (head still gets no gradients to scale). Other LR knobs reviewed and intentionally NOT changed: - optimizer_lr=2.5e-5: openpi-validated, matches leaderboard. - scheduler_warmup_steps=1000: standard for VLA finetuning. - scheduler_decay_steps=30000: auto-scales for short runs. - optimizer_betas=(0.9, 0.95): GPT/LLM convention, works for flow-matching + LM-CE. - optimizer_weight_decay=0.01, grad_clip=1.0: standard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:57:43 +02:00
Pepijn	057c794ffe	wandb: flip training-example logging defaults to on (every 5000 steps) The training-example wandb.Table dump (camera images + text fields + GT/predicted action chunk endpoints) was opt-in. Flip defaults so any run with --wandb.enable=true gets visual training observability for free. log_examples_freq: 0 -> 5000 (push table every 5k steps) log_examples_n: 4 -> 4 (unchanged) log_examples_predict_actions: False -> True (extra forward in eval mode) Runs without --wandb.enable=true are unaffected (the training loop gate checks wandb_logger is not None first). Set log_examples_freq=0 to opt out of the dump even with wandb enabled; set log_examples_predict_actions =false to skip the extra inference forward pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 18:00:04 +02:00
Pepijn	b1e83f556c	train: periodic wandb log of training examples (images + text + actions) Adds an opt-in cadence for pushing rich training examples to W&B, independent of the scalar log_freq. Off by default; turn on with --wandb.log_examples_freq=5000 (one wandb.Table dump every 5k steps). WandBConfig (configs/default.py): + log_examples_freq: int = 0 # 0 disables + log_examples_n: int = 4 # batch elements per dump + log_examples_predict_actions: bool = False # opt-in extra forward pass to # show predicted vs GT action chunk WandBLogger.log_training_examples (common/wandb_utils.py): Builds one wandb.Table row per sampled batch element with: * one wandb.Image column per camera (auto handles CHW/HWC, uint8/float32 [0,1]) * any text fields present in the batch (task / subtask / memory / instruction) * gt_action_first / gt_action_last (chunk endpoints) * pred_action_first / pred_action_last when --wandb.log_examples_ predict_actions=true (policy.eval() + no_grad; restores train mode after) Defensive: per-camera failures don't poison the row; predict_action_ chunk exceptions are logged and the predicted columns are dropped. Training loop (scripts/lerobot_train.py): One new gated block right after the existing scalar log_step clause. Reads batch + dataset.meta.camera_keys, hands them to log_training_examples. Wrapped in try/except so a bad sample never kills the run. Usage: lerobot-train ... \ --wandb.enable=true --wandb.project=robocasa_composite_seen \ --wandb.log_examples_freq=5000 \ --wandb.log_examples_n=4 \ --wandb.log_examples_predict_actions=true Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 16:57:15 +02:00
Pepijn	da3e87ee86	Merge branch 'feat/smolvla-on-steerable' of https://github.com/huggingface/lerobot into feat/smolvla-on-steerable	2026-05-25 16:56:50 +02:00
Pepijn	1e9a6d044d	Merge remote-tracking branch 'origin/feat/language-annotation-pipeline' into feat/smolvla-on-steerable # Conflicts: # src/lerobot/datasets/__init__.py # src/lerobot/policies/__init__.py # src/lerobot/policies/factory.py # src/lerobot/processor/render_messages_processor.py # uv.lock	2026-05-25 16:56:22 +02:00
pepijn	3fdfcb912a	examples(port_datasets): generalize RoboCasa builder + add smoke script - Add ATOMIC_TASKS, COMPOSITE_UNSEEN_TASKS and four new --task-set keys (atomic, composite_unseen, composite_all, composite_atomic) so the same builder produces the 50-task target benchmark or the 300-task Human300 pretraining slice (via --split=pretrain --task-set=all) without duplicating logic. - Stop hardcoding the composite_seen tag on the HF push; tags are now derived from --split / --source / --task-set so atomic, composite_all, and pretrain runs land with accurate metadata. - Refresh module docstring to match the broader scope. - Add scripts/build_robocasa_smoke.sh: 2-atomic-task smoke dataset (~1k episodes, ~131k frames) for fast end-to-end training validation before kicking off Human300-scale runs.	2026-05-25 14:54:00 +00:00
Pepijn	c37b1fc7d0	Merge origin/feat/language-annotation-pipeline (8 fix(annotate) commits + vocabulary phase)	2026-05-25 15:47:25 +02:00
Pepijn	9020635b14	Merge branch 'main' into feat/language-annotation-pipeline Resolves conflicts from 32 commits on main: * docs/source/_toctree.yml — keep both new toc entries (annotation_pipeline + video_encoding_parameters). * docs/source/language_and_recipes.mdx — adopt main's section ordering (Layer 2 before "Temporal semantics") and float32 timestamp dtype to match the codebase. * src/lerobot/configs/__init__.py — keep both export sets (recipe + video encoder). * src/lerobot/datasets/dataset_metadata.py — drop redundant lazy imports (top-level imports cover both LANGUAGE_COLUMNS and DEFAULT_TOOLS); adopt main's @tools.setter for info.json write-back. * src/lerobot/datasets/feature_utils.py — call the real validate_feature_language() instead of returning "". * src/lerobot/datasets/language.py — float32 timestamps to match pa.float32() used in video_utils.py and the rest of the codebase. * src/lerobot/datasets/language_render.py — adopt main's unwrap_scalar() helper (drops two hand-rolled .item()/list unwrappers); float32 in docstring. * src/lerobot/processor/render_messages_processor.py — drop PR-local _scalar() helper, use shared unwrap_scalar(). * tests/datasets/test_language.py — adopt main's new float32 dtype + validate_feature_language warning tests. * tests/datasets/test_dataset_metadata.py — adopt main's new tools.setter persist/clear tests. * uv.lock — regenerated cleanly from main's resolver. 90 of 92 touched tests pass. Two pre-existing test failures (test_module1_plan_memory_subtask_smoke, test_module2_mid_episode_emits_paired_interjection_and_speech in tests/annotations/test_modules.py) are unrelated to this merge — that test file doesn't exist on main, so the failures originate on the branch and are addressed by the 8 newer fix(annotate) commits already on origin that will land in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:46:32 +02:00
Pepijn	83d0c390da	pi052: drop debug scaffolding left over from training/inference bug hunts Three diagnostic surfaces shipped in PR3 that don't belong in a clean release: * ``LEROBOT_DUMP_RECIPE_SAMPLES`` env-var dump (~70 LOC in text_processor_pi052.py): pretty-prints the next N rendered samples with ``[TGT]...[/TGT]`` markers over supervised spans. One-off training-inspection tool — no production user, never wired into a CLI flag, only useful while iterating on the recipe. Drop the module constants, the ``_is_dump_rank`` / ``_dump_recipe_sample`` helpers, the call site, and the now-unused ``import os``. * ``_log_obs_tensors_once()`` in lerobot_pi052_runtime.py: the docstring literally says "Used to bisect train/inference mismatches" — a debugging artifact from when the LM head was collapsing on the live robot. Logged unconditionally at WARNING level from both the dataset-driven and robot-driven providers, with no ``--verbose`` gate. Drop the function, both call sites, and the ``_logged`` / ``_obs_logged`` flag dicts that fed them. (``_resize_logged`` is kept — it gates the operationally useful camera-size sanity log.) * Defensive ``unsqueeze(0)`` block in the dataset observation provider: papered over an upstream bug where some preprocessor step could produce an unbatched tensor. ``AddBatchDimensionProcessorStep`` is reliable in the current pipeline — pi052 tests still pass with the block removed. If the bug ever resurfaces it should be fixed at the source, not silently re-batched here. Net: -169 LOC. All 30 ``tests/policies/pi052/`` tests pass. The ``<loc>`` token plumbing (``register_paligemma_loc_tokens``, ``_loc_token``, ``suppress_loc_tokens`` runtime gate) is left as-is — it's the actual mechanism for VQA spatial answers, not scaffolding, and the ``suppress_loc_tokens=True`` callers on subtask/memory/ interjection paths and ``=False`` on the VQA path are intentional asymmetric behaviour, not a bug-routing knob. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:07:43 +02:00
Pepijn	1ff10b935c	Merge branch 'feat/language-annotation-pipeline' into feat/smolvla-on-steerable Resolves conflicts from 66 commits on the base branch: * pyproject.toml — keep base's transformers>=5.4.0,<5.6.0; add the sentencepiece-dep entry pi052 (FAST action tokenizer) needs. * policies/__init__.py — keep pi052 export; drop the RewardClassifierConfig export that base removed. * policies/factory.py — docstring list resolution (keep pi052; drop reward_classifier, removed by base). * annotations/steerable_pipeline/executor.py — adopt base's renamed _ensure_annotation_metadata_in_info (it already advertises the say tool); drop pi052's older _ensure_tools_in_info call. * configs/train.py — keep pi052's vqa_target_fraction; adopt base's SampleWeightingConfig (legacy RA-BC inline params already covered by the migration shim base added). * scripts/lerobot_train.py — merge pi052's per-policy processor rebuild + dataset_repo_id pass-through with base's active_cfg / is_reward_model_training tightening, and re-route vqa-weighted sampler to active_cfg.drop_n_last_frames. * datasets/language_render.py — adopt base's _select_one + timestamp tolerance (drops pi052's stale _select_latest / per-style sort_key). * tests — adopt base's parametrized per-camera blend + tolerance test; drop pi052 tests that overlap with base's tighter rewrites; keep pi052's flow-only / VQA-blend coverage; add a test_canonical_recipe_loads check on subtask_mem_vqa_speech.yaml. * policies/pi052/processor_pi052.py — import RenderMessagesStep directly from render_messages_processor (base intentionally dropped it from lerobot.processor's re-exports). * uv.lock — regenerated cleanly from base + pi052's pocket-tts / beartype. All 67 touched tests pass (30 pi052 + 37 recipe / language-render / pipeline / render-messages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:47:09 +02:00
Pepijn	67bdf4690e	examples(port_datasets): rewrite RoboCasa composite_seen builder Replace the earlier wrapper (which depended on robocasa.scripts.download + dataset_registry) with a self-contained pipeline that: * downloads each task tarball directly from Box via box_links_ds.json * converts v2.1 -> v3.0 in place using convert_dataset_v21_to_v30 * standardizes camera keys under observation.images.robot0_* and flattens observation.state by concatenating base/EE/gripper subkeys when the source dataset stores them separately * builds per-rank unified shards then aggregates into one dataset Filter: composite_seen task-set restricts discovery to the 16 multi-step target tasks (DeliverStraw, GetToastedBread, ..., WashLettuce). Use --task-set=all to keep every discovered task in the split/source slice; --tasks=... overrides for arbitrary subsets. Defaults sized for hopper-cpu @ 128 cores: 16 workers x 8 cpus-per-task. Adapted from a battle-tested port_robocasa.py reference shared by the user; the only semantic addition is the task-set filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:27:42 +02:00
Pepijn	8085feab6e	pi052(runtime): factor out shared observation-prep boilerplate Both observation providers in lerobot_pi052_runtime.py ended a sample dict the same way — strip the runtime-owned language columns and hand the policy a device-resident ``observation.*``-only subset. Extract two tiny helpers (``_strip_runtime_owned_language_cols`` and ``_select_observation_to_device``) so the dataset and robot paths read as a clear linear pipeline. Path-specific concerns (defensive unsqueeze on the dataset path; camera resize + state-vector sanity logging on the robot path) stay inline at the call sites. Behaviour unchanged; all 30 ``tests/policies/pi052/`` tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:25:08 +02:00
Pepijn	a088c10c80	examples(port_datasets): SLURM+datatrove RoboCasa composite_seen build Parallel variant of build_robocasa_composite_seen.py modeled after the existing slurm_port_shards.py / slurm_aggregate_shards.py pattern. Two-phase datatrove pipeline: * Phase 1 DOWNLOAD: tasks=16 (one per RoboCasa composite_seen task), each worker downloads its assigned tar via RoboCasa's own download_datasets helper. Network-bound, idempotent. * Phase 2 AGGREGATE: tasks=1, single worker calls aggregate_datasets over the 16 extracted directories. Submitted with depends=phase1 so SLURM only releases it once all 16 downloads succeed. Reuses the COMPOSITE_SEEN_TASKS list and per-task download/resolve helpers from the single-machine script via aliased imports — single source of truth for 'what does it mean to download a composite_seen task'. Local (--slurm 0) mode runs the two phases sequentially in-process for debugging on a workstation. Usage on SLURM: uv run python examples/port_datasets/slurm_build_robocasa_composite_seen.py \ --output-dir=/scratch/${USER}/robocasa_composite_seen \ --hub-repo-id=${HF_USER}/robocasa_composite_seen \ --logs-dir=/scratch/${USER}/logs/robocasa \ --partition=cpu --push-to-hub Prereq: uv sync --extra annotations (pulls datatrove) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:10:05 +02:00
Pepijn	9c3d5ab7ce	scripts: build_robocasa_composite_seen — aggregate 16 target tasks RoboCasa 1.0 ships its target/human demos in LeRobot format (parquet + mp4) as lerobot.tar archives distributed via Box. This script wraps RoboCasa's own download_datasets helper to pull each of the 16 composite_seen tasks, opens each extracted directory as a LeRobotDataset, and merges them into a single combined dataset via merge_datasets (a thin wrapper over aggregate_datasets that revalidates fps/robot_type/features, unifies task indices, concatenates videos and parquet, and recomputes stats). The 16-task slice corresponds exactly to the 'Composite-Seen' column of the published RoboCasa365 leaderboard, so the resulting dataset is the right substrate for an apples-to-apples pi05 vs pi052 comparison on multi-step kitchen manipulation. Usage: uv run python -m lerobot.scripts.build_robocasa_composite_seen \ --output-dir=/data/lerobot/robocasa_composite_seen \ --hub-repo-id=${HF_USER}/robocasa_composite_seen \ --push-to-hub Idempotent: re-running skips already-downloaded tasks. Defensive fallbacks handle RoboCasa API drift in get_ds_path / download_datasets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 14:01:28 +02:00
Pepijn	e84f97a8c1	smolvla2(runtime): interactive task picker + drop action diagnostic Task picker: The dataset bootstrap used to silently overwrite args.task with the canonical training task. Replace that with an interactive picker (_select_task_interactively) that shows every unique task in ds_meta.tasks as a numbered menu (canonical task first as default) plus a 'type a custom task' option. --task on the CLI still skips the picker, and non-TTY runs fall back to the bootstrap task so scripted invocations are unchanged. Action diagnostic removal: Drop the [act] log block in LowLevelForward.run (\|a\|_mean / spread / normalized + unnormalized first/last + state) that was added while debugging the 'barely moving' issue. Robot motion is now healthy, the output is noise in steady-state, and it depended on stashing the postprocessor on runtime.state — also removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:59:08 +02:00
Pepijn	6d2b8c80ab	smolvla2(runtime): wire MemoryUpdateFwd into the inference pipeline MemoryUpdateFwd was importable but never installed, so subtask_change events fired by HighLevelSubtaskFwd had no listener and current_memory stayed at its initial None value — the runtime panel always showed 'memory (not set)' even when the policy was trained with the memory_update recipe (e.g. subtask_mem_vqa_speech.yaml, weight 0.15). Insert MemoryUpdateFwd between HighLevelSubtaskFwd and AskVQAFwd so the event is visible the same tick it is emitted, and refresh the stale comment that claimed memory was not in scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:52:44 +02:00
Pepijn	793c7c4ddd	feat(runtime): --subtask_chunks_per_gen throttles HL gen vs action chunks Adds a per-chunk-boundary counter to HighLevelSubtaskFwd: subtask gen fires only once every N chunk boundaries (default 1 = current behavior). Lets the operator run e.g. 5 flow-matching action chunks per LM-head subtask gen so the subtask doesn't churn every 1.7s while the previous one is still being executed — saves compute and avoids re-planning the action trajectory mid-grasp. --subtask_chunks_per_gen=5 # 5 chunks per subtask refresh The counter starts at 0 so the very first chunk boundary fires immediately (no startup delay). Trigger is rearmed when skipping so a low high_level_hz doesn't lose slots. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:34:59 +02:00
Pepijn	db927ab40b	feat(runtime): action chunk diagnostic — log normalized + unnormalized values Adds a per-chunk log line in LowLevelForward that surfaces what the action expert actually emits and what the robot receives after the postprocessor unnormalizes it, so "barely moving" can be diagnosed at a glance: [act] T=50 \|a\|_mean=0.234 spread=0.512 [act] norm first=[0.12, -0.31, ...] last=[0.45, -0.22, ...] [act] joint first=[3.2, -47.8, ...] last=[12.4, -41.0, ...] state=[0.5, -55.3, ...] \|a\|_mean ~ 0.3–0.6 with spread ~ 0.3+ and visible delta from first to last → healthy trajectory. \|a\|_mean near 0 across the chunk → model defaulting to median pose. joint values that don't differ much from state → safety cap or model output near current state. Postprocessor is stashed on runtime.state["_postprocessor"] at startup so the diagnostic can replay the same unnormalize the dispatcher uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 12:10:52 +02:00
pepijn	471b2b1b1d	fix(annotate): bump same-frame subtasks onto distinct frames If two consecutive VLM-emitted subtask spans have ``start`` timestamps that round to the same source frame after ``snap_to_frame`` (e.g. on short episodes the VLM sometimes nominates two ~adjacent action boundaries within one 30 Hz step), the writer emits two ``style=subtask`` rows at the identical persistent timestamp. The training-time renderer's default binding ``subtask: active_at(t, style=subtask)`` then raises: ValueError: Ambiguous resolver for style='subtask'; add role=..., tool_name=..., or camera=... to disambiguate. … and the whole training run dies on the first batch. Observed concretely on ``pepijn223/super_poulain_vocab2`` (job 22159979): episodes 3 and 30 each had two subtask rows at the same timestamp (``release yellow cube`` + ``retract arm`` snapping to the same frame). Add ``_dedupe_starts_to_distinct_frames`` to walk the cleaned span list and, whenever a snapped start collides with one already used, push the later span onto the next free frame timestamp. Both subtasks survive on distinct timestamps; the renderer can now disambiguate. If the episode genuinely has no later free frame (extremely unlikely — would require a same-timestamp collision on the very last frame of the episode), the later span is dropped with a warning rather than left to poison the render. New test ``test_plan_module_bumps_collocated_subtasks_to_distinct_frames`` locks in the contract; full vocabulary suite is 14/14 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-23 19:31:44 +00:00
pepijn	a15e16c072	fix(annotate): replace fuzzy subtask snapping with strict match + one-shot retry The Jaccard-overlap snap was warping VLM output into wrong canonical labels — e.g. an off-vocab "consult the wizard" span would silently become "grasp blue cube" if that scored highest. Even with a higher floor the operator can't tell which subtasks were paraphrases vs genuine mislabels in the resulting dataset. Replace with strict exact-match validation + a single targeted retry: 1. Generate subtasks as before. 2. If any returned subtask's normalised form (lowercased, articles stripped, whitespace collapsed) isn't in the canonical vocab, fire one retry call naming the offending strings and re-sending the full canonical list. The retry prompt requires byte-identical output from the vocab. 3. After the retry, validate again. Spans still off-vocab are dropped — no fuzzy snapping ever produces a different canonical label than the VLM actually emitted. 4. If every span ends up off-vocab even after the retry, warn loudly so the operator extends ``meta/canonical_vocabulary.json`` to cover the missing phase. The episode is left with empty subtasks rather than silently fabricated ones — visibility > sweep-under- the-rug. Promote ``_NORMALIZE_STRIP_TOKENS`` to a class constant and split the normalisation helper out so the retry-validation and the final canonicalisation share one source of truth. Tests: - test_plan_module_accepts_article_only_difference: "grasp the blue cube" still maps to canonical "grasp blue cube" (article-tolerant). - test_plan_module_retries_when_subtask_off_vocab: paraphrase triggers the retry which the VLM corrects in pass 2. - test_plan_module_drops_off_vocab_subtask_after_retry: VLM that refuses to correct → bad span dropped, in-vocab span kept. - test_plan_module_empty_when_all_off_vocab_after_retry: every span off-vocab → episode left empty (no warping). All 13 vocabulary tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-23 09:57:27 +00:00
pepijn	336af85c09	fix(annotate): never leave an episode with zero canonical subtasks When the canonical vocabulary is enabled and the VLM produces spans that don't overlap any canonical label, the previous Jaccard-floor (0.5) dropped them and the episode came out with no subtasks at all — invisible to the downstream policy. Observed on ``pepijn223/super_poulain_vocab``: some episodes had empty subtask columns because every VLM-emitted phrase scored below 0.5 against the discovered vocabulary. Two-pass canonicalisation: - First pass keeps the Jaccard floor (lowered from 0.5 → 0.25, to let mild paraphrases through) and drops everything below. - If that first pass leaves the episode with zero subtasks, fall back to a second pass that always snaps each VLM span to its nearest canonical label by Jaccard (no floor). The episode ends up with subtasks even when the vocabulary missed a phase — a slightly-wrong canonical label is still closer to the right motion than nothing at all. - Log loudly when the fallback fires so the operator can spot coverage gaps in ``meta/canonical_vocabulary.json``. - Log a per-episode count at INFO when some (but not all) spans were dropped so it's visible without spamming the run output. Promote the Jaccard floor + ignore-tokens to class constants so they're a single edit point. Add ``force=True`` parameter to ``_canonicalize_subtask`` for the no-floor fallback path. New test ``test_plan_module_snaps_when_all_off_vocab`` covers the fallback; existing ``test_plan_module_drops_off_vocab_subtask`` is adjusted to keep at least one in-vocab span so the floor path can still fire and is exercised. All 12 vocabulary tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 12:44:03 +00:00
pepijn	54221ceea2	feat(annotate): let the VLM decide vocabulary size Hardcoding ``n_subtask_target=10`` and ``n_memory_target=6`` baked task complexity into the config — a simple pick-and-place needs ~6, a multi-step recipe needs ~20. The VLM already sees the clips, so let it pick the count itself from what's recurring across episodes. Drop both knobs from ``VocabularyConfig`` and the ``module_0_vocabulary`` prompt template. The prompt now says "decide the count yourself based on what you see — the smallest set that still covers every recurring phase" and adds an "each label must recur across the demos" rule so the VLM filters out one-off motions. Update the launcher script + docs to remove the old knobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 11:46:31 +00:00
pepijn	369ab17110	fix(annotate): update run_hf_job CLI args for renamed namespaces + phase 0 Three stale things in the launcher script: - ``--module_1/2/3.*`` no longer exist; review commit `fd18beb` renamed the CLI namespaces to ``--plan/interjections/vqa``. Forwarded all eight existing args to their new names. - ``--push_to_hub`` is now a bool; the destination repo lives at ``--dest_repo_id``. Split the single positional into both args. - ``openai`` was missing from the pip install list, which the prior review review (claude bot, 2026-05-08) flagged — the default vlm backend is ``openai`` so the job would have ImportError'd. Added. Also expose the new phase 0 (canonical vocabulary discovery) knobs explicitly: ``--vocabulary.sample_episodes``, ``--n_subtask_target``, ``--n_memory_target``. Defaults are sane (3 / 10 / 6) but worth flagging in the example so the operator knows what they're running. Update the docstring + section comments to match the current phase layout (vocabulary → plan → interjections → vqa → writer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 11:43:06 +00:00
pepijn	86a7edc590	feat(annotate): phase 0 — derive canonical vocabulary from sample episodes The pipeline previously emitted near-unique subtask + memory phrasings per episode (free-form LLM rephrasing). On the downstream low-level policy that collapses the action expert's conditioning to noise: every episode pairs a different paraphrase with similar motions, so the expert learns a flat scene-prior that ignores the subtask string — then at inference the high-level head invents yet another paraphrase and the expert produces tiny "uncertain hover" chunks. Add a vocabulary-discovery phase (phase 0) that runs once per dataset: - watches the first ``vocabulary.sample_episodes`` (default 3) episode videos as one Qwen-VL prompt, - asks the VLM to derive ~``n_subtask_target`` canonical imperative subtask labels and ~``n_memory_target`` first-person past-tense memory milestones that recur across the demos, - persists them to ``meta/canonical_vocabulary.json`` (human- inspectable, hand-editable), and - wires the resulting ``Vocabulary`` into the ``plan`` module so every per-episode subtask + memory call is constrained to those exact strings (both as prompt-side instructions and post-VLM validation: paraphrases snap to the closest canonical entry via token-set overlap; below a 0.5 Jaccard floor the subtask is dropped rather than warped into something semantically wrong). Operator workflow: - first run discovers the vocabulary, writes the JSON, and runs the ``plan`` module against it, - subsequent runs reuse the on-disk file (``reuse_existing=True`` default) so hand-edits stick, - set ``--vocabulary.enabled=False`` to fall back to free-form generation (the original behaviour). The discovery prompt forbids gerunds / third-person / adverbs and caps the lists to the requested counts, matching the Hi-Robot / π0.6-MEM convention of small per-environment vocabularies. The ``plan`` module's subtask + memory prompts grow a conditional ``{vocabulary_block}`` slot rendered only when a vocabulary is present; without one the templates collapse to their previous free-form form. Tests: 11 new unit tests under tests/annotations/test_vocabulary.py cover the on-disk round-trip, discovery against the fixture dataset, ``reuse_existing`` short-circuit, paraphrase canonicalisation, off- vocab subtask dropping, and the no-vocabulary pass-through path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 11:40:05 +00:00
pepijn	77a16db529	fix(smolvla2): make HighLevelSubtaskFwd actually fire at low hz + quiet startup log Two runtime fixes that surfaced from on-robot testing. (1) HighLevelSubtaskFwd was double-gated: HzTrigger fires every period (e.g. every 5s at --high_level_hz=0.2) AND the step requires the action queue to be empty. The queue-empty window is brief (~tens of ms between drain and refill) and almost never coincides with the low-hz timer, so HL effectively never fired and the subtask shown in the runtime panel stayed on the dataset's frame-0 annotation. Add HzTrigger.rearm() and have HighLevelSubtaskFwd call it when skipping due to queue-non-empty — the trigger stays armed and tries again on the next tick instead of waiting another full period. LowLevelForward keeps the original "skip" semantics because chunk_hz is meant as a true upper bound on chunk-generation rate. (2) The "robot state at startup" warning in _build_robot_observation_provider was meant to fire once but wasn't gated by _resize_logged like the sibling "camera ... live=AxB" warning. Result: it spammed every observation tick (~1-2s). Gate it on first_call (snapshot of _resize_logged["done"]) so both logs fire once at session start. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 11:04:12 +00:00
Pepijn	8194897994	fix(deps): cap placo below 0.9.16 and harden kinematics import (#3647 ) * fix(deps): cap placo below 0.9.16 and harden kinematics import placo 0.9.16 links against liburdfdom_sensor.so.4, which is unavailable on Ubuntu 24.04 (noble ships urdfdom 3.x). Importing placo on that base crashes with: ImportError: liburdfdom_sensor.so.4.0: cannot open shared object file This broke nightly Latest Deps tests (CPU and GPU) when the lockfile upgrade picked placo 0.9.16, since lerobot.model.kinematics unconditionally imports placo when _placo_available is true, and that check (importlib.util.find_spec) cannot detect dlopen failures of transitive shared libraries — so unrelated subsystems (RL actor, gym_manipulator) became unimportable. Two changes: 1. Pin placo to <0.9.16 in pyproject.toml + regenerate uv.lock (0.9.16 → 0.9.15). Short-term unblock for nightly CI until system urdfdom 4.x is broadly available. 2. Harden the import guard in src/lerobot/model/kinematics.py: wrap 'import placo' in try/except ImportError so a missing transitive .so no longer crashes module import. RobotKinematics instantiation now raises an informative ImportError citing the underlying dlopen failure via _raise_if_placo_unusable(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(kinematics): hoist _placo_runtime_error to module scope for mypy Mypy walks the TYPE_CHECKING branch in which the runtime else-block is not executed, so _placo_runtime_error was only defined at runtime and mypy reported 'Name "_placo_runtime_error" is not defined' on the three references inside _raise_if_placo_unusable. Declare the symbol unconditionally at module scope with a default of None; the runtime import-failure branch still assigns to it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(kinematics): drop verbose comments around placo import guard Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 12:03:07 +02:00
pepijn	ca1b951e7b	feat(pi05): expose lm_head_lr_scale for stronger text-CE gradient With knowledge_insulation=True the LM head only receives gradients on text-CE samples (e.g. ~45% of the mix for subtask_mem.yaml). Under aggressive cosine LR decay this is enough for the head's first-token distribution to drift back toward PaliGemma's pretrained <loc> detection prior — teacher-forced argmax stays high while autoregressive generation collapses to <locDDDD> tokens. Add `lm_head_lr_scale` (default 1.0, no behavior change) on PI05Config. When != 1.0, PI05Policy.get_optim_params splits the policy into two param groups: the PaliGemma lm_head projection plus its tied embed_tokens at lr * lm_head_lr_scale, and the rest at lr. The cosine scheduler multiplies both groups by the same lambda each step, so the ratio is preserved across decay. Recommended starting point for pi052 + subtask_mem.yaml runs: 5.0, combined with a higher scheduler_decay_lr floor (e.g. 5e-6 instead of 1e-6) so the head doesn't get starved in the second half of training. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 09:56:46 +00:00
pepijn	9d30d91021	fix(pi052,smolvla2): unblock text generation when LM head drifted to <loc> PaliGemma's pretraining puts heavy first-token mass on its <loc0000>.. <loc1023> ids at any "Assistant:" continuation. Our pi052 fine-tunes with knowledge_insulation=True and a small text-CE budget (~45% of samples) drift back toward that prior on long runs at low LR — teacher- forced argmax stays at 100% (CE only measures next-token given correct prefix) while autoregressive first-token selection collapses onto <loc>. On the running poulain11 checkpoint at step 8000 this manifests as a stream of <locDDDD> tokens for every subtask call — confirmed locally against the saved checkpoint on a dataset frame. Add a `suppress_loc_tokens` knob to `PI052Policy.select_message` that masks ids [256000, 257024) to -inf before sampling, and pass it from the three text-only inference steps (HighLevelSubtaskFwd, MemoryUpdateFwd, UserInterjectionFwd). VQA steps keep the default False so spatial answers can still emit locs. Verified end-to-end: suppressed → "the robot arm moves the blue block to the green basket". Also fix `_msgs_for_memory`: it was emitting the older `User: ${task}\nPlan:..\nMemory:..` / `Assistant: ${subtask}` template, which no longer matches the `memory_update` recipe layout (`User: ${task}` / `Assistant: Previous memory: ..` / `User: Completed subtask: ..`). The new prompt mirrors the training recipe; `HighLevelSubtaskFwd` stashes the just-completed subtask in `state['prior_subtask']` so the memory prompt can render `Completed subtask: ..` for `MemoryUpdateFwd`. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-22 09:50:14 +00:00
Haoming Song	9f437d86b6	fix(groot): align GR00TN15Config with transformers config dataclasses (#3606 ) * fix(gr00t): fix gr00t config dataclass init TypeError * fix(groot): guard strict config decorator without transformers for passing CI --------- Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-05-22 10:31:04 +02:00
Haoming Song	b74a551d38	fix(pi0, pi05): stabilize torch.compile and expand test coverage (#3610 ) * chore(gr00t): sync with #3606 for fixing gr00t config crash * fix(pi0&pi05): fix graph break caused by deepcopy of past_key_values in sample_actions * fix(pi0&pi05): fix frequent recompile caused by compute_layer_complete * feat(test): add compile test and benchamrk for pi0 and pi05 * feat(test): add comprehensive testing for pi0 and pi05. Including processor, forward, sample action, etc.	2026-05-22 10:29:34 +02:00
Nikodem Bartnik	c0a2e9814d	fix examples (#3623 ) - Fixed broken API examples in Lerobot Imitation Learning Documentation - Teleoperation with cameras improved by adding a fixed frequency in the loop (without it the cameras feed gets very slow) - Wrapped record example script in main() to avoid problems on Mac - Previously teleoperation example was using SO-ARM and teleoperation with cameras was using Koch. I changed it to use SO-ARM in all of the examples. - Added section on how to train with HF Jobs - CLI and Python examples - Replaced lerobot-record with lerobot-rollout in policies examples	2026-05-21 22:14:07 +02:00
pepijn	e050d0fe0a	fix(recipes): use active_at for memory_update, rebalance subtask_mem memory_update was bound to `emitted_at(t, style=memory)`, which requires the frame's exact timestamp to match a memory annotation. Memory rows are placed at subtask-boundary timestamps and at 30 fps that's ~1% of frames, so 99% of memory_update draws couldn't render and silently fell through to _fallback_low_level_render — injecting task-conditioned low-level training on ~30% of samples (subtask_mem.yaml). Switch to `active_at`. At inference `MemoryUpdateFwd` is triggered on `subtask_change` events, but the model only needs to learn the stateless mapping (prior_memory, completed_subtask) -> current_memory. active_at supervises this mapping on every frame inside a subtask interval, against varied observations; the trigger lives outside the model. Net effect: memory_update renders on ~87% of frames, the fallback leak drops from ~30% to ~4%, and memory CE gets a meaningful (not 0.3%) training share. subtask_mem.yaml: rebalance to 0.30 / 0.55 / 0.15 so memory CE is ~13% effective and the freed weight goes to low_level_execution. subtask_mem_vqa_speech.yaml: keep weights (memory_update=0.10 was already balanced against the other text-CE branches). Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 14:53:13 +00:00
pepijn	2ca030fa28	fix(pi052): build processors from current config When fine-tuning from pi05_base, reuse only the pretrained weights so pi052 still generates recipe text labels and FAST action labels. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 13:54:29 +00:00
pepijn	36f828221c	fix(pi05): preserve pretrained paligemma lm head Keep the PaliGemma LM head in float32 and initialize it from pretrained weights or token embeddings when loading pi05 checkpoints. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 13:25:24 +00:00
Pepijn	d41d874581	fix(pi052): debug parity harness truncates prompt instead of masking The parity check in debug_text_predictions was producing false ✗ DIVERGED reports. Root cause: I built the "inference" batch by zero-masking the attention past the supervised span, but kept the full 512-token padded sequence. select_message reads the prompt-end hidden state via ``vlm_out[:, -1:]`` — the LAST position of the prefix — which in a padded batch is a padding-token hidden state, not the last prompt token. PaliGemma's prior on those padded positions reliably argmaxes to <loc0879>, falsely flagging a training/inference mismatch. Fix: truncate both tokens AND mask to length == first_sup before calling select_message, mirroring what the real runtime does (``tokenizer(prompt)`` returns un-padded ids). Now the parity check compares like-with-like. The actual training argmax in the dump was sensible English ("' move the blue cube into the green bin'" at acc=6/9) — the head is learning correctly. The "<loc>" salad was purely the harness reading from the wrong position. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 15:09:36 +02:00
Khalil Meftah	bac4f61eae	refactor: support custom progress parquet overlays (#3640 )	2026-05-21 14:32:10 +02:00
Pepijn	efa05f0ada	fix(train): unwrap DDP policy in debug_text_predictions hook At training time the policy is wrapped by Accelerator/DDP into a .module attribute and custom methods are NOT proxied through the wrapper, so ``hasattr(policy, "debug_text_predictions")`` was False and the periodic dump was silently no-op'ing. Walk through .module indirection to reach the raw PI052Policy that defines the method. Also surface why the dump didn't fire (no method / empty supervised positions / generation error) so users can see what's blocking it instead of staring at silence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 13:41:20 +02:00
Pepijn	e98b6f726b	feat(train): debug dump runs inference too, with parity check Extends the periodic LM-head dump (LEROBOT_DEBUG_PREDS_EVERY) to ALSO run select_message autoregressively on the same prompt prefix and show: prompt : '<bos>User: ... Assistant: ' target (ground truth) : ' close the gripper ...' training argmax (teacher-fed) : ' close the gri lift ...' acc=12/15=80% inference (autoregressive) : ' close the gripper around ...' first-token parity : train=3387 (' close') vs infer=3387 (' close') ✓ MATCH The first-token parity check is decisive: training-side argmax at the prompt-end position and inference's first generated token both compute ``argmax(lm_head(h_last_prompt))`` on identical context, so they MUST match. Any divergence signals a training↔inference bug (mask, dtype, KI routing, embedding scale, etc.). Subsequent tokens can diverge because training uses teacher forcing while inference free-runs. debug_text_predictions now also returns an ``inference`` list keyed by sample, each entry carrying ``first_sup_pos`` and ``decoded``. Limited to 24 new tokens per sample to keep the dump fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 12:27:32 +02:00
Pepijn	f7747d02a9	feat(train): periodic LM-head prediction dump for live debugging Adds an opt-in diagnostic that, every N training steps, dumps 5 batch samples plus the LM head's argmax prediction at every supervised position alongside the label and a ✓/✗ marker — the cheapest signal for "is text training actually learning what we expect, or collapsing to a fixed token". Refills the recipe-sample dump budget on the same cadence so the raw input shapes are also re-dumped. Opt in via env var: LEROBOT_DEBUG_PREDS_EVERY=1000 lerobot-train ... PI052 implements ``debug_text_predictions`` (mirrors the text-loss forward but returns argmax instead of CE); other policies are silently skipped. The dump runs in eval() mode under no_grad, slicing the current batch to N samples — no extra data fetch, no train-state mutation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 12:23:05 +02:00
pepijn	86ecd4bc2e	add subtask memory training recipe Add a recipe that blends subtask prediction, low-level execution, and memory update supervision. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 09:56:10 +00:00
pepijn	28b86449a2	fix(pi05): cast attention masks to model dtype Ensure attention masks follow the backbone dtype during bf16 inference to avoid mixed dtype failures. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-21 09:52:46 +00:00
Virgileboat	f4b834844e	Feat/clean can bus (#3526 ) * change timeout for handshake * enforce last state read when querry * change import order * fix(motors): flush stale robstride RX and harden feedback drain * robstride: remove redundant timeout and max_messages casts * bugfix + %-style * update exception catch	2026-05-21 11:44:04 +02:00
Pepijn	5bb2da4da6	fix(pi052): VQA target format = "label <loc><loc>" not "<loc><loc> label" The trained model collapsed to spewing 40+ <loc> tokens for every prompt — subtask, memory, anything — because VQA targets were supervised to start with <loc>. With ~25% of all text samples beginning with a <loc> token, the LM head learned "Assistant: → <loc>" as a strong attractor; once one loc is emitted, autoregression chains the rest. Flip the format so every text target — subtask, memory, speech, AND VQA — starts with a regular word. The model still learns the <loc> vocabulary for the spatial portion of the answer, but loc can no longer be the first generation step out of a clean prompt. Examples: point : "green box <loc0162><loc0759>" bbox : "cube <loc0082>…<loc0409>" multi : "blue <locs> ; yellow <locs>" The runtime parser (parse_loc_answer) strips loc tokens and uses the remainder as label, so it's order-tolerant and works under either format. Old loc-first checkpoints still parse cleanly at inference; new training will use label-first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 18:56:48 +02:00
Pepijn	f7b989ad97	fix(pi052): read backbone dtype from q_proj, not first parameter select_message's bf16 cast used next(paligemma.parameters()).dtype, which lands on a fp32-kept param (norm / embedding) under to_bfloat16_for_selected_params. Mask stayed fp32 while q/k/v were bf16 → SDPA still raised "invalid dtype for bias". Read the dtype from layers[0].self_attn.q_proj.weight instead — q_proj is always cast with the rest, so its dtype matches what SDPA sees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 18:46:08 +02:00
Pepijn	3b4376aa33	fix(pi052): cast attention bias to model dtype for bf16 inference `_prepare_attention_masks_4d` always returns fp32 (the 0.0 / -inf literals); with bf16 weights, HF PaliGemma's SDPA path raises "invalid dtype for bias - should match query's dtype" and select_message returns empty every step. Cast in both attention sites: `_compute_layer_ki` (training, when both experts run) and `select_message` (inference, VLM-only branch). Bf16 training + bf16 inference now run end to end with no dtype mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 18:42:26 +02:00

1 2 3 4 5 ...

1744 Commits