lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-19 01:07:18 +00:00

Author	SHA1	Message	Date
Pepijn	c5965d4971	Merge branch 'main' into feat/smolvla-on-steerable	2026-06-08 11:02:54 +02:00
Maxime Ellerbach	09808183ca	feat(rollout): adding episodic strategy (#3717 ) * feat(rollout): adding legacy strategy * adding legacy to existing tests * updating docs and docstring * changing misleading docstring Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * adding extra guard like dagged with try except finally * Potential fix for pull request finding Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * adding reset to initial position * moving smooth teleop handover to control_utils and adding this behavior to legacy strategy * reducing duration of the handover * * renaming to episodic * changing semantics of the docstring * fixing leader - follower handover disable torque * adding optionnal config to disable handover * wiring the smooth_leader_follower_handover config * renaming config smooth_leader_to_follower_handover --------- Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>	2026-06-06 00:32:38 +02:00
pepijn223	470fdd195d	fix(ema): default EMA decay to 0.99 Matches openpi's top-level default (ema_decay=0.99, ~last 100 steps). Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 16:10:00 +02:00
pepijn223	384feca91a	fix(ema): default EMAConfig.enable to False (opt-in) EMA was on by default, so every training run on the branch (incl. VLA-JEPA and other non-flow-matching policies) created a full fp32 shadow copy. EMA only benefits flow-matching/diffusion policies (pi0/pi05/pi052). Make it opt-in via --ema.enable=true; the pi05/pi052 recipes already pass that flag. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 16:09:08 +02:00
pepijn223	7b35af6eca	Merge remote-tracking branch 'origin/main' into feat/smolvla-on-steerable Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # uv.lock	2026-06-05 14:38:47 +02:00
pepijn223	aca02ff24c	fix(robocasa): align env state/action order to openpi/robocasa convention LeRobot's RoboCasaEnv used a divergent flat state/action layout vs the robocasa package (robocasa.utils.env_utils.convert_action) and the openpi robocasa pipeline. This scrambles I/O when using openpi-convention checkpoints (e.g. the JAX->PyTorch->LeRobot converted pi05 robocasa model: CloseFridge 20% -> 60% once both orders match openpi). - convert_action: ee_pos(3)+ee_rot(3)+gripper(1)+base_motion(4)+control_mode(1) - observation.state: ee_pos_rel(3)+ee_rot_rel(4)+base_pos(3)+base_rot(4)+gripper(2) Matches openpi examples/robocasa/main.py + RobocasaInputs ordering. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 13:47:43 +02:00
pepijn223	de7ba67556	style: drop decorative === comment banners from pi052 split Replace the === separator banners (against repo style) with plain comments. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 20:21:10 +02:00
pepijn223	c020c0d053	refactor(pi052): split pi05_backbone into pi_gemma + modeling_pi052 Eliminate the standalone pi052/pi05_backbone.py by distributing its contents: - Generic dual-expert transformer machinery -> lerobot/policies/pi_gemma.py (sdpa_attention_forward, compute_layer_complete, PaliGemmaWithExpertModel, get_gemma_config; the openpi width/depth config is renamed GemmaConfig -> GemmaVariantConfig to avoid clashing with transformers' GemmaConfig). These sit next to the existing PiGemma layer code they already depend on. - pi052-specific model + helpers -> pi052/modeling_pi052.py (PI05Pytorch, ActionSelectKwargs, make_att_2d_masks, pad_vector, resize_with_pad_torch, create_sinusoidal_pos_embedding, sample_beta, get_safe_dtype). DEFAULT_IMAGE_SIZE is duplicated as a plain constant in pi_gemma to avoid a pi_gemma -> pi05 import cycle. Additive to pi_gemma; pi0/pi05 unaffected. Verified bit-exact on pepijn223/pi052_robocasa_full (embed/predict/forward identical) and all 34 pi052 tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 20:18:18 +02:00
pepijn223	4cbd91a04e	chore: drop one-off bench/build/train scripts from the PR Remove development-only tooling that doesn't belong in the PR: - examples/benchmark/* (pi052 step/kernel benchmark slurm + harness) - examples/port_datasets/slurm_build_robocasa_composite_seen.py and src/lerobot/scripts/build_robocasa_composite_seen.py (composite_seen dataset build scripts) - scripts/build_episode_filter.py, scripts/build_robocasa_smoke.sh, scripts/train_pi052_human300_exclude_unannotated.sh None are imported by the library, tests, or entry points. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 20:05:25 +02:00
pepijn223	afe30630cc	test(pi052): repair stale-name CE tests for fused linear CE _fast_ce/_shifted_ce were renamed to _fast_lin_ce/_shifted_lin_ce and changed from logits-based to Liger fused-linear-CE (hidden @ lm_head_weightᵀ). Update the tests via thin adapters that pass an identity lm_head_weight (so the computed logits equal the provided ones), run on CUDA (Liger is GPU-only) and skip otherwise, and loosen the allclose tolerance to absorb GPU-vs-CPU float noise on the tiny losses. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 20:03:18 +02:00
pepijn223	a594ad7969	refactor(pi052): self-contained policy; revert pi0/pi05 to upstream main The smolvla branch had modified the shared pi0/pi05 modeling + pi05 config to support pi052 (SDPA attention, layernorm/lm_head handling, optimizer foreach/fused/lm_head_lr_scale, embedding scaling). Decouple pi052 instead: - Vendor the PI0.5 backbone (PaliGemmaWithExpertModel, PI05Pytorch, helpers) into pi052/pi05_backbone.py (verbatim copy, no PI05Policy). - Flatten PI052Policy to subclass PreTrainedPolicy directly (no longer PI05Policy); inline the needed PI05Policy methods. - Restore optimizer_foreach/fused + get_optimizer_preset on PI052Config. - Revert pi0, pi0_fast, pi05 modeling and configuration_pi05 to origin/main (byte-identical), so the shared policies carry no smolvla modifications. Behavior verified bit-exact on pepijn223/pi052_robocasa_full: embed_language_ tokens, predict_action_chunk, and the fused flow+text+FAST training loss are identical before/after (max_abs_diff=0). pi052 tests pass (pre-existing stale-name collection errors unchanged). Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 19:59:27 +02:00
Maxime Ellerbach	2e9cd87bbd	feat(policies): add VLA-JEPA (#3568 ) * first commit * feat(policies): add VLA-JEPA * feat(policies): add VLA-JEPA * support vla_jepa * (feat)policies: add VLA-JEPA * linting * adding deps to pyproject.toml * updating uv lock * adding guards to avoid needing transformers and diffusers for type checking and basic tests * fixing action and state dim * fix warnings with qwen processor kwargs * fixing wm_loss not propagating * adjusting obs steps, tublets size to match original implementation * some more fixes to be closer to the original implem * adding more tests to ensure good coverage * align VLA-JEPA architecture with original checkpoint - Remove stale `action_num_heads` / `action_attention_head_dim` config fields; DiT head dimensions are now always derived from the preset (DiT-B/L/test). - Add `num_target_vision_tokens` and `action_max_seq_len` config fields required by the action head's future-token embedding and positional embedding tables. - Fix default `qwen_model_name` to 2B (matches all released checkpoints). - Rename `ActionEncoder` attrs w1/w2/w3 → layer1/layer2/layer3 to match checkpoint key names; replace `nn.Sequential` decoder/state-encoder with `_MLP2` (layer1/layer2 naming). - Fix `VLAJEPAActionHead` to size ActionEncoder and StateEncoder at `inner_dim` (DiT input width) rather than `action_hidden_size` (DiT output width). - Rename `DiT.blocks` → `transformer_blocks` and `attn` → `attn1` to match checkpoint; add alternating cross/self attention (even blocks cross-attend to Qwen context, odd blocks self-attend). - Add `DiT-test` preset for unit tests. - Rewrite `ActionConditionedVideoPredictor` with explicit ViT-style blocks (`_PredictorBlock` with fused qkv) to match checkpoint structure; rename `encoder`/`norm`/`proj` → `predictor_blocks`/`predictor_norm`/`predictor_proj`. * propagate action_is_pad masking through VLA-JEPA policy pipeline Pass the `action_is_pad` tensor from the batch through to the action head so padded timesteps are excluded from the flow-matching loss. * update VLA-JEPA tests for arch changes and action_is_pad - Switch conftest to use `action_model_type="DiT-test"` now that `action_num_heads` / `action_attention_head_dim` have been removed. - Add action_head tests covering fully-padded loss (zero) and equivalence of action_is_pad=None vs all-zeros mask. - Remove obsolete `test_native_to_lerobot_wm_only` test. * add VLA-JEPA documentation Covers architecture overview, pretrained checkpoints, config reference, training/eval commands for LIBERO-10, and guidance on fine-tuning for single-camera datasets. * add one-shot script to convert ginwind/VLA-JEPA checkpoints to safetensors (will remove once migrated) * make default params more aligned with paper and pretrained models - adding possibility of freezing qwen backbone and world model - added tests for weight loading * trying out to re-init the action head to avoid pretraining dimension mismatch * allow different state dim and action dim * removing missleading future_action_window_size to just use chunk_size * lots of changes to make existing weights work, need to massively refactor the pre and post processing * refactoring into using pre and post processor * pre-commit cleanup * fixing doc defaults args Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * adressing dtype zeros issue * adding guard for diffusers * fixing training and exal examples * trying to close success rate gap * fix qwen norm layer output libero eval is now as expected * adding instructions for different embodiement + fixing some tests * smol fix to avoid having default CPU device when training * fixing misconception about multiview / singleview handling * removing conversion script * adding licences * adding .mdx docs and shortening polivy_vla_jepa_README.md * removing useless pre-processor * cleanup * removing swish in favor of silu * adding configuration gripper index and threshold * fixing simlink --------- Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> Co-authored-by: ginwind <ginwind@mail.ustc.edu.cn>	2026-06-04 19:22:51 +02:00
pepijn223	8292548f0d	fix(pi052): stop double-scaling FAST/text token embeddings embed_language_tokens already applies Gemma's sqrt(hidden) normalizer (GemmaTextScaledWordEmbedding, transformers >=5.4.0). pi052 multiplied FAST action-token and autoregressive subtask-text embeddings by sqrt(emb_dim) on top of that, double-scaling them (~2048x). Remove the manual scaling so FAST and text tokens are single-scaled, consistent with the pi05 fix and OpenPI. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 18:31:41 +02:00
pepijn223	77cc35b932	fix(pi0,pi05,pi0_fast): stop double-scaling text embeddings transformers >=5.4.0 (PR #44432) makes Gemma's embed_tokens a GemmaTextScaledWordEmbedding that already multiplies token embeddings by sqrt(hidden_size). The manual `* sqrt(embed_dim)` applied on top therefore double-scaled text (~2048x instead of ~45x), breaking VLM alignment for models trained/run on stock transformers. Remove the manual scaling and rely on embed_tokens' internal normalizer (matches main #3603). Image features stay raw (un-normalized), as before. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 18:22:34 +02:00
pepijn223	f0757fc707	fix(pi0,pi0_fast): scale text embeddings by sqrt(embed_dim) to match OpenPI OpenPI (pi0 and pi0-FAST) multiplies language token embeddings by sqrt(embed_dim) — the Gemma embedder normalizer — before the transformer. LeRobot pi0/pi0_fast omitted it, leaving text tokens ~45x under-scaled relative to the residual stream (same class of bug as the pi05 image scaling). pi0: applied in embed_prefix's lang_embed_func. pi0_fast: applied inside embed_language_tokens so prompt, FAST action tokens, and autoregressive next-token embeds are all scaled consistently. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 18:14:27 +02:00
pepijn223	a48d4e32a1	fix(pi05): don't scale image features by sqrt(hidden_size) lerobot/pi05_base was trained in the OpenPI/big_vision regime where image (soft) tokens are NOT multiplied by the Gemma embedder normalizer (sqrt(hidden_size)) — only text tokens are. Scaling image features here over-scaled them ~45x, breaking the pretrained vision-language alignment and yielding ~0% closed-loop success on RoboCasa across all pi05 runs. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 17:20:34 +02:00
Pepijn	9596e3d53f	Merge remote-tracking branch 'origin/feat/smolvla-on-steerable' into feat/smolvla-on-steerable	2026-06-04 17:14:33 +02:00
Pepijn	0a6a799317	Merge feat/language-annotation-pipeline into feat/smolvla-on-steerable Bring the authoritative annotation pipeline from the annotation branch. The annotation surface is forced to EXACTLY match feat/language-annotation- pipeline (the annotation branch is the source of truth for annotation code), which also removes smolvla's stale copies: - deleted: steerable_pipeline/vocabulary.py, tests/annotations/test_ vocabulary.py, prompts/module_0_vocabulary.txt, module_1_action_record .txt, module_3_vqa.txt, module_1_plan.txt, and the old module_* prompt names (now plan_/interjections_/vqa.txt). - synced: all of src/lerobot/annotations/, lerobot_annotate.py, examples/annotations/, tests/annotations/, datasets/language.py, tests/datasets/test_language.py, docs/annotation_pipeline.mdx. Non-annotation conflicts resolved by union (keeping both branches' intent): - pyproject.toml: keep smolvla's pi extra (+sentencepiece) and add the molmoact2 extra from main. - policies/factory.py: keep both dataset_repo_id (pi052 FAST tokenizer) and dataset_meta (both are referenced); union the policy-type docstring. - scripts/lerobot_train.py: keep smolvla's pi052 / use_relative_actions processor-rebuild block. - uv.lock: regenerated from the merged pyproject. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 17:13:36 +02:00
pepijn	e660a51e78	pi052(debug): drop misleading inference/parity dump from text preds The first-token parity check re-tokenized the decoded (stripped) inference string, so the leading-space SentencePiece variant always mismatched the training argmax — a false "DIVERGED" alarm. Remove the autoregressive inference print and parity comparison (and the now-dead per-sample select_message generation), keeping only the prompt, ground-truth target, and teacher-forced argmax accuracy. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 13:32:44 +00:00
Pepijn	cdd94a703f	annotate(config): tighten field comments to one line each Collapse the remaining multi-line field comments / docstrings in config.py to single lines (or two where a knob genuinely needs it), keeping the essential rationale. Comments only — no field or behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 15:12:31 +02:00
Pepijn	cd59c8b312	annotate: remove the action_record style/feature entirely Drop the optional structured per-subtask action records — not a feature we want to ship. * language.py: remove 'action_record' from CORE_STYLES + PERSISTENT_STYLES (and the matching assertion in tests/datasets/test_language.py). * config.py: delete ActionRecordsConfig (verb/grasp vocabularies, frames_per_subtask, emit_record_row) and the PlanConfig.action_records field. * plan_subtasks_memory.py: delete _extract_action_record and the run_episode block that emitted style='action_record' rows; drop the now-unused json / to_image_blocks imports. * remove the plan_action_record.txt prompt. * run_hf_job.py: drop the action_records comment. Verified: 40 tests pass; pre-commit (ruff, mypy, bandit) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:40:34 +02:00
Pepijn	99baae012f	annotate(config): further compact field comments Tighten the remaining multi-line comment blocks in config.py (derive_task, frames/window, describe_first, action-record/vqa/vlm fields, video_backend, repo ids, executor) to 1-3 lines each. Also fix a stale path typo ('examples/annotation' -> the docstring now just says HF Jobs). Comments only — no field or behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:36:02 +02:00
Pepijn	973318ef65	annotate: dedup task_aug + row-normalization; docs module on/off table Two behavior-preserving simplifications: * plan_subtasks_memory.run_episode: the task_aug 'axes' and free-form branches built identical deduped rows via copy-pasted seen/append loops. Collapse to one branch that picks the variant source, then a shared _task_aug_rows() helper does the dedup + row build (-~25 LOC). * writer: _normalize_persistent_row / _normalize_event_row shared the same camera-validate + struct construction. Extract _normalize_row(), keeping the exact key order (the parquet struct schema is inferred from insertion order, so timestamp must stay between style and camera). docs: 'Which modules run' is now a table giving each module's on/off flag (--plan.enabled / --interjections.enabled / --vqa.enabled) and what it turns off. Verified: 40 tests pass (incl. test_writer struct round-trip); pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:18:36 +02:00
Pepijn	7471a6b1ed	annotate: compress conftest + pyproject comments (fix stale backend note) The pyproject annotations-extra comment still described the removed vllm/transformers in-process backends ('vllm preferred ... transformers fallback', '_make_vllm_client'); rewrite it for the openai-only reality and trim it. Also condense the conftest lazy-import NOTE. Comments only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:12:04 +02:00
Pepijn	20c7a12dd5	annotate: remove dead code, document CLI options, compact config Dead code (defined but never referenced anywhere in src/tests/examples): * reader.py: keyframe_indices, episode_frame_timestamps, lookup_data_path, and the now-orphaned gather_data_paths + episode_offsets_per_path (lookup_data_path was their only caller). * staging.py: iter_staged_episodes. * writer.py: normalize_rows_for_writer. * config.py VlmConfig: json_mode, batch_size, tensor_parallel_size, gpu_memory_utilization, trust_remote_code — consumed only by the in-process vllm/transformers backends that were removed; the openai auto-serve path carries those vLLM flags via serve_command instead. Kept max_model_len (still used as the serve-command default). * config.py TaskAugAxesConfig.total property. Docs: new 'Key options' section in annotation_pipeline.mdx — grouped tables (dataset in/out, module toggles, --vlm., --plan., interjections + vqa) describing the flags users actually reach for, with defaults. config.py: compact the verbose field comments + ActionRecordsConfig / TaskAugAxesConfig docstrings; fix two stale 'verify' references (the verify pass was removed — it's describe -> segment now) and the stale 'renders record back to subtask text' note (that path was removed). vlm_client docstring no longer mentions the removed json_mode field. Verified: tests/annotations + tests/datasets/test_language + tests/scripts/test_lerobot_annotate (40 passed); pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:05:46 +02:00
Pepijn	dbe02f0c4f	annotate(plan): condense verbose comments + docstrings Trim the long inline comment blocks (effective_task / task_aug, action records, plan-boundary rows, plan-update span closing, windowed + coverage-stitch sections) and the _generate_plan / run_plan_updates docstrings to a few lines each. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 13:52:24 +02:00
Pepijn	56cbb5f9ec	annotate(example): trim run_hf_job comments to one line each Same flags and rationale, condensed — each plan-module flag now has a short one/two-line comment instead of a paragraph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 13:48:55 +02:00
Pepijn	2af2402a0c	docs(annotate): cleaner architecture diagram layout Top-down flow (read episodes → 3 modules fan out → validator → writer → parquet) with aligned boxes, instead of the cramped bordered version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 11:59:31 +02:00
Pepijn	7bec991cdf	docs(annotate): friendlier rewrite + architecture diagram; drop reproducibility section Rewrite annotation_pipeline.mdx in plainer, easier-to-read language (shorter sentences, active voice, a plain-text intro), add an ASCII 'How it fits together' architecture diagram, and remove the 'Reproducibility via seed and prompt hashes' section. Content/links are preserved; only wording and structure change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 11:48:59 +02:00
Pepijn	c6f682b3f4	annotate docs: install lerobot from main (post-merge wording) The example already pins '@main'; update the doc step and the script docstring from 'the branch under test' to 'lerobot (from main)' now that the pipeline is merging to main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 11:45:38 +02:00
Pepijn	eba3ab3741	annotate: address review feedback — bug fixes, docs/code drift, naming, cleanup Bugs * validator: don't re-raise on unknown style. The second column_for_style lookup (used to route persistent vs event) now sits in try/except so an unknown style is recorded by _check_column_routing and skipped instead of crashing the whole validation pass. * general_vqa._target_cameras: when restrict_to_default_camera is set but the configured camera_key isn't one the provider exposes, warn and fall back to all cameras instead of returning a phantom key that KeyErrors deep in frame decode. * interjections: clamp interjection timestamps to frame_timestamps[0] rather than a hardcoded 0.0 (datasets can start at non-zero t). Docs / code drift * annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase 0 / --vocabulary.* / canonical_vocabulary.json' section (none of it exists); describe the real describe->segment + coverage-stitch flow. Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of this PR' (matches tools.mdx, which already marks the runtime layer as not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note the default is now a single h200. Add a 'Contributing new modules' section inviting module / prompt / quality contributions. * executor docstring: six phases, no phantom phase 0. run_hf_job.py * add the Apache 2.0 license header (was flagged repeatedly). * default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1 (scale to h200x4 noted in the docstring). * pin the install to @main instead of the feature branch (won't break after merge). Naming / cleanup * rename dest_repo_id -> new_repo_id across config / script / example / test to match the LeRobot dataset edit tools. * rename prompt templates module_N_.txt -> descriptive (plan_, interjections_, vqa.txt) and update every load_prompt() call. remove dead _messages_to_prompt (used only by the removed in-process backends). * declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as real init=False dataclass fields instead of getattr monkey-patches. * scope bandit B607 to the two ffmpeg subprocess.run sites via '# nosec B607' and drop it from the global skip list. Tests * fix stale canned-VLM markers ('ONE realistic interruption' -> 'compact interjection', 'Update the memory' -> 'compressed semantic memory') and drop the dead 'concise hierarchical PLAN' plan responders (plan generation is deterministic now) in run_e2e_smoke, test_pipeline_recipe_render, test_modules. * run_e2e_smoke now asserts interjection + speech rows are produced so a stale marker can't silently pass again. * drop remaining 'PR 1' / 'PR 2' references from test comments / names. Verified: tests/annotations + tests/datasets/test_language + tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke (interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit, prettier) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 18:30:46 +02:00
Pepijn	3a24e426df	language: register action_record in CORE_STYLES so STYLE_REGISTRY contains it action_record is in PERSISTENT_STYLES but was missing from CORE_STYLES, so STYLE_REGISTRY (= CORE_STYLES \| EXTENDED_STYLES) didn't contain it and the PERSISTENT_STYLES \| EVENT_ONLY_STYLES <= STYLE_REGISTRY invariant in test_style_registry_routes_columns failed. Add it to CORE_STYLES so the registry, the persistent-set, and column_for_style() stay consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 16:38:06 +02:00
Pepijn	b9a0187335	annotate: drop local in-process VLM backends — HF Jobs (openai) only for now The shipped workflow is Hugging Face Jobs (examples/annotations/run_hf_ job.py): it serves the model with vLLM in the vllm/vllm-openai image and the pipeline talks to it over the OpenAI-compatible API. The in-process vllm / transformers local backends added surface (and the vllm one pinned an old torch) without being part of that path, so they're removed for now. * vlm_client.make_vlm_client: keep only backend='openai' (+ 'stub' rejected with the usual guidance). Requesting 'vllm'/'transformers' now raises a clear 'not supported for now — use the HF Jobs flow' error. Removed _make_vllm_client and _make_transformers_client. * config: backend docstring updated (openai-only); default model_id bumped to Qwen/Qwen3.6-27B to match run_hf_job. * docs/annotation_pipeline.mdx: remove the '## Running locally' section; the launcher description now says one vLLM server per GPU over the OpenAI API, and the 'One Qwen-VL pass' note drops the 'vLLM/transformers fallback' wording. Tests are unaffected (they construct StubVlmClient directly; nothing referenced the removed backends). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 16:28:40 +02:00
Pepijn	a18d969753	tests(annotations): fix stale canned-VLM markers + action_record style assertion The annotation tests had never actually run in CI (collection failed on the missing 'datasets' extra); now that they do, three stale assertions surfaced against the evolved pipeline: * test_module1_plan_memory_subtask_smoke: the memory canned-responder marker 'Update the memory' no longer appears in module_1_memory.txt (now 'compressed semantic memory'), so the stub returned no memory row and the {subtask,plan,memory} subset check failed. Marker updated to match the current prompt. * test_module2_mid_episode_emits_paired_interjection_and_speech: the interjection marker 'Write ONE interjection' is now 'Write ONE compact interjection' in module_2_interjection.txt, so 0 interjections were emitted. Marker updated. * tests/datasets/test_language.py::test_style_registry_routes_columns: PERSISTENT_STYLES gained 'action_record' in this PR; add it to the expected set. These are test/prompt-marker syncs — no production behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 16:21:17 +02:00
Pepijn	273a8fc335	deps(annotations): drop hard vllm dependency to unblock CI torch/torchcodec resolution Fast Pytest 'dataset' tier failed collecting tests/datasets/test_video_ decoder_cache.py with 'Could not load libtorchcodec ... undefined symbol: torch_dtype_float4_e2m1fn_x2' — a torch/torchcodec ABI mismatch. Root cause: the annotations extra's vllm hard-pins an older torch (via xformers/xgrammar -> torch 2.8). uv resolves a SINGLE unified lock across all extras, so vllm capped torch to 2.8 for every tier — including dataset, whose torchcodec 0.11.1 needs torch 2.11. The result was torch 2.8 + torchcodec 0.11.1 installed together -> ABI break. (main has no vllm, so it resolves torch 2.11 + torchcodec 0.11.1 cleanly.) Fix: remove vllm from the annotations extra. It is not needed by the shipped workflow — examples/annotations/run_hf_job.py gets vllm from the vllm/vllm-openai image and talks to it over the OpenAI-compatible API (--vlm.backend=openai), and vlm_client._make_vllm_client imports vllm lazily. For the in-process --vlm.backend=vllm path, install vllm separately (the ImportError now says so). After the fix uv resolves torch 2.11.0 + torchcodec 0.11.1 (matching main); uv lock --check is clean. The annotations extra still provides datasets / transformers / openai. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 16:09:22 +02:00
Pepijn	b9246ef61b	tests(annotations): guard on the 'dataset' extra so base fast-test tier skips cleanly Fast Pytest Tests failed at COLLECTION in the base '--extra test' tier with 'ModuleNotFoundError: No module named datasets': tests/annotations/ conftest.py imported the fixture dataset builder (-> lerobot.datasets -> the HF 'datasets' lib + pandas/pyarrow), which only ship under the 'dataset' extra, so the whole annotations package crashed. Fix uses the repo's proven module-level guard pattern (see tests/datasets/test_language.py), NOT a conftest-level importorskip — verified empirically that pytest.importorskip raised during conftest import is treated as a collection ERROR (exit 1), while module-level importorskip is a clean SKIP. * conftest.py: import build_annotation_dataset LAZILY inside the fixtures so the conftest itself imports cleanly in every tier. * test_modules / test_validator / test_writer / test_pipeline_recipe_ render: add module-level pytest.importorskip('datasets') + ('pandas') before the pyarrow / lerobot.* imports (# noqa: E402 to match the existing convention). pyarrow-importing modules place the guard before the pyarrow import. * tests/scripts/test_lerobot_annotate.py: same guard (its _push_to_hub path imports lerobot.datasets). Result: - base / hardware / viz tiers (no dataset extra): annotation tests skip cleanly; the rest of the suite runs -> exit 0. - dataset tier: datasets present -> guards pass through -> annotation tests run with the stub VLM. The pipeline modules import only stdlib + relative + lerobot.datasets (no module-level datatrove / vllm / openai), so they import fine there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 15:57:04 +02:00
Pepijn	870980efd6	Merge branch 'main' into feat/language-annotation-pipeline	2026-06-03 15:46:13 +02:00
Jaimin	d1b1c5c8cf	docs: fix broken dataset script paths (datasets/v30 -> scripts) (#3695 ) The docs pointed at src/lerobot/datasets/v30/, which does not exist. Both scripts actually live in src/lerobot/scripts/: - convert_dataset_v21_to_v30.py - augment_dataset_quantile_stats.py Updated the four references (one python -m module path and three file-path invocations) to the correct location, matching each script's own usage docstring.	2026-06-03 14:48:19 +02:00
Nikodem Bartnik	741c2d0a39	Docs/add lelab (#3707 ) * first text draft (no images) * simplified docs * fix formatting * add youtube video * add a tip about compatibility * fix broken link	2026-06-03 14:22:05 +02:00
Haoming Song	19fe315971	fix(train): enable relative action overrides for pretrained processors (#3711 ) * fix(train): enable relative action overrides for pretrained processors Keep pretrained processor pipelines when use_relative_actions is enabled and apply relative/absolute action processor settings through overrides. Rename the relative action processor registry key to relative_actions_processor. * fix(config): reject rename_map without pretrained checkpoint Fail fast when rename_map is set during fresh initialization, since fresh configs derive feature names from the current dataset and no rename is applied. --------- Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-06-03 11:46:35 +02:00
Khalil Meftah	906b585826	fix(datasets): default `private` to `None` in `push_to_hub` to respect Hub org visibility settings (#3713 )	2026-06-02 19:25:13 +02:00
Pepijn	4c86332fe3	feat(annotate): add plan toggle, drop subtask verify pass, 4xH200 job - PlanConfig.emit_plan (default True): keep subtasks + memory but skip the per-boundary "plan" rows and their VLM call when False. - Remove the subtask_verify pass entirely: pruning dropped legitimate subtasks and the stitch step already guarantees full-episode coverage. Deletes _verify_subtasks, both call sites, and the now-unused module_1_subtask_verify prompt. - run_hf_job example: 4xH200 (4 vllm servers), emit_plan=false, vqa off. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-02 18:02:13 +02:00
pepijn	23419026d5	pi052: parquet-direct FAST tokenizer fit (fix v3 dataset hang) ``fit_fast_tokenizer`` previously called ``LeRobotDataset(repo_id, episodes=[N])`` per sampled episode, which on v3-format datasets routes through HF datasets' split lookup and raises ``ValueError: Instruction "train" corresponds to no data!`` on every episode. On ``pepijn223/robocasa_pretrain_human300_v4`` (32 k episodes) this looped through 13,293 skipped episodes for ~2.5 h before the NCCL watchdog killed the run via the 2 h ALLREDUCE timeout (job 22182985). Switch to reading the ``action`` column directly from the dataset's ``data/chunk-/file-.parquet`` shards (same pattern as the audit scripts). Verified end-to-end on the 32 k-episode dataset: 1000 chunks collected from 1000 episodes in 70.7 s. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-02 15:54:31 +00:00
Pepijn	1417fd69b2	docs(annotate): prettier format annotation_pipeline.mdx Quality-gate fix: ruff-format/markdown prettier hook reflow of the annotation pipeline doc. No content change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:41:46 +02:00
Pepijn	53c7b4c69a	annotate: ruff lint + format pass Quality-gate fixes after the main merge: * UP037: drop redundant quotes from PlanConfig forward-ref annotations (action_records / task_aug_axes) — safe under 'from __future__ import annotations'. * ruff format applied to config.py, executor.py, general_vqa.py, plan_subtasks_memory.py, validator.py, lerobot_annotate.py. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:38:18 +02:00
Pepijn	3662c41b85	Merge remote-tracking branch 'origin/main' into feat/language-annotation-pipeline # Conflicts: # uv.lock	2026-06-02 17:36:07 +02:00
Pepijn	518e191337	annotate: windowed subtask generation for constant temporal density Long episodes no longer get sparse subtasks. Previously a long episode was subsampled to max_video_frames=32 across its whole duration (~1 frame/4s for a 2-min clip). New opt-in windowing keeps a CONSTANT frames_per_second density by splitting the episode into fixed-length windows and running the subtask chain per window. New PlanConfig.subtask_window_seconds (default 0.0 = off). When > 0 and the episode is longer than one window: * episode is split into consecutive [w0, w1] windows of this length * each window's frames are sampled at frames_per_second (so a 32s window at 1 fps = 32 frames, filling but not exceeding the per-call context budget) * the full describe -> segment -> verify chain runs PER window, in window-relative time [0, L]; spans are offset back to absolute * all windows' spans are merged, frame-snap-deduped, and stitched into one contiguous whole-episode cover Implementation: * _episode_video_block / _video_message / _describe_episode / _verify_subtasks gain an optional window=(w0,w1); when set they embed frames sampled in that absolute range at frames_per_second (video_url path skipped — it's whole-episode). * _clean_spans gains bounds= (override clamp range, for window-relative spans) and dedupe= (skip frame-snap until the merged absolute set). * new _generate_subtasks_windowed + _subtasks_for_window orchestrate the loop; _generate_subtasks branches to them when window_s > 0. run_hf_job.py: --plan.subtask_window_seconds=32 (32s windows at 1 fps). Cost scales with episode length (chain calls × ceil(duration/window)). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 16:26:14 +02:00
Pepijn	3236c6ee4a	examples(annotate): switch run_hf_job to Qwen3.6-27B (dense VLM) Swap the annotation VLM from Qwen3.6-35B-A3B (sparse MoE, ~3B active) to Qwen3.6-27B (dense, 27B all-active). Per Scale's dense-captioning study, model capacity is the #1 lever and the dominant failure is visual grounding — both helped by ~9x more active params. Qwen3.6-27B is a vision-language model (vision encoder, image + video), same family so the chat template / video handling / enable_thinking=false flag are unchanged, and at 27B dense it still fits one H200 per server, so the two-parallel-server layout (TP=1, one per GPU) is preserved — no throughput-layout change, just a much stronger model. Kept: parallel_servers=2, num_gpus=2, max-model-len 32768 (the 32-frame embedded budget is ~10k tokens, well under), gpu-mem 0.8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 16:16:26 +02:00
Pepijn	cd128cbbd5	annotate: add verb-scoped disambiguation rules to subtask prompt Adopt the one prompt technique Scale's dense-captioning study found reliably positive: targeted, verb-scoped, visually-grounded disambiguation rules. Their lesson was that such a rule must fire ONLY on the spatial situation it names (their narrow 'Stack vs Put' rule helped; an over-broad directional 'Scoop' rule bled into other verbs and hurt), so each rule here is phrased visually and scoped to one confusable pair: * stack-vs-put (on top of an object vs on a surface) * insert-vs-put (fitted slot vs surface) * pick-up/retrieve-vs-put (decide by which way the OBJECT moves: gripper closes + object moves with hand = pick up; gripper opens + object stays = put — directly targets Scale's dominant direction-flip failure) * pour-vs-put (tilt + flow vs untilted move) This is the highest-confidence, lowest-risk change from the Scale findings; our pipeline already aligns with their 'avoid' list (no temporal tokens, no overlays, no fancy sampling, no sequential context injection, uniform sampling, describe-don't-predict framing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 16:10:49 +02:00
Pepijn	1fb46ab300	annotate: cap embedded-frame budget to fit VLM context (fix 32k overflow) Switching the plan module to embedded frames (use_video_url=false) exposed a context overflow: at frames_per_second=2.0 with the old max_video_frames=128 default, a 480x640 episode embeds ~128 frames ≈ 33-39k vision tokens, over the model's 32768 context — every plan call died with 'Input length exceeds maximum context length' (HTTP 400), crashing the whole annotation job. The video_url path never hit this because the server downsampled; the embedded path sends every sampled frame, so the frame count is a hard token budget. Fix: * config default max_video_frames 128 -> 32 (~8-10k vision tokens, comfortable headroom for the prompt + describe/verify passes). Frames are still sampled UNIFORMLY across the whole episode, so longer episodes are subsampled, not truncated — full temporal coverage preserved, just coarser density. * run_hf_job.py: frames_per_second 2.0 -> 1.0, explicit --plan.max_video_frames=32, with a comment explaining the token budget and the 'do not raise toward 128 with embedded frames' rule. Only the plan module embeds the full episode; VQA (1 frame/tick) and interjections (4-frame window) were never at risk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 16:02:25 +02:00

1 2 3 4 5 ...

1820 Commits