lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-18 00:37:10 +00:00

Author	SHA1	Message	Date
pepijn223	aca02ff24c	fix(robocasa): align env state/action order to openpi/robocasa convention LeRobot's RoboCasaEnv used a divergent flat state/action layout vs the robocasa package (robocasa.utils.env_utils.convert_action) and the openpi robocasa pipeline. This scrambles I/O when using openpi-convention checkpoints (e.g. the JAX->PyTorch->LeRobot converted pi05 robocasa model: CloseFridge 20% -> 60% once both orders match openpi). - convert_action: ee_pos(3)+ee_rot(3)+gripper(1)+base_motion(4)+control_mode(1) - observation.state: ee_pos_rel(3)+ee_rot_rel(4)+base_pos(3)+base_rot(4)+gripper(2) Matches openpi examples/robocasa/main.py + RobocasaInputs ordering. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-05 13:47:43 +02:00
pepijn223	de7ba67556	style: drop decorative === comment banners from pi052 split Replace the === separator banners (against repo style) with plain comments. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 20:21:10 +02:00
pepijn223	c020c0d053	refactor(pi052): split pi05_backbone into pi_gemma + modeling_pi052 Eliminate the standalone pi052/pi05_backbone.py by distributing its contents: - Generic dual-expert transformer machinery -> lerobot/policies/pi_gemma.py (sdpa_attention_forward, compute_layer_complete, PaliGemmaWithExpertModel, get_gemma_config; the openpi width/depth config is renamed GemmaConfig -> GemmaVariantConfig to avoid clashing with transformers' GemmaConfig). These sit next to the existing PiGemma layer code they already depend on. - pi052-specific model + helpers -> pi052/modeling_pi052.py (PI05Pytorch, ActionSelectKwargs, make_att_2d_masks, pad_vector, resize_with_pad_torch, create_sinusoidal_pos_embedding, sample_beta, get_safe_dtype). DEFAULT_IMAGE_SIZE is duplicated as a plain constant in pi_gemma to avoid a pi_gemma -> pi05 import cycle. Additive to pi_gemma; pi0/pi05 unaffected. Verified bit-exact on pepijn223/pi052_robocasa_full (embed/predict/forward identical) and all 34 pi052 tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 20:18:18 +02:00
pepijn223	4cbd91a04e	chore: drop one-off bench/build/train scripts from the PR Remove development-only tooling that doesn't belong in the PR: - examples/benchmark/* (pi052 step/kernel benchmark slurm + harness) - examples/port_datasets/slurm_build_robocasa_composite_seen.py and src/lerobot/scripts/build_robocasa_composite_seen.py (composite_seen dataset build scripts) - scripts/build_episode_filter.py, scripts/build_robocasa_smoke.sh, scripts/train_pi052_human300_exclude_unannotated.sh None are imported by the library, tests, or entry points. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 20:05:25 +02:00
pepijn223	afe30630cc	test(pi052): repair stale-name CE tests for fused linear CE _fast_ce/_shifted_ce were renamed to _fast_lin_ce/_shifted_lin_ce and changed from logits-based to Liger fused-linear-CE (hidden @ lm_head_weightᵀ). Update the tests via thin adapters that pass an identity lm_head_weight (so the computed logits equal the provided ones), run on CUDA (Liger is GPU-only) and skip otherwise, and loosen the allclose tolerance to absorb GPU-vs-CPU float noise on the tiny losses. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 20:03:18 +02:00
pepijn223	a594ad7969	refactor(pi052): self-contained policy; revert pi0/pi05 to upstream main The smolvla branch had modified the shared pi0/pi05 modeling + pi05 config to support pi052 (SDPA attention, layernorm/lm_head handling, optimizer foreach/fused/lm_head_lr_scale, embedding scaling). Decouple pi052 instead: - Vendor the PI0.5 backbone (PaliGemmaWithExpertModel, PI05Pytorch, helpers) into pi052/pi05_backbone.py (verbatim copy, no PI05Policy). - Flatten PI052Policy to subclass PreTrainedPolicy directly (no longer PI05Policy); inline the needed PI05Policy methods. - Restore optimizer_foreach/fused + get_optimizer_preset on PI052Config. - Revert pi0, pi0_fast, pi05 modeling and configuration_pi05 to origin/main (byte-identical), so the shared policies carry no smolvla modifications. Behavior verified bit-exact on pepijn223/pi052_robocasa_full: embed_language_ tokens, predict_action_chunk, and the fused flow+text+FAST training loss are identical before/after (max_abs_diff=0). pi052 tests pass (pre-existing stale-name collection errors unchanged). Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 19:59:27 +02:00
pepijn223	8292548f0d	fix(pi052): stop double-scaling FAST/text token embeddings embed_language_tokens already applies Gemma's sqrt(hidden) normalizer (GemmaTextScaledWordEmbedding, transformers >=5.4.0). pi052 multiplied FAST action-token and autoregressive subtask-text embeddings by sqrt(emb_dim) on top of that, double-scaling them (~2048x). Remove the manual scaling so FAST and text tokens are single-scaled, consistent with the pi05 fix and OpenPI. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 18:31:41 +02:00
pepijn223	77cc35b932	fix(pi0,pi05,pi0_fast): stop double-scaling text embeddings transformers >=5.4.0 (PR #44432) makes Gemma's embed_tokens a GemmaTextScaledWordEmbedding that already multiplies token embeddings by sqrt(hidden_size). The manual `* sqrt(embed_dim)` applied on top therefore double-scaled text (~2048x instead of ~45x), breaking VLM alignment for models trained/run on stock transformers. Remove the manual scaling and rely on embed_tokens' internal normalizer (matches main #3603). Image features stay raw (un-normalized), as before. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 18:22:34 +02:00
pepijn223	f0757fc707	fix(pi0,pi0_fast): scale text embeddings by sqrt(embed_dim) to match OpenPI OpenPI (pi0 and pi0-FAST) multiplies language token embeddings by sqrt(embed_dim) — the Gemma embedder normalizer — before the transformer. LeRobot pi0/pi0_fast omitted it, leaving text tokens ~45x under-scaled relative to the residual stream (same class of bug as the pi05 image scaling). pi0: applied in embed_prefix's lang_embed_func. pi0_fast: applied inside embed_language_tokens so prompt, FAST action tokens, and autoregressive next-token embeds are all scaled consistently. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 18:14:27 +02:00
pepijn223	a48d4e32a1	fix(pi05): don't scale image features by sqrt(hidden_size) lerobot/pi05_base was trained in the OpenPI/big_vision regime where image (soft) tokens are NOT multiplied by the Gemma embedder normalizer (sqrt(hidden_size)) — only text tokens are. Scaling image features here over-scaled them ~45x, breaking the pretrained vision-language alignment and yielding ~0% closed-loop success on RoboCasa across all pi05 runs. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 17:20:34 +02:00
Pepijn	9596e3d53f	Merge remote-tracking branch 'origin/feat/smolvla-on-steerable' into feat/smolvla-on-steerable	2026-06-04 17:14:33 +02:00
Pepijn	0a6a799317	Merge feat/language-annotation-pipeline into feat/smolvla-on-steerable Bring the authoritative annotation pipeline from the annotation branch. The annotation surface is forced to EXACTLY match feat/language-annotation- pipeline (the annotation branch is the source of truth for annotation code), which also removes smolvla's stale copies: - deleted: steerable_pipeline/vocabulary.py, tests/annotations/test_ vocabulary.py, prompts/module_0_vocabulary.txt, module_1_action_record .txt, module_3_vqa.txt, module_1_plan.txt, and the old module_* prompt names (now plan_/interjections_/vqa.txt). - synced: all of src/lerobot/annotations/, lerobot_annotate.py, examples/annotations/, tests/annotations/, datasets/language.py, tests/datasets/test_language.py, docs/annotation_pipeline.mdx. Non-annotation conflicts resolved by union (keeping both branches' intent): - pyproject.toml: keep smolvla's pi extra (+sentencepiece) and add the molmoact2 extra from main. - policies/factory.py: keep both dataset_repo_id (pi052 FAST tokenizer) and dataset_meta (both are referenced); union the policy-type docstring. - scripts/lerobot_train.py: keep smolvla's pi052 / use_relative_actions processor-rebuild block. - uv.lock: regenerated from the merged pyproject. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 17:13:36 +02:00
pepijn	e660a51e78	pi052(debug): drop misleading inference/parity dump from text preds The first-token parity check re-tokenized the decoded (stripped) inference string, so the leading-space SentencePiece variant always mismatched the training argmax — a false "DIVERGED" alarm. Remove the autoregressive inference print and parity comparison (and the now-dead per-sample select_message generation), keeping only the prompt, ground-truth target, and teacher-forced argmax accuracy. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 13:32:44 +00:00
Pepijn	cdd94a703f	annotate(config): tighten field comments to one line each Collapse the remaining multi-line field comments / docstrings in config.py to single lines (or two where a knob genuinely needs it), keeping the essential rationale. Comments only — no field or behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 15:12:31 +02:00
Pepijn	cd59c8b312	annotate: remove the action_record style/feature entirely Drop the optional structured per-subtask action records — not a feature we want to ship. * language.py: remove 'action_record' from CORE_STYLES + PERSISTENT_STYLES (and the matching assertion in tests/datasets/test_language.py). * config.py: delete ActionRecordsConfig (verb/grasp vocabularies, frames_per_subtask, emit_record_row) and the PlanConfig.action_records field. * plan_subtasks_memory.py: delete _extract_action_record and the run_episode block that emitted style='action_record' rows; drop the now-unused json / to_image_blocks imports. * remove the plan_action_record.txt prompt. * run_hf_job.py: drop the action_records comment. Verified: 40 tests pass; pre-commit (ruff, mypy, bandit) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:40:34 +02:00
Pepijn	99baae012f	annotate(config): further compact field comments Tighten the remaining multi-line comment blocks in config.py (derive_task, frames/window, describe_first, action-record/vqa/vlm fields, video_backend, repo ids, executor) to 1-3 lines each. Also fix a stale path typo ('examples/annotation' -> the docstring now just says HF Jobs). Comments only — no field or behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:36:02 +02:00
Pepijn	973318ef65	annotate: dedup task_aug + row-normalization; docs module on/off table Two behavior-preserving simplifications: * plan_subtasks_memory.run_episode: the task_aug 'axes' and free-form branches built identical deduped rows via copy-pasted seen/append loops. Collapse to one branch that picks the variant source, then a shared _task_aug_rows() helper does the dedup + row build (-~25 LOC). * writer: _normalize_persistent_row / _normalize_event_row shared the same camera-validate + struct construction. Extract _normalize_row(), keeping the exact key order (the parquet struct schema is inferred from insertion order, so timestamp must stay between style and camera). docs: 'Which modules run' is now a table giving each module's on/off flag (--plan.enabled / --interjections.enabled / --vqa.enabled) and what it turns off. Verified: 40 tests pass (incl. test_writer struct round-trip); pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:18:36 +02:00
Pepijn	7471a6b1ed	annotate: compress conftest + pyproject comments (fix stale backend note) The pyproject annotations-extra comment still described the removed vllm/transformers in-process backends ('vllm preferred ... transformers fallback', '_make_vllm_client'); rewrite it for the openai-only reality and trim it. Also condense the conftest lazy-import NOTE. Comments only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:12:04 +02:00
Pepijn	20c7a12dd5	annotate: remove dead code, document CLI options, compact config Dead code (defined but never referenced anywhere in src/tests/examples): * reader.py: keyframe_indices, episode_frame_timestamps, lookup_data_path, and the now-orphaned gather_data_paths + episode_offsets_per_path (lookup_data_path was their only caller). * staging.py: iter_staged_episodes. * writer.py: normalize_rows_for_writer. * config.py VlmConfig: json_mode, batch_size, tensor_parallel_size, gpu_memory_utilization, trust_remote_code — consumed only by the in-process vllm/transformers backends that were removed; the openai auto-serve path carries those vLLM flags via serve_command instead. Kept max_model_len (still used as the serve-command default). * config.py TaskAugAxesConfig.total property. Docs: new 'Key options' section in annotation_pipeline.mdx — grouped tables (dataset in/out, module toggles, --vlm., --plan., interjections + vqa) describing the flags users actually reach for, with defaults. config.py: compact the verbose field comments + ActionRecordsConfig / TaskAugAxesConfig docstrings; fix two stale 'verify' references (the verify pass was removed — it's describe -> segment now) and the stale 'renders record back to subtask text' note (that path was removed). vlm_client docstring no longer mentions the removed json_mode field. Verified: tests/annotations + tests/datasets/test_language + tests/scripts/test_lerobot_annotate (40 passed); pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 14:05:46 +02:00
Pepijn	dbe02f0c4f	annotate(plan): condense verbose comments + docstrings Trim the long inline comment blocks (effective_task / task_aug, action records, plan-boundary rows, plan-update span closing, windowed + coverage-stitch sections) and the _generate_plan / run_plan_updates docstrings to a few lines each. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 13:52:24 +02:00
Pepijn	56cbb5f9ec	annotate(example): trim run_hf_job comments to one line each Same flags and rationale, condensed — each plan-module flag now has a short one/two-line comment instead of a paragraph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 13:48:55 +02:00
Pepijn	2af2402a0c	docs(annotate): cleaner architecture diagram layout Top-down flow (read episodes → 3 modules fan out → validator → writer → parquet) with aligned boxes, instead of the cramped bordered version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 11:59:31 +02:00
Pepijn	7bec991cdf	docs(annotate): friendlier rewrite + architecture diagram; drop reproducibility section Rewrite annotation_pipeline.mdx in plainer, easier-to-read language (shorter sentences, active voice, a plain-text intro), add an ASCII 'How it fits together' architecture diagram, and remove the 'Reproducibility via seed and prompt hashes' section. Content/links are preserved; only wording and structure change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 11:48:59 +02:00
Pepijn	c6f682b3f4	annotate docs: install lerobot from main (post-merge wording) The example already pins '@main'; update the doc step and the script docstring from 'the branch under test' to 'lerobot (from main)' now that the pipeline is merging to main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-04 11:45:38 +02:00
Pepijn	eba3ab3741	annotate: address review feedback — bug fixes, docs/code drift, naming, cleanup Bugs * validator: don't re-raise on unknown style. The second column_for_style lookup (used to route persistent vs event) now sits in try/except so an unknown style is recorded by _check_column_routing and skipped instead of crashing the whole validation pass. * general_vqa._target_cameras: when restrict_to_default_camera is set but the configured camera_key isn't one the provider exposes, warn and fall back to all cameras instead of returning a phantom key that KeyErrors deep in frame decode. * interjections: clamp interjection timestamps to frame_timestamps[0] rather than a hardcoded 0.0 (datasets can start at non-zero t). Docs / code drift * annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase 0 / --vocabulary.* / canonical_vocabulary.json' section (none of it exists); describe the real describe->segment + coverage-stitch flow. Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of this PR' (matches tools.mdx, which already marks the runtime layer as not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note the default is now a single h200. Add a 'Contributing new modules' section inviting module / prompt / quality contributions. * executor docstring: six phases, no phantom phase 0. run_hf_job.py * add the Apache 2.0 license header (was flagged repeatedly). * default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1 (scale to h200x4 noted in the docstring). * pin the install to @main instead of the feature branch (won't break after merge). Naming / cleanup * rename dest_repo_id -> new_repo_id across config / script / example / test to match the LeRobot dataset edit tools. * rename prompt templates module_N_.txt -> descriptive (plan_, interjections_, vqa.txt) and update every load_prompt() call. remove dead _messages_to_prompt (used only by the removed in-process backends). * declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as real init=False dataclass fields instead of getattr monkey-patches. * scope bandit B607 to the two ffmpeg subprocess.run sites via '# nosec B607' and drop it from the global skip list. Tests * fix stale canned-VLM markers ('ONE realistic interruption' -> 'compact interjection', 'Update the memory' -> 'compressed semantic memory') and drop the dead 'concise hierarchical PLAN' plan responders (plan generation is deterministic now) in run_e2e_smoke, test_pipeline_recipe_render, test_modules. * run_e2e_smoke now asserts interjection + speech rows are produced so a stale marker can't silently pass again. * drop remaining 'PR 1' / 'PR 2' references from test comments / names. Verified: tests/annotations + tests/datasets/test_language + tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke (interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit, prettier) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 18:30:46 +02:00
Pepijn	3a24e426df	language: register action_record in CORE_STYLES so STYLE_REGISTRY contains it action_record is in PERSISTENT_STYLES but was missing from CORE_STYLES, so STYLE_REGISTRY (= CORE_STYLES \| EXTENDED_STYLES) didn't contain it and the PERSISTENT_STYLES \| EVENT_ONLY_STYLES <= STYLE_REGISTRY invariant in test_style_registry_routes_columns failed. Add it to CORE_STYLES so the registry, the persistent-set, and column_for_style() stay consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 16:38:06 +02:00
Pepijn	b9a0187335	annotate: drop local in-process VLM backends — HF Jobs (openai) only for now The shipped workflow is Hugging Face Jobs (examples/annotations/run_hf_ job.py): it serves the model with vLLM in the vllm/vllm-openai image and the pipeline talks to it over the OpenAI-compatible API. The in-process vllm / transformers local backends added surface (and the vllm one pinned an old torch) without being part of that path, so they're removed for now. * vlm_client.make_vlm_client: keep only backend='openai' (+ 'stub' rejected with the usual guidance). Requesting 'vllm'/'transformers' now raises a clear 'not supported for now — use the HF Jobs flow' error. Removed _make_vllm_client and _make_transformers_client. * config: backend docstring updated (openai-only); default model_id bumped to Qwen/Qwen3.6-27B to match run_hf_job. * docs/annotation_pipeline.mdx: remove the '## Running locally' section; the launcher description now says one vLLM server per GPU over the OpenAI API, and the 'One Qwen-VL pass' note drops the 'vLLM/transformers fallback' wording. Tests are unaffected (they construct StubVlmClient directly; nothing referenced the removed backends). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 16:28:40 +02:00
Pepijn	a18d969753	tests(annotations): fix stale canned-VLM markers + action_record style assertion The annotation tests had never actually run in CI (collection failed on the missing 'datasets' extra); now that they do, three stale assertions surfaced against the evolved pipeline: * test_module1_plan_memory_subtask_smoke: the memory canned-responder marker 'Update the memory' no longer appears in module_1_memory.txt (now 'compressed semantic memory'), so the stub returned no memory row and the {subtask,plan,memory} subset check failed. Marker updated to match the current prompt. * test_module2_mid_episode_emits_paired_interjection_and_speech: the interjection marker 'Write ONE interjection' is now 'Write ONE compact interjection' in module_2_interjection.txt, so 0 interjections were emitted. Marker updated. * tests/datasets/test_language.py::test_style_registry_routes_columns: PERSISTENT_STYLES gained 'action_record' in this PR; add it to the expected set. These are test/prompt-marker syncs — no production behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 16:21:17 +02:00
Pepijn	273a8fc335	deps(annotations): drop hard vllm dependency to unblock CI torch/torchcodec resolution Fast Pytest 'dataset' tier failed collecting tests/datasets/test_video_ decoder_cache.py with 'Could not load libtorchcodec ... undefined symbol: torch_dtype_float4_e2m1fn_x2' — a torch/torchcodec ABI mismatch. Root cause: the annotations extra's vllm hard-pins an older torch (via xformers/xgrammar -> torch 2.8). uv resolves a SINGLE unified lock across all extras, so vllm capped torch to 2.8 for every tier — including dataset, whose torchcodec 0.11.1 needs torch 2.11. The result was torch 2.8 + torchcodec 0.11.1 installed together -> ABI break. (main has no vllm, so it resolves torch 2.11 + torchcodec 0.11.1 cleanly.) Fix: remove vllm from the annotations extra. It is not needed by the shipped workflow — examples/annotations/run_hf_job.py gets vllm from the vllm/vllm-openai image and talks to it over the OpenAI-compatible API (--vlm.backend=openai), and vlm_client._make_vllm_client imports vllm lazily. For the in-process --vlm.backend=vllm path, install vllm separately (the ImportError now says so). After the fix uv resolves torch 2.11.0 + torchcodec 0.11.1 (matching main); uv lock --check is clean. The annotations extra still provides datasets / transformers / openai. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 16:09:22 +02:00
Pepijn	b9246ef61b	tests(annotations): guard on the 'dataset' extra so base fast-test tier skips cleanly Fast Pytest Tests failed at COLLECTION in the base '--extra test' tier with 'ModuleNotFoundError: No module named datasets': tests/annotations/ conftest.py imported the fixture dataset builder (-> lerobot.datasets -> the HF 'datasets' lib + pandas/pyarrow), which only ship under the 'dataset' extra, so the whole annotations package crashed. Fix uses the repo's proven module-level guard pattern (see tests/datasets/test_language.py), NOT a conftest-level importorskip — verified empirically that pytest.importorskip raised during conftest import is treated as a collection ERROR (exit 1), while module-level importorskip is a clean SKIP. * conftest.py: import build_annotation_dataset LAZILY inside the fixtures so the conftest itself imports cleanly in every tier. * test_modules / test_validator / test_writer / test_pipeline_recipe_ render: add module-level pytest.importorskip('datasets') + ('pandas') before the pyarrow / lerobot.* imports (# noqa: E402 to match the existing convention). pyarrow-importing modules place the guard before the pyarrow import. * tests/scripts/test_lerobot_annotate.py: same guard (its _push_to_hub path imports lerobot.datasets). Result: - base / hardware / viz tiers (no dataset extra): annotation tests skip cleanly; the rest of the suite runs -> exit 0. - dataset tier: datasets present -> guards pass through -> annotation tests run with the stub VLM. The pipeline modules import only stdlib + relative + lerobot.datasets (no module-level datatrove / vllm / openai), so they import fine there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 15:57:04 +02:00
Pepijn	870980efd6	Merge branch 'main' into feat/language-annotation-pipeline	2026-06-03 15:46:13 +02:00
Jaimin	d1b1c5c8cf	docs: fix broken dataset script paths (datasets/v30 -> scripts) (#3695 ) The docs pointed at src/lerobot/datasets/v30/, which does not exist. Both scripts actually live in src/lerobot/scripts/: - convert_dataset_v21_to_v30.py - augment_dataset_quantile_stats.py Updated the four references (one python -m module path and three file-path invocations) to the correct location, matching each script's own usage docstring.	2026-06-03 14:48:19 +02:00
Nikodem Bartnik	741c2d0a39	Docs/add lelab (#3707 ) * first text draft (no images) * simplified docs * fix formatting * add youtube video * add a tip about compatibility * fix broken link	2026-06-03 14:22:05 +02:00
Haoming Song	19fe315971	fix(train): enable relative action overrides for pretrained processors (#3711 ) * fix(train): enable relative action overrides for pretrained processors Keep pretrained processor pipelines when use_relative_actions is enabled and apply relative/absolute action processor settings through overrides. Rename the relative action processor registry key to relative_actions_processor. * fix(config): reject rename_map without pretrained checkpoint Fail fast when rename_map is set during fresh initialization, since fresh configs derive feature names from the current dataset and no rename is applied. --------- Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-06-03 11:46:35 +02:00
Khalil Meftah	906b585826	fix(datasets): default `private` to `None` in `push_to_hub` to respect Hub org visibility settings (#3713 )	2026-06-02 19:25:13 +02:00
Pepijn	4c86332fe3	feat(annotate): add plan toggle, drop subtask verify pass, 4xH200 job - PlanConfig.emit_plan (default True): keep subtasks + memory but skip the per-boundary "plan" rows and their VLM call when False. - Remove the subtask_verify pass entirely: pruning dropped legitimate subtasks and the stitch step already guarantees full-episode coverage. Deletes _verify_subtasks, both call sites, and the now-unused module_1_subtask_verify prompt. - run_hf_job example: 4xH200 (4 vllm servers), emit_plan=false, vqa off. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-02 18:02:13 +02:00
pepijn	23419026d5	pi052: parquet-direct FAST tokenizer fit (fix v3 dataset hang) ``fit_fast_tokenizer`` previously called ``LeRobotDataset(repo_id, episodes=[N])`` per sampled episode, which on v3-format datasets routes through HF datasets' split lookup and raises ``ValueError: Instruction "train" corresponds to no data!`` on every episode. On ``pepijn223/robocasa_pretrain_human300_v4`` (32 k episodes) this looped through 13,293 skipped episodes for ~2.5 h before the NCCL watchdog killed the run via the 2 h ALLREDUCE timeout (job 22182985). Switch to reading the ``action`` column directly from the dataset's ``data/chunk-/file-.parquet`` shards (same pattern as the audit scripts). Verified end-to-end on the 32 k-episode dataset: 1000 chunks collected from 1000 episodes in 70.7 s. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-02 15:54:31 +00:00
Pepijn	1417fd69b2	docs(annotate): prettier format annotation_pipeline.mdx Quality-gate fix: ruff-format/markdown prettier hook reflow of the annotation pipeline doc. No content change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:41:46 +02:00
Pepijn	53c7b4c69a	annotate: ruff lint + format pass Quality-gate fixes after the main merge: * UP037: drop redundant quotes from PlanConfig forward-ref annotations (action_records / task_aug_axes) — safe under 'from __future__ import annotations'. * ruff format applied to config.py, executor.py, general_vqa.py, plan_subtasks_memory.py, validator.py, lerobot_annotate.py. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:38:18 +02:00
Pepijn	3662c41b85	Merge remote-tracking branch 'origin/main' into feat/language-annotation-pipeline # Conflicts: # uv.lock	2026-06-02 17:36:07 +02:00
Pepijn	518e191337	annotate: windowed subtask generation for constant temporal density Long episodes no longer get sparse subtasks. Previously a long episode was subsampled to max_video_frames=32 across its whole duration (~1 frame/4s for a 2-min clip). New opt-in windowing keeps a CONSTANT frames_per_second density by splitting the episode into fixed-length windows and running the subtask chain per window. New PlanConfig.subtask_window_seconds (default 0.0 = off). When > 0 and the episode is longer than one window: * episode is split into consecutive [w0, w1] windows of this length * each window's frames are sampled at frames_per_second (so a 32s window at 1 fps = 32 frames, filling but not exceeding the per-call context budget) * the full describe -> segment -> verify chain runs PER window, in window-relative time [0, L]; spans are offset back to absolute * all windows' spans are merged, frame-snap-deduped, and stitched into one contiguous whole-episode cover Implementation: * _episode_video_block / _video_message / _describe_episode / _verify_subtasks gain an optional window=(w0,w1); when set they embed frames sampled in that absolute range at frames_per_second (video_url path skipped — it's whole-episode). * _clean_spans gains bounds= (override clamp range, for window-relative spans) and dedupe= (skip frame-snap until the merged absolute set). * new _generate_subtasks_windowed + _subtasks_for_window orchestrate the loop; _generate_subtasks branches to them when window_s > 0. run_hf_job.py: --plan.subtask_window_seconds=32 (32s windows at 1 fps). Cost scales with episode length (chain calls × ceil(duration/window)). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 16:26:14 +02:00
Pepijn	3236c6ee4a	examples(annotate): switch run_hf_job to Qwen3.6-27B (dense VLM) Swap the annotation VLM from Qwen3.6-35B-A3B (sparse MoE, ~3B active) to Qwen3.6-27B (dense, 27B all-active). Per Scale's dense-captioning study, model capacity is the #1 lever and the dominant failure is visual grounding — both helped by ~9x more active params. Qwen3.6-27B is a vision-language model (vision encoder, image + video), same family so the chat template / video handling / enable_thinking=false flag are unchanged, and at 27B dense it still fits one H200 per server, so the two-parallel-server layout (TP=1, one per GPU) is preserved — no throughput-layout change, just a much stronger model. Kept: parallel_servers=2, num_gpus=2, max-model-len 32768 (the 32-frame embedded budget is ~10k tokens, well under), gpu-mem 0.8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 16:16:26 +02:00
Pepijn	cd128cbbd5	annotate: add verb-scoped disambiguation rules to subtask prompt Adopt the one prompt technique Scale's dense-captioning study found reliably positive: targeted, verb-scoped, visually-grounded disambiguation rules. Their lesson was that such a rule must fire ONLY on the spatial situation it names (their narrow 'Stack vs Put' rule helped; an over-broad directional 'Scoop' rule bled into other verbs and hurt), so each rule here is phrased visually and scoped to one confusable pair: * stack-vs-put (on top of an object vs on a surface) * insert-vs-put (fitted slot vs surface) * pick-up/retrieve-vs-put (decide by which way the OBJECT moves: gripper closes + object moves with hand = pick up; gripper opens + object stays = put — directly targets Scale's dominant direction-flip failure) * pour-vs-put (tilt + flow vs untilted move) This is the highest-confidence, lowest-risk change from the Scale findings; our pipeline already aligns with their 'avoid' list (no temporal tokens, no overlays, no fancy sampling, no sequential context injection, uniform sampling, describe-don't-predict framing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 16:10:49 +02:00
Pepijn	1fb46ab300	annotate: cap embedded-frame budget to fit VLM context (fix 32k overflow) Switching the plan module to embedded frames (use_video_url=false) exposed a context overflow: at frames_per_second=2.0 with the old max_video_frames=128 default, a 480x640 episode embeds ~128 frames ≈ 33-39k vision tokens, over the model's 32768 context — every plan call died with 'Input length exceeds maximum context length' (HTTP 400), crashing the whole annotation job. The video_url path never hit this because the server downsampled; the embedded path sends every sampled frame, so the frame count is a hard token budget. Fix: * config default max_video_frames 128 -> 32 (~8-10k vision tokens, comfortable headroom for the prompt + describe/verify passes). Frames are still sampled UNIFORMLY across the whole episode, so longer episodes are subsampled, not truncated — full temporal coverage preserved, just coarser density. * run_hf_job.py: frames_per_second 2.0 -> 1.0, explicit --plan.max_video_frames=32, with a comment explaining the token budget and the 'do not raise toward 128 with embedded frames' rule. Only the plan module embeds the full episode; VQA (1 frame/tick) and interjections (4-frame window) were never at risk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 16:02:25 +02:00
Pepijn	79f9a84407	annotate: make full-episode subtask coverage unconditional Remove the subtask_full_coverage config flag. Stitching subtask spans into a contiguous full-episode cover is now always applied in _generate_subtasks — a sparse / gap-ridden subtask timeline is never desirable for conditioning, so there's no reason to make it optional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 15:36:23 +02:00
Pepijn	799d0e3bcc	annotate: stitch subtasks to full-episode coverage The verify pass prunes subtasks, which could leave the first subtask starting after t0 or leave gaps between spans — so the subtask timeline no longer tiled the episode and frames fell through with no active subtask label. New deterministic post-step (no VLM call), default on via PlanConfig.subtask_full_coverage: * first subtask start pulled back to the episode's first frame t0 (idle / approach before the first labelled action folds into it) * each subtask end snapped to the next subtask start (gaps closed) * last subtask end extended to the last frame t_last Runs after segment + verify in _generate_subtasks. Starts other than the first are left as the VLM/verify produced them (already frame- snapped + distinct), so the cover is contiguous and non-overlapping. Disable with --plan.subtask_full_coverage=false if a consumer wants sparse subtasks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 15:34:34 +02:00
Pepijn	1fe1463ae0	annotate: enable subtask describe->segment->verify chain by default Flip PlanConfig.subtask_describe_first and subtask_verify defaults False -> True. Every subtask annotation now runs the 3-call grounding + pruning chain by default, since the single-call path reliably hallucinates steps from the task text. Costs 2 extra VLM calls/episode; disable with --plan.subtask_describe_first=false / --plan.subtask_ verify=false on easy datasets where fewer calls matter more than label fidelity. run_hf_job.py: drop the now-redundant explicit flags, leave a note that the chain is default-on and how to opt out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 15:13:50 +02:00
Pepijn	dcd368e1f8	annotate: multi-call subtask quality chain (describe -> segment -> verify) The single-call 'watch video -> emit subtask JSON' pattern makes the VLM commit to structured output before reasoning about what it saw, so it pattern-matches the task text and hallucinates steps. Split it into an opt-in multi-call chain that grounds first and prunes last. New PlanConfig flags (both default False -> single-call unchanged): * subtask_describe_first: a grounding pass narrates ONLY what is visible in the video (no subtask JSON yet). That description is injected into the segmentation prompt via a new {observation_block} placeholder, so the model segments its own grounded observations instead of the instruction text. +1 VLM call/episode. * subtask_verify: after segmentation, an adversarial pass re-watches the video and drops any candidate subtask it cannot see. Can only PRUNE (never add/rewrite/move) and fails open (keeps un-verified spans if the call returns nothing). +1 VLM call/episode. Implementation: * _generate_subtasks now orchestrates describe -> segment -> verify. * Factored span cleaning into _clean_spans (shared by segment + verify outputs); added _describe_episode and _verify_subtasks helpers. * New prompts module_1_subtask_describe.txt (returns {description}) and module_1_subtask_verify.txt (returns pruned {subtasks}). * module_1_subtasks.txt gains a {observation_block} slot at the top. run_hf_job.py enables both for the RoboCasa run (3 VLM calls/episode for subtasks). Combined with single-camera grounding + the embedded- frame path, this is the high-quality configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 15:12:46 +02:00
Pepijn	ba5d4c5cd8	annotate: kill subtask hallucination + single-camera grounding Two fixes for 'subtasks describe actions not in the video' plus a way to focus the whole pipeline on one camera. ANTI-HALLUCINATION 1. _episode_video_block: when use_video_url is set but clip extraction fails, FALL BACK to embedded frames instead of returning an empty block. An empty block left the VLM with zero visual grounding, so it invented subtasks from the task text alone — the likely root cause of hallucinated steps. Now logs a warning and embeds frames. 2. module_1_subtasks.txt gains a GROUNDING preamble (overrides all other rules): label only motion visible in specific frames; never invent/anticipate/pad; max_steps is a CEILING not a target; atomic demos may be exactly ONE subtask; the VIDEO is ground truth, not the instruction text. SINGLE-CAMERA GROUNDING * New VqaConfig.restrict_to_default_camera (default False). When True, the VQA module grounds on only the --vlm.camera_key stream instead of iterating every camera — matching the plan / interjection modules, which already use that single camera. Now the whole pipeline can focus on one view (e.g. observation.images.base). run_hf_job.py updated: * use_video_url=false + frames_per_second=2.0 — embed frames directly (most reliable; no silent text-only failure mode) with dense grounding. * vqa.restrict_to_default_camera=true — VQA on the single camera too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 15:08:25 +02:00
Pepijn	7454b4c993	annotate: remove action-record subtask-text replacement entirely Drops the replace_subtask_text option and the _render_action_record_to_subtask_text renderer. Action records are now strictly additive: when action_records.enabled=True the module emits style='action_record' rows (the typed {verb,object,arm,grasp,dest, mistake} schema) and NEVER rewrites the subtask text the policy conditions on. The render-back-to-text path was the source of corrupted subtasks (navigation tasks produced 'move stove to stove', manipulation tasks got spurious 'with left arm using pinch grip' suffixes). Reconstructing natural-language subtasks from hallucinated structured fields is inherently fragile, so the capability is removed rather than guarded. Removed: * ActionRecordsConfig.replace_subtask_text field * PlanSubtasksMemoryModule._render_action_record_to_subtask_text * the span['text'] = canonical_text overwrite in run_episode Updated docstrings + run_hf_job.py comment accordingly. emit_record_row (default True) is now the feature's only output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 14:42:36 +02:00

1 2 3 4 5 ...

1814 Commits