lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-24 02:06:15 +00:00

Author	SHA1	Message	Date
Pepijn	bce5387e04	Merge branch 'main' into feat/language-columns	2026-05-08 10:29:49 +02:00
Steven Palma	c8ce413d73	fix(robots): allign lekiwi default with so100 use_degrees (#3531 )	2026-05-07 17:52:34 +02:00
Pepijn	82dffde7fa	fix(ci): speed up multi-task benchmark evals (parallelize + cap VLABench steps) (#3529 ) * fix(ci): run multi-task benchmark evals 5-at-a-time in parallel The eval script supports running tasks concurrently via a ThreadPoolExecutor (env.max_parallel_tasks). Apply it to the four multi-task benchmark CI jobs (RoboTwin, RoboCasa, RoboMME, LIBERO-plus — 8-10 tasks/task_ids each) so they finish in ~2 waves of 5 instead of running sequentially. Single-task jobs (Libero, MetaWorld, RoboCerebra) are unchanged. * fix(ci): cap VLABench smoke eval at 50 steps per task VLABench's default episode_length is 500 steps; with 10 tasks at ~1 it/s the smoke eval took ~80 minutes of rollouts on top of the image build. The eval is a pipeline smoke test (running_success_rate stays at 0% on this short rollout anyway), so we don't need full episodes — cap each task at 50 steps to bring total rollout time down ~10x. * fix(ci): run VLABench tasks 5-at-a-time in parallel The eval script already supports running multiple tasks concurrently via a ThreadPoolExecutor (env.max_parallel_tasks). Set it to 5 so the 10 VLABench tasks finish in ~2 waves instead of running sequentially.	2026-05-07 13:37:16 +02:00
Ville Kuosmanen	eaf0218bc8	feat(policy): use pretrained vision encoder weights by default for diffusion and vqbet (#3202 ) * feat: add pretrained vision encoder weights for diffusion and vqbet * fix test by re-generating artifacts --------- Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>	2026-05-07 12:10:38 +02:00
Pepijn	a0e52d52fe	fix(ci): bump robotwin benchmark image to CUDA 12.6 (#3525 ) The robotwin benchmark Dockerfile still installed cuda-nvcc-12-4 and cuda-cudart-dev-12-4 after #3505 upgraded the base image to CUDA 12.6.3 on Ubuntu 24.04. Those packages aren't available in the ubuntu2404 CUDA repo, so the build failed at apt-get install. Bumping both to -12-6 to match the base image.	2026-05-07 11:11:12 +02:00
Pepijn	85576acc29	docs(tools): drop follow-up-PR references Reword the two callouts in `tools.mdx` to describe the runtime layer in present tense ("not part of the catalog layer shipped today", "those modules don't yet exist in the tree") instead of pointing at a specific follow-up PR. Keeps the doc honest about what works now without coupling it to a particular release order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 20:29:42 +02:00
Pepijn	e7e5fca5de	review: emitted_at uses 0.1s tolerance; MessageTurn requires stream at construction * Float tolerance in `emitted_at` for persistent styles. The ``_timestamp(row) == t`` exact-equality check silently missed any caller that derived ``t`` arithmetically (e.g. ``frame_idx / fps``) even though the parquet timestamp would only differ by ULPs. Added ``EMITTED_AT_TOLERANCE_S = 0.1`` and check ``abs(...) <= tolerance`` instead, with a docstring explaining why exact equality wasn't enough and why 0.1 s is safe at typical 30–100 Hz control rates. Test asserts the new behavior at half-window (matches) and double-window (no match) using the constant so it stays in sync. * `MessageTurn.stream` is required at construction. It was typed ``MessageStream \| None = None`` so YAML could omit ``stream:`` and pass the dataclass invariant — but ``_validate_rendered`` rejected ``None`` streams later, surfacing the error at the first sample instead of at recipe load. Now ``__post_init__`` raises ``ValueError`` if ``stream`` is ``None``, with the list of valid streams in the message. The redundant late-stage check in ``_validate_rendered`` is replaced with a one-line comment that cites the upstream invariant. Test pins the new construction-time rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 19:55:08 +02:00
Pepijn	beb22afd81	review: dedupe regex, centralize column names, harden collate, more tests * #2 — dedupe `_PLACEHOLDER_RE`. The same regex was compiled in `recipe.py` and `language_render.py`. Promote to module-level `PLACEHOLDER_RE` in `recipe.py` (its primary owner — declares template syntax) and import from `language_render.py`. * #3 — centralize language column names. `io_utils.py` had hardcoded `{"language_persistent", "language_events"}` literals at two sites. Replace with `LANGUAGE_COLUMNS` import so a future column rename can't silently desync. * #4 — defensive collate preserved-keys. `lerobot_collate_fn` silently filtered language fields from samples that didn't have them, which would hand downstream consumers a preserved list shorter than the tensor batch. Now: if any sample carries a key, every sample in the batch must carry it; otherwise raise a `ValueError` so the upstream rendering bug surfaces at the boundary. * #5 — `_scalar` rejects non-singleton lists. Previously a zero- or multi-element list fell through and triggered confusing `float([])` errors downstream. Now raises `ValueError` with the actual length. * #6 — refactor `_extract_complementary_data`. Replace 11 lines of `key = {... if ... else {}}` plus an 11-line splat dict with a single `_COMPLEMENTARY_KEYS` tuple iterated once. * #7 — document `EXTENDED_STYLES`. Was an empty `set()` with no comment. Add a docstring explaining it's an intentional extension point: downstream modules append project-local styles before `column_for_style` is called. * #9 — `tools.mdx` notes the runtime layer is future work. The page referenced `src/lerobot/tools/`, `registry.py`, and `get_tools(meta)` — none exist in this PR. Added a callout at the start of "How to add your own tool" plus a note on the implementations paragraph. * #10 — tests for YAML round-trip, malformed rows, blend validation. `test_recipe.py` grew from 1 case to 12 covering: blend-or-messages exclusivity, target-turn requirement, blend emptiness, weight presence/positivity, nested-blend rejection, `from_dict` with nested blends, `from_yaml` / `load_recipe` agreement, top-level non-mapping rejection. Added a malformed-row test for `_normalize_rows` that asserts non-dict entries raise `TypeError`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 19:06:38 +02:00
Pepijn	33a4b4a5a0	feat(smolvla2): autonomous robot mode in lerobot-smolvla2-runtime The runtime CLI was deliberately scoped to dry-run only: it hard-coded ``robot_executor=None`` and printed a "real-robot integration is a follow-up" warning even when ``--no_robot`` was omitted. The runtime engine was already structured for real-robot operation (separate ``LowLevelForward`` chunk-rate generation + ``DispatchAction`` ctrl-rate dispatch with a ``robot_executor`` hook); only the wiring was missing. Add the wiring: * ``_load_policy_and_preprocessor`` now also returns the postprocessor (action denormaliser). * ``--robot.type`` / ``--robot.port`` / ``--robot.id`` / ``--robot.cameras`` (JSON) build a ``Robot`` via ``make_robot_from_config`` and connect it. * ``_build_robot_observation_provider`` reads ``robot.get_observation()`` each call, drops the language columns (runtime drives messages itself), and runs the policy's preprocessor (rename → batch → device → normalise). * ``_build_robot_action_executor`` postprocesses the policy's action tensor (denormalise), converts to the ``{joint: value}`` dict via ``make_robot_action(action, ds_meta.features)``, and calls ``robot.send_action(...)``. Optional ``--max_action_norm`` safety clip rejects ticks whose action L2 norm exceeds the threshold (kill-switch when bringing up a new robot). * ``_run_autonomous`` runs ``runtime.run()`` in a background thread (the policy must keep generating chunks at chunk_hz and dispatching at ctrl_hz regardless of stdin) and handles user interjections / VQA queries from the foreground stdin loop. Confirmation prompt before start (skip with ``--auto_start``); Ctrl+C stops the thread and disconnects the robot cleanly. * Autonomous mode requires ``--dataset.repo_id`` for action stats / feature shapes — pass the same dataset the policy was trained on. The bootstrap path that pulls canonical task / plan / memory runs in both REPL and autonomous modes so the model's first prompt matches training distribution. Dry-run REPL behaviour is unchanged when ``--robot.type`` is not passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 18:30:56 +02:00
Haoming Song	e99c55af4b	feat(policies): add EO-1 model (#3403 ) * feat(policies): add EO-1 model * chore(eo1): adjust policy_eo1_README.md to to avoid duplicate with eo1.mdx * chore(eo1): remove policy_eo1_README.md, link eo1.mdx in policy folder --------- Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-05-06 18:01:16 +02:00
Steven Palma	408e0ca763	fix(robots): openarm features with openarmmini (#3524 )	2026-05-06 17:03:09 +02:00
Pepijn	d55b581ca1	fix(language): address review — tools accessor, motion docs, conditional collate * `meta.tools` actually reads `info.json["tools"]`. `DatasetInfo` had no `tools` field, so `from_dict` silently dropped the key (it warned about unknown fields then discarded them) and the property always returned `DEFAULT_TOOLS`. Added `tools: list[dict] \| None` to the dataclass; `to_dict()` drops it when unset so existing datasets keep a clean `info.json`. Fixed the accessor to read `self.info.tools` (the previous `.get(...)` would have raised AttributeError on the dataclass anyway). Added regression tests: fallback when absent, round-trip from disk, and round-trip through `DatasetInfo.from_dict` / `to_dict`. * `motion` is not view-dependent — fix the docs. The mdx claimed rows of style `motion` must carry `camera`, but `VIEW_DEPENDENT_STYLES = {"vqa", "trace"}` and the validator agrees: motion primitives are joint/Cartesian-frame, not pixel-space. Updated both call-out paragraphs in `language_and_recipes.mdx`. * Conditional `collate_fn` swap. Added `meta.has_language_columns` and gate the `lerobot_collate_fn` swap in `lerobot_train.py` on it, so non-language datasets keep PyTorch's `default_collate`. Also added a pass-through test in `test_collate.py` that asserts on a plain tensor batch the custom collate matches `default_collate` key-for-key, plus a test for the `None`-sample drop path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 14:51:06 +02:00
Pepijn	24d2ffe3c6	fix(language): keep base install green — drop processor re-export, gate dataset-extra tests `lerobot.processor` re-exported `RenderMessagesStep` at the package level, so importing anything from `lerobot.processor` pulled in `lerobot.datasets.language` → `lerobot.datasets/__init__.py` → `require_package("datasets")`, which fails in the Tier 1 base install that intentionally omits the `[dataset]` extra. The chain bricked collection for unrelated suites (`tests/policies/pi0_pi05/...`, `tests/envs/...`, etc.). * Stop re-exporting `RenderMessagesStep` from `lerobot.processor`. The only consumer (the test) already imports from the submodule. Document the deliberate omission in the module docstring. * Add `pytest.importorskip("datasets", ...)` (and `pandas` where needed) at the top of the four PR-added tests that exercise the language stack: - tests/datasets/test_language.py - tests/datasets/test_language_render.py - tests/processor/test_render_messages_processor.py - tests/utils/test_collate.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 14:12:54 +02:00
Pepijn	789f29aa56	chore: fix CI — collapse short ValueError to one line, refresh uv.lock * `ruff format` on CI (newer version) wants the short `camera=None` ValueError on a single line. * `uv.lock` was stale relative to `pyproject.toml`'s `datasets>=4.7.0` pin (and picked up upstream `s390x` marker fixes for cuda packages). CI runs `uv sync --locked` which rejected the divergence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 14:05:42 +02:00
Pepijn	a356b12c41	fix(language): always raise on ambiguous resolver matches `_select_one` previously skipped its ambiguity check whenever any of `role`/`tool_name`/`camera` was set, on the assumption that the caller had already pinned down a unique row. That left a real ambiguity hole for VQA: with two cameras emitting `(vqa, assistant)` at the same frame, `emitted_at(..., role="assistant")` silently picked the first sorted row instead of telling the recipe to add `camera=...`. The existing `test_emitted_at_raises_on_ambiguous_per_camera_vqa` test already encoded the desired behavior. Tighten the check: any time `len(rows) > 1` we now raise with the selectors echoed back, so users see exactly which fields they passed and that more is needed to disambiguate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 14:00:45 +02:00
Pepijn	e8327b8e62	refactor(language): unify resolver dispatch and prune redundant test scaffolding * Drop the unused `events` kwarg from `active_at`/`nth_prev`/`nth_next`; only `emitted_at` actually consults events. The dispatcher in `_resolve_spec` now passes events conditionally. * Replace the dual `_persistent_sort_key`/`_event_sort_key` pair with a single `_row_sort_key` and drop the `sort_key` parameter from `_select_one`. Event rows lack `timestamp` (it is implicit in the frame) and now default to `0.0` for sort purposes — the `(style, role)` tiebreaker is unchanged. * Inline `_select_latest` into `active_at` (its only caller). * Collapse `emitted_at`'s dual-branch into one `_select_one` call. * Tighten `_validate_persistent_resolver` to a single `column_for_style(style) != LANGUAGE_PERSISTENT` check. * Parameterize `test_per_camera_blend_renders_both_views` over the two cameras and factor the sub-recipe builder into `_vqa_subrecipe` so the test no longer hand-rolls two near-identical recipe blocks. Net -98 LOC; behavior, public resolver names, and test expectations unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 13:15:45 +02:00
Pepijn	c450298147	Apply ruff and prettier formatting after merge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 12:10:41 +02:00
Pepijn	5c30b14929	Merge remote-tracking branch 'origin/main' into feat/language-columns	2026-05-06 12:09:13 +02:00
Pepijn	a764c3e1d6	fix(datasets,annotate): tag pushed dataset + clean revision error Two bugs combining to make the brand-new ``_tool3`` dataset unloadable: 1. ``lerobot_annotate.py:_push_to_hub`` uploads the annotated dataset folder but never creates a codebase-version tag, so ``api/datasets/<repo>/refs`` returns ``"tags": []``. Then ``LeRobotDatasetMetadata`` → ``get_safe_version`` → ``get_repo_versions`` returns empty and the loader raises ``RevisionNotFoundError``. 2. ``RevisionNotFoundError`` itself was unconstructible: its ``HfHubHTTPError.__init__`` indexes ``response.headers`` unconditionally on current ``huggingface_hub`` versions, so constructing it without a real ``Response`` blew up with ``AttributeError: 'NoneType' object has no attribute 'headers'``, masking the real "no tag" message. Fix #1: after upload, read ``meta/info.json["codebase_version"]`` and ``HfApi.create_tag(..., tag=<v3.x>, repo_type='dataset', exist_ok=True)`` so the dataset is loadable straight from the Hub on the next ``LeRobotDataset(repo_id)`` call. Falls back to the in-tree ``CODEBASE_VERSION`` if info.json is missing/malformed; on tag creation failure, prints the manual one-liner the user needs. Fix #2: stop trying to instantiate ``RevisionNotFoundError`` (which inherits HfHubHTTPError) for what is really a config issue, not an HTTP failure. Raise plain ``RuntimeError`` with the same message — the caller actually sees what's wrong instead of an upstream attribute error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 18:23:18 +02:00
Pepijn	b416f287f2	fix(datasets): raise readable error when repo has no version tags ``RevisionNotFoundError`` inherits from ``huggingface_hub.HfHubHTTPError`` which made ``response`` a required keyword-only argument on recent versions. Constructing it with just a message string blew up with ``TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only argument: 'response'`` instead of surfacing the actual problem (the dataset/checkpoint repo doesn't exist on the Hub yet). Pass ``response=None`` explicitly. Fall back to the bare-message form for older ``huggingface_hub`` versions that don't accept the kwarg. Also clarify the message to call out the most common cause: typing a hub repo id that hasn't been pushed yet (instead of just "needs a version tag"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 18:12:40 +02:00
Pepijn	aa749d4947	chore(annotate): throttle Module 3 + executor parallelism to fix vLLM stall Last bump combined ``module_3.K=3`` with ``vqa_emission_hz=2.0`` and ``executor.episode_parallelism=32``. With 2 cameras per dataset that produced ~12× the original VQA call volume, all submitted concurrently. Module 3 latency went from ~30s/phase to ~490s per episode, vLLM's KV cache pegged at 94% with 800+ in-flight requests, and the multimodal cache corrupted with ``AssertionError: Expected a cached item for mm_hash='...'`` (a known vLLM bug under image-heavy concurrency). Module 1 and 2 ran fine; Module 3 was the bottleneck. Pull back the multipliers to land in a sustainable spot: * module_3.K: 3 (kept) — three diverse questions per emission, where the diversity actually helps the LM head. * module_3.vqa_emission_hz: 2.0 → 1.0 — back to the original emission rate. Net VQA volume is now ~3× original (K alone) on a single camera, ~6× across both cameras — manageable. * module_2.max_interjections_per_episode: 9 → 6 — still 2× the default, fewer than the prior 3× to keep total request volume in check. * vlm.client_concurrency: 256 → 128 — gives vLLM headroom on the multimodal request path so the mm_cache doesn't desync. * executor.episode_parallelism: 32 → 16 — half the episodes in flight at once, so peak vLLM load is ~half. n_task_rephrasings stays at 30 (text-only, doesn't load the image path) and vlm.temperature stays at 0.7. The diversity gains are preserved; only the throughput knobs come down. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:07:18 +02:00
Maxime Ellerbach	ce24063efd	feat(dagger): adding smooth handover (#3506 ) * feat(dagger): adding smooth handover * update docstring * small phase fix and documenting potential issues * cleaning up	2026-05-05 14:44:32 +02:00
Pepijn	1394a6ab5d	chore(annotate): bump diversity knobs ~3x to fight memorisation Following Pi0.7 §V (prompt expansion / diverse context conditioning), push more atom variants per episode and higher VLM sampling temperature so the training distribution has enough wording diversity that the LM head is forced to use its parameters rather than memorise specific (prompt, target) pairs. Changes vs prior annotation pass: * vlm.temperature: 0.2 (default) → 0.7 — every Module-1/2/3 call now produces diverse phrasings; same prompt yields different completions across emissions. * module_1.n_task_rephrasings: 10 → 30 — three times as many ``task_aug`` rows in language_persistent. ``${task}`` already rotates through them deterministically per sample_idx (see ``_resolve_task`` in language_render.py). * module_2.max_interjections_per_episode: 3 (default) → 9 — more ``user_interjection_response`` training samples + more plan refresh events. * module_3.K: 1 → 3 — three VQA pairs per emission tick instead of one. Combined with the hz bump below, ~6× more VQA samples. * module_3.vqa_emission_hz: 1.0 → 2.0 — double the VQA emission rate within each subtask span. Pushes to a new hub repo (``_tool3``) so the working ``_tool2`` dataset stays intact for comparison. ``${task}`` already wired to rotate through ``task_aug`` rows, so no renderer change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:32:05 +02:00
Steven Palma	82934719db	chore(dep): bump transformers to 5.4.0 (#3374 ) * fix(deps): breaking change from transformers 5.4.0 * Update src/lerobot/policies/xvla/modeling_florence2.py Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * Update src/lerobot/policies/wall_x/qwen_model/qwen2_5_vl_moe.py Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * removing dataclass * bumping transformers 5.4.0 * weird i can't even pass the test on main * oops, typo * chore(style): fix pre-commit run * chore: update uv.lock * seems like a weird numerical precision issue, lets check in runners * chore: update uv.lock * chore(dependecies): adjust transformers version * chore: update uv.lock --------- Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> Co-authored-by: Maximellerbach <maxime.ellerbach@huggingface.co> Co-authored-by: raushan <raushan@huggingface.co>	2026-05-05 14:19:09 +02:00
Pepijn	db9118f16f	fix(smolvla2): reject gibberish high-level generations Memorised models can collapse to dominant-mode outputs (the JSON-token salad ``":":":":...`` from VQA training) when the prompt drifts even slightly from training distribution. Without a guard, that gibberish lands in ``current_subtask`` / ``current_plan`` / ``current_memory``, which feeds the next tick's prompt and cascades into worse outputs. The user observed exactly this: a clean run followed by a tick that wrote ``" " "`` into plan and memory, then slow recovery several ticks later. Add ``_looks_like_gibberish`` heuristic (alpha density, repeating chars, JSON-prefix sniff) and apply it before mutating state in ``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd``. Bad generations are logged inline (``[info] subtask gen rejected (gibberish): "":":":..."``) so the user can see what was dropped, but the state stays at its last-known-good value (typically the dataset bootstrap) instead of being polluted. VQA path is intentionally exempt — its training targets are JSON-shaped, so the heuristic would false-positive on them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:07:25 +02:00
Pepijn	7a945d7bdc	fix(smolvla2): bootstrap canonical task + plan/memory from dataset The user-typed task and the dataset's canonical task differ in wording (capitalisation, ``green box`` vs ``green bin``, etc.). With ``text_loss`` driven down to ~6e-6 across 78 epochs the model is memorised on the exact rendered training prompts: any wording drift puts the prompt out of distribution and the model collapses to its dominant training mode (VQA JSON output). When ``--dataset.repo_id`` is set, automatically: * read the canonical task string from the chosen episode (and use it as ``--task`` when the user didn't pass one); * pull the active ``plan`` / ``memory`` / ``subtask`` rows from the persistent slice (latest row whose timestamp ≤ start frame's timestamp — same semantics as the renderer's ``active_at``) and seed them into the runtime state. The first prompt the runtime builds at REPL start now mirrors what the recipe rendered during training (task + active plan + active memory + optional current subtask). The user can still override any of these by typing. Memorisation itself is upstream (training mix collapsed to too few unique high-level targets); this commit only fixes the inference-side prompt mismatch that was making the memorisation surface as gibberish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:00:36 +02:00
Pepijn	a47e535b02	fix(smolvla2): per-recipe inference prompts to match training shape The four high-level steps shared one generic ``_control_context_messages`` that jammed task + plan + memory + completed_subtask into a single user message. The recipes in ``smolvla2_hirobot.yaml`` each have a specific multi-message layout (``memory_update``: ``user(task) → assistant(prev memory) → user(completed subtask)``; ``high_level_subtask``: ``user(task+plan+ memory) → user(current subtask)``; ``user_interjection_response``: ``user(task) → assistant(prev plan) → user(interjection)``). After ``apply_chat_template`` those layouts produce different prompts than the runtime's flattened single-user-turn version, and the model fell back to its dominant training mode (VQA JSON output) — generating ``":":":":":":...`` repetition. Add four per-recipe prompt builders (``_msgs_for_subtask``, ``_msgs_for_memory``, ``_msgs_for_interjection``, ``_msgs_for_vqa``), each mirroring its sub-recipe's exact message structure including the ``if_present`` skips. Wire each high-level step to its matching builder. Inference prompts now line up with what the model saw in training, so generation should produce coherent text instead of repeated tokens. Generic ``_control_context_messages`` is kept (still used by tests and the no-recipe fallback path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:47:22 +02:00
Pepijn	6d9b431b54	fix(smolvla2): match training's text-loss forward in select_message Previous rewrite drove generation through ``vlm.generate()`` (the standard SmolVLM path), which ignores SmolVLA's custom ``embed_prefix`` that interleaves images + lang + state. Result: the model received a prompt format it had never been trained on at inference and emitted JSON-fragment gibberish (``" " " ,",","`` ``cube lift {"...``). Revert to the cumulative-buffer AR loop driven through ``vlm_with_expert.forward`` — the same forward call ``_compute_text_loss`` makes during training (``inputs_embeds=[prefix_embs, None], use_cache=False, fill_kv_cache=True``). With ``fill_kv_cache=True``, every layer routes through ``forward_attn_layer``, which gracefully skips ``None`` expert inputs (``if hidden_states is None or layer is None: continue``); cross-attention layers — which would otherwise hard- require a non-None expert input — are bypassed entirely. Inference now sees the same prefix structure as training: images + lang + state, with new tokens appended to the lang region. The text distribution matches what the model was trained to produce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:42:15 +02:00
Pepijn	347e706326	fix(smolvla2): drop pixel_values from select_message generate path SmolVLA's image preprocessor sizes frames to whatever the action expert was trained on, but SmolVLM's standard vision tower expects its own default tile grid (e.g. 384/14 → 27×27 patches). The mismatch surfaces deep in the post-vision reshape as ``RuntimeError: shape '[2, 34, 34, 768]' is invalid for input of size 1843200`` — the model has 1200 patches but expects 34×34=1156. Drop ``pixel_values`` from ``vlm.generate(...)`` so SmolVLM runs as a text-only LM at REPL time. The high-level branches (subtask / plan / memory) are dominated by their text context anyway, so this is acceptable for dry-run inference. VQA loses its image grounding — that will be marked as expected for the dry-run path until a follow-up either re-processes images through SmolVLM's own ``ImageProcessor`` to match its tile grid, or gives ``vlm_with_expert`` a real AR text decode mode that handles state and image embeddings the way training does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:36:53 +02:00
Pepijn	fa8ae1e89b	fix(smolvla2): drive select_message through SmolVLM.generate The hand-rolled AR loop in ``select_message`` was fighting the underlying ``vlm_with_expert.forward`` design, which assumes the "prefix-once + suffix-always-via-expert" pattern that ``denoise_step`` uses for action chunks. Cross-attn layers (every other layer with ``attention_mode='cross_attn'`` + ``self_attn_every_n_layers=2``) hard-require an expert input on every call: passing ``inputs_embeds=[current_embs, None]`` crashed at ``expert_layer.input_layernorm(None)`` with ``'NoneType' object has no attribute 'dtype'``. Earlier KV-cache attempts ran into the matching ``[15, 139] vs [15, 1]`` shape mismatch because the cache gets overwritten, not appended, on each ``fill_kv_cache=True`` call — there's just no AR-text-decode mode in this forward. Stop fighting it: drive AR text generation through the underlying SmolVLM via ``vlm.generate(input_ids=..., attention_mask=..., pixel_values=...)``. KV caching, sampling/greedy, EOS handling all come from HF's standard implementation. Trade-off: ``state`` drops out of the prefix at inference (no slot for it on the standard SmolVLM path), so high-level generations may drift from training distribution slightly. That's acceptable for the dry-run REPL — the high-level branches (subtask / plan / memory / vqa) are mostly vision+language conditioned anyway, and the action expert (where state actually matters) goes through the unchanged ``select_action`` path. Image features the runtime merged in (``observation.images.*``) are stacked into the ``[B, num_images, C, H, W]`` shape SmolVLM expects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:39:34 +02:00
Pepijn	3ff6c6860e	fix(smolvla2): rewrite select_message decode loop without KV cache SmolVLA's ``vlm_with_expert.forward`` doesn't actually support incremental KV cache growth — its only ``fill_kv_cache=True`` mode overwrites the cache with the latest call's key/value states, and its only ``fill_kv_cache=False`` mode concatenates ``cache + new`` into a local ``key_states`` for one matmul without ever updating the cache itself. The original ``select_message`` decode loop tried to use ``fill_kv_cache=True`` per step, which clobbered the cache to 1 token after the first decode and threw ``Expected size for first two dimensions of batch2 tensor to be: [15, 139] but got: [15, 1]`` — the attention mask still expected 139 keys but the cached + new key_states only had 1. Match the pattern ``denoise_step`` already uses successfully: maintain a cumulative ``(embs, pad, att)`` buffer that starts as the prefix and grows by one bool/embedding row per step. Each step forwards the full sequence with ``use_cache=False, fill_kv_cache=False, past_key_values=None`` so the matmul shapes always line up. Generated-token rows are tagged ``pad=1, att=1`` which makes them fully causal among themselves while still able to attend back to the entire prefix (per ``make_att_2d_masks`` semantics: a token can attend to any earlier token whose cumulative ``att`` count is ≤ its own). Image encoding is still done once via the initial ``embed_prefix`` call — the expensive part doesn't repeat. The remaining cost is O(n²) text-only transformer forwards, which is fine for the dry-run REPL's 50–100 token responses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:15:28 +02:00
Pepijn	fd89efb545	fix(smolvla2): 3D attention mask in select_message decode loop SmolVLA's ``eager_attention_forward`` does ``masked = torch.where(attention_mask[:, None, :, :], ...)``, which requires a 3D ``[B, query_len, key_len]`` bool tensor so the broadcast to 4D works. ``select_message``'s prefix forward got this right (passes ``prefix_2d`` from ``make_att_2d_masks``), but the KV-cache decoding loop built ``new_attn = torch.ones((bsize, cur_pos + 1))`` — 2D — and the very first decode step blew up with ``IndexError: too many indices for tensor of dimension 2``. During KV-cache decoding ``query_len = 1`` and ``key_len = cur_pos + 1`` (prefix + every token already generated), so the right shape is ``[B, 1, cur_pos + 1]``. Match the layout SmolVLA's working ``denoise_step`` uses for the equivalent ``prefix_pad_2d_masks`` build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:08:52 +02:00
Pepijn	2776b57c9e	fix(smolvla2): bool attention mask + clean Claude-Code-style REPL Two issues that combined to make the REPL unusable: 1. ``BatchEncoding.attention_mask`` is a ``Long`` tensor, but SmolVLA's ``eager_attention_forward`` does ``torch.where(attention_mask[..., None, :, :], ...)`` which requires a bool condition. Every forward raised ``where expected condition to be a boolean tensor, but got a tensor with dtype Long`` and the diagnostic surfaced it cleanly in the REPL — but generation produced nothing useful. Cast to ``bool`` in ``_build_text_batch`` so the prefix forward goes through. 2. The interactive REPL used ``rich.live.Live`` panels stacked on top of ``logging.basicConfig(level=DEBUG)`` HTTP request lines from ``httpcore`` / ``httpx`` / ``huggingface_hub``. The two rendering loops fought each other in the user's terminal and the output was illegible: hundreds of debug lines interleaved with re-rendered panels. Replace ``Live`` with a simple block redraw — clear screen, print the state block, print any robot log lines, then a single ``> `` prompt. State changes are visible above the prompt, the way Claude Code's REPL renders. No flicker, no re-render races. ``_silence_noisy_loggers`` drops the chatty third-party HTTP / download / model-init loggers to WARNING. ``-v`` still enables DEBUG on the lerobot loggers; if the user needs the HTTP traces, they can flip those individually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:03:47 +02:00
Pepijn	0fb5f04965	fix(smolvla2): handle BatchEncoding return from apply_chat_template ``tokenizer.apply_chat_template(..., tokenize=True, return_tensors='pt')`` on newer transformers returns a ``BatchEncoding`` (dict-like) rather than a raw ``Tensor`` — particularly when the underlying call routes through a processor. ``_build_text_batch`` only handled the ``Tensor`` and ``list`` shapes, so the encoding object reached SmolVLA's ``embed_language_tokens`` and ``F.embedding`` blew up with ``argument 'indices' must be Tensor, not BatchEncoding`` on every high-level forward. Normalise the return: * ``BatchEncoding`` / ``dict`` → take ``input_ids`` (and the encoder's ``attention_mask`` when present, since ``pad_token_id`` can be ``None`` for SmolVLM and the fall-back ``ids != pad_token_id`` breaks then), * ``list[int]`` / ``list[list[int]]`` → wrap in a long tensor, * ``Tensor`` → keep as-is. After unwrapping, ensure shape ``(1, seq)`` and that ``attention_mask`` is a tensor on the same device as ``ids``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:59:57 +02:00
Pepijn	7296ac97af	fix(smolvla2): make silent generation failures visible in REPL Two failure modes were combining to make the runtime "look dead": 1. ``_build_text_batch`` produced lang tokens via ``apply_chat_template(return_tensors='pt')`` on CPU, but the policy sits on the configured device (mps / cuda). The first prefix-embed inside ``select_message`` then raised a device-mismatch on every call. The bare ``except Exception`` in ``_generate_with_policy`` swallowed it at debug level — no logs, no chat output, no visible sign anything had run. 2. Even when generation succeeded but returned an empty string (greedy EOS, unhappy chat template, etc.), the high-level steps silently no-op'd, so users saw nothing. Move tokens to ``policy.config.device`` in ``_build_text_batch`` so the prefix forward succeeds in the common case. Bump the swallowing log level to ``warning`` (with optional traceback under ``-v``), and when ``state`` is given route the same diagnostic into the REPL log via ``push_log`` so the user sees ``[warn] subtask gen failed: ...`` inline. Also push an ``[info] ... produced no text this tick`` line when generation runs but yields nothing, so empty completions are distinguishable from "step never ran". Apply the same surface to ``LowLevelForward.select_action`` failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:47:34 +02:00
Pepijn	9cbbcfb6a2	fix(smolvla2): tokenize lang prompt inline before select_action LowLevelForward was handing the observation provider's output straight to ``policy.select_action``, but SmolVLA's ``_get_action_chunk`` indexes ``batch[OBS_LANGUAGE_TOKENS]`` and crashes with ``KeyError: 'observation.language.tokens'`` when the key isn't there. Our provider deliberately strips the dataset's language columns (the runtime drives messages itself), so nothing else was producing those tokens — the chunk path crashed on the very first tick after task was set. Build a low-level prompt from current runtime state inline (task / plan / memory as the user turn, current subtask appended as a continuation assistant turn when known), tokenize it with the same helper the high-level steps use, and merge ``lang_tokens`` / ``lang_masks`` into the observation before the call. Skip the step when no task is set yet, and swallow ``select_action`` exceptions at debug level so a missing observation feature doesn't kill the REPL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:40:18 +02:00
Pepijn	fea41b29f5	fix(datasets): probe parquet for language columns before strict cast ``_load_hf_dataset`` was building the strict cast schema only from ``meta/info.json["features"]``. Datasets annotated by ``lerobot-annotate`` but still tagged at the older codebase version (no ``language_persistent`` / ``language_events`` entry in ``info.json``) carry both columns in the parquet itself but not in the features dict, so ``Dataset.from_parquet`` blew up with ``CastError: column names don't match`` when trying to project a 9-column parquet onto a 7-column schema. Probe one parquet shard's actual schema; if either language column is present in the parquet but missing from ``features``, graft it on using PR 1's ``language_persistent_column_feature`` / ``language_events_column_feature`` helpers. No-op when neither column is present (fully backwards-compatible with v3.0 datasets), no-op when both are already registered (fully forwards-compatible with future v3.1 ``info.json`` writes). This unblocks dry-run inference on PR 2-annotated datasets that weren't re-tagged to v3.1 — including the ones in the field today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:31:19 +02:00
Pepijn	7b4d281ef5	fix(smolvla2): build preprocessor fresh, don't round-trip the recipe ``PolicyProcessorPipeline.from_pretrained`` reconstructs each saved step by passing the persisted JSON config back to ``__init__``, but ``RenderMessagesStep.recipe`` (a ``TrainingRecipe``) doesn't survive the JSON round-trip — the saved entry is ``{}`` and the reconstructor crashes with ``missing 1 required argument: 'recipe'``. Bypass the round-trip in the runtime CLI by passing ``pretrained_path=None`` to ``make_pre_post_processors``. That re-runs ``make_smolvla2_pre_post_processors``, which reloads the recipe YAML referenced by ``cfg.recipe_path`` and wires it back into the step correctly. ``NormalizerProcessorStep`` still gets stats from ``ds_meta.stats`` so normalization matches training. Proper fix is to make ``RenderMessagesStep`` serializable (e.g. by persisting the recipe path / contents); this commit keeps it scoped to the runtime path so dry-run testing isn't blocked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:27:12 +02:00
Pepijn	29bb8bb20e	fix(tools): unblock pocket-tts resolution (>=1.0.0,<3.0.0) The previous bound `>=0.1.0,<1.0.0` matched zero published versions — pocket-tts went straight to 1.0.0 on PyPI, with 0.x never released. That made `uv sync --extra tools` (and any sync that pulls the `dev` / `all` superset) fail with "requirements are unsatisfiable" on every Python version uv tried, including 3.12. Bump to `>=1.0.0,<3.0.0` so 1.x and 2.x are reachable. SayTool only touches `TTSModel.load_model()`, `get_state_for_audio_prompt`, `generate_audio`, and `sample_rate` — small enough surface that 1.x and 2.x should both work; tighten if a real API break shows up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:15:20 +02:00
Pepijn	3fe686ce9f	feat(smolvla2): runtime accepts Hub IDs + dataset-driven dry-run The runtime CLI's loader was broken — it imported a `make_policy_from_path` that doesn't exist in `lerobot.policies.factory` — and the high-level text steps generated plan / subtask / memory / VQA from a text-only batch with no images or state, so dry-runs drifted from the training distribution. Switch to the standard `PreTrainedConfig.from_pretrained` + `make_policy(cfg, ds_meta=...)` flow so `--policy.path` accepts both local directories and Hub repo ids, and add a `--dataset.repo_id` path that walks a chosen episode and feeds preprocessed observations into every forward pass — including the four high-level steps (`HighLevelSubtaskFwd`, `MemoryUpdateFwd`, `UserInterjectionFwd`, `AskVQAFwd`). Frames are routed through the saved preprocessor pipeline with `language_persistent` / `language_events` stripped so the recipe-render step stays a no-op (the runtime supplies its own messages from current state). Also wires the rich-based two-zone REPL layout (`ui.py`) that the script was already importing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:09:19 +02:00
pepijn	a1b8134ef1	fix(smolvla2): train on rendered language batches Keep annotated language columns through collation, render batched recipe samples, and make SmolVLA2 text loss robust enough for distributed training on the steerable dataset. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-05 08:55:56 +00:00
Steven Palma	401a217597	chore(ci): increase time stale (#3507 )	2026-05-04 22:35:16 +02:00
Steven Palma	40094b0464	chore(ci): upgrade docker internal (#3505 )	2026-05-04 21:28:52 +02:00
pepijn	8fa8323c91	fix(annotate): sync language metadata after parquet rewrite Ensure annotated datasets advertise language columns in meta/info.json so non-streaming dataset loads cast against the rewritten parquet schema. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-04 15:17:15 +00:00
Pepijn	5f7c6ba61d	feat(annotate): compact steerable annotation prompts Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-04 15:57:04 +02:00
Jash Shah	fdbfc015a2	fix(peft): fix LoRA resume from Hub (PosixPath + double wrap) (#3485 )	2026-05-04 10:52:37 +02:00
Pepijn	223cc8a9e2	feat(smolvla2): inference runtime — select_message + multi-rate REPL Closes the loop on PR 3: SmolVLA2 can now be queried interactively at inference, dispatching the same five sub-recipe shapes it was trained on (action chunks, subtask gen, memory updates, plan/speech on interjection, VQA on questions). Modeling fixes + additions -------------------------- - ``_compute_text_loss``: standard next-token CE shift was missing (logits at position t were CE'd against the label at t — identity- mapped, learning nothing). Adds ``logits[:, :-1]`` / ``labels[:, 1:]`` shift to match HuggingFace ``LlamaForCausalLM``. - New ``select_message`` on ``SmolVLA2Policy``: AR text generation with KV caching, mirroring SmolVLA's ``select_action`` pattern. Single prefix forward fills the cache, then per-token forwards reuse it. Greedy + top-p nucleus sampling. Returns the decoded string with the prompt stripped. Runtime package — ``src/lerobot/policies/smolvla2/inference/`` ------------------------------------------------------------- - ``triggers.py`` — ``Trigger`` Protocol + ``HzTrigger`` / ``EventTrigger`` + ``TickClock``. The whole runtime ticks at ``max_rate_hz=50`` and each step gates itself off its own cadence. - ``runtime_state.py`` — runtime state dict factory plus tiny helpers (``take_event``, ``set_if_changed``, ``push_log``). Stable keys are documented at the top of the module. - ``steps.py`` — :class:`InferenceStep` base + concrete steps: ``LowLevelForward`` / ``DispatchAction`` (action path), ``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd`` / ``AskVQAFwd`` (text paths), ``DispatchToolCalls`` (tool registry → ``Tool.call``). Each text step builds a chat-template prompt from current ``RuntimeState`` (task / plan / memory / subtask) matching what ``smolvla2_hirobot.yaml`` renders during training. Includes a tiny ``<say>...</say>`` parser for the ``user_interjection_response`` branch's combined plan + speech output. - ``runtime.py`` — :class:`SmolVLA2Runtime` composes the pipeline, drives ticks via ``TickClock``, polls a user-supplied ``event_collector`` per tick, and prints state-change log lines. - ``repl.py`` — :class:`StdinReader` non-blocking line reader with simple intent classification: ``stop`` / ``quit`` / ``exit`` → terminate; ``?`` suffix → ``user_vqa_query`` event; first line → set task; other lines → ``user_interjection``. CLI --- - ``src/lerobot/scripts/lerobot_smolvla2_runtime.py``: console script ``lerobot-smolvla2-runtime`` that loads a checkpoint, optionally instantiates ``SayTool`` (pocket-tts), wires up ``SmolVLA2Runtime`` + ``StdinReader``, and runs. Real-robot wiring (observation_provider / robot_executor) is intentionally left as a follow-up — v1 is dry-run / language- only so the REPL works without robot hardware. Registered in ``pyproject.toml`` ``[project.scripts]``. Known follow-ups ---------------- - Real-robot integration: today ``LowLevelForward`` only fires when an observation_provider is wired. The CLI prints a warning if ``--no_robot`` is omitted. - ``select_message`` runs an extra prefix forward; could share with the action path's prefix when both are needed in the same tick. - Tests: no end-to-end runtime test yet (would need a tiny SmolVLM fixture). The components compile and the public surface is exercised by the CLI's argument-parsing path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:04:00 +02:00
Pepijn	af6d8ebd5b	feat(smolvla2): dual-head forward — flow loss + lm_head text loss The third and final commit of PR 3's SmolVLA2 work. Wires the actual training signal through: * ``predict_actions[i] = True`` → sample i contributes to flow loss * ``text_labels[i, t] != -100`` → token t of sample i contributes to LM-head cross-entropy Both routing knobs come from ``SmolVLA2ChatTokenizerStep`` (previous commit on this branch), which builds them from the recipe's ``message_streams`` / ``target_message_indices``. The per-sample ``predict_actions`` mask preserves the Pi0.5 convention from the plan's Section I.7: "True iff any low_level target exists". Implementation: - ``forward`` reads ``text_labels`` and ``predict_actions`` from the batch. When neither is present (vanilla SmolVLA usage with no recipe), delegates to ``SmolVLAPolicy.forward`` so unannotated datasets keep training as before — full backward compatibility. - ``flow_loss``: super().forward(reduction="none") returns the per-sample (B,) flow loss; we mask non-action samples with the ``predict_actions`` bool and renormalize by the count of action samples. ``flow_loss_weight = 0`` in the config disables this branch entirely (text-only training). - ``text_loss``: a prefix-only forward through the VLM (no action expert / suffix), slicing the lang-token range out of the resulting hidden states (``embed_prefix`` orders the prefix as ``[image_blocks..., lang, state]`` so the slice is unambiguous). Apply ``vlm.lm_head`` to those hidden states, cross-entropy with ``text_labels`` (ignore_index=-100). ``text_loss_weight = 0`` disables this branch (reverts to flow-only behaviour, matching SmolVLA exactly). - The two losses are summed with the config-supplied weights. Mixed-stream samples (one batch containing both action targets and text-only sub-recipes) are handled correctly: each sample contributes where its labels are valid and is masked elsewhere. Limitations / known follow-ups: - Text loss runs an additional prefix-only forward separate from the flow path's prefix forward. The forwards could share their prefix computation; for clarity of this first commit they don't. Optimization is straightforward when needed. - Per-sample loss for ``reduction="none"`` is not yet meaningfully defined for the dual path — we broadcast the scalar to (B,) for caller compatibility (e.g. RA-BC weighting will need follow-up). - Inference ``select_action`` is unchanged from SmolVLA today — it predicts actions only. A separate "generate text" ``select_message`` path is the natural next step for runtime use of the LM head (memory updates, plan refreshes, VQA answers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:54:57 +02:00
Pepijn	37b1eb218a	feat(smolvla2): chat-template processor + label mask + predict_actions Wires PR 1's recipe stack into the SmolVLA2 pipeline so multi-target sub-recipes (memory_update, ask_vqa, user_interjection_response, high_level_subtask) carry meaningful supervision through to the model. - New ``chat_processor_smolvla2.py`` with ``SmolVLA2ChatTokenizerStep``: reads ``messages`` / ``message_streams`` / ``target_message_indices`` from the rendered sample (PR 1 ``RenderMessagesStep``), calls ``apply_chat_template(messages, tools=DEFAULT_TOOLS, ...)`` on the SmolVLM tokenizer, and writes: OBS_LANGUAGE_TOKENS / _ATTENTION_MASK ← chat-templated prompt text_labels ← -100 except target msg tokens predict_actions ← True iff any low_level target Builds the label mask robustly by re-rendering the chat through each target's prefix and reading off the prefix length — same tokenizer, same tools, so the prefix tokens are guaranteed to be a prefix of the full sequence. Image/video content blocks (LeRobot ``feature``-keyed) are stripped before tokenizing; the actual image tensors flow through SmolVLA's existing ``OBS_IMAGES_*`` channels and ``embed_prefix`` puts them before the language embeddings, matching the chat-template-stripped text order. - ``processor_smolvla2.py``: when ``config.recipe_path`` is set, build a new pipeline with ``RenderMessagesStep`` + ``SmolVLA2ChatTokenizerStep`` instead of SmolVLA's plain ``TokenizerProcessorStep``. When ``recipe_path`` is ``None``, fall back to SmolVLA's pipeline so unannotated datasets still work unchanged. Resolves recipe paths relative to ``src/lerobot/configs/`` so ``recipes/smolvla2_hirobot.yaml`` works directly. The next commit on this branch picks up ``text_labels`` and ``predict_actions`` from the batch and routes them through the SmolVLM ``lm_head`` for the actual dual-loss training. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:21:03 +02:00
Pepijn	52e1fd35cb	feat(tools): src/lerobot/tools/ — runnable tool registry + SayTool Ships the runtime side of the OpenAI-style function-calling stack introduced in PR 1 (catalog in ``meta/info.json["tools"]``) and PR 2 (annotation pipeline writes the catalog after a run). One file per tool — heavy deps stay isolated. Layout: - ``base.py`` — :class:`Tool` Protocol: ``name``, ``schema``, ``call(arguments)``. Runtime-checkable so tests can use ``isinstance(...)``. - ``registry.py`` — :data:`TOOL_REGISTRY` (name → class) plus ``get_tools(meta, **kwargs)`` that instantiates every entry whose ``function.name`` is registered. Tools whose name is unknown are silently skipped — the schema still rides through the chat template, the model just can't actually invoke that tool at inference. - ``say.py`` — :class:`SayTool` wrapping Kyutai's pocket-tts (CPU-only, ~100M params, ~6× real-time on a MacBook Air M4). Lazy model load: pocket-tts is imported and the voice state computed on first ``call(...)`` (or eagerly via ``preload()``). Returns the PCM tensor; optionally writes a ``.wav`` to ``output_dir`` for offline inspection. - ``__init__.py`` — re-exports the public surface. Optional install: pip install lerobot[tools] The ``[tools]`` extra in ``pyproject.toml`` pulls in ``pocket-tts`` + ``scipy`` (for the wav writer). Adding more tools later means a new file + a registry entry — no new extras unless the tool brings new deps. To add your own tool, follow the three-step guide in ``docs/source/tools.mdx`` (PR 1): 1. Drop ``src/lerobot/tools/<my_tool>.py`` with a ``Tool``-conforming class. 2. Register the class in ``TOOL_REGISTRY`` (this file). 3. Pre-populate ``meta/info.json["tools"]`` with the schema (or let ``lerobot-annotate`` add it on the next run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:58:04 +02:00

... 3 4 5 6 7 ...

1740 Commits