``--robot.cameras`` parses the JSON into ``dict[str, dict]``, but
``RobotConfig`` expects ``dict[str, CameraConfig]`` — each inner
value must be the actual ``CameraConfig`` subclass instance for the
chosen backend (e.g. ``OpenCVCameraConfig``). Passing raw dicts
blew up in ``RobotConfig.__post_init__`` with
``AttributeError: 'dict' object has no attribute 'width'`` when it
iterated cameras and tried to read attributes.
Look up the right subclass per-camera by its ``"type"`` field via
``CameraConfig.get_choice_class(...)`` (mirroring the lazy-import
dance we already do for ``RobotConfig``: eagerly walk
``lerobot.cameras``'s submodules so the registry is populated
before lookup). Construct an instance with the rest of the dict's
fields. On an unknown camera type, raise a clean ``ValueError``
listing the available choices.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``RobotConfig._choice_registry`` is populated as a side-effect of
each robot's ``@RobotConfig.register_subclass`` decorator running,
and those decorators only fire when the corresponding
``lerobot.robots.<name>`` module is imported. The package's
``__init__.py`` doesn't import them — instead ``make_robot_from_config``
does it lazily in its big if/elif chain.
``_build_robot`` jumped the gun: called ``RobotConfig.get_choice_class
(robot_type)`` before any robot module had been imported, so the
registry was empty and every ``--robot.type=<X>`` produced
``KeyError: 'X'`` (e.g. ``KeyError: 'omx_follower'``).
Walk ``lerobot.robots``'s submodules via ``pkgutil.iter_modules`` and
``importlib.import_module`` each one before the lookup. ~200ms on the
first invocation, negligible for an autonomous run. On a real
``KeyError`` (typo / unsupported robot), raise a clean ``ValueError``
listing the registry's available choices instead of a bare KeyError.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hand-rolled action-norm safety clip duplicated what every
``RobotConfig`` already exposes — ``max_relative_target`` — and at
the wrong layer (after postprocess but before send_action, instead
of inside the robot driver where every other lerobot entry point
puts it). The norm clip also rejected entire actions instead of
clipping per-motor relative motion, so a single rogue joint would
kill the whole tick.
Replace with ``--robot.max_relative_target``: a string parsed as
either a bare float (uniform per-motor cap) or a JSON object
mapping motor name → cap. Passed through to
``RobotConfig(max_relative_target=...)`` at robot construction;
the driver's ``send_action`` clips each commanded joint position
relative to the current measured one before issuing it on the bus —
same behaviour ``lerobot-record`` ships.
Also bump ``--chunk_hz`` default from ``4.0`` to ``1.0``. One new
chunk per second is what the trained checkpoint can comfortably
keep up with on common hardware and gives smoother motion than
sub-second chunk regenerations (no RTC interpolation between
chunks yet).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**#1 Plan-update phase reports correct skip count.**
``_run_plan_update_phase`` only ran ``run_plan_updates`` for episodes
with at least one interjection but hardcoded ``episodes_skipped=0``.
The summary undercounted skipped episodes. Now returns
``len(records) - processed`` so processed + skipped == total.
**#2 ``run_hf_job.py`` installs ``openai``.**
The ``CMD`` block does ``pip install --no-deps lerobot[branch]`` then
explicitly lists transitive deps. ``openai`` was missing — and since
``VlmConfig.backend`` defaults to ``"openai"``, the job would have
``ImportError``'d when ``vlm_client._make_openai_client`` ran.
**#3 Dedupe subtask-span reconstruction.**
Module 1's ``_reconstruct_subtasks_from_rows`` (no ``and spans`` guard)
and Module 2's ``_read_subtask_spans`` (with the guard) had near-
identical logic. Promoted to ``reconstruct_subtask_spans`` in
``reader.py`` using the safer guarded form. Both modules now import
the single helper.
**#5 Atomic staging.py JSONL writes.**
Mirroring the parquet-writer fix from an earlier review round:
``EpisodeStaging.write`` now writes to a sibling ``.tmp`` and
``Path.replace`` atomically. A crash mid-write can no longer leave a
half-written JSONL that ``read()`` would then fail to parse.
**#6 Atomic ``info.json`` write.**
Same pattern in ``executor._ensure_annotation_metadata_in_info`` —
``info.json`` is load-bearing for dataset metadata, so partial writes
brick the dataset.
**#7 Writer's role-key guard.**
``_normalize_persistent_row`` and ``_normalize_event_row`` accessed
``row["role"]`` directly while every other field used ``.get()``.
Pre-validate ``"role" in row`` and raise a friendly ``ValueError``
naming the row, so a future module that accidentally drops ``role``
fails with a triagable message instead of a bare KeyError deep in the
writer.
**#8 Last subtask span's ``end`` extends to episode end.**
``reconstruct_subtask_spans`` (the new shared helper) takes an optional
``episode_end_t``. When provided, the final span's ``end`` is closed
to that timestamp instead of equalling its own ``start`` (zero
duration). Both Module 1's plan-update pass and Module 2's interjection
anchoring pass ``record.frame_timestamps[-1]``, so downstream "current
subtask at refresh_t" lookups no longer miss refreshes that land
inside the final span.
Sweep: 66 passed, 0 failed. Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both tests were stale relative to design changes that landed earlier on
this branch. Update the tests to match the current production contract.
**``test_module1_attaches_video_block_to_subtask_prompt``**
The test took ``captured[0]`` and asserted on its content blocks, but
Module 1 issues several sub-prompts and the rephrasings call (which is
text-only, no video block) usually lands first. Two fixes:
* The test's intent is "the subtask prompt carries the video block" —
not "the first prompt carries it". Pick the call by content
(``"atomic subtasks"`` keyword in the text block) so the test is
resilient to future reordering of unrelated sub-prompts.
* Set ``n_task_rephrasings=0`` so the rephrasings call is skipped
entirely — keeps the test focused on ``_generate_subtasks``.
**``test_module2_mid_episode_emits_paired_interjection_and_speech``**
Two issues both rooted in design changes on the branch:
1. ``InterjectionsAndSpeechModule._mid_episode_interjections`` now
anchors interjections on subtask boundaries from Module 1's staging
tree, bailing out with zero rows when no spans exist. The production
executor runs Module 1 first; the test ran Module 2 in isolation.
Reproduce the contract by seeding two ``style=subtask`` rows in the
staging before calling Module 2 — gives it the single ``0 → 1``
boundary it needs.
2. The test's stub responder used the marker ``"ONE realistic
interruption"`` to match the interjection prompt, but that string is
from a previous prompt version. The current
``module_2_interjection.txt`` says ``"Write ONE interjection..."`` —
the old prompt asked for counterfactual interjections (e.g. "skip the
wipe"), the new one anchors on the upcoming subtask. Marker updated
to ``"Write ONE interjection"``; canned response wording aligned to
the new design.
Sweep on the language stack: 66 passed, 0 failed (was 64 passed, 2
failed). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**Critical: video_for_episode was unreachable dead code.**
``video_for_episode`` was indented inside ``_decode_pyav_direct``, after
its ``return`` statement — Python parsed it as a nested function that
never executed. Module 1's ``_episode_video_block`` calls
``self.frame_provider.video_for_episode(record, target_count)`` on the
``use_video_url=False`` path, which would have AttributeError'd on any
real dataset. Tests passed only because they used ``_StubFrameProvider``
/ ``_NullProvider`` which have the method. Moved it to be a proper
method of ``VideoFrameProvider`` (right after ``frames_at``).
**Thread safety on VideoFrameProvider.**
The executor runs Module 1/2/3 phases under a ``ThreadPoolExecutor``, so
the per-instance ``_cache`` dict and the one-shot ``_warned_decode_fail``
flag were exposed to concurrent reads/writes. Added a ``threading.Lock``
field, wrapped cache reads/writes and the warn-flag check-and-set in
``with self._lock:``. Stub fixtures unaffected.
**episode_clip_path is now a method of VideoFrameProvider.**
Used to be a free function reaching into ``provider._meta.episodes`` and
``provider._meta.get_video_file_path`` from outside the class. As a
method it just uses ``self._meta``. The only caller (Module 1) updated;
no external callers.
**Atomic write in LanguageColumnsWriter.**
``pq.write_table(new_table, path)`` was overwriting the parquet shard
in place — a crash mid-write would corrupt the file. Now writes to a
sibling ``.tmp`` and ``Path.replace`` atomically.
**Smaller items:**
* ``executor.py`` docstring opened with "four phases" but listed six.
Now says "six phases" to match.
* ``[annotations]`` extra in ``pyproject.toml`` now includes
``openai>=1.40,<2.0``. Default ``VlmConfig.backend`` is ``"openai"``,
so without it ``_make_openai_client`` would ImportError on a fresh
``uv sync --extra annotations``.
* ``_snap_to_frame`` was duplicated identically in
``plan_subtasks_memory.py`` and ``interjections_and_speech.py``.
Promoted to ``snap_to_frame`` in ``reader.py`` (next to
``EpisodeRecord``); both modules now import it. Backwards-compat alias
not needed — no external callers.
* ``EpisodeRecord.frames_df()`` was re-reading the full parquet on every
call. Now memoizes via a private dataclass field so repeat calls from
different modules pay the cost once. Method signature unchanged.
* ``_extract_first_json_object`` had a redundant ``and not escape`` guard
that was dead because the prior block already handled and reset
``escape``. Replaced with a comment explaining the invariant.
**Pre-existing lint cleanups surfaced once these files entered
pre-commit's scope:**
* dead local ``client = clients[0]`` in ``_make_openai_client`` (the
real round-robin uses ``clients[rr_counter[...]]``).
* ``cmd = ... if "{port}" in cmd else f"...{port}"`` ternary collapse in
``_spawn_parallel_inference_servers``.
* ``seek_pts = 0 if stream.time_base is None else int(...)`` ternary
collapse in ``_decode_pyav_direct``.
* ``# nosec B310`` on the localhost ``urllib.request.urlopen`` probe in
``_server_is_up`` — the URL is the user-configured local-server endpoint
the CLI itself spawned, not arbitrary user input.
**Test added.**
``tests/annotations/test_frames.py`` pins the regression on
``VideoFrameProvider``: asserts ``video_for_episode`` and
``episode_clip_path`` are callable methods (not nested dead code or
free functions), and that the ``_lock`` field is a real
``threading.Lock``.
Sweep: 64 passed, 2 failed (same pre-existing module-impl bugs as
before this commit). Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five Module 1 sub-prompts (`_derive_task_from_video`,
`_generate_task_rephrasings`, `_generate_subtasks`, `_generate_plan`,
`_generate_memory`) all repeated the same shape:
result = self.vlm.generate_json([messages])[0]
if isinstance(result, dict) and isinstance(result.get(<field>), <type>):
...
…each spelled with slightly different field names + post-processing.
Three small helpers replace it:
* `_vlm_field(messages, field)` — single VLM call, returns
``result[field]`` or ``None``. Centralizes the
``generate_json([m])[0]`` + ``isinstance(dict)`` dance.
* `_text_message(text)` — wraps a string in the canonical user-message
shape every text-only prompt builds inline.
* `_video_message(record, prompt)` — combines the episode video block
with a prompt; replaces the duplicated video-block construction
inside `_generate_subtasks` (which previously inlined the same
``use_video_url``/``frames_per_second``/``max_video_frames`` branches
that `_episode_video_block` already implements).
Net -35 LOC. Each call site now is 3-5 lines instead of 10-20. The
public method signatures are unchanged so tests don't move.
Drive-by: `_task_seems_bad` collapsed via SIM103 fix; `zip` in
`run_plan_updates` annotated `strict=True` per ruff B905.
Tests: same 2 pre-existing module-impl failures
(`test_module1_attaches_video_block_to_subtask_prompt`,
`test_module2_mid_episode_emits_paired_interjection_and_speech`) —
they were failing on `origin/feat/language-annotation-pipeline` before
this commit and continue to do so for the same reasons. 61/63 in the
language stack pass; pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve conflicts and pull in the latest PR 1 fixes.
Conflicts:
- pyproject.toml: PR 1 added `lerobot-rollout` and PR 2 added
`lerobot-annotate` to the same `[project.scripts]` block. Kept both.
- uv.lock: dropped both sides and regenerated against the merged
`pyproject.toml` (PR 2 dropped the `datatrove` dep when distribution
moved to HF Jobs; PR 1's lock didn't have it).
Test follow-up:
- `tests/annotations/test_pipeline_recipe_render.py` — PR 1 deleted
`src/lerobot/configs/recipes/pi05_hirobot.yaml` (review feedback:
remove the canonical-recipe file; recipes are user-supplied). The
cross-PR contract this test guards is "the recipe DSL renders
non-empty messages from pipeline output", which doesn't depend on
any specific YAML, so the test now builds an inline blend recipe
with the same coverage. Passes.
Sweep: 82 passed, 2 failed (pre-existing module-impl bugs:
`test_module1_attaches_video_block_to_subtask_prompt`,
`test_module2_mid_episode_emits_paired_interjection_and_speech`).
The PR 1 carryover (`test_emitted_at_raises_on_ambiguous_per_camera_vqa`)
is now passing — the merge brought in PR 1's tightened `_select_one`
ambiguity check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The executor previously claimed it would "optionally hand off" to
datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it
already runs phases inline in every code path, and HF Jobs (see
``examples/annotation/run_hf_job.py``) is the actual distribution
strategy. Stop pretending we have an executor selector.
* `executor.py`: drop `select_executor_class`, the "kind" log line, and
the references to LocalPipelineExecutor / SlurmPipelineExecutor.
Module docstring now says distribution is delegated to HF Jobs.
* `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`,
`slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only
`episode_parallelism`. While here, prune the longer "why" docstrings
on every field down to the load-bearing bits — full story moves to
`docs/source/annotation_pipeline.mdx`.
* `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the
`[annotations]` extra; the dep was only there for the (never used)
cluster executors. Comment block notes the new HF-Jobs delegation.
* `reader.py`, `lerobot_annotate.py`: drop their own datatrove /
flavor-namespace mentions.
* `docs/source/annotation_pipeline.mdx`:
- remove the flavor-namespace / sidecar paragraph (out of scope —
"multiple revisions = multiple copies" is dataset-level policy);
- remove the "writer drops the legacy `subtask_index` column" note
(already covered by PR 1's intentional-break call-out);
- remove the chat-template + `apply_chat_template(messages, tools=...)`
line (covered by Tools doc);
- replace the "executor picks Local vs Slurm" paragraph with
`--executor.episode_parallelism` and a pointer to HF Jobs;
- rewrite the style→recipe section to talk about "recipes" generically
instead of pinning a specific YAML;
- add a "Running on Hugging Face Jobs" section pointing at
`examples/annotation/run_hf_job.py`;
- add a "Running locally" example matching the CLI's docstring
(`uv run lerobot-annotate --root=... --vlm.model_id=...`);
- extend the paper-inspirations list with Pi0.7 and Steerable VLA
Policies (Zhao 2025) for Module 3.
Tests: same 3 pre-existing failures as before this commit (2 module
assertions still in flight; 1 carryover from PR 1). 41/44 pass.
Pre-commit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(ci): run multi-task benchmark evals 5-at-a-time in parallel
The eval script supports running tasks concurrently via a
ThreadPoolExecutor (env.max_parallel_tasks). Apply it to the four
multi-task benchmark CI jobs (RoboTwin, RoboCasa, RoboMME, LIBERO-plus
— 8-10 tasks/task_ids each) so they finish in ~2 waves of 5 instead of
running sequentially. Single-task jobs (Libero, MetaWorld, RoboCerebra)
are unchanged.
* fix(ci): cap VLABench smoke eval at 50 steps per task
VLABench's default episode_length is 500 steps; with 10 tasks at ~1 it/s
the smoke eval took ~80 minutes of rollouts on top of the image build.
The eval is a pipeline smoke test (running_success_rate stays at 0% on
this short rollout anyway), so we don't need full episodes — cap each
task at 50 steps to bring total rollout time down ~10x.
* fix(ci): run VLABench tasks 5-at-a-time in parallel
The eval script already supports running multiple tasks concurrently via
a ThreadPoolExecutor (env.max_parallel_tasks). Set it to 5 so the 10
VLABench tasks finish in ~2 waves instead of running sequentially.
* feat: add pretrained vision encoder weights for diffusion and vqbet
* fix test by re-generating artifacts
---------
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
The robotwin benchmark Dockerfile still installed cuda-nvcc-12-4 and
cuda-cudart-dev-12-4 after #3505 upgraded the base image to CUDA 12.6.3
on Ubuntu 24.04. Those packages aren't available in the ubuntu2404 CUDA
repo, so the build failed at apt-get install. Bumping both to -12-6 to
match the base image.
Reword the two callouts in `tools.mdx` to describe the runtime layer
in present tense ("not part of the catalog layer shipped today",
"those modules don't yet exist in the tree") instead of pointing at a
specific follow-up PR. Keeps the doc honest about what works now
without coupling it to a particular release order.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* **Float tolerance in `emitted_at` for persistent styles.** The
``_timestamp(row) == t`` exact-equality check silently missed any
caller that derived ``t`` arithmetically (e.g. ``frame_idx / fps``)
even though the parquet timestamp would only differ by ULPs. Added
``EMITTED_AT_TOLERANCE_S = 0.1`` and check ``abs(...) <= tolerance``
instead, with a docstring explaining why exact equality wasn't
enough and why 0.1 s is safe at typical 30–100 Hz control rates.
Test asserts the new behavior at half-window (matches) and
double-window (no match) using the constant so it stays in sync.
* **`MessageTurn.stream` is required at construction.** It was typed
``MessageStream | None = None`` so YAML could omit ``stream:`` and
pass the dataclass invariant — but ``_validate_rendered`` rejected
``None`` streams later, surfacing the error at the first sample
instead of at recipe load. Now ``__post_init__`` raises
``ValueError`` if ``stream`` is ``None``, with the list of valid
streams in the message. The redundant late-stage check in
``_validate_rendered`` is replaced with a one-line comment that
cites the upstream invariant. Test pins the new construction-time
rejection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* **#2 — dedupe `_PLACEHOLDER_RE`.** The same regex was compiled in
`recipe.py` and `language_render.py`. Promote to module-level
`PLACEHOLDER_RE` in `recipe.py` (its primary owner — declares
template syntax) and import from `language_render.py`.
* **#3 — centralize language column names.** `io_utils.py` had
hardcoded `{"language_persistent", "language_events"}` literals at
two sites. Replace with `LANGUAGE_COLUMNS` import so a future column
rename can't silently desync.
* **#4 — defensive collate preserved-keys.** `lerobot_collate_fn`
silently filtered language fields from samples that didn't have
them, which would hand downstream consumers a preserved list
shorter than the tensor batch. Now: if any sample carries a key,
every sample in the batch must carry it; otherwise raise a
`ValueError` so the upstream rendering bug surfaces at the boundary.
* **#5 — `_scalar` rejects non-singleton lists.** Previously a zero-
or multi-element list fell through and triggered confusing
`float([])` errors downstream. Now raises `ValueError` with the
actual length.
* **#6 — refactor `_extract_complementary_data`.** Replace 11 lines
of `key = {... if ... else {}}` plus an 11-line splat dict with a
single `_COMPLEMENTARY_KEYS` tuple iterated once.
* **#7 — document `EXTENDED_STYLES`.** Was an empty `set()` with no
comment. Add a docstring explaining it's an intentional extension
point: downstream modules append project-local styles before
`column_for_style` is called.
* **#9 — `tools.mdx` notes the runtime layer is future work.** The
page referenced `src/lerobot/tools/`, `registry.py`, and
`get_tools(meta)` — none exist in this PR. Added a callout at the
start of "How to add your own tool" plus a note on the
implementations paragraph.
* **#10 — tests for YAML round-trip, malformed rows, blend
validation.** `test_recipe.py` grew from 1 case to 12 covering:
blend-or-messages exclusivity, target-turn requirement, blend
emptiness, weight presence/positivity, nested-blend rejection,
`from_dict` with nested blends, `from_yaml` / `load_recipe`
agreement, top-level non-mapping rejection. Added a malformed-row
test for `_normalize_rows` that asserts non-dict entries raise
`TypeError`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime CLI was deliberately scoped to dry-run only: it
hard-coded ``robot_executor=None`` and printed a "real-robot
integration is a follow-up" warning even when ``--no_robot`` was
omitted. The runtime *engine* was already structured for real-robot
operation (separate ``LowLevelForward`` chunk-rate generation +
``DispatchAction`` ctrl-rate dispatch with a ``robot_executor``
hook); only the wiring was missing.
Add the wiring:
* ``_load_policy_and_preprocessor`` now also returns the
postprocessor (action denormaliser).
* ``--robot.type`` / ``--robot.port`` / ``--robot.id`` /
``--robot.cameras`` (JSON) build a ``Robot`` via
``make_robot_from_config`` and connect it.
* ``_build_robot_observation_provider`` reads
``robot.get_observation()`` each call, drops the language
columns (runtime drives messages itself), and runs the policy's
preprocessor (rename → batch → device → normalise).
* ``_build_robot_action_executor`` postprocesses the policy's
action tensor (denormalise), converts to the ``{joint: value}``
dict via ``make_robot_action(action, ds_meta.features)``, and
calls ``robot.send_action(...)``. Optional ``--max_action_norm``
safety clip rejects ticks whose action L2 norm exceeds the
threshold (kill-switch when bringing up a new robot).
* ``_run_autonomous`` runs ``runtime.run()`` in a background
thread (the policy must keep generating chunks at chunk_hz and
dispatching at ctrl_hz regardless of stdin) and handles user
interjections / VQA queries from the foreground stdin loop.
Confirmation prompt before start (skip with ``--auto_start``);
Ctrl+C stops the thread and disconnects the robot cleanly.
* Autonomous mode requires ``--dataset.repo_id`` for action stats
/ feature shapes — pass the same dataset the policy was trained
on. The bootstrap path that pulls canonical task / plan / memory
runs in both REPL and autonomous modes so the model's first
prompt matches training distribution.
Dry-run REPL behaviour is unchanged when ``--robot.type`` is not
passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(policies): add EO-1 model
* chore(eo1): adjust policy_eo1_README.md to to avoid duplicate with eo1.mdx
* chore(eo1): remove policy_eo1_README.md, link eo1.mdx in policy folder
---------
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* **`meta.tools` actually reads `info.json["tools"]`.** `DatasetInfo`
had no `tools` field, so `from_dict` silently dropped the key (it
warned about unknown fields then discarded them) and the property
always returned `DEFAULT_TOOLS`. Added `tools: list[dict] | None`
to the dataclass; `to_dict()` drops it when unset so existing
datasets keep a clean `info.json`. Fixed the accessor to read
`self.info.tools` (the previous `.get(...)` would have raised
AttributeError on the dataclass anyway). Added regression tests:
fallback when absent, round-trip from disk, and round-trip
through `DatasetInfo.from_dict` / `to_dict`.
* **`motion` is not view-dependent — fix the docs.** The mdx claimed
rows of style `motion` must carry `camera`, but `VIEW_DEPENDENT_STYLES
= {"vqa", "trace"}` and the validator agrees: motion primitives are
joint/Cartesian-frame, not pixel-space. Updated both call-out
paragraphs in `language_and_recipes.mdx`.
* **Conditional `collate_fn` swap.** Added `meta.has_language_columns`
and gate the `lerobot_collate_fn` swap in `lerobot_train.py` on it,
so non-language datasets keep PyTorch's `default_collate`. Also
added a pass-through test in `test_collate.py` that asserts on a
plain tensor batch the custom collate matches `default_collate`
key-for-key, plus a test for the `None`-sample drop path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`lerobot.processor` re-exported `RenderMessagesStep` at the package
level, so importing anything from `lerobot.processor` pulled in
`lerobot.datasets.language` → `lerobot.datasets/__init__.py` →
`require_package("datasets")`, which fails in the Tier 1 base install
that intentionally omits the `[dataset]` extra. The chain bricked
collection for unrelated suites (`tests/policies/pi0_pi05/...`,
`tests/envs/...`, etc.).
* Stop re-exporting `RenderMessagesStep` from `lerobot.processor`. The
only consumer (the test) already imports from the submodule.
Document the deliberate omission in the module docstring.
* Add `pytest.importorskip("datasets", ...)` (and `pandas` where
needed) at the top of the four PR-added tests that exercise the
language stack:
- tests/datasets/test_language.py
- tests/datasets/test_language_render.py
- tests/processor/test_render_messages_processor.py
- tests/utils/test_collate.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* `ruff format` on CI (newer version) wants the short `camera=None`
ValueError on a single line.
* `uv.lock` was stale relative to `pyproject.toml`'s `datasets>=4.7.0`
pin (and picked up upstream `s390x` marker fixes for cuda packages).
CI runs `uv sync --locked` which rejected the divergence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_select_one` previously skipped its ambiguity check whenever any of
`role`/`tool_name`/`camera` was set, on the assumption that the caller
had already pinned down a unique row. That left a real ambiguity hole
for VQA: with two cameras emitting `(vqa, assistant)` at the same
frame, `emitted_at(..., role="assistant")` silently picked the first
sorted row instead of telling the recipe to add `camera=...`. The
existing `test_emitted_at_raises_on_ambiguous_per_camera_vqa` test
already encoded the desired behavior.
Tighten the check: any time `len(rows) > 1` we now raise with the
selectors echoed back, so users see exactly which fields they passed
and that more is needed to disambiguate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Drop the unused `events` kwarg from `active_at`/`nth_prev`/`nth_next`;
only `emitted_at` actually consults events. The dispatcher in
`_resolve_spec` now passes events conditionally.
* Replace the dual `_persistent_sort_key`/`_event_sort_key` pair with a
single `_row_sort_key` and drop the `sort_key` parameter from
`_select_one`. Event rows lack `timestamp` (it is implicit in the
frame) and now default to `0.0` for sort purposes — the
`(style, role)` tiebreaker is unchanged.
* Inline `_select_latest` into `active_at` (its only caller).
* Collapse `emitted_at`'s dual-branch into one `_select_one` call.
* Tighten `_validate_persistent_resolver` to a single
`column_for_style(style) != LANGUAGE_PERSISTENT` check.
* Parameterize `test_per_camera_blend_renders_both_views` over the two
cameras and factor the sub-recipe builder into `_vqa_subrecipe` so
the test no longer hand-rolls two near-identical recipe blocks.
Net -98 LOC; behavior, public resolver names, and test expectations
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs combining to make the brand-new ``_tool3`` dataset
unloadable:
1. ``lerobot_annotate.py:_push_to_hub`` uploads the annotated
dataset folder but never creates a codebase-version tag, so
``api/datasets/<repo>/refs`` returns ``"tags": []``. Then
``LeRobotDatasetMetadata`` → ``get_safe_version`` →
``get_repo_versions`` returns empty and the loader raises
``RevisionNotFoundError``.
2. ``RevisionNotFoundError`` itself was unconstructible: its
``HfHubHTTPError.__init__`` indexes ``response.headers``
unconditionally on current ``huggingface_hub`` versions, so
constructing it without a real ``Response`` blew up with
``AttributeError: 'NoneType' object has no attribute 'headers'``,
masking the real "no tag" message.
Fix#1: after upload, read ``meta/info.json["codebase_version"]`` and
``HfApi.create_tag(..., tag=<v3.x>, repo_type='dataset',
exist_ok=True)`` so the dataset is loadable straight from the Hub on
the next ``LeRobotDataset(repo_id)`` call. Falls back to the in-tree
``CODEBASE_VERSION`` if info.json is missing/malformed; on tag
creation failure, prints the manual one-liner the user needs.
Fix#2: stop trying to instantiate ``RevisionNotFoundError`` (which
inherits HfHubHTTPError) for what is really a config issue, not an
HTTP failure. Raise plain ``RuntimeError`` with the same message —
the caller actually sees what's wrong instead of an upstream
attribute error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``RevisionNotFoundError`` inherits from
``huggingface_hub.HfHubHTTPError`` which made ``response`` a required
keyword-only argument on recent versions. Constructing it with just a
message string blew up with
``TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only
argument: 'response'`` instead of surfacing the actual problem (the
dataset/checkpoint repo doesn't exist on the Hub yet).
Pass ``response=None`` explicitly. Fall back to the bare-message form
for older ``huggingface_hub`` versions that don't accept the kwarg.
Also clarify the message to call out the most common cause: typing a
hub repo id that hasn't been pushed yet (instead of just "needs a
version tag").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Last bump combined ``module_3.K=3`` with ``vqa_emission_hz=2.0`` and
``executor.episode_parallelism=32``. With 2 cameras per dataset that
produced ~12× the original VQA call volume, all submitted concurrently.
Module 3 latency went from ~30s/phase to ~490s per episode, vLLM's
KV cache pegged at 94% with 800+ in-flight requests, and the
multimodal cache corrupted with ``AssertionError: Expected a cached
item for mm_hash='...'`` (a known vLLM bug under image-heavy
concurrency). Module 1 and 2 ran fine; Module 3 was the bottleneck.
Pull back the multipliers to land in a sustainable spot:
* module_3.K: 3 (kept) — three diverse questions per emission,
where the diversity actually helps the LM head.
* module_3.vqa_emission_hz: 2.0 → 1.0 — back to the original
emission rate. Net VQA volume is now ~3× original (K alone) on
a single camera, ~6× across both cameras — manageable.
* module_2.max_interjections_per_episode: 9 → 6 — still 2× the
default, fewer than the prior 3× to keep total request volume
in check.
* vlm.client_concurrency: 256 → 128 — gives vLLM headroom on the
multimodal request path so the mm_cache doesn't desync.
* executor.episode_parallelism: 32 → 16 — half the episodes
in flight at once, so peak vLLM load is ~half.
n_task_rephrasings stays at 30 (text-only, doesn't load the image
path) and vlm.temperature stays at 0.7. The diversity gains are
preserved; only the throughput knobs come down.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following Pi0.7 §V (prompt expansion / diverse context conditioning),
push more atom variants per episode and higher VLM sampling
temperature so the training distribution has enough wording diversity
that the LM head is forced to use its parameters rather than memorise
specific (prompt, target) pairs.
Changes vs prior annotation pass:
* vlm.temperature: 0.2 (default) → 0.7 — every Module-1/2/3 call
now produces diverse phrasings; same prompt yields different
completions across emissions.
* module_1.n_task_rephrasings: 10 → 30 — three times as many
``task_aug`` rows in language_persistent. ``${task}`` already
rotates through them deterministically per sample_idx (see
``_resolve_task`` in language_render.py).
* module_2.max_interjections_per_episode: 3 (default) → 9 — more
``user_interjection_response`` training samples + more plan
refresh events.
* module_3.K: 1 → 3 — three VQA pairs per emission tick instead of
one. Combined with the hz bump below, ~6× more VQA samples.
* module_3.vqa_emission_hz: 1.0 → 2.0 — double the VQA emission
rate within each subtask span.
Pushes to a new hub repo (``_tool3``) so the working ``_tool2``
dataset stays intact for comparison. ``${task}`` already wired to
rotate through ``task_aug`` rows, so no renderer change needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Memorised models can collapse to dominant-mode outputs (the
JSON-token salad ``":":":":...`` from VQA training) when the prompt
drifts even slightly from training distribution. Without a guard,
that gibberish lands in ``current_subtask`` / ``current_plan`` /
``current_memory``, which feeds the next tick's prompt and cascades
into worse outputs. The user observed exactly this: a clean run
followed by a tick that wrote ``" " "`` into plan and memory, then
slow recovery several ticks later.
Add ``_looks_like_gibberish`` heuristic (alpha density, repeating
chars, JSON-prefix sniff) and apply it before mutating state in
``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd``.
Bad generations are logged inline (``[info] subtask gen rejected
(gibberish): "":":":..."``) so the user can see what was dropped, but
the state stays at its last-known-good value (typically the dataset
bootstrap) instead of being polluted.
VQA path is intentionally exempt — its training targets *are*
JSON-shaped, so the heuristic would false-positive on them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The user-typed task and the dataset's canonical task differ in
wording (capitalisation, ``green box`` vs ``green bin``, etc.). With
``text_loss`` driven down to ~6e-6 across 78 epochs the model is
memorised on the *exact* rendered training prompts: any wording drift
puts the prompt out of distribution and the model collapses to its
dominant training mode (VQA JSON output).
When ``--dataset.repo_id`` is set, automatically:
* read the canonical task string from the chosen episode (and use
it as ``--task`` when the user didn't pass one);
* pull the active ``plan`` / ``memory`` / ``subtask`` rows from the
persistent slice (latest row whose timestamp ≤ start frame's
timestamp — same semantics as the renderer's ``active_at``) and
seed them into the runtime state.
The first prompt the runtime builds at REPL start now mirrors what
the recipe rendered during training (task + active plan + active
memory + optional current subtask). The user can still override any
of these by typing.
Memorisation itself is upstream (training mix collapsed to too few
unique high-level targets); this commit only fixes the inference-side
prompt mismatch that was making the memorisation surface as gibberish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The four high-level steps shared one generic
``_control_context_messages`` that jammed task + plan + memory +
completed_subtask into a single user message. The recipes in
``smolvla2_hirobot.yaml`` each have a *specific* multi-message layout
(``memory_update``: ``user(task) → assistant(prev memory) →
user(completed subtask)``; ``high_level_subtask``: ``user(task+plan+
memory) → user(current subtask)``; ``user_interjection_response``:
``user(task) → assistant(prev plan) → user(interjection)``). After
``apply_chat_template`` those layouts produce different prompts than
the runtime's flattened single-user-turn version, and the model fell
back to its dominant training mode (VQA JSON output) — generating
``":":":":":":...`` repetition.
Add four per-recipe prompt builders (``_msgs_for_subtask``,
``_msgs_for_memory``, ``_msgs_for_interjection``, ``_msgs_for_vqa``),
each mirroring its sub-recipe's exact message structure including
the ``if_present`` skips. Wire each high-level step to its matching
builder. Inference prompts now line up with what the model saw in
training, so generation should produce coherent text instead of
repeated tokens.
Generic ``_control_context_messages`` is kept (still used by tests
and the no-recipe fallback path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous rewrite drove generation through ``vlm.generate()`` (the
standard SmolVLM path), which ignores SmolVLA's custom ``embed_prefix``
that interleaves images + lang + state. Result: the model received a
prompt format it had never been trained on at inference and emitted
JSON-fragment gibberish (``" " " ,",","`` ``cube lift {"...``).
Revert to the cumulative-buffer AR loop driven through
``vlm_with_expert.forward`` — the *same* forward call ``_compute_text_loss``
makes during training (``inputs_embeds=[prefix_embs, None],
use_cache=False, fill_kv_cache=True``). With ``fill_kv_cache=True``,
every layer routes through ``forward_attn_layer``, which gracefully
skips ``None`` expert inputs (``if hidden_states is None or layer is
None: continue``); cross-attention layers — which would otherwise hard-
require a non-None expert input — are bypassed entirely.
Inference now sees the same prefix structure as training: images +
lang + state, with new tokens appended to the lang region. The text
distribution matches what the model was trained to produce.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLA's image preprocessor sizes frames to whatever the action
expert was trained on, but SmolVLM's standard vision tower expects
its own default tile grid (e.g. 384/14 → 27×27 patches). The
mismatch surfaces deep in the post-vision reshape as
``RuntimeError: shape '[2, 34, 34, 768]' is invalid for input of
size 1843200`` — the model has 1200 patches but expects 34×34=1156.
Drop ``pixel_values`` from ``vlm.generate(...)`` so SmolVLM runs as
a text-only LM at REPL time. The high-level branches (subtask /
plan / memory) are dominated by their text context anyway, so this
is acceptable for dry-run inference. VQA loses its image grounding
— that will be marked as expected for the dry-run path until a
follow-up either re-processes images through SmolVLM's own
``ImageProcessor`` to match its tile grid, or gives
``vlm_with_expert`` a real AR text decode mode that handles state
and image embeddings the way training does.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hand-rolled AR loop in ``select_message`` was fighting the
underlying ``vlm_with_expert.forward`` design, which assumes the
"prefix-once + suffix-always-via-expert" pattern that ``denoise_step``
uses for action chunks. Cross-attn layers (every other layer with
``attention_mode='cross_attn'`` + ``self_attn_every_n_layers=2``)
hard-require an expert input on every call: passing
``inputs_embeds=[current_embs, None]`` crashed at
``expert_layer.input_layernorm(None)`` with ``'NoneType' object has
no attribute 'dtype'``. Earlier KV-cache attempts ran into the
matching ``[15, 139] vs [15, 1]`` shape mismatch because the cache
gets *overwritten*, not appended, on each ``fill_kv_cache=True`` call
— there's just no AR-text-decode mode in this forward.
Stop fighting it: drive AR text generation through the underlying
SmolVLM via ``vlm.generate(input_ids=..., attention_mask=...,
pixel_values=...)``. KV caching, sampling/greedy, EOS handling all
come from HF's standard implementation. Trade-off: ``state`` drops
out of the prefix at inference (no slot for it on the standard
SmolVLM path), so high-level generations may drift from training
distribution slightly. That's acceptable for the dry-run REPL — the
high-level branches (subtask / plan / memory / vqa) are mostly
vision+language conditioned anyway, and the action expert (where
state actually matters) goes through the unchanged ``select_action``
path.
Image features the runtime merged in (``observation.images.*``) are
stacked into the ``[B, num_images, C, H, W]`` shape SmolVLM expects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLA's ``vlm_with_expert.forward`` doesn't actually support
incremental KV cache growth — its only ``fill_kv_cache=True`` mode
*overwrites* the cache with the latest call's key/value states, and
its only ``fill_kv_cache=False`` mode concatenates ``cache + new``
into a local ``key_states`` for one matmul without ever updating the
cache itself. The original ``select_message`` decode loop tried to
use ``fill_kv_cache=True`` per step, which clobbered the cache to
1 token after the first decode and threw
``Expected size for first two dimensions of batch2 tensor to be:
[15, 139] but got: [15, 1]`` — the attention mask still expected
139 keys but the cached + new key_states only had 1.
Match the pattern ``denoise_step`` already uses successfully:
maintain a cumulative ``(embs, pad, att)`` buffer that starts as the
prefix and grows by one bool/embedding row per step. Each step
forwards the *full* sequence with ``use_cache=False,
fill_kv_cache=False, past_key_values=None`` so the matmul shapes
always line up. Generated-token rows are tagged ``pad=1, att=1``
which makes them fully causal among themselves while still able to
attend back to the entire prefix (per ``make_att_2d_masks``
semantics: a token can attend to any earlier token whose cumulative
``att`` count is ≤ its own).
Image encoding is still done once via the initial ``embed_prefix``
call — the expensive part doesn't repeat. The remaining cost is
O(n²) text-only transformer forwards, which is fine for the dry-run
REPL's 50–100 token responses.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SmolVLA's ``eager_attention_forward`` does
``masked = torch.where(attention_mask[:, None, :, :], ...)``, which
requires a 3D ``[B, query_len, key_len]`` bool tensor so the
broadcast to 4D works. ``select_message``'s prefix forward got this
right (passes ``prefix_2d`` from ``make_att_2d_masks``), but the
KV-cache decoding loop built ``new_attn = torch.ones((bsize,
cur_pos + 1))`` — 2D — and the very first decode step blew up with
``IndexError: too many indices for tensor of dimension 2``.
During KV-cache decoding ``query_len = 1`` and
``key_len = cur_pos + 1`` (prefix + every token already generated),
so the right shape is ``[B, 1, cur_pos + 1]``. Match the layout
SmolVLA's working ``denoise_step`` uses for the equivalent
``prefix_pad_2d_masks`` build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues that combined to make the REPL unusable:
1. ``BatchEncoding.attention_mask`` is a ``Long`` tensor, but SmolVLA's
``eager_attention_forward`` does
``torch.where(attention_mask[..., None, :, :], ...)`` which
requires a *bool* condition. Every forward raised ``where expected
condition to be a boolean tensor, but got a tensor with dtype Long``
and the diagnostic surfaced it cleanly in the REPL — but generation
produced nothing useful. Cast to ``bool`` in ``_build_text_batch``
so the prefix forward goes through.
2. The interactive REPL used ``rich.live.Live`` panels stacked on top
of ``logging.basicConfig(level=DEBUG)`` HTTP request lines from
``httpcore`` / ``httpx`` / ``huggingface_hub``. The two rendering
loops fought each other in the user's terminal and the output was
illegible: hundreds of debug lines interleaved with re-rendered
panels.
Replace ``Live`` with a simple block redraw — clear screen, print
the state block, print any robot log lines, then a single ``> ``
prompt. State changes are visible above the prompt, the way Claude
Code's REPL renders. No flicker, no re-render races.
``_silence_noisy_loggers`` drops the chatty third-party HTTP /
download / model-init loggers to WARNING. ``-v`` still enables
DEBUG on the lerobot loggers; if the user needs the HTTP traces,
they can flip those individually.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``tokenizer.apply_chat_template(..., tokenize=True, return_tensors='pt')``
on newer transformers returns a ``BatchEncoding`` (dict-like) rather
than a raw ``Tensor`` — particularly when the underlying call routes
through a processor. ``_build_text_batch`` only handled the ``Tensor``
and ``list`` shapes, so the encoding object reached SmolVLA's
``embed_language_tokens`` and ``F.embedding`` blew up with
``argument 'indices' must be Tensor, not BatchEncoding`` on every
high-level forward.
Normalise the return:
* ``BatchEncoding`` / ``dict`` → take ``input_ids`` (and the encoder's
``attention_mask`` when present, since ``pad_token_id`` can be
``None`` for SmolVLM and the fall-back ``ids != pad_token_id``
breaks then),
* ``list[int]`` / ``list[list[int]]`` → wrap in a long tensor,
* ``Tensor`` → keep as-is.
After unwrapping, ensure shape ``(1, seq)`` and that ``attention_mask``
is a tensor on the same device as ``ids``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two failure modes were combining to make the runtime "look dead":
1. ``_build_text_batch`` produced lang tokens via
``apply_chat_template(return_tensors='pt')`` on CPU, but the policy
sits on the configured device (mps / cuda). The first prefix-embed
inside ``select_message`` then raised a device-mismatch on every
call. The bare ``except Exception`` in ``_generate_with_policy``
swallowed it at debug level — no logs, no chat output, no visible
sign anything had run.
2. Even when generation succeeded but returned an empty string
(greedy EOS, unhappy chat template, etc.), the high-level steps
silently no-op'd, so users saw nothing.
Move tokens to ``policy.config.device`` in ``_build_text_batch`` so
the prefix forward succeeds in the common case. Bump the swallowing
log level to ``warning`` (with optional traceback under ``-v``), and
when ``state`` is given route the same diagnostic into the REPL log
via ``push_log`` so the user sees ``[warn] subtask gen failed: ...``
inline. Also push an ``[info] ... produced no text this tick`` line
when generation runs but yields nothing, so empty completions are
distinguishable from "step never ran". Apply the same surface to
``LowLevelForward.select_action`` failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LowLevelForward was handing the observation provider's output straight
to ``policy.select_action``, but SmolVLA's ``_get_action_chunk``
indexes ``batch[OBS_LANGUAGE_TOKENS]`` and crashes with ``KeyError:
'observation.language.tokens'`` when the key isn't there. Our provider
deliberately strips the dataset's language columns (the runtime drives
messages itself), so nothing else was producing those tokens — the
chunk path crashed on the very first tick after task was set.
Build a low-level prompt from current runtime state inline (task /
plan / memory as the user turn, current subtask appended as a
continuation assistant turn when known), tokenize it with the same
helper the high-level steps use, and merge ``lang_tokens`` /
``lang_masks`` into the observation before the call. Skip the step
when no task is set yet, and swallow ``select_action`` exceptions at
debug level so a missing observation feature doesn't kill the REPL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``_load_hf_dataset`` was building the strict cast schema only from
``meta/info.json["features"]``. Datasets annotated by
``lerobot-annotate`` but still tagged at the older codebase version
(no ``language_persistent`` / ``language_events`` entry in
``info.json``) carry both columns in the parquet itself but not in the
features dict, so ``Dataset.from_parquet`` blew up with
``CastError: column names don't match`` when trying to project a
9-column parquet onto a 7-column schema.
Probe one parquet shard's actual schema; if either language column is
present in the parquet but missing from ``features``, graft it on
using PR 1's ``language_persistent_column_feature`` /
``language_events_column_feature`` helpers. No-op when neither column
is present (fully backwards-compatible with v3.0 datasets), no-op when
both are already registered (fully forwards-compatible with future
v3.1 ``info.json`` writes).
This unblocks dry-run inference on PR 2-annotated datasets that
weren't re-tagged to v3.1 — including the ones in the field today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``PolicyProcessorPipeline.from_pretrained`` reconstructs each saved
step by passing the persisted JSON config back to ``__init__``, but
``RenderMessagesStep.recipe`` (a ``TrainingRecipe``) doesn't survive
the JSON round-trip — the saved entry is ``{}`` and the reconstructor
crashes with ``missing 1 required argument: 'recipe'``.
Bypass the round-trip in the runtime CLI by passing
``pretrained_path=None`` to ``make_pre_post_processors``. That re-runs
``make_smolvla2_pre_post_processors``, which reloads the recipe YAML
referenced by ``cfg.recipe_path`` and wires it back into the step
correctly. ``NormalizerProcessorStep`` still gets stats from
``ds_meta.stats`` so normalization matches training.
Proper fix is to make ``RenderMessagesStep`` serializable (e.g. by
persisting the recipe path / contents); this commit keeps it scoped to
the runtime path so dry-run testing isn't blocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous bound `>=0.1.0,<1.0.0` matched zero published versions —
pocket-tts went straight to 1.0.0 on PyPI, with 0.x never released.
That made `uv sync --extra tools` (and any sync that pulls the `dev` /
`all` superset) fail with "requirements are unsatisfiable" on every
Python version uv tried, including 3.12.
Bump to `>=1.0.0,<3.0.0` so 1.x and 2.x are reachable. SayTool only
touches `TTSModel.load_model()`, `get_state_for_audio_prompt`,
`generate_audio`, and `sample_rate` — small enough surface that 1.x
and 2.x should both work; tighten if a real API break shows up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime CLI's loader was broken — it imported a `make_policy_from_path`
that doesn't exist in `lerobot.policies.factory` — and the high-level text
steps generated plan / subtask / memory / VQA from a text-only batch with
no images or state, so dry-runs drifted from the training distribution.
Switch to the standard `PreTrainedConfig.from_pretrained` +
`make_policy(cfg, ds_meta=...)` flow so `--policy.path` accepts both local
directories and Hub repo ids, and add a `--dataset.repo_id` path that walks
a chosen episode and feeds preprocessed observations into every forward
pass — including the four high-level steps (`HighLevelSubtaskFwd`,
`MemoryUpdateFwd`, `UserInterjectionFwd`, `AskVQAFwd`). Frames are routed
through the saved preprocessor pipeline with `language_persistent` /
`language_events` stripped so the recipe-render step stays a no-op (the
runtime supplies its own messages from current state).
Also wires the rich-based two-zone REPL layout (`ui.py`) that the script
was already importing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep annotated language columns through collation, render batched recipe samples, and make SmolVLA2 text loss robust enough for distributed training on the steerable dataset.
Co-authored-by: Cursor <cursoragent@cursor.com>