lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-23 17:56:07 +00:00

Author	SHA1	Message	Date
Pepijn	f7b989ad97	fix(pi052): read backbone dtype from q_proj, not first parameter select_message's bf16 cast used next(paligemma.parameters()).dtype, which lands on a fp32-kept param (norm / embedding) under to_bfloat16_for_selected_params. Mask stayed fp32 while q/k/v were bf16 → SDPA still raised "invalid dtype for bias". Read the dtype from layers[0].self_attn.q_proj.weight instead — q_proj is always cast with the rest, so its dtype matches what SDPA sees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 18:46:08 +02:00
Pepijn	3b4376aa33	fix(pi052): cast attention bias to model dtype for bf16 inference `_prepare_attention_masks_4d` always returns fp32 (the 0.0 / -inf literals); with bf16 weights, HF PaliGemma's SDPA path raises "invalid dtype for bias - should match query's dtype" and select_message returns empty every step. Cast in both attention sites: `_compute_layer_ki` (training, when both experts run) and `select_message` (inference, VLM-only branch). Bf16 training + bf16 inference now run end to end with no dtype mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 18:42:26 +02:00
Pepijn	a0233f53f4	feat(annotate): default VLM to Qwen3.6-35B-A3B-FP8 Match the production target used in examples/annotations/run_hf_job.py. Per Scale Labs' dense-captioning ablations, model capacity dominates prompt-engineering gains; defaulting to the larger model avoids shipping a worst-tier configuration out of the box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 11:46:59 +02:00
Pepijn	34269a5d78	fix(pi052): register PaliGemma <loc> tokens so they tokenize as single ids THE bug behind the <loc>-salad. PaliGemma's vocab reserves ids [256000, 257023] for <locDDDD> detection / pointing tokens, but the stock AutoTokenizer does NOT match them on raw text — it BPE-splits <loc0162> into SEVEN pieces (<, loc, 0, 1, 6, 2, >). So a VQA target like "<loc0162><loc0759> green box<eos>" tokenized to 16 pieces, not 5, and training the LM head supervised those generic BPE pieces instead of one detection-vocab id. The piece logits got pumped up across ~25% of supervised positions; at inference they dominated every turn — even subtask prompts produced <loc>-salad followed by the actual answer. Register the 1024 <locDDDD> tokens via tokenizer.add_tokens once on load, in every path the policy uses: PI052TextTokenizerStep (training encode), _build_text_batch_pi052 (runtime encode), and select_message's default tokenizer (runtime decode). Verified empirically with the real PaliGemma tokenizer: VQA target now tokenizes to 5 ids matching the loc-vocab range (256162, 256759, ...) with correct offset_mapping. This unlocks PaliGemma's actual detection prior; <loc>-salad cannot recur because each <locDDDD> is a single class on the LM head, not a character sequence the head accidentally learns to extend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 11:41:41 +02:00
Pepijn	75507491bf	fix(pi052): VQA <loc> conversion treats coords as 0-1000 normalized Confirmed empirically on the published dataset: VQA bbox/keypoint coordinates are Qwen2.5-VL's 0–1000 normalized grounding output, NOT pixels. Scanning 8207 samples showed x and y both spanning 0..1000 with ~30% of values exceeding the camera's pixel dimensions (which is impossible if they were pixels). _vqa_answer_to_loc was dividing by the observation image's H/W, so e.g. point [742, 158] on a 640x480 wrist cam clamped x to <loc1023> (the far-right edge) instead of mapping to <loc0760> (~74% across). Fix: divide by 1000 — the actual Qwen scale. The conversion is now camera-resolution-independent, so _camera_image_shapes and the image_shapes plumbing through __call__ / _encode_messages / _messages_vqa_to_loc are dropped. Tests updated to the new signature and the 0–1000 round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:21:28 +02:00
Pepijn	88519cb14c	fix(pi052): quantile-normalize actions before FAST tokenizer fit base.fit() rejected the data with "Vocab size 1024 is too small for the range of tokens 9339": the FAST tokenizer was fit on raw motor-unit actions, whose DCT-token range vastly exceeds the 1024 codebook. Two problems, one fix. (1) Raw actions blow up the token range. (2) At training time ActionTokenizerProcessorStep runs after the QUANTILES NormalizerProcessorStep, so it encodes normalized actions — fitting on raw actions mismatches that space. Replicate QUANTILES normalization (per-dim [q01,q99] -> [-1,1], clipped) before base.fit() so the fit and the training-time encode see the same distribution and the token range fits the codebook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:02:20 +02:00
Pepijn	bc0c993b25	fix(pi052): FAST tokenizer fit read actions from column, not ds[i] fit_fast_tokenizer collected action chunks via ds[i]["action"], which builds a full training item — delta-timestamp expansion, video decode, image transforms. A single video-decode failure threw, was swallowed at debug level, and silently starved the fit of every chunk → "FAST fit collected zero action chunks", falling back to the universal tokenizer. Read the ``action`` column straight from the HF dataset instead: it carries no video, so it is immune to decode errors and far faster. Also fail fast with a clear message when the dataset has no ``action`` feature or all episodes are shorter than chunk_size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:51:53 +02:00
Pepijn	ddf4bc2063	fix(pi052): knowledge insulation crashed on wrong _gated_residual import _compute_layer_ki called modeling_gemma._gated_residual, but that adaRMSNorm gated-residual helper is a lerobot helper in pi_gemma, not part of HF transformers — so enabling knowledge_insulation crashed with AttributeError on the first training step. Import _gated_residual from pi_gemma, matching pi05's own layer code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:48:02 +02:00
Pepijn	b7317b6c29	test(pi052): round-trip coverage for VQA <loc> conversion Pins JSON pixel coords -> PaliGemma <loc> -> runtime parse back: the conversion preserves coordinate order (JSON x-first, <loc> y-first) and per-axis normalization, losing only <loc>-grid quantization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:24:24 +02:00
Pepijn	c026aed8f8	feat(pi052): train VQA spatial answers in PaliGemma <loc> format Spatial VQA answers (bbox / keypoint) were trained as pixel-coordinate JSON, which fights PaliGemma's detection prior and leaks <loc>-token salad at inference. Convert them to PaliGemma's native <locNNNN> vocabulary instead so the LM head reuses that prior. Training side (text_processor_pi052.py): a target turn whose content parses as a bbox/keypoint answer is rewritten to <loc> text, using the camera frame's native (H, W) from the observation and the preceding image block. Non-spatial answers, subtask/memory targets and SmolVLA2 keep their JSON form — the dataset stays backbone-agnostic. Runtime side (smolvla2/inference/vqa.py): parse_vqa_answer detects <loc> answers (2 locs -> keypoint, 4 -> bbox), returning normalized [0,1] coords with a normalized flag; draw_vqa_overlay denormalizes against the chosen camera frame's pixel size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 20:23:46 +02:00
pepijn	e425dfd624	fix(processor): fallback to task message when recipe misses Keep action-only samples trainable by rendering the task as a low-level user message when no recipe branch matches. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 15:32:09 +00:00
Pepijn	15f79b5e5e	fix(pi052): supervise an EOS token at the end of each text target PI052TextTokenizerStep masked text_labels over the assistant turn's content only — the trailing newline was excluded and no EOS token was ever a supervised label. So the LM head was never given a stop signal: at inference select_message decoded to max_new_tokens, producing the runaway subtask paragraphs and the "}"}"}-style VQA tails. _format_messages now appends the tokenizer's EOS to each supervised target turn and extends that turn's span to cover it, so the EOS lands in text_labels. _shifted_ce then trains "<last content token> -> EOS" and the model learns to terminate; select_message stops on it. Inference callers (the runtime's _build_text_batch_pi052) pass no target_indices / eos_token, so no EOS is baked into the prompt — the model generates it. Verified end-to-end with the PaliGemma tokenizer: the supervised span is `<content><eos>` and the trailing newline stays unsupervised. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:22:22 +02:00
Roham Z. Nobari	dfdc48a7f1	fix(datasets): bound VideoDecoderCache to prevent OOM on large datasets (#3614 ) VideoDecoderCache used an unbounded dict keyed on absolute path, with no eviction in the standard LeRobotDataset path. With shuffled iteration over datasets that have many distinct mp4 files, every DataLoader worker accumulated one cached (VideoDecoder, fsspec file handle) pair per distinct path it had ever touched. Per-entry cost is ~3-5 MB of host RAM plus one open FD; at ~8 k entries this is roughly 30 GB per worker. This was hit in the wild during a SmolVLA training run on a 4,195-episode SO-101 dataset (8,390 mp4s, two cameras per episode). dmesg showed anon-rss climbing to 34.9 GB on a single pt_data_worker before the OOM killer fired ~30 min into training; with --num_workers=8 the per-worker peak halved to 17.9 GB, which is the expected inverse-scaling signature when the leak is per-decode and the workload is split across workers. The working workaround on the affected platform was --dataset.video_backend=pyav, because the pyav path opens/closes per call and never touches this cache. Switch the backing store to an OrderedDict and evict LRU entries when the cap is reached, closing the evicted file handle inside the lock so we do not leak FDs either. Default cap is DEFAULT_DECODER_CACHE_SIZE = 100, overridable via LEROBOT_VIDEO_DECODER_CACHE_SIZE or by passing max_size= to the constructor; max_size=None restores the legacy unbounded behaviour for callers that need it. Validation on the original failing workload (decode_video_frames_torchcodec called over real mp4s from the affected SO-101 dataset): unbounded: 300 files -> +1087 MB host RSS, cache=300, still climbing cap=50: 500 files -> +266 MB host RSS, cache=50, stable cap=50: 2000 calls -> +312 MB host RSS, cache=50, stable cap=100: 1000 calls -> +470 MB host RSS, cache=100, stable Three independent seeded runs at cap=50 agreed to within 1% (263 / 266 / 265 MB delta), and the 2000-call multi-pass run shows RSS plateaus after the cap is reached instead of drifting. Tests in tests/datasets/test_video_decoder_cache.py cover: default-is-bounded, size cap, LRU ordering, FD close on eviction, FD close on clear(), cache-hit invariance, max_size=None fallback, and env-var override. No regressions in test_video_encoding.py, test_streaming.py, or test_dataset_reader.py (73 prior tests still pass alongside the 8 new ones).	2026-05-19 16:54:25 +02:00
四七	6a8878a639	fix(datasets): normalize shape=(1,) numeric values before HF encoding (#3344 ) * fix(datasets): normalize shape=(1,) numeric values before save * test(datasets): cover shape=(1,) int/bool and finalize Co-authored-by: Copilot <copilot@github.com>	2026-05-19 16:53:19 +02:00
Caroline Pascal	d38eb89f71	feat(video re-encoding): Adding utility and dataset edition tool for video re-encoding (#3611 ) * feat(utility): adding video re-encode utility * feat(edit): adding a new lerobot-edit-dataset tool to re-encode all the videos of a dataset * chore(format): formatting code * chore(review): fix Claude reviews * test(reencode dataset): adding missing test for reencode dataset	2026-05-19 14:46:14 +02:00
Pepijn	7ab4936b1b	Add extensive language support (#3467 ) * Add extensive language support * Address review: split persistent/event schemas, drop event timestamps - recipe.py: derive _VALID_ROLES/_VALID_STREAMS from MessageRole/MessageStream Literals - dataset_metadata.py: keep CODEBASE_VERSION at v3.0 - language.py: remove RESERVED_STYLES; split arrow/feature schemas into persistent (with timestamp) and event (without timestamp); add docstrings - language_render.py: events use frame-row timestamp implicitly; no per-event timestamp filtering or sorting - converters.py: drop unused subtask_key passthrough - add docstrings to new public APIs (recipe, render_messages_processor, collate) - update tests for split schemas; revert uv.lock Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add docstrings to all new helpers; revert uv.lock Covers private helpers in recipe.py, language.py, language_render.py, and render_messages_processor.py. Also reverts uv.lock to main (it was re-generated by `uv run` during local checks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(language): add motion (persistent) and trace (event-only) styles Promote the previously-reserved motion/trace styles to first-class core styles. motion routes to language_persistent (it tracks robot state over time); trace routes to language_events (single-moment annotations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(language): per-camera tagging on view-dependent styles Adds a nullable `camera` field to the language row struct (both persistent and event variants) so view-dependent styles like `vqa` can carry which `observation.images.` view they were grounded against. Without this, multi-camera datasets ended up with multiple `(vqa, role)` rows at the same timestamp that the resolver could not disambiguate. - `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS, to both Arrow struct types and the HF datasets feature mappings; introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus `is_view_dependent_style` and `validate_camera_field` helpers (camera required iff style is view-dependent). - `language_render.py`: thread an optional `camera=` kwarg through every resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through `_matching_rows` / `_select_`, so recipes can disambiguate per-camera VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`. Without a `camera` filter, multi-row matches keep raising the existing ambiguity error — which is the desired behaviour on multi-camera data. - `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with `ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying the matching image block), keeping the original 0.20 budget and documenting the customization point for datasets with different cameras. - Tests: schema test asserts the new field order; new tests cover `is_view_dependent_style`, `validate_camera_field` (both required and forbidden directions), per-camera `emitted_at` filtering, and the ambiguity error when two cameras emit `(vqa, assistant)` at the same timestamp without a `camera=` filter. RenderMessagesStep + dataset passthrough fixtures updated to include the new field. - `docs/source/language_and_recipes.mdx`: document the `camera` field, the per-camera resolver pattern, and the canonical recipe convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(language): drop motion from VIEW_DEPENDENT_STYLES Motion primitives are described in robot-frame (joint / Cartesian) terms, not pixel space, so they are camera-agnostic. Only `vqa` (event) and `trace` (event, pixel-trajectory) are view-dependent. The `camera` field stays on PERSISTENT_ROW_FIELDS for schema symmetry — the validator, resolver, and HF feature mapping behave identically across the two columns regardless of which styles populate `camera` today — but persistent rows now always have `camera=None` in practice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(language): task_aug style + automatic ${task} rephrasing rotation Adds task-prompt diversity (Xiao 2022 / CAST) without touching ``meta/tasks.parquet`` or forcing recipes to opt in. The plan reserved ``task_aug`` as a future style; this lands it now. - ``language.py``: add ``task_aug`` to ``CORE_STYLES`` and ``PERSISTENT_STYLES``. ``column_for_style("task_aug")`` returns ``language_persistent`` so PR 2 writers route it correctly. - ``language_render.py``: ``_resolve_task`` now consults the persistent slice for rows of ``style="task_aug", role="user"``. When any exist it picks one deterministically by ``sample_idx`` (blake2b-keyed, not Python's randomized hash) so an epoch sees every rephrasing of every episode while the same sample still resolves identically across reruns. Falls back to the canonical ``meta/tasks.parquet`` task when no rephrasings are present, so existing datasets and unannotated runs keep their behaviour. Explicit ``task=`` overrides still win. - Tests: rephrasing coverage across samples, determinism on repeat ``sample_idx``, fallback when persistent has no ``task_aug`` rows, and explicit override priority. Recipes get this for free: any ``${task}`` placeholder rotates through the available rephrasings. Recipes that want the literal canonical task can override the binding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(language): tool catalog in meta/info.json + LeRobotDatasetMetadata.tools Stores OpenAI-style function schemas at ``meta/info.json["tools"]`` so datasets can declare which tools are available (today: just ``say``; tomorrow: per-dataset extensions). The ``DEFAULT_TOOLS`` constant fills in for unannotated datasets so chat-template consumers don't have to special-case anything. Three pieces: - ``language.py``: ``SAY_TOOL_SCHEMA`` and ``DEFAULT_TOOLS`` constants. Single source of truth — PR 2's writer and PR 3's runtime tool registry will both import from here instead of duplicating the dict. - ``dataset_metadata.py``: ``LeRobotDatasetMetadata.tools`` property reads ``info.json["tools"]`` and falls back to ``DEFAULT_TOOLS``. Returns deep-copied dicts so callers can mutate the result safely. - ``docs/source/tools.mdx``: spec page covering the catalog, per-row invocations, and the three-step "how to add a new tool" workflow (declare schema, implement, register). Linked from the docs toctree under the Datasets section. This lays the groundwork for PR 2's pipeline writing the catalog out during annotation, and PR 3's ``src/lerobot/tools/`` package shipping runnable implementations (one file per tool — first up: ``say.py`` wrapping Kyutai's pocket-tts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Apply ruff and prettier formatting after merge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(language): unify resolver dispatch and prune redundant test scaffolding * Drop the unused `events` kwarg from `active_at`/`nth_prev`/`nth_next`; only `emitted_at` actually consults events. The dispatcher in `_resolve_spec` now passes events conditionally. * Replace the dual `_persistent_sort_key`/`_event_sort_key` pair with a single `_row_sort_key` and drop the `sort_key` parameter from `_select_one`. Event rows lack `timestamp` (it is implicit in the frame) and now default to `0.0` for sort purposes — the `(style, role)` tiebreaker is unchanged. * Inline `_select_latest` into `active_at` (its only caller). * Collapse `emitted_at`'s dual-branch into one `_select_one` call. * Tighten `_validate_persistent_resolver` to a single `column_for_style(style) != LANGUAGE_PERSISTENT` check. * Parameterize `test_per_camera_blend_renders_both_views` over the two cameras and factor the sub-recipe builder into `_vqa_subrecipe` so the test no longer hand-rolls two near-identical recipe blocks. Net -98 LOC; behavior, public resolver names, and test expectations unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(language): always raise on ambiguous resolver matches `_select_one` previously skipped its ambiguity check whenever any of `role`/`tool_name`/`camera` was set, on the assumption that the caller had already pinned down a unique row. That left a real ambiguity hole for VQA: with two cameras emitting `(vqa, assistant)` at the same frame, `emitted_at(..., role="assistant")` silently picked the first sorted row instead of telling the recipe to add `camera=...`. The existing `test_emitted_at_raises_on_ambiguous_per_camera_vqa` test already encoded the desired behavior. Tighten the check: any time `len(rows) > 1` we now raise with the selectors echoed back, so users see exactly which fields they passed and that more is needed to disambiguate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: fix CI — collapse short ValueError to one line, refresh uv.lock * `ruff format` on CI (newer version) wants the short `camera=None` ValueError on a single line. * `uv.lock` was stale relative to `pyproject.toml`'s `datasets>=4.7.0` pin (and picked up upstream `s390x` marker fixes for cuda packages). CI runs `uv sync --locked` which rejected the divergence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(language): keep base install green — drop processor re-export, gate dataset-extra tests `lerobot.processor` re-exported `RenderMessagesStep` at the package level, so importing anything from `lerobot.processor` pulled in `lerobot.datasets.language` → `lerobot.datasets/__init__.py` → `require_package("datasets")`, which fails in the Tier 1 base install that intentionally omits the `[dataset]` extra. The chain bricked collection for unrelated suites (`tests/policies/pi0_pi05/...`, `tests/envs/...`, etc.). * Stop re-exporting `RenderMessagesStep` from `lerobot.processor`. The only consumer (the test) already imports from the submodule. Document the deliberate omission in the module docstring. * Add `pytest.importorskip("datasets", ...)` (and `pandas` where needed) at the top of the four PR-added tests that exercise the language stack: - tests/datasets/test_language.py - tests/datasets/test_language_render.py - tests/processor/test_render_messages_processor.py - tests/utils/test_collate.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(language): address review — tools accessor, motion docs, conditional collate * `meta.tools` actually reads `info.json["tools"]`. `DatasetInfo` had no `tools` field, so `from_dict` silently dropped the key (it warned about unknown fields then discarded them) and the property always returned `DEFAULT_TOOLS`. Added `tools: list[dict] \| None` to the dataclass; `to_dict()` drops it when unset so existing datasets keep a clean `info.json`. Fixed the accessor to read `self.info.tools` (the previous `.get(...)` would have raised AttributeError on the dataclass anyway). Added regression tests: fallback when absent, round-trip from disk, and round-trip through `DatasetInfo.from_dict` / `to_dict`. * `motion` is not view-dependent — fix the docs. The mdx claimed rows of style `motion` must carry `camera`, but `VIEW_DEPENDENT_STYLES = {"vqa", "trace"}` and the validator agrees: motion primitives are joint/Cartesian-frame, not pixel-space. Updated both call-out paragraphs in `language_and_recipes.mdx`. * Conditional `collate_fn` swap. Added `meta.has_language_columns` and gate the `lerobot_collate_fn` swap in `lerobot_train.py` on it, so non-language datasets keep PyTorch's `default_collate`. Also added a pass-through test in `test_collate.py` that asserts on a plain tensor batch the custom collate matches `default_collate` key-for-key, plus a test for the `None`-sample drop path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * review: dedupe regex, centralize column names, harden collate, more tests * #2 — dedupe `_PLACEHOLDER_RE`. The same regex was compiled in `recipe.py` and `language_render.py`. Promote to module-level `PLACEHOLDER_RE` in `recipe.py` (its primary owner — declares template syntax) and import from `language_render.py`. * #3 — centralize language column names. `io_utils.py` had hardcoded `{"language_persistent", "language_events"}` literals at two sites. Replace with `LANGUAGE_COLUMNS` import so a future column rename can't silently desync. * #4 — defensive collate preserved-keys. `lerobot_collate_fn` silently filtered language fields from samples that didn't have them, which would hand downstream consumers a preserved list shorter than the tensor batch. Now: if any sample carries a key, every sample in the batch must carry it; otherwise raise a `ValueError` so the upstream rendering bug surfaces at the boundary. * #5 — `_scalar` rejects non-singleton lists. Previously a zero- or multi-element list fell through and triggered confusing `float([])` errors downstream. Now raises `ValueError` with the actual length. * #6 — refactor `_extract_complementary_data`. Replace 11 lines of `key = {... if ... else {}}` plus an 11-line splat dict with a single `_COMPLEMENTARY_KEYS` tuple iterated once. * #7 — document `EXTENDED_STYLES`. Was an empty `set()` with no comment. Add a docstring explaining it's an intentional extension point: downstream modules append project-local styles before `column_for_style` is called. * #9 — `tools.mdx` notes the runtime layer is future work. The page referenced `src/lerobot/tools/`, `registry.py`, and `get_tools(meta)` — none exist in this PR. Added a callout at the start of "How to add your own tool" plus a note on the implementations paragraph. * #10 — tests for YAML round-trip, malformed rows, blend validation. `test_recipe.py` grew from 1 case to 12 covering: blend-or-messages exclusivity, target-turn requirement, blend emptiness, weight presence/positivity, nested-blend rejection, `from_dict` with nested blends, `from_yaml` / `load_recipe` agreement, top-level non-mapping rejection. Added a malformed-row test for `_normalize_rows` that asserts non-dict entries raise `TypeError`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * review: emitted_at uses 0.1s tolerance; MessageTurn requires stream at construction * Float tolerance in `emitted_at` for persistent styles. The ``_timestamp(row) == t`` exact-equality check silently missed any caller that derived ``t`` arithmetically (e.g. ``frame_idx / fps``) even though the parquet timestamp would only differ by ULPs. Added ``EMITTED_AT_TOLERANCE_S = 0.1`` and check ``abs(...) <= tolerance`` instead, with a docstring explaining why exact equality wasn't enough and why 0.1 s is safe at typical 30–100 Hz control rates. Test asserts the new behavior at half-window (matches) and double-window (no match) using the constant so it stays in sync. * `MessageTurn.stream` is required at construction. It was typed ``MessageStream \| None = None`` so YAML could omit ``stream:`` and pass the dataclass invariant — but ``_validate_rendered`` rejected ``None`` streams later, surfacing the error at the first sample instead of at recipe load. Now ``__post_init__`` raises ``ValueError`` if ``stream`` is ``None``, with the list of valid streams in the message. The redundant late-stage check in ``_validate_rendered`` is replaced with a one-line comment that cites the upstream invariant. Test pins the new construction-time rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(tools): drop follow-up-PR references Reword the two callouts in `tools.mdx` to describe the runtime layer in present tense ("not part of the catalog layer shipped today", "those modules don't yet exist in the tree") instead of pointing at a specific follow-up PR. Keeps the doc honest about what works now without coupling it to a particular release order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * review: address CarolinePascal feedback - language timestamps: float64 -> float32 to match LeRobotDataset frame timestamps (Arrow struct + HF feature) - dataset_metadata: hoist `.language` imports to module top — language.py has no lerobot imports, so there is no circular-import risk - dataset_metadata: add a `meta.tools` setter that persists the catalog to info.json and reloads `meta.info` - feature_utils: validate the `language` dtype instead of returning "" — warn (non-fatal) when a non-empty value is written at record time - centralize the scalar-unwrap helper as `lerobot.utils.utils.unwrap_scalar`, shared by render_messages_processor and language_render - docs: move `## Layer 2 — recipe anatomy` ahead of the resolver sections, which describe recipe bindings rather than dataset layout - language_render: note in EMITTED_AT_TOLERANCE_S that persistent rows change on a human-action timescale, not the camera frame rate Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:46:11 +02:00
pepijn	2ea0da2d9f	fix(annotate): tag uploaded dataset revision Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 12:44:35 +00:00
Pepijn	725ac95b0d	feat(runtime): make the interactive runtime drive PI052 too The runtime's text path was hard-wired to SmolVLA2: _build_text_batch read policy.config.vlm_model_name (which PI052Config doesn't have) and built a SmolVLM2 chat-template prompt. PI052/PaliGemma is not chat-pretrained and trains on a flat `User: ... \nAssistant: ...` prompt, so the runtime crashed or fed an out-of-distribution prefix. - _build_text_batch now dispatches on policy.config.type: smolvla2 -> chat template (renamed _build_text_batch_chat); pi052 -> flat role-prefixed text via PI052TextTokenizerStep's own _format_messages / _strip_blocks / _flatten_say_tool_calls, so the inference prefix matches PI052 training exactly. - Add a lerobot-pi052-runtime entry point (alias of the same main; the policy type is read from the checkpoint) so the command name isn't misleading. argparse prog now defaults to the invoked command name. PI052's select_message / predict_action_chunk already work with the runtime; this was the one SmolVLA2-only coupling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:28:55 +02:00
Pepijn	7b64e5498d	revert(annotate): move memory + speech prompts to base PR (#3471 ) The first-person memory narrative, task-rephrasing and initial-speech prompt tweaks belong in the annotation pipeline itself. Applied to feat/language-annotation-pipeline (#3471); reverting them here to the merge-base so they drop out of this PR's diff. general_vqa.py keeps its docstring fix since it references a recipe this PR introduces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:17:52 +02:00
Pepijn	134a707c7a	feat(annotate): first-person memory narrative + shorter speech prompts - module_1_memory: rewrite as an explicit first-person, past-tense narrative ("I picked up...", "I opened...") matching the MEM (Torne 2026) running-memory style, instead of "one or two short sentences" with no person/tense guidance. - module_1_task_rephrasings: bias rephrasings toward short imperative. - module_2_initial_speech: prefer very short robot acknowledgements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:17:30 +02:00
Pepijn	182f10184f	revert(annotate): move pipeline changes to base PR (#3471 ) The deterministic-plan rewrite, single-frame VQA (K 3->1), dataset version tagging, telegraphic-subtask prompt and shorter interjection prompt belong in the annotation pipeline itself, not in the SmolVLA training PR. They have been applied to feat/language-annotation- pipeline (#3471). Reverting these six files here to the merge-base so they drop out of this PR's diff; #3491 will inherit the canonical versions when it next rebases on its base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:07:23 +02:00
von Neumann 101	ca8c60a0ed	Set OpenCV fourcc after size and fps (#3620 ) * Set OpenCV fourcc after size and fps * Set OpenCV fourcc last on Windows * Add comment explaining DSHOW fourcc ordering	2026-05-19 14:06:41 +02:00
Pepijn	ce47075d6b	feat(annotate): deterministic plan, single-frame VQA, dataset tagging Port the steerable-pipeline refinements developed on feat/smolvla-on- steerable back into the annotation pipeline itself: - module_1_subtasks: imperative verb-first telegraphic labels with a consistent-object-noun rule and good/bad examples (no hard word cap). - _generate_plan: drop the VLM round-trip; the plan is now a deterministic numbered list of still-todo subtasks, re-emitted at every subtask boundary so it shrinks as work progresses. Removes module_1_plan.txt. - VqaConfig.K 3 -> 1: a VQA pair anchors exactly its emission frame, no stale-label temporal smear. - lerobot-annotate: tag the pushed dataset with its codebase_version so LeRobotDataset can resolve a revision and load it. - module_2_interjection: shorter, more natural mid-task cues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:06:15 +02:00
Pepijn	26013da699	feat(annotations): enforce imperative verb-first subtask phrasing Rewrite module_1_subtasks prompt to produce short imperative commands ("pick up the orange") instead of third-person narration ("the robot arm moves to the orange"). Drops the verbose "how, not what" rule and adds a good/bad few-shot table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 13:53:20 +02:00
pepijn	bb31988915	fix(pi052): pass 4d masks to prefix-only forwards Convert PI052 prefix-only attention masks before calling PaliGemma so text-only batches and generation use the same mask shape as fused training. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 21:07:13 +00:00
pepijn	2629175d2d	fix(pi05): use fused AdamW by default Route full PI05/PI052 fine-tuning through PyTorch's fused AdamW path to avoid the single-tensor Adam denominator allocation near GPU memory limits. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 19:23:17 +00:00
pepijn	2b4c5f49e3	fix(pi05): disable foreach AdamW by default Avoid the multi-tensor AdamW temporary that can OOM full PI05/PI052 fine-tuning near GPU memory limits. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 18:58:17 +00:00
pepijn	22c9c4905e	fix(pi052): avoid dense CE over padded tokens Select only supervised text and FAST action-code positions before cross-entropy to avoid full-vocabulary loss tensors over padded sequences. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 18:40:34 +00:00
pepijn	7960cc14ec	fix(pi052): call policy preprocessing helpers Use PI05Policy helpers for action padding and image preprocessing in PI052 fused losses instead of looking them up on the inner PI05Pytorch module. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 17:52:47 +00:00
Pepijn	3c15fd8537	feat(robots): natively integrate Seeed Studio reBot B601-DM arm (#3624 ) * feat(robots): natively integrate Seeed Studio reBot B601-DM arm Add first-class LeRobot support for the Seeed Studio reBot arm, replacing the out-of-tree `lerobot-robot-seeed-b601` / `lerobot-teleoperator-rebot-arm-102` plugin packages. New devices: - robot `rebot_b601_follower` — single-arm B601-DM follower (6-DOF + gripper, Damiao CAN motors via `motorbridge`) - robot `bi_rebot_b601_follower` — bimanual follower composing two single arms - teleoperator `rebot_102_leader` — single-arm StarArm102 / reBot Arm 102 leader (FashionStar UART servos via `motorbridge-smart-servo`) - teleoperator `bi_rebot_102_leader` — bimanual leader composing two single arms The bimanual variants reuse the single-arm classes and namespace each arm's observation/action keys with `left_` / `right_` prefixes, so a bimanual StarArm102 leader can teleoperate a bimanual reBot B601 follower. Optional SDK imports are guarded; a `rebot` extra installs `motorbridge` and `motorbridge-smart-servo`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add reBot B601-DM calibration & dual-arm teleoperation guide Add docs/source/rebot_b601.mdx covering single-arm and bimanual calibration and teleoperation for the reBot B601-DM follower and reBot Arm 102 leader, with zero-position reference images from the Seeed Studio wiki. Register the page in the docs toctree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: fix reBot B601 MDX build (move JSON example out of <Tip>) The doc-builder parses `{...}` inside MDX component children as a Svelte expression, so the joint_directions JSON example broke the build. Move it into a top-level fenced code block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: apply prettier formatting to reBot B601 page Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: remove duplicate colocated reBot B601 page docs/source/rebot_b601.mdx is the canonical, toctree-registered page; the colocated rebot_b601.md was a redundant thinner copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify 6-DOF leader fallback comment in reBot B601 follower Explain that holding wrist_yaw at zero is what lets a 6-DOF leader (e.g. so100_leader / so101_leader) teleoperate the 7-DOF follower. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: address Caroline's PR review on reBot B601 integration - leader: remove _validate_config (no other lerobot device validates its config; a key mismatch now surfaces as a plain KeyError) - leader: simplify _round_to_valid_range to direct modular arithmetic instead of a bidirectional search loop - leader: inline the single-use _clamp helper - follower & leader: write MotorCalibration range_min/range_max from the configured joint_limits / joint_ranges instead of a fixed [-90, 90] - docs: add a "Find the USB ports" section (lerobot-find-port) and move the brltty/permissions tip there; link the OpenArm page for SocketCAN adapter configuration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 19:49:21 +02:00
pepijn	1750a87104	fix(pi052): handle batched rendered messages Tokenize batched recipe outputs in PI052 so training batches with nested message lists do not crash before model forward. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 17:41:58 +00:00
pepijn	0e2dc1b76f	fix(pi052): supervise only FAST action-code tokens Mask the FAST auxiliary loss to discrete action-code tokens so wrapper formatting tokens do not affect action co-training. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 17:38:34 +00:00
Pepijn	474c5478d9	tune(annotations): VQA emission anchors a single frame (K 3 -> 1) Module 3 anchored each VQA emission tick to K=3 consecutive frames (~0.1s at 30fps). The VLM grounds the answer — bbox/keypoint coordinates especially — against the first frame's image, so copying it onto frames 2-3 smears a stale label over a moving scene. Default K=1: a VQA pair lands on exactly its emission frame, no temporal smear. VQA frames get sparser; the WeightedEpisodeAwareSampler (vqa_target_fraction) is the knob to compensate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:24:36 +02:00
Pepijn	f72b28738a	fix(annotate): default keyframe decode to ffmpeg CLI (thread-safe) The decoder chain tried torchcodec first, then ffmpeg. torchcodec is not thread-safe: under the executor's 16-wide concurrent decode in the interjections phase it SIGSEGVs (exit 139) before the ffmpeg fallback is ever reached — uncatchable, so it kills the whole job. Default the auto chain to ffmpeg only. Per-frame ffmpeg decode runs in an isolated child process: crash-safe and concurrency-safe (the plan phase already proved 16 parallel ffmpeg subprocesses are fine). torchcodec / pyav remain available via an explicit video_backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:40:29 +02:00
Pepijn	1bd53cc7da	fix(annotate): decode keyframes via ffmpeg CLI fallback PyAV segfaulted (exit 139) decoding the AV1 streams modern LeRobot datasets use — a SIGSEGV that the per-episode try/except cannot catch, killing the whole job when the interjections phase started. Replace the PyAV fallback with _decode_frames_ffmpeg, which shells out to the ffmpeg CLI: a full ffmpeg build decodes AV1, and a child-process crash is a catchable non-zero exit rather than a segfault. Decoder chain is now torchcodec -> ffmpeg. _decode_frames_av stays available behind video_backend="pyav" for callers that want it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:08:31 +02:00
Pepijn	0f5f0e4091	refactor(recipes): rename recipes, drop pi05_hirobot - hirobot.yaml -> subtasks_vqa.yaml - hirobot_memory.yaml -> subtask_mem_vqa_speech.yaml - pi05_hirobot.yaml -> deleted (stale: uses plan, top-camera names; superseded by the two recipes above) - smolvla2_hirobot.yaml -> deleted (was untracked stale junk) Updated the smolvla2 / pi052 `recipe_path` config defaults, all docstring / comment references, the annotation-pipeline + recipe docs, and the three tests that loaded pi05_hirobot.yaml (repointed to the renamed recipes; the low-level-branch and pipeline-render assertions now accept a flow-only `low_level` stream as valid supervision, since the new recipes' low_level_execution has no text-CE target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:02:15 +02:00
Pepijn	7128bb1769	fix(annotate): decode keyframes via PyAV directly The pyav fallback routed through lerobot's decode_video_frames(backend= "pyav"), which uses torchvision.io.VideoReader — removed in torchvision 0.23+. On modern torch stacks (e.g. vllm-openai with torchvision 0.26) both torchcodec and that path fail, leaving interjection/vqa prompts without visual context. Add _decode_frames_av: a self-contained PyAV decoder that picks the nearest frame per timestamp. It is the always-available tail of the decoder chain (torchcodec -> pyav) and the target of --video_backend=pyav. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:45:04 +02:00
Pepijn	426d48dbbf	fix(pi052): port the smolvla2 text-head fixes to pi052 pi052 had the same text-CE collapse bug smolvla2 had — PaliGemma's embed_prefix flags the language block att=0, so make_att_2d_masks makes it fully bidirectional and the text cross-entropy degenerates into a copy task. Ported the three model-specific fixes: - _mark_target_span_causal: set att=1 on supervised target language positions so the text-CE is genuine causal next-token prediction. Applied in both _compute_all_losses_fused and _compute_text_and_fast_loss. - flow_loss_weight 10.0 -> 5.0: the paper's a=10 swamps the LM head once the flow-only low_level recipe fires often (matches SmolVLA2Config). - _flatten_say_tool_calls in the text tokenizer: serialize `say` tool calls into a <say>...</say> marker so the spoken reply is tokenized and supervised (PaliGemma's flat prompt has no structured calls, so they were dropped entirely). select_message needed no change: pi052's prefix is [images, language] with no trailing state token, so it already decodes from the last language token. Regression tests mirror the smolvla2 attention-masking + tool-call suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:42:19 +02:00
Pepijn	fbcb9225f5	feat: oversample sparse VQA annotations (recipe consumption + weighted sampler) VQA annotations are sparse, so VQA was badly underrepresented in training: its effective share was weight x density, and blend draws that picked an ask_vqa* sub-recipe for a non-VQA frame were wasted entirely. Two pieces: 1. Recipe-side consumption (language_render.py): render_sample now routes any frame that carries a VQA annotation to a matching ask_vqa* sub-recipe, regardless of the weighted blend draw. No VQA annotation is wasted and no draw lands on a non-renderable VQA recipe — VQA's recipe-side share now equals the VQA-annotation density. 2. Dataset-side oversampling (WeightedEpisodeAwareSampler + vqa_target_fraction): a new weighted, episode-aware sampler draws frames with replacement by per-frame weight. When TrainPipelineConfig.vqa_target_fraction is set, the train script scans language_events, weights VQA frames so they make up ~that fraction of the training stream, and uses the weighted sampler. This is what actually lets VQA exceed its natural density. Default None keeps uniform episode-aware sampling unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:30:00 +02:00
Pepijn	31e0c15e55	fix(annotate): pyav fallback when torchcodec keyframe decode fails VideoFrameProvider decoded keyframes via torchcodec only. Some containers (e.g. vllm-openai) ship a torchcodec that cannot push packets to the decoder ("Operation not permitted"), silently degrading interjection/vqa prompts to no visual context. _decode now retries with pyav when the default backend raises, and a new `video_backend` config field lets callers pin the backend explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:23:53 +02:00
Pepijn	c5676ef1b3	feat(annotate): add dest_repo_id for separate push target Adds an optional `dest_repo_id` to AnnotationPipelineConfig. When set, `push_to_hub` uploads the annotated dataset there instead of overwriting the source `repo_id`, restoring separate source/destination repos. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:05:23 +02:00
Quentin Lhoest	5ebbdf3d05	Mention the new Lance LeRobotDataset implementation in the docs (#3609 ) * Enhance documentation with Lance format details Added information about Lance format and `lerobot-lancedb` package for multimodal AI datasets. Signed-off-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>	2026-05-18 14:51:26 +02:00
Pepijn	b319ccf688	fix(smolvla2): only prompt for a camera when a VQA overlay is drawn The VLM already sees every camera, so the operator never needs to name one to ask a question. Move the camera prompt to after generation and only fire it when the answer actually carries a bounding box / point (whose pixel coordinates are camera-specific and need a target frame). Non-spatial answers (count / attribute / spatial / plain text) now skip the prompt entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:50:19 +02:00
Pepijn	3174e14bc0	fix(smolvla2): feed all cameras to VQA generation, not just the chosen one handle_vqa_query filtered the observation down to the single chosen camera before calling the VLM. But training feeds every camera: the ask_vqa_* recipes' image blocks are stripped before tokenization and the frames reach the model via OBS_IMAGES_*, where embed_prefix consumes all config.image_features regardless of the per-camera recipe tag. Filtering to one camera changed the image-token count in the prefix (the dropped camera zero-padded with mask=0) — a prefix shape the model never saw at training. Now the full observation is passed to select_message; the chosen camera is used only to pick which frame the bbox/point overlay is drawn on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:46:38 +02:00
Pepijn	dc530e10fe	feat(smolvla2): VQA example prompts in the panel; drop quotes from hints Command arguments never needed quotes (`_strip_quotes` only strips a matching pair if present) — `/question point to the yellow cube` works. The hints wrongly implied `""` were required; all hints/help now show `/action <task>` / `/question <text>`. Also adds a reference line to the state panel showing the two overlay-producing VQA prompt shapes: /question point to the yellow cube -> point overlay /question detect the blue cube -> bounding-box overlay plus the same examples in /help. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:42:32 +02:00
Pepijn	e7c5613a39	refactor(smolvla2): command-driven runtime — no startup prompts Replace the startup mode prompt + task picker with a single command-driven prompt. The runtime now comes up immediately at the command line in `paused` mode (robot idle) and the operator drives it: /action "task" run the robot on a task (bare = resume, number = timed burst) /pause stop the action loop — robot holds position /question "..." pause and answer one VQA question (camera prompt + overlay) /help / stop - Removed _select_mode_interactively / _select_task_interactively / _dataset_task_strings (the interactive pickers). - mode value renamed "question" -> "paused"; --mode choices are now action\|paused (default paused). - /question takes the question inline and runs it via _handle_slash_command (pauses first, so the policy isn't used concurrently). - The ENTER-to-start gate only fires when starting in action mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:37:51 +02:00
Pepijn	516ffc7687	feat(smolvla2): --mode flag, skip task picker with --task, timed /action Lets the operator skip the interactive startup entirely and go straight to the command line: - New --mode {action,question} arg; when given, the startup mode prompt is skipped. - When --task is passed explicitly on the CLI, the startup task picker is skipped (the dataset-bootstrap task still shows the picker so you can override it). Also adds a timed action burst: /action <seconds> runs the robot for N seconds, then the autonomous loop auto-reverts to question mode and clears the action queue. Plain /action stays unlimited. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:26:12 +02:00
Pepijn	7a68bf13d9	feat(recipes): add hirobot_memory — hirobot + memory + spoken tool-call replies New recipe alongside hirobot.yaml (kept as the lean baseline). Superset that adds two text-supervised sub-recipes: - memory_update: compress progress into a memory note. - user_interjection_response: reply to a user interjection with a `say` tool call only (no plan/subtask text). The SmolVLA2 chat tokenizer flattens the call to a `<say>...</say>` marker the runtime parses back. Plan is intentionally omitted; memory is the only persistent high-level state. Weights: low_level 0.40, subtask 0.25, memory 0.10, interjection 0.10, vqa 0.075 x2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:21:41 +02:00
Pepijn	15229468d0	feat(smolvla2): startup mode prompt; rename /vlm mode to /question Add a mode prompt at startup, shown before the task picker, so the operator chooses action (run the robot) vs question (VQA only) up front instead of having to discover /vlm mid-run. Also rename the VQA mode from "vlm" to the clearer "question": - state["mode"] value is now "action" \| "question" - the command is /question (/vlm and /vqa kept as aliases) - panels, hints and help text updated to match handle_vqa_query now reports via both push_log and direct stdout, so VQA answers / overlay paths are visible in autonomous question mode where the panel redraw is suspended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:17:03 +02:00
Pepijn	a9cea3e8dd	fix(smolvla2): make the autonomous REPL usable for slash commands / VQA The autonomous panel redraw cleared the screen every 0.5s, so the "> " prompt and the one-shot command hint vanished — the operator could not see what to type or what they were typing, making /vlm unreachable. - Suspend the timer redraw entirely while in /vlm mode (the action loop is paused, nothing changes in the background) so the VQA question and camera prompt stay on a stable screen. - Re-print the "> " prompt after each redraw so it is always visible. - Show an always-on command hint in the panel (/vlm, /help, /action) instead of relying on the startup line that scrolls away. - Redraw immediately after a slash command so the mode flip is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:10:13 +02:00

1 2 3 4 5 ...

1746 Commits