mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-16 00:59:46 +00:00
01e2228b242a094c216aa986efec9ee104ef9d79
1511 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
01e2228b24 |
feat(smolvla2): per-component prompt dropout + augmented training script
Two complementary regularisers to attack the ``text_loss=6e-6 = memorised one dataset`` failure mode that's making the model collapse on real-robot input: 1. **Per-component prompt dropout** (Pi0.7 §V.E / plan's ``feat/pi05-prompt-dropout`` follow-up). ``SmolVLA2ChatTokenizerStep`` gains ``plan_dropout_prob`` / ``memory_dropout_prob`` / ``subtask_dropout_prob`` knobs (default 0.0 — opt-in). At training, non-target messages whose rendered content starts with ``Plan:`` / ``Memory:`` / ``Current subtask:`` etc. are dropped with their respective probability before tokenisation, with a deterministic per-sample RNG keyed off the dataset ``index``. ``target_message_indices`` is re-mapped so the supervision still lands on the right turn. Forces the model to handle missing plan/memory/subtask context — directly attacks the real-robot collapse where a stale or empty plan field puts the prompt OOD. Surfaced on ``SmolVLA2Config`` as three floats so they're ``--policy.<knob>=<value>``-controllable from the train CLI; plumbed through ``make_smolvla2_pre_post_processors``. 2. **Image augmentation** is already wired in lerobot via ``--dataset.image_transforms.enable=true`` (torchvision v2 ColorJitter + SharpnessJitter + RandomAffine, default 3 of 6 sampled per frame). No code change needed — just a CLI flag. ``examples/training/smolvla2_hirobot.slurm`` shows the full training command with both enabled. Drop-in replacement for the ad-hoc SLURM script Pepijn was using locally; same args, plus the three dropout probs and the image-transforms flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c36de3a3e8 |
fix(smolvla2): enqueue full chunk via predict_action_chunk
``LowLevelForward`` was calling ``select_action()`` once per ``chunk_hz`` tick. SmolVLA's ``select_action`` is a thin queue-pop: it returns one action per call and only re-runs the expensive flow-matching forward when its private internal queue empties. Result: we got one action back per chunk_hz tick (1Hz default), ``DispatchAction`` at ctrl_hz=30 popped it instantly, then queue sat empty for ~1s waiting for the next tick. Net throughput was 1 dispatched action/sec instead of the 30 we wanted. Switch to ``predict_action_chunk`` and enqueue every step of the returned ``(batch, n_action_steps, action_dim)`` chunk. Refresh only when the queue is below half a chunk so we don't burn one flow-matching forward per chunk_hz tick — saves ~5x inference cost on this hot path. At ctrl_hz=30, chunk_size=50, the queue drains in ~1.7s before the next refresh, giving smooth dispatch at the control rate the robot was trained on. Side effect: ``state['last_chunk_size']`` records how many actions the most recent chunk produced — useful for the panel later if we want to surface "chunks generated" alongside "dispatched". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
cbfaf2c544 |
feat(smolvla2): action-dispatch counter + tighter gibberish filter
Real-robot run was unreadable for two reasons:
1. The panel surfaced ``queued actions: 0`` (always zero — dispatch
pops faster than chunk_hz generates) and gave no signal that
actions were actually reaching the robot. The only sign of life
was the safety-clamp warning lines scrolling past.
2. The text head consistently collapses to ``the`` / ``Ass``
fragments on real-camera input (memorisation wall). The old
gibberish filter caught ``":":":"`` JSON salad but let
single-token fragments through, and the ``[info] subtask gen
produced no text this tick`` line flooded the panel every second.
Changes:
* ``DispatchAction`` bumps ``state["actions_dispatched"]`` each
tick; panel renders it next to queue depth. Operator can see
the policy IS issuing actions even when text is broken.
* ``_looks_like_gibberish`` now also rejects:
- too few unique alphabetic tokens (``the``, ``the the``, ...)
- chat-template marker leakage (``Assistant:``, ``Ass\\n::``)
catching the actual failure mode on real-robot frames.
* Gibberish rejections log only the first occurrence + every 30th
after that, with a count, so the panel stays legible.
* Empty completions no longer log at all (was every tick).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d0278ea093 |
feat(smolvla2): render state panel in autonomous mode too
Dry-run REPL had a clean ANSI-clear-+-rich-panel layout via ``_redraw`` showing task / subtask / plan / memory / queued-actions / pending-tool-calls; autonomous mode just had bare ``> `` plus log lines scrolling past the user. Same data, two presentations. Extract ``_make_state_panel_renderer(runtime, mode_label=...)`` and use it from both ``_run_repl`` (called per user input) and ``_run_autonomous`` (called both on user input *and* on a 0.5s background timer so subtask / plan / memory refreshes from the runtime's own loop become visible without the user typing anything). Title bar shows ``dry-run`` vs ``autonomous`` so it's obvious which mode you're in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
15f6b08b0e |
fix(smolvla2): use canonical _strip_lerobot_blocks for inference msgs
Training tokenises messages through ``_strip_lerobot_blocks`` (in
``chat_processor_smolvla2.py``), which normalises every variant of
``message['content']`` into the ``[{type:text, text:...}]`` list shape
SmolVLM's chat template expects:
* ``list[block]`` → keep text blocks, drop images
* ``None`` → ``[{type:text, text:""}]``
* ``str`` / other → ``[{type:text, text:str(content)}]``
Inference was doing a partial inline conversion that only handled the
``str`` case — ``None`` and pre-formatted ``list`` content slipped
through unchanged. ``memory_update``'s ``Previous memory: ...``
assistant turn ends up with ``None`` content when there's no prior
memory, which then renders as no-content / role-marker-only and the
model hallucinates ``Assistant:`` fragments. Subtask gen got further
because its prompt always has at least the task string.
Reuse ``_strip_lerobot_blocks`` directly. Now the inference prompt
shape matches the exact tokenisation training did — no more "trained
on shape X, asked to predict shape Y" mismatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
fc715db4a3 |
fix(smolvla2): coerce str content to list-of-blocks for chat template
SmolVLM's chat template (and many other multimodal templates) declares
``message['content']`` as a list of typed blocks and iterates it
expecting dicts with a ``'type'`` field:
{% for line in message['content'] %}
{% if line['type'] == 'text' %}{{ line['text'] }}
{% elif line['type'] == 'image' %}{{ '<image>' }}
{% endif %}
{% endfor %}
When the caller passes ``content`` as a plain ``str`` (which we did
throughout ``_msgs_for_subtask`` / ``_msgs_for_memory`` etc.), Jinja
silently iterates the string character-by-character. ``'P'['type']``
returns nothing; neither branch fires; *no text tokens get emitted*.
The model receives a prompt containing only role markers
(``User:<end_of_utterance>\nAssistant:``) and predictably continues by
emitting ``Assistant:`` fragments — the gibberish ``subtask: Ass\n::``
on the runtime panel.
Before calling ``apply_chat_template``, walk the messages and rewrite
any string ``content`` into ``[{'type': 'text', 'text': content}]``.
The template's text branch then fires correctly and the model sees
the actual user/assistant text, not just structural tokens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
fe4bd2b6ba |
fix(smolvla2): pass flat batch dict to preprocessor (no manual wrap)
``PolicyProcessorPipeline.__call__`` already wraps its input via ``to_transition`` (defaulting to ``batch_to_transition``) before running the steps, and unwraps via ``to_output`` (defaulting to ``transition_to_batch``) afterwards. The input format is therefore a *flat batch dict* keyed by ``observation.*`` / ``action`` / etc., not an ``EnvTransition``. Previous attempt pre-wrapped the observation into a transition with ``TransitionKey.OBSERVATION`` as the key, then handed *that* to the pipeline — which fed it to ``batch_to_transition``, which looked for top-level ``observation.*`` entries, found none (they were nested inside the enum key), and produced an empty observation. Every step then bailed with ``ObservationProcessorStep requires an observation in the transition.`` Pass the flat dict from ``build_inference_frame`` straight to the preprocessor — it does the wrap/unwrap itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3f7436ff8a |
fix(smolvla2): use TransitionKey enum (not .value) as transition keys
``EnvTransition`` is declared as a ``TypedDict`` keyed by ``TransitionKey.OBSERVATION.value`` (the string ``'observation'``), but every concrete ``ProcessorStep`` in the pipeline indexes the transition with the enum *member* (``transition[TransitionKey. OBSERVATION]`` / ``transition.get(TransitionKey.OBSERVATION)``). Those are two different keys in a Python dict — string key vs enum key — so steps couldn't find the observation we'd placed under the string variant, and bailed every tick with ``ObservationProcessorStep requires an observation in the transition``. Build the transition with the enum members directly. Matches how ``BatchProcessor``, ``RelativeActionProcessor``, ``HilProcessor``, etc. read the dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
992d13d4e9 |
fix(smolvla2): use build_inference_frame for raw robot observations
``robot.get_observation()`` on omx_follower (and most lerobot robots)
returns:
* per-joint scalar floats with ``.pos`` suffix
(``shoulder_pan.pos: 0.123``, ``shoulder_lift.pos: 0.456``, ...)
* per-camera ndarrays keyed by the camera config name (``wrist:
ndarray(H,W,3)``)
But the trained policy expects:
* single ``observation.state: tensor[N_joints]`` vector
* image keys prefixed: ``observation.images.<cam_key>:
tensor[1, 3, H, W]``
``prepare_observation_for_inference`` only handles the tensor /
batch-dim / device step — it crashes on scalar floats with
``expected np.ndarray (got float)``. The right helper is
``build_inference_frame`` which uses the dataset's feature schema
(``ds_meta.features``) to:
1. extract the right raw keys per dataset feature,
2. fold ``shoulder_pan.pos`` / ``shoulder_lift.pos`` / ...
into a single ``observation.state`` ndarray,
3. prefix camera keys with ``observation.images.``,
4. delegate to ``prepare_observation_for_inference`` for the
tensor / batch / device step.
Pass ``ds_meta.features`` into the observation provider and switch
to ``build_inference_frame`` when available; fall back to the bare
``prepare_observation_for_inference`` only when no dataset is
provided (rare — autonomous mode already requires it).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
afe40a016b |
fix(smolvla2): wrap robot obs in EnvTransition before preprocessor
The policy preprocessor pipeline is transition-shaped — its steps
read ``TransitionKey.OBSERVATION`` off an ``EnvTransition`` dict, not
a flat ``RobotObservation`` dict. Passing the raw observation through
made every step bail with
``ObservationProcessorStep requires an observation in the transition``,
which the runtime swallowed at warning level. ``select_message`` then
got called with no ``observation.images.*`` features and crashed
with ``All image features are missing from the batch``.
Mirror ``lerobot-record``'s preamble:
1. ``prepare_observation_for_inference`` → numpy → torch, ``CHW``
image layout, ``[0,1]`` scaling, add batch dim, move to device.
2. Wrap into an ``EnvTransition`` (``{TransitionKey.OBSERVATION.value:
...}`` plus ``COMPLEMENTARY_DATA: {}`` and ``None``s for the rest)
so transition-aware steps see the keys they expect.
3. Run preprocessor.
4. Unwrap the transition's ``OBSERVATION`` slot to get the final
flat dict the policy's ``select_action`` / ``select_message``
consume.
Image features now reach the policy; the autonomous loop produces
real actions instead of swallowing warnings every tick.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
41095e3cc3 |
fix(smolvla2): instantiate CameraConfig subclasses from JSON dicts
``--robot.cameras`` parses the JSON into ``dict[str, dict]``, but ``RobotConfig`` expects ``dict[str, CameraConfig]`` — each inner value must be the actual ``CameraConfig`` subclass instance for the chosen backend (e.g. ``OpenCVCameraConfig``). Passing raw dicts blew up in ``RobotConfig.__post_init__`` with ``AttributeError: 'dict' object has no attribute 'width'`` when it iterated cameras and tried to read attributes. Look up the right subclass per-camera by its ``"type"`` field via ``CameraConfig.get_choice_class(...)`` (mirroring the lazy-import dance we already do for ``RobotConfig``: eagerly walk ``lerobot.cameras``'s submodules so the registry is populated before lookup). Construct an instance with the rest of the dict's fields. On an unknown camera type, raise a clean ``ValueError`` listing the available choices. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e0fa957569 |
fix(smolvla2): eagerly import robot submodules before get_choice_class
``RobotConfig._choice_registry`` is populated as a side-effect of each robot's ``@RobotConfig.register_subclass`` decorator running, and those decorators only fire when the corresponding ``lerobot.robots.<name>`` module is imported. The package's ``__init__.py`` doesn't import them — instead ``make_robot_from_config`` does it lazily in its big if/elif chain. ``_build_robot`` jumped the gun: called ``RobotConfig.get_choice_class (robot_type)`` before any robot module had been imported, so the registry was empty and every ``--robot.type=<X>`` produced ``KeyError: 'X'`` (e.g. ``KeyError: 'omx_follower'``). Walk ``lerobot.robots``'s submodules via ``pkgutil.iter_modules`` and ``importlib.import_module`` each one before the lookup. ~200ms on the first invocation, negligible for an autonomous run. On a real ``KeyError`` (typo / unsupported robot), raise a clean ``ValueError`` listing the registry's available choices instead of a bare KeyError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c661d81409 |
fix(smolvla2): use RobotConfig.max_relative_target, drop --max_action_norm
The hand-rolled action-norm safety clip duplicated what every ``RobotConfig`` already exposes — ``max_relative_target`` — and at the wrong layer (after postprocess but before send_action, instead of inside the robot driver where every other lerobot entry point puts it). The norm clip also rejected entire actions instead of clipping per-motor relative motion, so a single rogue joint would kill the whole tick. Replace with ``--robot.max_relative_target``: a string parsed as either a bare float (uniform per-motor cap) or a JSON object mapping motor name → cap. Passed through to ``RobotConfig(max_relative_target=...)`` at robot construction; the driver's ``send_action`` clips each commanded joint position relative to the current measured one before issuing it on the bus — same behaviour ``lerobot-record`` ships. Also bump ``--chunk_hz`` default from ``4.0`` to ``1.0``. One new chunk per second is what the trained checkpoint can comfortably keep up with on common hardware and gives smoother motion than sub-second chunk regenerations (no RTC interpolation between chunks yet). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
33a4b4a5a0 |
feat(smolvla2): autonomous robot mode in lerobot-smolvla2-runtime
The runtime CLI was deliberately scoped to dry-run only: it
hard-coded ``robot_executor=None`` and printed a "real-robot
integration is a follow-up" warning even when ``--no_robot`` was
omitted. The runtime *engine* was already structured for real-robot
operation (separate ``LowLevelForward`` chunk-rate generation +
``DispatchAction`` ctrl-rate dispatch with a ``robot_executor``
hook); only the wiring was missing.
Add the wiring:
* ``_load_policy_and_preprocessor`` now also returns the
postprocessor (action denormaliser).
* ``--robot.type`` / ``--robot.port`` / ``--robot.id`` /
``--robot.cameras`` (JSON) build a ``Robot`` via
``make_robot_from_config`` and connect it.
* ``_build_robot_observation_provider`` reads
``robot.get_observation()`` each call, drops the language
columns (runtime drives messages itself), and runs the policy's
preprocessor (rename → batch → device → normalise).
* ``_build_robot_action_executor`` postprocesses the policy's
action tensor (denormalise), converts to the ``{joint: value}``
dict via ``make_robot_action(action, ds_meta.features)``, and
calls ``robot.send_action(...)``. Optional ``--max_action_norm``
safety clip rejects ticks whose action L2 norm exceeds the
threshold (kill-switch when bringing up a new robot).
* ``_run_autonomous`` runs ``runtime.run()`` in a background
thread (the policy must keep generating chunks at chunk_hz and
dispatching at ctrl_hz regardless of stdin) and handles user
interjections / VQA queries from the foreground stdin loop.
Confirmation prompt before start (skip with ``--auto_start``);
Ctrl+C stops the thread and disconnects the robot cleanly.
* Autonomous mode requires ``--dataset.repo_id`` for action stats
/ feature shapes — pass the same dataset the policy was trained
on. The bootstrap path that pulls canonical task / plan / memory
runs in both REPL and autonomous modes so the model's first
prompt matches training distribution.
Dry-run REPL behaviour is unchanged when ``--robot.type`` is not
passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a764c3e1d6 |
fix(datasets,annotate): tag pushed dataset + clean revision error
Two bugs combining to make the brand-new ``_tool3`` dataset unloadable: 1. ``lerobot_annotate.py:_push_to_hub`` uploads the annotated dataset folder but never creates a codebase-version tag, so ``api/datasets/<repo>/refs`` returns ``"tags": []``. Then ``LeRobotDatasetMetadata`` → ``get_safe_version`` → ``get_repo_versions`` returns empty and the loader raises ``RevisionNotFoundError``. 2. ``RevisionNotFoundError`` itself was unconstructible: its ``HfHubHTTPError.__init__`` indexes ``response.headers`` unconditionally on current ``huggingface_hub`` versions, so constructing it without a real ``Response`` blew up with ``AttributeError: 'NoneType' object has no attribute 'headers'``, masking the real "no tag" message. Fix #1: after upload, read ``meta/info.json["codebase_version"]`` and ``HfApi.create_tag(..., tag=<v3.x>, repo_type='dataset', exist_ok=True)`` so the dataset is loadable straight from the Hub on the next ``LeRobotDataset(repo_id)`` call. Falls back to the in-tree ``CODEBASE_VERSION`` if info.json is missing/malformed; on tag creation failure, prints the manual one-liner the user needs. Fix #2: stop trying to instantiate ``RevisionNotFoundError`` (which inherits HfHubHTTPError) for what is really a config issue, not an HTTP failure. Raise plain ``RuntimeError`` with the same message — the caller actually sees what's wrong instead of an upstream attribute error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b416f287f2 |
fix(datasets): raise readable error when repo has no version tags
``RevisionNotFoundError`` inherits from ``huggingface_hub.HfHubHTTPError`` which made ``response`` a required keyword-only argument on recent versions. Constructing it with just a message string blew up with ``TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only argument: 'response'`` instead of surfacing the actual problem (the dataset/checkpoint repo doesn't exist on the Hub yet). Pass ``response=None`` explicitly. Fall back to the bare-message form for older ``huggingface_hub`` versions that don't accept the kwarg. Also clarify the message to call out the most common cause: typing a hub repo id that hasn't been pushed yet (instead of just "needs a version tag"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
aa749d4947 |
chore(annotate): throttle Module 3 + executor parallelism to fix vLLM stall
Last bump combined ``module_3.K=3`` with ``vqa_emission_hz=2.0`` and
``executor.episode_parallelism=32``. With 2 cameras per dataset that
produced ~12× the original VQA call volume, all submitted concurrently.
Module 3 latency went from ~30s/phase to ~490s per episode, vLLM's
KV cache pegged at 94% with 800+ in-flight requests, and the
multimodal cache corrupted with ``AssertionError: Expected a cached
item for mm_hash='...'`` (a known vLLM bug under image-heavy
concurrency). Module 1 and 2 ran fine; Module 3 was the bottleneck.
Pull back the multipliers to land in a sustainable spot:
* module_3.K: 3 (kept) — three diverse questions per emission,
where the diversity actually helps the LM head.
* module_3.vqa_emission_hz: 2.0 → 1.0 — back to the original
emission rate. Net VQA volume is now ~3× original (K alone) on
a single camera, ~6× across both cameras — manageable.
* module_2.max_interjections_per_episode: 9 → 6 — still 2× the
default, fewer than the prior 3× to keep total request volume
in check.
* vlm.client_concurrency: 256 → 128 — gives vLLM headroom on the
multimodal request path so the mm_cache doesn't desync.
* executor.episode_parallelism: 32 → 16 — half the episodes
in flight at once, so peak vLLM load is ~half.
n_task_rephrasings stays at 30 (text-only, doesn't load the image
path) and vlm.temperature stays at 0.7. The diversity gains are
preserved; only the throughput knobs come down.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
1394a6ab5d |
chore(annotate): bump diversity knobs ~3x to fight memorisation
Following Pi0.7 §V (prompt expansion / diverse context conditioning),
push more atom variants per episode and higher VLM sampling
temperature so the training distribution has enough wording diversity
that the LM head is forced to use its parameters rather than memorise
specific (prompt, target) pairs.
Changes vs prior annotation pass:
* vlm.temperature: 0.2 (default) → 0.7 — every Module-1/2/3 call
now produces diverse phrasings; same prompt yields different
completions across emissions.
* module_1.n_task_rephrasings: 10 → 30 — three times as many
``task_aug`` rows in language_persistent. ``${task}`` already
rotates through them deterministically per sample_idx (see
``_resolve_task`` in language_render.py).
* module_2.max_interjections_per_episode: 3 (default) → 9 — more
``user_interjection_response`` training samples + more plan
refresh events.
* module_3.K: 1 → 3 — three VQA pairs per emission tick instead of
one. Combined with the hz bump below, ~6× more VQA samples.
* module_3.vqa_emission_hz: 1.0 → 2.0 — double the VQA emission
rate within each subtask span.
Pushes to a new hub repo (``_tool3``) so the working ``_tool2``
dataset stays intact for comparison. ``${task}`` already wired to
rotate through ``task_aug`` rows, so no renderer change needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
db9118f16f |
fix(smolvla2): reject gibberish high-level generations
Memorised models can collapse to dominant-mode outputs (the JSON-token salad ``":":":":...`` from VQA training) when the prompt drifts even slightly from training distribution. Without a guard, that gibberish lands in ``current_subtask`` / ``current_plan`` / ``current_memory``, which feeds the next tick's prompt and cascades into worse outputs. The user observed exactly this: a clean run followed by a tick that wrote ``" " "`` into plan and memory, then slow recovery several ticks later. Add ``_looks_like_gibberish`` heuristic (alpha density, repeating chars, JSON-prefix sniff) and apply it before mutating state in ``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd``. Bad generations are logged inline (``[info] subtask gen rejected (gibberish): "":":":..."``) so the user can see what was dropped, but the state stays at its last-known-good value (typically the dataset bootstrap) instead of being polluted. VQA path is intentionally exempt — its training targets *are* JSON-shaped, so the heuristic would false-positive on them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7a945d7bdc |
fix(smolvla2): bootstrap canonical task + plan/memory from dataset
The user-typed task and the dataset's canonical task differ in
wording (capitalisation, ``green box`` vs ``green bin``, etc.). With
``text_loss`` driven down to ~6e-6 across 78 epochs the model is
memorised on the *exact* rendered training prompts: any wording drift
puts the prompt out of distribution and the model collapses to its
dominant training mode (VQA JSON output).
When ``--dataset.repo_id`` is set, automatically:
* read the canonical task string from the chosen episode (and use
it as ``--task`` when the user didn't pass one);
* pull the active ``plan`` / ``memory`` / ``subtask`` rows from the
persistent slice (latest row whose timestamp ≤ start frame's
timestamp — same semantics as the renderer's ``active_at``) and
seed them into the runtime state.
The first prompt the runtime builds at REPL start now mirrors what
the recipe rendered during training (task + active plan + active
memory + optional current subtask). The user can still override any
of these by typing.
Memorisation itself is upstream (training mix collapsed to too few
unique high-level targets); this commit only fixes the inference-side
prompt mismatch that was making the memorisation surface as gibberish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a47e535b02 |
fix(smolvla2): per-recipe inference prompts to match training shape
The four high-level steps shared one generic ``_control_context_messages`` that jammed task + plan + memory + completed_subtask into a single user message. The recipes in ``smolvla2_hirobot.yaml`` each have a *specific* multi-message layout (``memory_update``: ``user(task) → assistant(prev memory) → user(completed subtask)``; ``high_level_subtask``: ``user(task+plan+ memory) → user(current subtask)``; ``user_interjection_response``: ``user(task) → assistant(prev plan) → user(interjection)``). After ``apply_chat_template`` those layouts produce different prompts than the runtime's flattened single-user-turn version, and the model fell back to its dominant training mode (VQA JSON output) — generating ``":":":":":":...`` repetition. Add four per-recipe prompt builders (``_msgs_for_subtask``, ``_msgs_for_memory``, ``_msgs_for_interjection``, ``_msgs_for_vqa``), each mirroring its sub-recipe's exact message structure including the ``if_present`` skips. Wire each high-level step to its matching builder. Inference prompts now line up with what the model saw in training, so generation should produce coherent text instead of repeated tokens. Generic ``_control_context_messages`` is kept (still used by tests and the no-recipe fallback path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
6d9b431b54 |
fix(smolvla2): match training's text-loss forward in select_message
Previous rewrite drove generation through ``vlm.generate()`` (the
standard SmolVLM path), which ignores SmolVLA's custom ``embed_prefix``
that interleaves images + lang + state. Result: the model received a
prompt format it had never been trained on at inference and emitted
JSON-fragment gibberish (``" " " ,",","`` ``cube lift {"...``).
Revert to the cumulative-buffer AR loop driven through
``vlm_with_expert.forward`` — the *same* forward call ``_compute_text_loss``
makes during training (``inputs_embeds=[prefix_embs, None],
use_cache=False, fill_kv_cache=True``). With ``fill_kv_cache=True``,
every layer routes through ``forward_attn_layer``, which gracefully
skips ``None`` expert inputs (``if hidden_states is None or layer is
None: continue``); cross-attention layers — which would otherwise hard-
require a non-None expert input — are bypassed entirely.
Inference now sees the same prefix structure as training: images +
lang + state, with new tokens appended to the lang region. The text
distribution matches what the model was trained to produce.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
347e706326 |
fix(smolvla2): drop pixel_values from select_message generate path
SmolVLA's image preprocessor sizes frames to whatever the action expert was trained on, but SmolVLM's standard vision tower expects its own default tile grid (e.g. 384/14 → 27×27 patches). The mismatch surfaces deep in the post-vision reshape as ``RuntimeError: shape '[2, 34, 34, 768]' is invalid for input of size 1843200`` — the model has 1200 patches but expects 34×34=1156. Drop ``pixel_values`` from ``vlm.generate(...)`` so SmolVLM runs as a text-only LM at REPL time. The high-level branches (subtask / plan / memory) are dominated by their text context anyway, so this is acceptable for dry-run inference. VQA loses its image grounding — that will be marked as expected for the dry-run path until a follow-up either re-processes images through SmolVLM's own ``ImageProcessor`` to match its tile grid, or gives ``vlm_with_expert`` a real AR text decode mode that handles state and image embeddings the way training does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fa8ae1e89b |
fix(smolvla2): drive select_message through SmolVLM.generate
The hand-rolled AR loop in ``select_message`` was fighting the underlying ``vlm_with_expert.forward`` design, which assumes the "prefix-once + suffix-always-via-expert" pattern that ``denoise_step`` uses for action chunks. Cross-attn layers (every other layer with ``attention_mode='cross_attn'`` + ``self_attn_every_n_layers=2``) hard-require an expert input on every call: passing ``inputs_embeds=[current_embs, None]`` crashed at ``expert_layer.input_layernorm(None)`` with ``'NoneType' object has no attribute 'dtype'``. Earlier KV-cache attempts ran into the matching ``[15, 139] vs [15, 1]`` shape mismatch because the cache gets *overwritten*, not appended, on each ``fill_kv_cache=True`` call — there's just no AR-text-decode mode in this forward. Stop fighting it: drive AR text generation through the underlying SmolVLM via ``vlm.generate(input_ids=..., attention_mask=..., pixel_values=...)``. KV caching, sampling/greedy, EOS handling all come from HF's standard implementation. Trade-off: ``state`` drops out of the prefix at inference (no slot for it on the standard SmolVLM path), so high-level generations may drift from training distribution slightly. That's acceptable for the dry-run REPL — the high-level branches (subtask / plan / memory / vqa) are mostly vision+language conditioned anyway, and the action expert (where state actually matters) goes through the unchanged ``select_action`` path. Image features the runtime merged in (``observation.images.*``) are stacked into the ``[B, num_images, C, H, W]`` shape SmolVLM expects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3ff6c6860e |
fix(smolvla2): rewrite select_message decode loop without KV cache
SmolVLA's ``vlm_with_expert.forward`` doesn't actually support incremental KV cache growth — its only ``fill_kv_cache=True`` mode *overwrites* the cache with the latest call's key/value states, and its only ``fill_kv_cache=False`` mode concatenates ``cache + new`` into a local ``key_states`` for one matmul without ever updating the cache itself. The original ``select_message`` decode loop tried to use ``fill_kv_cache=True`` per step, which clobbered the cache to 1 token after the first decode and threw ``Expected size for first two dimensions of batch2 tensor to be: [15, 139] but got: [15, 1]`` — the attention mask still expected 139 keys but the cached + new key_states only had 1. Match the pattern ``denoise_step`` already uses successfully: maintain a cumulative ``(embs, pad, att)`` buffer that starts as the prefix and grows by one bool/embedding row per step. Each step forwards the *full* sequence with ``use_cache=False, fill_kv_cache=False, past_key_values=None`` so the matmul shapes always line up. Generated-token rows are tagged ``pad=1, att=1`` which makes them fully causal among themselves while still able to attend back to the entire prefix (per ``make_att_2d_masks`` semantics: a token can attend to any earlier token whose cumulative ``att`` count is ≤ its own). Image encoding is still done once via the initial ``embed_prefix`` call — the expensive part doesn't repeat. The remaining cost is O(n²) text-only transformer forwards, which is fine for the dry-run REPL's 50–100 token responses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fd89efb545 |
fix(smolvla2): 3D attention mask in select_message decode loop
SmolVLA's ``eager_attention_forward`` does ``masked = torch.where(attention_mask[:, None, :, :], ...)``, which requires a 3D ``[B, query_len, key_len]`` bool tensor so the broadcast to 4D works. ``select_message``'s prefix forward got this right (passes ``prefix_2d`` from ``make_att_2d_masks``), but the KV-cache decoding loop built ``new_attn = torch.ones((bsize, cur_pos + 1))`` — 2D — and the very first decode step blew up with ``IndexError: too many indices for tensor of dimension 2``. During KV-cache decoding ``query_len = 1`` and ``key_len = cur_pos + 1`` (prefix + every token already generated), so the right shape is ``[B, 1, cur_pos + 1]``. Match the layout SmolVLA's working ``denoise_step`` uses for the equivalent ``prefix_pad_2d_masks`` build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2776b57c9e |
fix(smolvla2): bool attention mask + clean Claude-Code-style REPL
Two issues that combined to make the REPL unusable: 1. ``BatchEncoding.attention_mask`` is a ``Long`` tensor, but SmolVLA's ``eager_attention_forward`` does ``torch.where(attention_mask[..., None, :, :], ...)`` which requires a *bool* condition. Every forward raised ``where expected condition to be a boolean tensor, but got a tensor with dtype Long`` and the diagnostic surfaced it cleanly in the REPL — but generation produced nothing useful. Cast to ``bool`` in ``_build_text_batch`` so the prefix forward goes through. 2. The interactive REPL used ``rich.live.Live`` panels stacked on top of ``logging.basicConfig(level=DEBUG)`` HTTP request lines from ``httpcore`` / ``httpx`` / ``huggingface_hub``. The two rendering loops fought each other in the user's terminal and the output was illegible: hundreds of debug lines interleaved with re-rendered panels. Replace ``Live`` with a simple block redraw — clear screen, print the state block, print any robot log lines, then a single ``> `` prompt. State changes are visible above the prompt, the way Claude Code's REPL renders. No flicker, no re-render races. ``_silence_noisy_loggers`` drops the chatty third-party HTTP / download / model-init loggers to WARNING. ``-v`` still enables DEBUG on the lerobot loggers; if the user needs the HTTP traces, they can flip those individually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0fb5f04965 |
fix(smolvla2): handle BatchEncoding return from apply_chat_template
``tokenizer.apply_chat_template(..., tokenize=True, return_tensors='pt')``
on newer transformers returns a ``BatchEncoding`` (dict-like) rather
than a raw ``Tensor`` — particularly when the underlying call routes
through a processor. ``_build_text_batch`` only handled the ``Tensor``
and ``list`` shapes, so the encoding object reached SmolVLA's
``embed_language_tokens`` and ``F.embedding`` blew up with
``argument 'indices' must be Tensor, not BatchEncoding`` on every
high-level forward.
Normalise the return:
* ``BatchEncoding`` / ``dict`` → take ``input_ids`` (and the encoder's
``attention_mask`` when present, since ``pad_token_id`` can be
``None`` for SmolVLM and the fall-back ``ids != pad_token_id``
breaks then),
* ``list[int]`` / ``list[list[int]]`` → wrap in a long tensor,
* ``Tensor`` → keep as-is.
After unwrapping, ensure shape ``(1, seq)`` and that ``attention_mask``
is a tensor on the same device as ``ids``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
7296ac97af |
fix(smolvla2): make silent generation failures visible in REPL
Two failure modes were combining to make the runtime "look dead": 1. ``_build_text_batch`` produced lang tokens via ``apply_chat_template(return_tensors='pt')`` on CPU, but the policy sits on the configured device (mps / cuda). The first prefix-embed inside ``select_message`` then raised a device-mismatch on every call. The bare ``except Exception`` in ``_generate_with_policy`` swallowed it at debug level — no logs, no chat output, no visible sign anything had run. 2. Even when generation succeeded but returned an empty string (greedy EOS, unhappy chat template, etc.), the high-level steps silently no-op'd, so users saw nothing. Move tokens to ``policy.config.device`` in ``_build_text_batch`` so the prefix forward succeeds in the common case. Bump the swallowing log level to ``warning`` (with optional traceback under ``-v``), and when ``state`` is given route the same diagnostic into the REPL log via ``push_log`` so the user sees ``[warn] subtask gen failed: ...`` inline. Also push an ``[info] ... produced no text this tick`` line when generation runs but yields nothing, so empty completions are distinguishable from "step never ran". Apply the same surface to ``LowLevelForward.select_action`` failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9cbbcfb6a2 |
fix(smolvla2): tokenize lang prompt inline before select_action
LowLevelForward was handing the observation provider's output straight to ``policy.select_action``, but SmolVLA's ``_get_action_chunk`` indexes ``batch[OBS_LANGUAGE_TOKENS]`` and crashes with ``KeyError: 'observation.language.tokens'`` when the key isn't there. Our provider deliberately strips the dataset's language columns (the runtime drives messages itself), so nothing else was producing those tokens — the chunk path crashed on the very first tick after task was set. Build a low-level prompt from current runtime state inline (task / plan / memory as the user turn, current subtask appended as a continuation assistant turn when known), tokenize it with the same helper the high-level steps use, and merge ``lang_tokens`` / ``lang_masks`` into the observation before the call. Skip the step when no task is set yet, and swallow ``select_action`` exceptions at debug level so a missing observation feature doesn't kill the REPL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fea41b29f5 |
fix(datasets): probe parquet for language columns before strict cast
``_load_hf_dataset`` was building the strict cast schema only from ``meta/info.json["features"]``. Datasets annotated by ``lerobot-annotate`` but still tagged at the older codebase version (no ``language_persistent`` / ``language_events`` entry in ``info.json``) carry both columns in the parquet itself but not in the features dict, so ``Dataset.from_parquet`` blew up with ``CastError: column names don't match`` when trying to project a 9-column parquet onto a 7-column schema. Probe one parquet shard's actual schema; if either language column is present in the parquet but missing from ``features``, graft it on using PR 1's ``language_persistent_column_feature`` / ``language_events_column_feature`` helpers. No-op when neither column is present (fully backwards-compatible with v3.0 datasets), no-op when both are already registered (fully forwards-compatible with future v3.1 ``info.json`` writes). This unblocks dry-run inference on PR 2-annotated datasets that weren't re-tagged to v3.1 — including the ones in the field today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7b4d281ef5 |
fix(smolvla2): build preprocessor fresh, don't round-trip the recipe
``PolicyProcessorPipeline.from_pretrained`` reconstructs each saved
step by passing the persisted JSON config back to ``__init__``, but
``RenderMessagesStep.recipe`` (a ``TrainingRecipe``) doesn't survive
the JSON round-trip — the saved entry is ``{}`` and the reconstructor
crashes with ``missing 1 required argument: 'recipe'``.
Bypass the round-trip in the runtime CLI by passing
``pretrained_path=None`` to ``make_pre_post_processors``. That re-runs
``make_smolvla2_pre_post_processors``, which reloads the recipe YAML
referenced by ``cfg.recipe_path`` and wires it back into the step
correctly. ``NormalizerProcessorStep`` still gets stats from
``ds_meta.stats`` so normalization matches training.
Proper fix is to make ``RenderMessagesStep`` serializable (e.g. by
persisting the recipe path / contents); this commit keeps it scoped to
the runtime path so dry-run testing isn't blocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
29bb8bb20e |
fix(tools): unblock pocket-tts resolution (>=1.0.0,<3.0.0)
The previous bound `>=0.1.0,<1.0.0` matched zero published versions — pocket-tts went straight to 1.0.0 on PyPI, with 0.x never released. That made `uv sync --extra tools` (and any sync that pulls the `dev` / `all` superset) fail with "requirements are unsatisfiable" on every Python version uv tried, including 3.12. Bump to `>=1.0.0,<3.0.0` so 1.x and 2.x are reachable. SayTool only touches `TTSModel.load_model()`, `get_state_for_audio_prompt`, `generate_audio`, and `sample_rate` — small enough surface that 1.x and 2.x should both work; tighten if a real API break shows up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3fe686ce9f |
feat(smolvla2): runtime accepts Hub IDs + dataset-driven dry-run
The runtime CLI's loader was broken — it imported a `make_policy_from_path` that doesn't exist in `lerobot.policies.factory` — and the high-level text steps generated plan / subtask / memory / VQA from a text-only batch with no images or state, so dry-runs drifted from the training distribution. Switch to the standard `PreTrainedConfig.from_pretrained` + `make_policy(cfg, ds_meta=...)` flow so `--policy.path` accepts both local directories and Hub repo ids, and add a `--dataset.repo_id` path that walks a chosen episode and feeds preprocessed observations into every forward pass — including the four high-level steps (`HighLevelSubtaskFwd`, `MemoryUpdateFwd`, `UserInterjectionFwd`, `AskVQAFwd`). Frames are routed through the saved preprocessor pipeline with `language_persistent` / `language_events` stripped so the recipe-render step stays a no-op (the runtime supplies its own messages from current state). Also wires the rich-based two-zone REPL layout (`ui.py`) that the script was already importing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a1b8134ef1 |
fix(smolvla2): train on rendered language batches
Keep annotated language columns through collation, render batched recipe samples, and make SmolVLA2 text loss robust enough for distributed training on the steerable dataset. Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
5f7c6ba61d |
feat(annotate): compact steerable annotation prompts
Co-authored-by: Cursor <cursoragent@cursor.com> |
||
|
|
223cc8a9e2 |
feat(smolvla2): inference runtime — select_message + multi-rate REPL
Closes the loop on PR 3: SmolVLA2 can now be queried interactively at inference, dispatching the same five sub-recipe shapes it was trained on (action chunks, subtask gen, memory updates, plan/speech on interjection, VQA on questions). Modeling fixes + additions -------------------------- - ``_compute_text_loss``: standard next-token CE shift was missing (logits at position t were CE'd against the label at t — identity- mapped, learning nothing). Adds ``logits[:, :-1]`` / ``labels[:, 1:]`` shift to match HuggingFace ``LlamaForCausalLM``. - New ``select_message`` on ``SmolVLA2Policy``: AR text generation with KV caching, mirroring SmolVLA's ``select_action`` pattern. Single prefix forward fills the cache, then per-token forwards reuse it. Greedy + top-p nucleus sampling. Returns the decoded string with the prompt stripped. Runtime package — ``src/lerobot/policies/smolvla2/inference/`` ------------------------------------------------------------- - ``triggers.py`` — ``Trigger`` Protocol + ``HzTrigger`` / ``EventTrigger`` + ``TickClock``. The whole runtime ticks at ``max_rate_hz=50`` and each step gates itself off its own cadence. - ``runtime_state.py`` — runtime state dict factory plus tiny helpers (``take_event``, ``set_if_changed``, ``push_log``). Stable keys are documented at the top of the module. - ``steps.py`` — :class:`InferenceStep` base + concrete steps: ``LowLevelForward`` / ``DispatchAction`` (action path), ``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd`` / ``AskVQAFwd`` (text paths), ``DispatchToolCalls`` (tool registry → ``Tool.call``). Each text step builds a chat-template prompt from current ``RuntimeState`` (task / plan / memory / subtask) matching what ``smolvla2_hirobot.yaml`` renders during training. Includes a tiny ``<say>...</say>`` parser for the ``user_interjection_response`` branch's combined plan + speech output. - ``runtime.py`` — :class:`SmolVLA2Runtime` composes the pipeline, drives ticks via ``TickClock``, polls a user-supplied ``event_collector`` per tick, and prints state-change log lines. - ``repl.py`` — :class:`StdinReader` non-blocking line reader with simple intent classification: ``stop`` / ``quit`` / ``exit`` → terminate; ``?`` suffix → ``user_vqa_query`` event; first line → set task; other lines → ``user_interjection``. CLI --- - ``src/lerobot/scripts/lerobot_smolvla2_runtime.py``: console script ``lerobot-smolvla2-runtime`` that loads a checkpoint, optionally instantiates ``SayTool`` (pocket-tts), wires up ``SmolVLA2Runtime`` + ``StdinReader``, and runs. Real-robot wiring (observation_provider / robot_executor) is intentionally left as a follow-up — v1 is dry-run / language- only so the REPL works without robot hardware. Registered in ``pyproject.toml`` ``[project.scripts]``. Known follow-ups ---------------- - Real-robot integration: today ``LowLevelForward`` only fires when an observation_provider is wired. The CLI prints a warning if ``--no_robot`` is omitted. - ``select_message`` runs an extra prefix forward; could share with the action path's prefix when both are needed in the same tick. - Tests: no end-to-end runtime test yet (would need a tiny SmolVLM fixture). The components compile and the public surface is exercised by the CLI's argument-parsing path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
af6d8ebd5b |
feat(smolvla2): dual-head forward — flow loss + lm_head text loss
The third and final commit of PR 3's SmolVLA2 work. Wires the actual
training signal through:
* ``predict_actions[i] = True`` → sample i contributes to flow loss
* ``text_labels[i, t] != -100`` → token t of sample i contributes to
LM-head cross-entropy
Both routing knobs come from ``SmolVLA2ChatTokenizerStep`` (previous
commit on this branch), which builds them from the recipe's
``message_streams`` / ``target_message_indices``. The per-sample
``predict_actions`` mask preserves the Pi0.5 convention from the
plan's Section I.7: "True iff any low_level target exists".
Implementation:
- ``forward`` reads ``text_labels`` and ``predict_actions`` from the
batch. When neither is present (vanilla SmolVLA usage with no
recipe), delegates to ``SmolVLAPolicy.forward`` so unannotated
datasets keep training as before — full backward compatibility.
- ``flow_loss``: super().forward(reduction="none") returns the
per-sample (B,) flow loss; we mask non-action samples with the
``predict_actions`` bool and renormalize by the count of action
samples. ``flow_loss_weight = 0`` in the config disables this
branch entirely (text-only training).
- ``text_loss``: a prefix-only forward through the VLM (no action
expert / suffix), slicing the lang-token range out of the
resulting hidden states (``embed_prefix`` orders the prefix as
``[image_blocks..., lang, state]`` so the slice is unambiguous).
Apply ``vlm.lm_head`` to those hidden states, cross-entropy with
``text_labels`` (ignore_index=-100). ``text_loss_weight = 0``
disables this branch (reverts to flow-only behaviour, matching
SmolVLA exactly).
- The two losses are summed with the config-supplied weights.
Mixed-stream samples (one batch containing both action targets and
text-only sub-recipes) are handled correctly: each sample contributes
where its labels are valid and is masked elsewhere.
Limitations / known follow-ups:
- Text loss runs an additional prefix-only forward separate from the
flow path's prefix forward. The forwards could share their prefix
computation; for clarity of this first commit they don't.
Optimization is straightforward when needed.
- Per-sample loss for ``reduction="none"`` is not yet meaningfully
defined for the dual path — we broadcast the scalar to (B,) for
caller compatibility (e.g. RA-BC weighting will need follow-up).
- Inference ``select_action`` is unchanged from SmolVLA today —
it predicts actions only. A separate "generate text"
``select_message`` path is the natural next step for runtime
use of the LM head (memory updates, plan refreshes, VQA answers).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
37b1eb218a |
feat(smolvla2): chat-template processor + label mask + predict_actions
Wires PR 1's recipe stack into the SmolVLA2 pipeline so multi-target
sub-recipes (memory_update, ask_vqa, user_interjection_response,
high_level_subtask) carry meaningful supervision through to the model.
- New ``chat_processor_smolvla2.py`` with
``SmolVLA2ChatTokenizerStep``: reads ``messages`` /
``message_streams`` / ``target_message_indices`` from the rendered
sample (PR 1 ``RenderMessagesStep``), calls
``apply_chat_template(messages, tools=DEFAULT_TOOLS, ...)`` on the
SmolVLM tokenizer, and writes:
OBS_LANGUAGE_TOKENS / _ATTENTION_MASK ← chat-templated prompt
text_labels ← -100 except target msg tokens
predict_actions ← True iff any low_level target
Builds the label mask robustly by re-rendering the chat through
each target's prefix and reading off the prefix length — same
tokenizer, same tools, so the prefix tokens are guaranteed to be
a prefix of the full sequence. Image/video content blocks
(LeRobot ``feature``-keyed) are stripped before tokenizing; the
actual image tensors flow through SmolVLA's existing
``OBS_IMAGES_*`` channels and ``embed_prefix`` puts them before
the language embeddings, matching the chat-template-stripped
text order.
- ``processor_smolvla2.py``: when ``config.recipe_path`` is set,
build a new pipeline with ``RenderMessagesStep`` +
``SmolVLA2ChatTokenizerStep`` instead of SmolVLA's plain
``TokenizerProcessorStep``. When ``recipe_path`` is ``None``,
fall back to SmolVLA's pipeline so unannotated datasets still
work unchanged. Resolves recipe paths relative to
``src/lerobot/configs/`` so ``recipes/smolvla2_hirobot.yaml``
works directly.
The next commit on this branch picks up ``text_labels`` and
``predict_actions`` from the batch and routes them through the
SmolVLM ``lm_head`` for the actual dual-loss training.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
52e1fd35cb |
feat(tools): src/lerobot/tools/ — runnable tool registry + SayTool
Ships the runtime side of the OpenAI-style function-calling stack
introduced in PR 1 (catalog in ``meta/info.json["tools"]``) and PR 2
(annotation pipeline writes the catalog after a run). One file per
tool — heavy deps stay isolated.
Layout:
- ``base.py`` — :class:`Tool` Protocol: ``name``, ``schema``,
``call(arguments)``. Runtime-checkable so tests can use
``isinstance(...)``.
- ``registry.py`` — :data:`TOOL_REGISTRY` (name → class) plus
``get_tools(meta, **kwargs)`` that instantiates every entry whose
``function.name`` is registered. Tools whose name is unknown are
silently skipped — the schema still rides through the chat
template, the model just can't actually invoke that tool at
inference.
- ``say.py`` — :class:`SayTool` wrapping Kyutai's pocket-tts
(CPU-only, ~100M params, ~6× real-time on a MacBook Air M4).
Lazy model load: pocket-tts is imported and the voice state
computed on first ``call(...)`` (or eagerly via ``preload()``).
Returns the PCM tensor; optionally writes a ``.wav`` to
``output_dir`` for offline inspection.
- ``__init__.py`` — re-exports the public surface.
Optional install:
pip install lerobot[tools]
The ``[tools]`` extra in ``pyproject.toml`` pulls in ``pocket-tts`` +
``scipy`` (for the wav writer). Adding more tools later means a new
file + a registry entry — no new extras unless the tool brings new
deps.
To add your own tool, follow the three-step guide in
``docs/source/tools.mdx`` (PR 1):
1. Drop ``src/lerobot/tools/<my_tool>.py`` with a ``Tool``-conforming
class.
2. Register the class in ``TOOL_REGISTRY`` (this file).
3. Pre-populate ``meta/info.json["tools"]`` with the schema (or let
``lerobot-annotate`` add it on the next run).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
7459dfccb6 |
feat(policies): scaffold smolvla2 (smolvla + lm_head re-enabled)
PR 3 of the steerable-annotation plan retargeted from Pi0.5 to SmolVLA
because the recipe stack (PR 1 + PR 2) outputs HF/TRL-compatible chat
which a chat-pretrained backbone consumes natively. SmolVLA strips the
SmolVLM ``lm_head`` though, so it can only do flow-matching action
prediction. SmolVLA2 keeps the LM head so the same model can train on
the full Hi Robot / MEM / ECoT blend defined in the plan:
* action-only sub-recipes (low_level_execution) flow loss
* text-only sub-recipes (memory_update / ask_vqa / CE loss on
user_interjection_response) lm_head
* mixed sub-recipes both summed
This first commit lays down the structural scaffold:
- ``src/lerobot/policies/smolvla2/`` — new package with thin subclasses
of ``SmolVLAConfig`` / ``SmolVLAPolicy`` so we don't fork the 900-line
modeling code. ``SmolVLA2Config`` adds ``recipe_path``,
``apply_chat_template``, ``text_loss_weight``, ``flow_loss_weight``,
and ``unfreeze_lm_head``. ``SmolVLA2Policy`` unfreezes the SmolVLM
``lm_head`` (and the surrounding norm + last text-model layer SmolVLA
freezes) when ``unfreeze_lm_head=True`` and ``text_loss_weight>0``.
- ``factory.py`` registers ``smolvla2`` in ``get_policy_class``,
``make_policy_config``, and the pre/post-processor builder. Important:
the ``smolvla2`` branch lives BEFORE the ``isinstance(config,
SmolVLAConfig)`` check because ``SmolVLA2Config`` subclasses
``SmolVLAConfig`` — without the ordering, SmolVLA2 would silently
pick up SmolVLA's processor.
- ``configs/recipes/smolvla2_hirobot.yaml`` — canonical Hi Robot blend
for SmolVLA2. Same shape as ``pi05_hirobot.yaml`` (PR 1) so the
recipe stack stays uniform across policy backbones.
Behaviour today is identical to SmolVLA: the modeling forward
delegates to ``SmolVLAPolicy.forward`` and the processor delegates to
``make_smolvla_pre_post_processors``. The next commit on this branch
adds the chat-template processor + ``text_labels`` / ``predict_actions``
batch keys; the commit after that wires the actual text-loss path
through ``vlm.lm_head``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
73740ecf4b |
feat(annotate): write tool catalog to meta/info.json after annotation
After every ``lerobot-annotate`` run, the executor ensures ``meta/info.json["tools"]`` contains at minimum the canonical ``say`` schema, while preserving any tools the user pre-declared on the dataset. Chat-template consumers (PR 3 SmolVLA2 / Pi0.5 / dataset visualizer) read the catalog through ``LeRobotDatasetMetadata.tools`` and pass it to ``apply_chat_template(messages, tools=meta.tools, ...)``. - ``executor.py``: new ``_ensure_tools_in_info`` helper called after the parquet rewrite. Idempotent and additive — merges by ``function.name``, only writes back if the list changed. - ``writer.py``: drops the duplicated ``SAY_TOOL_SCHEMA`` / ``DEFAULT_TOOLS`` constants in favour of importing from ``lerobot.datasets.language`` (PR 1's single source of truth). Re-exported so existing imports keep working. - ``annotation_pipeline.mdx``: replace the "code constant only" note with a pointer to the new Tools doc and a description of the meta/info.json behaviour, including how to pre-declare custom tools before annotation runs. This is the storage half of the tools work; PR 3 ships the runnable implementations under ``src/lerobot/tools/`` (one file per tool, first up: ``say.py`` wired to Kyutai's pocket-tts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1b81e49214 |
feat(annotate): task rephrasings + video-derived task fallback
Module 1 now produces ``task_aug`` rows (registered in PR 1) so the
PR-1 ``${task}`` resolver can rotate phrasings deterministically per
``sample_idx``. Plus an opt-in video-derived task that bypasses the
canonical ``meta/tasks.parquet`` task when it's empty, low-quality, or
explicitly disabled — every downstream Module-1 prompt then uses the
derived task as its grounding.
- ``Module1Config``: adds ``n_task_rephrasings`` (default 10) and
``derive_task_from_video`` ∈ ``{off, if_short, always}`` (default
``if_short``: triggers when canonical is empty, < 3 words, or matches
a placeholder string like ``debug`` / ``unnamed`` / ``tbd``).
- ``plan_subtasks_memory.py``: ``run_episode`` now resolves an
``effective_task`` (canonical OR video-derived) and threads it
through ``_generate_subtasks`` / ``_generate_plan`` /
``_generate_memory`` so subtasks, plans, and memory are all grounded
in the same task string. Then generates ``n`` rephrasings of the
effective task and writes them as ``task_aug`` rows at ``t=0`` with
``role=user``. The effective task itself is included as the first
variant so the rotation is guaranteed to cover the source-of-truth
phrasing.
- New prompts: ``module_1_video_task.txt`` (one-shot video → task),
``module_1_task_rephrasings.txt`` (text-only paraphraser, ``n`` per
call).
- ``meta/tasks.parquet`` is NOT modified — derived tasks live only in
``language_persistent``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d813c75b76 |
fix(annotate): align interjections with the actual demo trajectory
qwen36moe-11 surfaced a deeper semantic problem with mid-episode
interjections: they were generated as *counterfactual* user requests
("actually skip the wipe", "use the blue one instead") but teleop data
is frozen — the robot in the video already executed everything,
including the steps the user "asked to skip". The training signal was
therefore self-contradictory: interjection text said one thing, the
robot's subsequent action stream did the opposite.
Flip the framing. Anchor every interjection at a subtask boundary and
write it as a natural user request for the *upcoming* subtask. The
robot's visible next behavior IS the interjection's effect, so:
interjection text → plan refresh → action stream
are all consistent with the same observed video.
Concretely:
- ``interjections_and_speech.py``: instead of sampling random
timestamps from ``frame_timestamps``, walk Module 1's subtask spans
and sample from the (subtask N → subtask N+1) transitions. Pass both
the just-finished and the upcoming subtask texts into the prompt.
- ``_window_timestamps``: re-center the multi-frame video window on
the boundary itself (half the frames cover the end of the previous
subtask, half cover the start of the next one) so the VLM has the
same visual conditioning the policy will see at training time.
- ``module_2_interjection.txt``: rewritten. The prompt now states
explicitly that this is offline data, the robot already committed to
the next subtask, and the interjection must be a natural request
that aligns with — not contradicts — the next subtask. Removes the
"negative task / situated correction" Hi Robot framing because those
scenarios require online execution to be coherent.
Plan-refresh logic from the previous commit (forwarding interjection
text into the refresh prompt) is unchanged and now reinforces the same
direction: the refreshed plan emphasizes the upcoming subtask the
interjection just asked for.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3434d2ef22 |
fix(annotate): ground interjections in video + propagate text to plan refresh
qwen36moe-10 showed three Module-2 / plan-refresh quality issues that
are not architecture problems — they're prompt-grounding bugs:
1. Interjection prompt passed ``current_subtask = record.episode_task``
(the WHOLE-episode task), not the actual subtask in force at the
chosen timestamp. The VLM had no signal about what was visible at
that moment, so its interjections were generic ("actually skip X"
where X had nothing to do with the visible activity).
2. Interjection prompt only attached a single frame
(``frames_at(record, [t_snap])``). With one frozen image the VLM
couldn't read the ongoing motion. Module 1 already gets the whole
episode video for subtask decomposition, which is why subtasks are
well-grounded; Module 2 was the outlier.
3. The plan-refresh prompt told the model "a plan refresh after a user
interjection at t=X.YZs" but never showed it the interjection
*text*. So the refreshed plan couldn't actually reflect the user's
correction — at best it recombined the same step list.
Fix:
- ``interjections_and_speech.py``: Module 2 reads Module 1's subtask
rows from the same staging tree (executor orders module_1 → module_2
so they're already there) and resolves the actual ``current_subtask``
at each chosen timestamp. Pulls a small clip
(``interjection_window_seconds`` × ``interjection_window_frames``,
defaulting to 4 frames over the leading 2 s) instead of one frame.
Drops the silently-zeroing ``len(candidate_ts) // 4`` cap on the
interjection count.
- ``module_2_interjection.txt``: prompt is rewritten to reference the
multi-frame visual context and require the interjection to mention
something visible OR named in the current subtask, not invented.
- ``plan_subtasks_memory.py``: ``run_plan_updates`` now accepts and
threads through interjection texts. ``_generate_plan(refresh_t,
interjection)`` injects both the current subtask AND the interjection
text into the prompt so the refreshed plan can drop / reorder /
constrain steps to match the user's correction. (Plan still refreshes
ONLY at user interjections — subtask generation runs ~1 Hz at
inference, plan re-emission is event-driven.)
- ``executor.py``: forwards ``interjection_texts`` alongside
``interjection_times`` to ``run_plan_updates``.
- ``Module2Config``: bumps ``max_interjections_per_episode`` default
from 1 to 3 and exposes ``interjection_window_seconds`` /
``interjection_window_frames``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b71e10da6b |
refactor(annotate): drop dataset-level `tools` parquet column
PR 2 used to write a top-level ``tools`` column on every parquet shard holding the JSON schema for the ``say`` tool, broadcast identically across every row. That extends PR 1's schema for no real information gain — the schema is a fixed code constant, parquet's RLE/dict encoding collapses it on disk anyway, and HF/TRL chat-template consumers can just import the constant directly. PR 2 should fill in PR 1's existing schema, not add to it. So: - ``writer.py``: stop emitting the ``tools`` column. Strip any legacy ``tools`` column from older shards on rerun so the schema converges to v3.1. ``SAY_TOOL_SCHEMA`` stays as a public constant (now joined by ``DEFAULT_TOOLS = [SAY_TOOL_SCHEMA]``); chat-template policies and the visualizer import them directly. - ``test_writer.py``: replace the "tools column present" assertion with one that explicitly checks the column is absent, plus a new test asserting the constant's shape. - ``test_pipeline_recipe_render.py``: drop the tools-column read; assert it's not present in the rewritten parquet. - ``annotation_pipeline.mdx``: update the writer description to note the parquet stays small and the schema lives as a code constant. If multi-tool-set support ever becomes real (datasets with different tool inventories), the right home is ``meta/info.json["tools"]`` — adding it later is non-breaking; ripping out a parquet column already shipped is not. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0f6e3230df |
fix(annotate): decode video frames with PyAV directly
``lerobot.datasets.video_utils.decode_video_frames`` routes ``backend="pyav"`` through ``decode_video_frames_torchvision`` → ``torchvision.io.VideoReader``, but ``VideoReader`` was removed in torchvision >= 0.22 (the vllm/vllm-openai:latest container ships with torchvision 0.25). That made every Module 3 frame decode raise ``AttributeError: module 'torchvision.io' has no attribute 'VideoReader'``, which the previous catch-all silently turned into an empty image list, which then made every Module 3 prompt skip via the ``not _has_image_block(messages)`` branch and produce zero VQA rows. Bypass ``video_utils`` entirely. The annotation pipeline only needs a handful of PIL frames per (episode, ts), so a direct PyAV decode is both simpler and insulated from torchvision API churn. ``av`` is already in the install set, no new dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2f2e42c4aa |
log(annotate): warn loudly on first video decode failure
VideoFrameProvider._decode used to swallow every exception silently and return []. That made Module 3 (VQA) produce zero rows whenever local video decoding broke (codec, backend, missing file, ...) because every prompt got skipped via the ``not _has_image_block(messages)`` branch in general_vqa.py — without any signal in the job log. Log the first failure with full exception info (subsequent failures stay quiet to avoid log spam) so this fast-path is debuggable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5ee0104739 |
log(annotate): surface resolved frame-provider cameras at startup
Print the default and full camera list once at the top of every run so a silent Module-3-no-op (cam_keys=[]) is visible in the job log instead of only being discoverable by counting parquet rows after upload. Also warn loudly when Module 3 is enabled but no cameras resolved, with a hint about the --vlm.camera_key fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e064cfcb04 |
fix(annotate): seed Module 3 cameras from camera_keys + camera_key fallback
Module 3 fast-pathed out (50 episodes in 0.6s) when ``frame_provider.camera_keys`` came back empty even though Module 1/2 worked, because they use ``frame_provider.camera_key`` (singular) and were happy with the explicit ``--vlm.camera_key=...`` override. Two fixes: - ``frames.py``: read ``meta.camera_keys`` (covers both video- and image-stored cameras) instead of ``meta.video_keys`` (video-only), matching :class:`LeRobotDatasetMetadata`'s canonical accessor. If metadata still surfaces nothing but the caller explicitly passed ``--vlm.camera_key=<key>``, fall back to ``[<key>]`` — the key is by definition known to exist on the dataset. - ``general_vqa.py``: emit a one-time WARNING log when Module 3 sees zero cameras so this never silently produces zero VQA again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |