mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-14 16:19:45 +00:00
fix(smolvla2): use canonical _strip_lerobot_blocks for inference msgs
Training tokenises messages through ``_strip_lerobot_blocks`` (in
``chat_processor_smolvla2.py``), which normalises every variant of
``message['content']`` into the ``[{type:text, text:...}]`` list shape
SmolVLM's chat template expects:
* ``list[block]`` → keep text blocks, drop images
* ``None`` → ``[{type:text, text:""}]``
* ``str`` / other → ``[{type:text, text:str(content)}]``
Inference was doing a partial inline conversion that only handled the
``str`` case — ``None`` and pre-formatted ``list`` content slipped
through unchanged. ``memory_update``'s ``Previous memory: ...``
assistant turn ends up with ``None`` content when there's no prior
memory, which then renders as no-content / role-marker-only and the
model hallucinates ``Assistant:`` fragments. Subtask gen got further
because its prompt always has at least the task string.
Reuse ``_strip_lerobot_blocks`` directly. Now the inference prompt
shape matches the exact tokenisation training did — no more "trained
on shape X, asked to predict shape Y" mismatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -170,18 +170,20 @@ def _build_text_batch(policy: Any, prompt_messages: list[dict[str, Any]]) -> dic
|
||||
if tokenizer.pad_token_id is None and tokenizer.eos_token_id is not None:
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
text_messages = [_strip_recipe_keys(m) for m in prompt_messages]
|
||||
# SmolVLM's chat template iterates ``message['content']`` expecting
|
||||
# a list of typed blocks (``[{type: 'text', text: ...}, ...]``).
|
||||
# When ``content`` is a plain ``str`` it silently iterates characters,
|
||||
# no branch matches, and *no content tokens are emitted* — the model
|
||||
# receives only role markers and starts hallucinating ``Assistant:``
|
||||
# fragments. Coerce string content to the list-of-blocks form the
|
||||
# template expects.
|
||||
for _m in text_messages:
|
||||
_c = _m.get("content")
|
||||
if isinstance(_c, str):
|
||||
_m["content"] = [{"type": "text", "text": _c}]
|
||||
# Reuse the *exact* normaliser that the training-time chat
|
||||
# tokenizer step uses (``_strip_lerobot_blocks``). It handles all
|
||||
# the cases the SmolVLM chat template expects:
|
||||
# * ``content: list[block]`` → keep text blocks, drop images
|
||||
# * ``content: None`` → ``[{type: text, text: ""}]``
|
||||
# * ``content: str`` / anything else → ``[{type: text, text: str(content)}]``
|
||||
# Doing it any other way creates a training/inference mismatch in
|
||||
# exactly the prompt shape the model was supervised on. Also
|
||||
# strips ``stream`` / ``target`` recipe metadata.
|
||||
from lerobot.policies.smolvla2.chat_processor_smolvla2 import ( # noqa: PLC0415
|
||||
_strip_lerobot_blocks,
|
||||
)
|
||||
|
||||
text_messages = [_strip_lerobot_blocks(m) for m in prompt_messages]
|
||||
encoded = tokenizer.apply_chat_template(
|
||||
text_messages,
|
||||
add_generation_prompt=True,
|
||||
|
||||
Reference in New Issue
Block a user