mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-20 19:19:56 +00:00
fix(smolvla2): feed all cameras to VQA generation, not just the chosen one
handle_vqa_query filtered the observation down to the single chosen camera before calling the VLM. But training feeds every camera: the ask_vqa_* recipes' image blocks are stripped before tokenization and the frames reach the model via OBS_IMAGES_*, where embed_prefix consumes all config.image_features regardless of the per-camera recipe tag. Filtering to one camera changed the image-token count in the prefix (the dropped camera zero-padded with mask=0) — a prefix shape the model never saw at training. Now the full observation is passed to select_message; the chosen camera is used only to pick which frame the bbox/point overlay is drawn on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -310,19 +310,20 @@ def handle_vqa_query(
|
|||||||
else:
|
else:
|
||||||
report(" [info] vqa: no camera available — answering text-only")
|
report(" [info] vqa: no camera available — answering text-only")
|
||||||
|
|
||||||
# Ground the question on the chosen camera only — filter the
|
# Feed the FULL observation (every camera + state) to the VLM. The
|
||||||
# observation to that one image (+ proprio state) so the VLM
|
# ``ask_vqa_*`` recipes look single-camera, but the image *block* is
|
||||||
# prefix matches the single-image ``ask_vqa_*`` training recipe.
|
# stripped before tokenization — the actual frames reach the model
|
||||||
vqa_obs: dict | None = None
|
# via SmolVLA's ``OBS_IMAGES_*`` channels, and ``embed_prefix``
|
||||||
if observation is not None and chosen is not None:
|
# consumes *all* ``config.image_features`` regardless of which
|
||||||
vqa_obs = {chosen: observation[chosen]}
|
# camera the sub-recipe was tagged for. So training always sees
|
||||||
if "observation.state" in observation:
|
# every camera; filtering to one here would change the image-token
|
||||||
vqa_obs["observation.state"] = observation["observation.state"]
|
# count in the prefix (the dropped camera gets zero-padded with
|
||||||
|
# mask=0) — a prefix shape the model never saw. The chosen camera
|
||||||
|
# is used only to pick which frame the overlay is drawn on.
|
||||||
answer = _generate_with_policy(
|
answer = _generate_with_policy(
|
||||||
policy,
|
policy,
|
||||||
_msgs_for_vqa(question),
|
_msgs_for_vqa(question),
|
||||||
observation=vqa_obs,
|
observation=observation,
|
||||||
state=state,
|
state=state,
|
||||||
label="vqa gen",
|
label="vqa gen",
|
||||||
)
|
)
|
||||||
|
|||||||
Reference in New Issue
Block a user