fix(smolvla2): feed all cameras to VQA generation, not just the chosen one

handle_vqa_query filtered the observation down to the single chosen
camera before calling the VLM. But training feeds every camera: the
ask_vqa_* recipes' image blocks are stripped before tokenization and
the frames reach the model via OBS_IMAGES_*, where embed_prefix
consumes all config.image_features regardless of the per-camera recipe
tag. Filtering to one camera changed the image-token count in the
prefix (the dropped camera zero-padded with mask=0) — a prefix shape
the model never saw at training.

Now the full observation is passed to select_message; the chosen
camera is used only to pick which frame the bbox/point overlay is
drawn on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pepijn
2026-05-18 14:46:38 +02:00
parent dc530e10fe
commit 3174e14bc0
+11 -10
View File
@@ -310,19 +310,20 @@ def handle_vqa_query(
else:
report(" [info] vqa: no camera available — answering text-only")
# Ground the question on the chosen camera only — filter the
# observation to that one image (+ proprio state) so the VLM
# prefix matches the single-image ``ask_vqa_*`` training recipe.
vqa_obs: dict | None = None
if observation is not None and chosen is not None:
vqa_obs = {chosen: observation[chosen]}
if "observation.state" in observation:
vqa_obs["observation.state"] = observation["observation.state"]
# Feed the FULL observation (every camera + state) to the VLM. The
# ``ask_vqa_*`` recipes look single-camera, but the image *block* is
# stripped before tokenization — the actual frames reach the model
# via SmolVLA's ``OBS_IMAGES_*`` channels, and ``embed_prefix``
# consumes *all* ``config.image_features`` regardless of which
# camera the sub-recipe was tagged for. So training always sees
# every camera; filtering to one here would change the image-token
# count in the prefix (the dropped camera gets zero-padded with
# mask=0) — a prefix shape the model never saw. The chosen camera
# is used only to pick which frame the overlay is drawn on.
answer = _generate_with_policy(
policy,
_msgs_for_vqa(question),
observation=vqa_obs,
observation=observation,
state=state,
label="vqa gen",
)