The trained model collapsed to spewing 40+ <loc> tokens for *every*
prompt — subtask, memory, anything — because VQA targets were supervised
to *start* with <loc>. With ~25% of all text samples beginning with a
<loc> token, the LM head learned "Assistant: → <loc>" as a strong
attractor; once one loc is emitted, autoregression chains the rest.
Flip the format so every text target — subtask, memory, speech, AND VQA
— starts with a regular word. The model still learns the <loc>
vocabulary for the spatial portion of the answer, but loc can no
longer be the first generation step out of a clean prompt.
Examples:
point : "green box <loc0162><loc0759>"
bbox : "cube <loc0082>…<loc0409>"
multi : "blue <locs> ; yellow <locs>"
The runtime parser (parse_loc_answer) strips loc tokens and uses the
remainder as label, so it's order-tolerant and works under either
format. Old loc-first checkpoints still parse cleanly at inference;
new training will use label-first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
THE bug behind the <loc>-salad. PaliGemma's vocab reserves ids
[256000, 257023] for <locDDDD> detection / pointing tokens, but the
stock AutoTokenizer does NOT match them on raw text — it BPE-splits
<loc0162> into SEVEN pieces (<, loc, 0, 1, 6, 2, >). So a VQA target
like "<loc0162><loc0759> green box<eos>" tokenized to 16 pieces, not
5, and training the LM head supervised those generic BPE pieces
instead of one detection-vocab id. The piece logits got pumped up
across ~25% of supervised positions; at inference they dominated
every turn — even subtask prompts produced <loc>-salad followed by
the actual answer.
Register the 1024 <locDDDD> tokens via tokenizer.add_tokens once on
load, in every path the policy uses: PI052TextTokenizerStep (training
encode), _build_text_batch_pi052 (runtime encode), and
select_message's default tokenizer (runtime decode). Verified
empirically with the real PaliGemma tokenizer: VQA target now
tokenizes to 5 ids matching the loc-vocab range (256162, 256759, ...)
with correct offset_mapping.
This unlocks PaliGemma's actual detection prior; <loc>-salad cannot
recur because each <locDDDD> is a single class on the LM head, not a
character sequence the head accidentally learns to extend.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Confirmed empirically on the published dataset: VQA bbox/keypoint
coordinates are Qwen2.5-VL's 0–1000 normalized grounding output, NOT
pixels. Scanning 8207 samples showed x and y both spanning 0..1000
with ~30% of values exceeding the camera's pixel dimensions (which is
impossible if they were pixels).
_vqa_answer_to_loc was dividing by the observation image's H/W, so
e.g. point [742, 158] on a 640x480 wrist cam clamped x to <loc1023>
(the far-right edge) instead of mapping to <loc0760> (~74% across).
Fix: divide by 1000 — the actual Qwen scale. The conversion is now
camera-resolution-independent, so _camera_image_shapes and the
image_shapes plumbing through __call__ / _encode_messages /
_messages_vqa_to_loc are dropped. Tests updated to the new signature
and the 0–1000 round-trip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spatial VQA answers (bbox / keypoint) were trained as pixel-coordinate
JSON, which fights PaliGemma's detection prior and leaks <loc>-token
salad at inference. Convert them to PaliGemma's native <locNNNN>
vocabulary instead so the LM head reuses that prior.
Training side (text_processor_pi052.py): a target turn whose content
parses as a bbox/keypoint answer is rewritten to <loc> text, using the
camera frame's native (H, W) from the observation and the preceding
image block. Non-spatial answers, subtask/memory targets and SmolVLA2
keep their JSON form — the dataset stays backbone-agnostic.
Runtime side (smolvla2/inference/vqa.py): parse_vqa_answer detects
<loc> answers (2 locs -> keypoint, 4 -> bbox), returning normalized
[0,1] coords with a normalized flag; draw_vqa_overlay denormalizes
against the chosen camera frame's pixel size.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>