mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-20 19:19:56 +00:00
34269a5d78
THE bug behind the <loc>-salad. PaliGemma's vocab reserves ids [256000, 257023] for <locDDDD> detection / pointing tokens, but the stock AutoTokenizer does NOT match them on raw text — it BPE-splits <loc0162> into SEVEN pieces (<, loc, 0, 1, 6, 2, >). So a VQA target like "<loc0162><loc0759> green box<eos>" tokenized to 16 pieces, not 5, and training the LM head supervised those generic BPE pieces instead of one detection-vocab id. The piece logits got pumped up across ~25% of supervised positions; at inference they dominated every turn — even subtask prompts produced <loc>-salad followed by the actual answer. Register the 1024 <locDDDD> tokens via tokenizer.add_tokens once on load, in every path the policy uses: PI052TextTokenizerStep (training encode), _build_text_batch_pi052 (runtime encode), and select_message's default tokenizer (runtime decode). Verified empirically with the real PaliGemma tokenizer: VQA target now tokenizes to 5 ids matching the loc-vocab range (256162, 256759, ...) with correct offset_mapping. This unlocks PaliGemma's actual detection prior; <loc>-salad cannot recur because each <locDDDD> is a single class on the LM head, not a character sequence the head accidentally learns to extend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>