lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-13 13:01:58 +00:00

Files

T

Pepijn c026aed8f8 feat(pi052): train VQA spatial answers in PaliGemma <loc> format

Spatial VQA answers (bbox / keypoint) were trained as pixel-coordinate
JSON, which fights PaliGemma's detection prior and leaks <loc>-token
salad at inference. Convert them to PaliGemma's native <locNNNN>
vocabulary instead so the LM head reuses that prior.

Training side (text_processor_pi052.py): a target turn whose content
parses as a bbox/keypoint answer is rewritten to <loc> text, using the
camera frame's native (H, W) from the observation and the preceding
image block. Non-spatial answers, subtask/memory targets and SmolVLA2
keep their JSON form — the dataset stays backbone-agnostic.

Runtime side (smolvla2/inference/vqa.py): parse_vqa_answer detects
<loc> answers (2 locs -> keypoint, 4 -> bbox), returning normalized
[0,1] coords with a normalized flag; draw_vqa_overlay denormalizes
against the chosen camera frame's pixel size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 20:23:46 +02:00

test_pi052_attention_masking.py

fix(pi052): avoid dense CE over padded tokens

2026-05-18 18:40:34 +00:00

test_pi052_fast_action_loss.py

fix(pi052): avoid dense CE over padded tokens

2026-05-18 18:40:34 +00:00

test_pi052_text_processor.py

fix(pi052): supervise an EOS token at the end of each text target

2026-05-19 17:22:22 +02:00

test_pi052_vqa_loc.py

feat(pi052): train VQA spatial answers in PaliGemma <loc> format

2026-05-19 20:23:46 +02:00