lerobot

admin/lerobot

Fork 0

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-26 14:09:47 +00:00

Commit Graph

Author	SHA1	Message	Date
Pepijn	c026aed8f8	feat(pi052): train VQA spatial answers in PaliGemma <loc> format Spatial VQA answers (bbox / keypoint) were trained as pixel-coordinate JSON, which fights PaliGemma's detection prior and leaks <loc>-token salad at inference. Convert them to PaliGemma's native <locNNNN> vocabulary instead so the LM head reuses that prior. Training side (text_processor_pi052.py): a target turn whose content parses as a bbox/keypoint answer is rewritten to <loc> text, using the camera frame's native (H, W) from the observation and the preceding image block. Non-spatial answers, subtask/memory targets and SmolVLA2 keep their JSON form — the dataset stays backbone-agnostic. Runtime side (smolvla2/inference/vqa.py): parse_vqa_answer detects <loc> answers (2 locs -> keypoint, 4 -> bbox), returning normalized [0,1] coords with a normalized flag; draw_vqa_overlay denormalizes against the chosen camera frame's pixel size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 20:23:46 +02:00
Pepijn	26cb38a7d0	feat(smolvla2): startup task picker, /vlm mode toggle, interactive VQA overlay Three additions to the SmolVLA2 interactive runtime: 1. Startup task picker — when no --task is given, the runtime lists the dataset's task strings as a numbered menu (plus a custom-task option) instead of silently waiting for the first stdin line. 2. Mode toggle — /action and /vlm slash commands flip a persistent run mode. /vlm pauses the whole action loop (HighLevelSubtaskFwd, LowLevelForward and DispatchAction gate on state["mode"]) and clears the action queue so the robot holds position; /action resumes it. The mode is shown in the state panel. 3. Interactive VQA — in /vlm mode a typed line is a VQA question. The new inference/vqa.py module asks which camera to ground on, runs the VLM on that single camera, and when the answer is a bbox/keypoint it draws the overlay, saves a PNG to ./vqa_overlays/ and auto-opens it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:20:57 +02:00

Author

SHA1

Message

Date

Pepijn

c026aed8f8

feat(pi052): train VQA spatial answers in PaliGemma <loc> format

Spatial VQA answers (bbox / keypoint) were trained as pixel-coordinate
JSON, which fights PaliGemma's detection prior and leaks <loc>-token
salad at inference. Convert them to PaliGemma's native <locNNNN>
vocabulary instead so the LM head reuses that prior.

Training side (text_processor_pi052.py): a target turn whose content
parses as a bbox/keypoint answer is rewritten to <loc> text, using the
camera frame's native (H, W) from the observation and the preceding
image block. Non-spatial answers, subtask/memory targets and SmolVLA2
keep their JSON form — the dataset stays backbone-agnostic.

Runtime side (smolvla2/inference/vqa.py): parse_vqa_answer detects
<loc> answers (2 locs -> keypoint, 4 -> bbox), returning normalized
[0,1] coords with a normalized flag; draw_vqa_overlay denormalizes
against the chosen camera frame's pixel size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 20:23:46 +02:00

Pepijn

26cb38a7d0

feat(smolvla2): startup task picker, /vlm mode toggle, interactive VQA overlay

Three additions to the SmolVLA2 interactive runtime:

1. Startup task picker — when no --task is given, the runtime lists the
   dataset's task strings as a numbered menu (plus a custom-task option)
   instead of silently waiting for the first stdin line.

2. Mode toggle — /action and /vlm slash commands flip a persistent run
   mode. /vlm pauses the whole action loop (HighLevelSubtaskFwd,
   LowLevelForward and DispatchAction gate on state["mode"]) and clears
   the action queue so the robot holds position; /action resumes it.
   The mode is shown in the state panel.

3. Interactive VQA — in /vlm mode a typed line is a VQA question. The
   new inference/vqa.py module asks which camera to ground on, runs the
   VLM on that single camera, and when the answer is a bbox/keypoint it
   draws the overlay, saves a PNG to ./vqa_overlays/ and auto-opens it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 11:20:57 +02:00

2 Commits