From 8615f3f613b8c54900a100daa4880c0dd0b88d09 Mon Sep 17 00:00:00 2001 From: pepijn Date: Tue, 26 May 2026 08:31:37 +0000 Subject: [PATCH] annotate(vqa): tighten bbox + keypoint quality bar MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Low-confidence VLM detections were producing many overlapping, loose boxes per frame (oven + toaster oven + counter + drawer + ...) and coarse keypoints, hurting downstream policy grounding. Two surgical fixes: - module_3_vqa prompt: cap bbox at most 3 high-confidence detections (prefer 1 tight box), require specific labels and ≤10% padding, allow empty detections list when nothing meets the bar; keypoint must be a single pixel-precise feature (handle / button / gripper tip) rather than a coarse "somewhere on object" point. - run_hf_job: lower vlm.temperature 0.7 → 0.2. Bbox + keypoint are coordinate-regression tasks where sampling noise directly degrades localization; question phrasing still varies enough at 0.2. No new config knobs — the count cap lives in the prompt since "top-N by confidence" is best picked by the VLM itself. Validator already accepts empty detections. Co-authored-by: Cursor --- examples/annotations/run_hf_job.py | 6 ++++- .../prompts/module_3_vqa.txt | 27 ++++++++++++++++++- 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/examples/annotations/run_hf_job.py b/examples/annotations/run_hf_job.py index be48185f8..6cb253e96 100644 --- a/examples/annotations/run_hf_job.py +++ b/examples/annotations/run_hf_job.py @@ -55,7 +55,11 @@ CMD = ( "--vlm.serve_ready_timeout_s=1800 " "--vlm.client_concurrency=256 " "--vlm.max_new_tokens=512 " - "--vlm.temperature=0.7 " + # Low temperature for VQA: bbox + keypoint are coordinate-regression + # tasks where sampling noise directly degrades localization + # (overlapping boxes, drifted points). 0.2 keeps the model decisive + # while still letting question/label phrasing vary across frames. + "--vlm.temperature=0.2 " "--executor.episode_parallelism=64 " "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' " # Whole-scene agentview is the right choice for subtask reasoning + diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt b/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt index 23590b381..c65d2c583 100644 --- a/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt +++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt @@ -5,15 +5,40 @@ pixel coordinates, keypoints, counts, attributes, and spatial relations. The frame shows a robot working on: "{episode_task}". +QUALITY BAR — read before answering: + +- Only label objects you are highly confident about. If you are not + sure what an object is, do NOT include it. A short, certain answer + beats a long, speculative one. +- For coordinate-grounded answers (bbox, keypoint) only emit a label + when you can localize the object *tightly and precisely*. If the + object is occluded, ambiguous, off-frame, or you can't pin its + extent, return an empty detections list / pick a different object + rather than guessing. +- Prefer task-relevant objects (the thing the robot is manipulating + or interacting with) over background clutter. + Question types and the EXACT answer JSON shape required for each: bbox => {{"detections": [{{"label": "", "bbox_format": "xyxy", "bbox": [x1, y1, x2, y2]}}, ...]}} - bbox is in pixel coordinates (x_min, y_min, x_max, y_max). + Pixel coordinates (x_min, y_min, x_max, y_max). Emit + AT MOST 3 detections, and *only* the highest-confidence + ones — 1 tight, certain detection is preferred over 3 + loose ones. Each box must be tight (no >10% padding + around the object) and the label must be specific + ("red mug" not "object"). Return an empty list if no + object meets the bar. ECoT example: "a white cup [124, 25, 176, 113]". keypoint => {{"label": "", "point_format": "xy", "point": [x, y]}} + Pick ONE high-confidence, precisely-localizable point + (e.g. a graspable handle, a button center, the gripper + tip). The point must land within a few pixels of the + feature. Do not emit a coarse "somewhere on the object" + point — pick a different question type if no such + point exists in this frame. count => {{"label": "", "count": , "note": ""}}