From 8615f3f613b8c54900a100daa4880c0dd0b88d09 Mon Sep 17 00:00:00 2001
From: pepijn <pepijn@huggingface.co>
Date: Tue, 26 May 2026 08:31:37 +0000
Subject: [PATCH] annotate(vqa): tighten bbox + keypoint quality bar
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Low-confidence VLM detections were producing many overlapping, loose
boxes per frame (oven + toaster oven + counter + drawer + ...) and
coarse keypoints, hurting downstream policy grounding. Two surgical
fixes:

- module_3_vqa prompt: cap bbox at most 3 high-confidence detections
  (prefer 1 tight box), require specific labels and ≤10% padding,
  allow empty detections list when nothing meets the bar; keypoint
  must be a single pixel-precise feature (handle / button / gripper
  tip) rather than a coarse "somewhere on object" point.
- run_hf_job: lower vlm.temperature 0.7 → 0.2. Bbox + keypoint are
  coordinate-regression tasks where sampling noise directly degrades
  localization; question phrasing still varies enough at 0.2.

No new config knobs — the count cap lives in the prompt since "top-N
by confidence" is best picked by the VLM itself. Validator already
accepts empty detections.

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 examples/annotations/run_hf_job.py            |  6 ++++-
 .../prompts/module_3_vqa.txt                  | 27 ++++++++++++++++++-
 2 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/examples/annotations/run_hf_job.py b/examples/annotations/run_hf_job.py
index be48185f8..6cb253e96 100644
--- a/examples/annotations/run_hf_job.py
+++ b/examples/annotations/run_hf_job.py
@@ -55,7 +55,11 @@ CMD = (
     "--vlm.serve_ready_timeout_s=1800 "
     "--vlm.client_concurrency=256 "
     "--vlm.max_new_tokens=512 "
-    "--vlm.temperature=0.7 "
+    # Low temperature for VQA: bbox + keypoint are coordinate-regression
+    # tasks where sampling noise directly degrades localization
+    # (overlapping boxes, drifted points). 0.2 keeps the model decisive
+    # while still letting question/label phrasing vary across frames.
+    "--vlm.temperature=0.2 "
     "--executor.episode_parallelism=64 "
     "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
     # Whole-scene agentview is the right choice for subtask reasoning +
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt b/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt
index 23590b381..c65d2c583 100644
--- a/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/module_3_vqa.txt
@@ -5,15 +5,40 @@ pixel coordinates, keypoints, counts, attributes, and spatial relations.
 
 The frame shows a robot working on: "{episode_task}".
 
+QUALITY BAR — read before answering:
+
+- Only label objects you are highly confident about. If you are not
+  sure what an object is, do NOT include it. A short, certain answer
+  beats a long, speculative one.
+- For coordinate-grounded answers (bbox, keypoint) only emit a label
+  when you can localize the object *tightly and precisely*. If the
+  object is occluded, ambiguous, off-frame, or you can't pin its
+  extent, return an empty detections list / pick a different object
+  rather than guessing.
+- Prefer task-relevant objects (the thing the robot is manipulating
+  or interacting with) over background clutter.
+
 Question types and the EXACT answer JSON shape required for each:
 
   bbox       => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
                                     "bbox": [x1, y1, x2, y2]}}, ...]}}
-                bbox is in pixel coordinates (x_min, y_min, x_max, y_max).
+                Pixel coordinates (x_min, y_min, x_max, y_max). Emit
+                AT MOST 3 detections, and *only* the highest-confidence
+                ones — 1 tight, certain detection is preferred over 3
+                loose ones. Each box must be tight (no >10% padding
+                around the object) and the label must be specific
+                ("red mug" not "object"). Return an empty list if no
+                object meets the bar.
                 ECoT example: "a white cup [124, 25, 176, 113]".
 
   keypoint   => {{"label": "<point>", "point_format": "xy",
                   "point": [x, y]}}
+                Pick ONE high-confidence, precisely-localizable point
+                (e.g. a graspable handle, a button center, the gripper
+                tip). The point must land within a few pixels of the
+                feature. Do not emit a coarse "somewhere on the object"
+                point — pick a different question type if no such
+                point exists in this frame.
 
   count      => {{"label": "<obj>", "count": <int>,
                   "note": "<optional short note>"}}