annotate(vqa): tighten bbox + keypoint quality bar

Low-confidence VLM detections were producing many overlapping, loose boxes per frame (oven + toaster oven + counter + drawer + ...) and coarse keypoints, hurting downstream policy grounding. Two surgical fixes: - module_3_vqa prompt: cap bbox at most 3 high-confidence detections (prefer 1 tight box), require specific labels and ≤10% padding, allow empty detections list when nothing meets the bar; keypoint must be a single pixel-precise feature (handle / button / gripper tip) rather than a coarse "somewhere on object" point. - run_hf_job: lower vlm.temperature 0.7 → 0.2. Bbox + keypoint are coordinate-regression tasks where sampling noise directly degrades localization; question phrasing still varies enough at 0.2. No new config knobs — the count cap lives in the prompt since "top-N by confidence" is best picked by the VLM itself. Validator already accepts empty detections. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-15 14:02:14 +00:00 · 2026-05-26 08:31:37 +00:00
parent 2686450d68
commit 8615f3f613
2 changed files with 31 additions and 2 deletions
@@ -55,7 +55,11 @@ CMD = (
    "--vlm.serve_ready_timeout_s=1800 "
    "--vlm.client_concurrency=256 "
    "--vlm.max_new_tokens=512 "
-    "--vlm.temperature=0.7 "
+    # Low temperature for VQA: bbox + keypoint are coordinate-regression
+    # tasks where sampling noise directly degrades localization
+    # (overlapping boxes, drifted points). 0.2 keeps the model decisive
+    # while still letting question/label phrasing vary across frames.
+    "--vlm.temperature=0.2 "
    "--executor.episode_parallelism=64 "
    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
    # Whole-scene agentview is the right choice for subtask reasoning +
@@ -5,15 +5,40 @@ pixel coordinates, keypoints, counts, attributes, and spatial relations.

 The frame shows a robot working on: "{episode_task}".

+QUALITY BAR — read before answering:
+
+- Only label objects you are highly confident about. If you are not
+  sure what an object is, do NOT include it. A short, certain answer
+  beats a long, speculative one.
+- For coordinate-grounded answers (bbox, keypoint) only emit a label
+  when you can localize the object *tightly and precisely*. If the
+  object is occluded, ambiguous, off-frame, or you can't pin its
+  extent, return an empty detections list / pick a different object
+  rather than guessing.
+- Prefer task-relevant objects (the thing the robot is manipulating
+  or interacting with) over background clutter.
+
 Question types and the EXACT answer JSON shape required for each:

  bbox       => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
                                    "bbox": [x1, y1, x2, y2]}}, ...]}}
-                bbox is in pixel coordinates (x_min, y_min, x_max, y_max).
+                Pixel coordinates (x_min, y_min, x_max, y_max). Emit
+                AT MOST 3 detections, and *only* the highest-confidence
+                ones — 1 tight, certain detection is preferred over 3
+                loose ones. Each box must be tight (no >10% padding
+                around the object) and the label must be specific
+                ("red mug" not "object"). Return an empty list if no
+                object meets the bar.
                ECoT example: "a white cup [124, 25, 176, 113]".

  keypoint   => {{"label": "<point>", "point_format": "xy",
                  "point": [x, y]}}
+                Pick ONE high-confidence, precisely-localizable point
+                (e.g. a graspable handle, a button center, the gripper
+                tip). The point must land within a few pixels of the
+                feature. Do not emit a coarse "somewhere on the object"
+                point — pick a different question type if no such
+                point exists in this frame.

  count      => {{"label": "<obj>", "count": <int>,
                  "note": "<optional short note>"}}