annotate(vqa): tighten bbox + keypoint quality bar

Low-confidence VLM detections were producing many overlapping, loose boxes per frame (oven + toaster oven + counter + drawer + ...) and coarse keypoints, hurting downstream policy grounding. Two surgical fixes: - module_3_vqa prompt: cap bbox at most 3 high-confidence detections (prefer 1 tight box), require specific labels and ≤10% padding, allow empty detections list when nothing meets the bar; keypoint must be a single pixel-precise feature (handle / button / gripper tip) rather than a coarse "somewhere on object" point. - run_hf_job: lower vlm.temperature 0.7 → 0.2. Bbox + keypoint are coordinate-regression tasks where sampling noise directly degrades localization; question phrasing still varies enough at 0.2. No new config knobs — the count cap lives in the prompt since "top-N by confidence" is best picked by the VLM itself. Validator already accepts empty detections. Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-15 14:02:14 +00:00 · 2026-05-26 08:31:37 +00:00
parent 2686450d68
commit 8615f3f613
2 changed files with 31 additions and 2 deletions
@@ -55,7 +55,11 @@ CMD = (
    "--vlm.serve_ready_timeout_s=1800 "
    "--vlm.client_concurrency=256 "
    "--vlm.max_new_tokens=512 "
-    "--vlm.temperature=0.7 "
+    # Low temperature for VQA: bbox + keypoint are coordinate-regression
+    # tasks where sampling noise directly degrades localization
+    # (overlapping boxes, drifted points). 0.2 keeps the model decisive
+    # while still letting question/label phrasing vary across frames.
+    "--vlm.temperature=0.2 "
    "--executor.episode_parallelism=64 "
    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
    # Whole-scene agentview is the right choice for subtask reasoning +