annotate(vqa): tighten bbox + keypoint quality bar

Low-confidence VLM detections were producing many overlapping, loose
boxes per frame (oven + toaster oven + counter + drawer + ...) and
coarse keypoints, hurting downstream policy grounding. Two surgical
fixes:

- module_3_vqa prompt: cap bbox at most 3 high-confidence detections
  (prefer 1 tight box), require specific labels and ≤10% padding,
  allow empty detections list when nothing meets the bar; keypoint
  must be a single pixel-precise feature (handle / button / gripper
  tip) rather than a coarse "somewhere on object" point.
- run_hf_job: lower vlm.temperature 0.7 → 0.2. Bbox + keypoint are
  coordinate-regression tasks where sampling noise directly degrades
  localization; question phrasing still varies enough at 0.2.

No new config knobs — the count cap lives in the prompt since "top-N
by confidence" is best picked by the VLM itself. Validator already
accepts empty detections.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
pepijn
2026-05-26 08:31:37 +00:00
parent 2686450d68
commit 8615f3f613
2 changed files with 31 additions and 2 deletions
+5 -1
View File
@@ -55,7 +55,11 @@ CMD = (
"--vlm.serve_ready_timeout_s=1800 "
"--vlm.client_concurrency=256 "
"--vlm.max_new_tokens=512 "
"--vlm.temperature=0.7 "
# Low temperature for VQA: bbox + keypoint are coordinate-regression
# tasks where sampling noise directly degrades localization
# (overlapping boxes, drifted points). 0.2 keeps the model decisive
# while still letting question/label phrasing vary across frames.
"--vlm.temperature=0.2 "
"--executor.episode_parallelism=64 "
"--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
# Whole-scene agentview is the right choice for subtask reasoning +