mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-28 15:09:51 +00:00
annotate(vqa): tighten bbox + keypoint quality bar
Low-confidence VLM detections were producing many overlapping, loose boxes per frame (oven + toaster oven + counter + drawer + ...) and coarse keypoints, hurting downstream policy grounding. Two surgical fixes: - module_3_vqa prompt: cap bbox at most 3 high-confidence detections (prefer 1 tight box), require specific labels and ≤10% padding, allow empty detections list when nothing meets the bar; keypoint must be a single pixel-precise feature (handle / button / gripper tip) rather than a coarse "somewhere on object" point. - run_hf_job: lower vlm.temperature 0.7 → 0.2. Bbox + keypoint are coordinate-regression tasks where sampling noise directly degrades localization; question phrasing still varies enough at 0.2. No new config knobs — the count cap lives in the prompt since "top-N by confidence" is best picked by the VLM itself. Validator already accepts empty detections. Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -55,7 +55,11 @@ CMD = (
|
||||
"--vlm.serve_ready_timeout_s=1800 "
|
||||
"--vlm.client_concurrency=256 "
|
||||
"--vlm.max_new_tokens=512 "
|
||||
"--vlm.temperature=0.7 "
|
||||
# Low temperature for VQA: bbox + keypoint are coordinate-regression
|
||||
# tasks where sampling noise directly degrades localization
|
||||
# (overlapping boxes, drifted points). 0.2 keeps the model decisive
|
||||
# while still letting question/label phrasing vary across frames.
|
||||
"--vlm.temperature=0.2 "
|
||||
"--executor.episode_parallelism=64 "
|
||||
"--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
|
||||
# Whole-scene agentview is the right choice for subtask reasoning +
|
||||
|
||||
@@ -5,15 +5,40 @@ pixel coordinates, keypoints, counts, attributes, and spatial relations.
|
||||
|
||||
The frame shows a robot working on: "{episode_task}".
|
||||
|
||||
QUALITY BAR — read before answering:
|
||||
|
||||
- Only label objects you are highly confident about. If you are not
|
||||
sure what an object is, do NOT include it. A short, certain answer
|
||||
beats a long, speculative one.
|
||||
- For coordinate-grounded answers (bbox, keypoint) only emit a label
|
||||
when you can localize the object *tightly and precisely*. If the
|
||||
object is occluded, ambiguous, off-frame, or you can't pin its
|
||||
extent, return an empty detections list / pick a different object
|
||||
rather than guessing.
|
||||
- Prefer task-relevant objects (the thing the robot is manipulating
|
||||
or interacting with) over background clutter.
|
||||
|
||||
Question types and the EXACT answer JSON shape required for each:
|
||||
|
||||
bbox => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
|
||||
"bbox": [x1, y1, x2, y2]}}, ...]}}
|
||||
bbox is in pixel coordinates (x_min, y_min, x_max, y_max).
|
||||
Pixel coordinates (x_min, y_min, x_max, y_max). Emit
|
||||
AT MOST 3 detections, and *only* the highest-confidence
|
||||
ones — 1 tight, certain detection is preferred over 3
|
||||
loose ones. Each box must be tight (no >10% padding
|
||||
around the object) and the label must be specific
|
||||
("red mug" not "object"). Return an empty list if no
|
||||
object meets the bar.
|
||||
ECoT example: "a white cup [124, 25, 176, 113]".
|
||||
|
||||
keypoint => {{"label": "<point>", "point_format": "xy",
|
||||
"point": [x, y]}}
|
||||
Pick ONE high-confidence, precisely-localizable point
|
||||
(e.g. a graspable handle, a button center, the gripper
|
||||
tip). The point must land within a few pixels of the
|
||||
feature. Do not emit a coarse "somewhere on the object"
|
||||
point — pick a different question type if no such
|
||||
point exists in this frame.
|
||||
|
||||
count => {{"label": "<obj>", "count": <int>,
|
||||
"note": "<optional short note>"}}
|
||||
|
||||
Reference in New Issue
Block a user