mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-22 03:59:42 +00:00
tune(annotations): VQA emission anchors a single frame (K 3 -> 1)
Module 3 anchored each VQA emission tick to K=3 consecutive frames (~0.1s at 30fps). The VLM grounds the answer — bbox/keypoint coordinates especially — against the first frame's image, so copying it onto frames 2-3 smears a stale label over a moving scene. Default K=1: a VQA pair lands on exactly its emission frame, no temporal smear. VQA frames get sparser; the WeightedEpisodeAwareSampler (vqa_target_fraction) is the knob to compensate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -114,7 +114,14 @@ class Module3Config:
|
|||||||
|
|
||||||
enabled: bool = True
|
enabled: bool = True
|
||||||
vqa_emission_hz: float = 1.0
|
vqa_emission_hz: float = 1.0
|
||||||
K: int = 3
|
K: int = 1
|
||||||
|
"""How many *consecutive* frames each emission tick anchors a VQA pair
|
||||||
|
to. The VLM grounds its answer (bbox / keypoint coordinates, count, …)
|
||||||
|
against the *first* anchored frame's image, so anchoring K>1 frames
|
||||||
|
copies that same answer onto later frames where the scene has already
|
||||||
|
moved — stale labels. Default ``1``: a VQA pair lands on exactly its
|
||||||
|
emission frame, no temporal smear. Raise it only to trade label
|
||||||
|
precision for more (noisier) VQA frames."""
|
||||||
question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")
|
question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user