tune(annotations): VQA emission anchors a single frame (K 3 -> 1)

Module 3 anchored each VQA emission tick to K=3 consecutive frames (~0.1s at 30fps). The VLM grounds the answer — bbox/keypoint coordinates especially — against the first frame's image, so copying it onto frames 2-3 smears a stale label over a moving scene. Default K=1: a VQA pair lands on exactly its emission frame, no temporal smear. VQA frames get sparser; the WeightedEpisodeAwareSampler (vqa_target_fraction) is the knob to compensate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-08 02:22:02 +00:00 · 2026-05-18 17:24:36 +02:00
parent 0f5f0e4091
commit 474c5478d9
1 changed files with 8 additions and 1 deletions
@@ -114,7 +114,14 @@ class Module3Config:

    enabled: bool = True
    vqa_emission_hz: float = 1.0
-    K: int = 3
+    K: int = 1
+    """How many *consecutive* frames each emission tick anchors a VQA pair
+    to. The VLM grounds its answer (bbox / keypoint coordinates, count, …)
+    against the *first* anchored frame's image, so anchoring K>1 frames
+    copies that same answer onto later frames where the scene has already
+    moved — stale labels. Default ``1``: a VQA pair lands on exactly its
+    emission frame, no temporal smear. Raise it only to trade label
+    precision for more (noisier) VQA frames."""
    question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")