annotate: cap embedded-frame budget to fit VLM context (fix 32k overflow)

Switching the plan module to embedded frames (use_video_url=false) exposed a context overflow: at frames_per_second=2.0 with the old max_video_frames=128 default, a 480x640 episode embeds ~128 frames ≈ 33-39k vision tokens, over the model's 32768 context — every plan call died with 'Input length exceeds maximum context length' (HTTP 400), crashing the whole annotation job. The video_url path never hit this because the server downsampled; the embedded path sends every sampled frame, so the frame count is a hard token budget. Fix: * config default max_video_frames 128 -> 32 (~8-10k vision tokens, comfortable headroom for the prompt + describe/verify passes). Frames are still sampled UNIFORMLY across the whole episode, so longer episodes are subsampled, not truncated — full temporal coverage preserved, just coarser density. * run_hf_job.py: frames_per_second 2.0 -> 1.0, explicit --plan.max_video_frames=32, with a comment explaining the token budget and the 'do not raise toward 128 with embedded frames' rule. Only the plan module embeds the full episode; VQA (1 frame/tick) and interjections (4-frame window) were never at risk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-08-02 22:49:45 +00:00 · 2026-06-02 16:02:25 +02:00
parent 79f9a84407
commit 1fb46ab300
2 changed files with 23 additions and 5 deletions
@@ -58,10 +58,18 @@ CMD = (
    # handing the server a file:// clip. The embedded path is more
    # reliable: if clip extraction ever fails, the video_url path would
    # silently send NO video and the VLM would hallucinate subtasks from
-    # the task text alone. 2 fps gives dense visual grounding so the VLM
-    # labels what actually happens.
-    "--plan.frames_per_second=2.0 "
+    # the task text alone.
+    #
+    # CONTEXT BUDGET: with embedded frames, each frame is ~250-320 vision
+    # tokens. The model's context is 32768 (see --max-model-len). 32
+    # frames sampled uniformly across the episode (~8-10k tokens) fits
+    # comfortably alongside the prompt and the describe/verify passes.
+    # Do NOT raise max_video_frames toward 128 with embedded frames — that
+    # is ~33-39k tokens and overflows the context (BadRequestError 400,
+    # "Input length exceeds maximum context length").
    "--plan.use_video_url=false "
+    "--plan.frames_per_second=1.0 "
+    "--plan.max_video_frames=32 "
    # IMPORTANT for RoboCasa: the dataset's task string ("Navigate to the
    # stove", "Pick the mug...") is authoritative and is what eval uses.
    # ``derive_task_from_video=off`` keeps that canonical task driving