annotate: enable subtask describe->segment->verify chain by default

Flip PlanConfig.subtask_describe_first and subtask_verify defaults
False -> True. Every subtask annotation now runs the 3-call grounding
+ pruning chain by default, since the single-call path reliably
hallucinates steps from the task text. Costs 2 extra VLM calls/episode;
disable with --plan.subtask_describe_first=false / --plan.subtask_
verify=false on easy datasets where fewer calls matter more than
label fidelity.

run_hf_job.py: drop the now-redundant explicit flags, leave a note that
the chain is default-on and how to opt out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pepijn
2026-06-02 15:13:50 +02:00
parent dcd368e1f8
commit 1fe1463ae0
2 changed files with 14 additions and 11 deletions
+4 -7
View File
@@ -82,13 +82,10 @@ CMD = (
# tasks. Leave off for RoboCasa atomic / navigation.
# Keep subtask decomposition tight for atomic tasks:
"--plan.plan_max_steps=6 "
# Multi-call quality chain (3 VLM calls/episode for subtasks):
# 1. describe-first: narrate ONLY what is visible before segmenting
# — the strongest fix for subtasks invented from the task text.
# 2. (segment)
# 3. verify: re-watch and prune any subtask not actually seen.
"--plan.subtask_describe_first=true "
"--plan.subtask_verify=true "
# NOTE: the multi-call subtask quality chain (describe -> segment ->
# verify, 3 VLM calls/episode) is ON BY DEFAULT now. Pass
# --plan.subtask_describe_first=false / --plan.subtask_verify=false to
# disable on datasets you've verified are easy and want fewer calls.
# Phase 2 — interjections + speech.
"--interjections.max_interjections_per_episode=6 "
# Phase 4 — general VQA.
@@ -51,21 +51,27 @@ class PlanConfig:
min_subtask_seconds: float = 1.5
plan_max_steps: int = 8
# Multi-call subtask quality chain (opt-in, more VLM calls, higher
# quality). Both off by default → single-call behaviour unchanged.
# Multi-call subtask quality chain. ON by default — the single-call
# 'watch video -> emit subtask JSON' pattern makes the VLM commit to
# structured output before reasoning about the video, so it
# pattern-matches the task text and hallucinates steps. The chain
# costs 2 extra VLM calls/episode (3 total for subtasks) but is the
# difference between trustworthy and fabricated labels. Set either to
# False to trade quality for fewer calls on datasets you've verified
# are easy.
#
# ``subtask_describe_first``: run a grounding pass that narrates ONLY
# what is visible in the video (no subtask JSON yet), then inject that
# description into the segmentation prompt. Forces the model to
# observe before committing to structured output — the strongest
# lever against subtasks invented from the task text. +1 VLM call/ep.
subtask_describe_first: bool = False
subtask_describe_first: bool = True
# ``subtask_verify``: after segmentation, re-watch the video and drop
# any proposed subtask that can't be verified as visible. Prunes
# hallucinations; can only remove subtasks, never add/rewrite them.
# Fail-open (keeps un-verified spans if the verify call returns
# nothing). +1 VLM call/ep.
subtask_verify: bool = False
subtask_verify: bool = True
# When True (and backend supports it, e.g. ``openai``), the ``plan``
# module sends a ``video_url`` block pointing at a per-episode mp4