fix(annotate): align interjections with the actual demo trajectory

qwen36moe-11 surfaced a deeper semantic problem with mid-episode interjections: they were generated as *counterfactual* user requests ("actually skip the wipe", "use the blue one instead") but teleop data is frozen — the robot in the video already executed everything, including the steps the user "asked to skip". The training signal was therefore self-contradictory: interjection text said one thing, the robot's subsequent action stream did the opposite. Flip the framing. Anchor every interjection at a subtask boundary and write it as a natural user request for the *upcoming* subtask. The robot's visible next behavior IS the interjection's effect, so: interjection text → plan refresh → action stream are all consistent with the same observed video. Concretely: - ``interjections_and_speech.py``: instead of sampling random timestamps from ``frame_timestamps``, walk Module 1's subtask spans and sample from the (subtask N → subtask N+1) transitions. Pass both the just-finished and the upcoming subtask texts into the prompt. - ``_window_timestamps``: re-center the multi-frame video window on the boundary itself (half the frames cover the end of the previous subtask, half cover the start of the next one) so the VLM has the same visual conditioning the policy will see at training time. - ``module_2_interjection.txt``: rewritten. The prompt now states explicitly that this is offline data, the robot already committed to the next subtask, and the interjection must be a natural request that aligns with — not contradicts — the next subtask. Removes the "negative task / situated correction" Hi Robot framing because those scenarios require online execution to be coherent. Plan-refresh logic from the previous commit (forwarding interjection text into the refresh prompt) is unchanged and now reinforces the same direction: the refreshed plan emphasizes the upcoming subtask the interjection just asked for. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-24 10:16:09 +00:00 · 2026-04-30 16:32:43 +02:00
parent 3434d2ef22
commit d813c75b76
2 changed files with 88 additions and 48 deletions
@@ -120,33 +120,56 @@ class InterjectionsAndSpeechModule:
        record: EpisodeRecord,
        subtask_spans: Sequence[dict[str, Any]],
    ) -> list[dict[str, Any]]:
+        """Generate interjections aligned with the actual demo trajectory.
+
+        Teleop data is frozen — the robot already executed every step in
+        the video. A *counterfactual* interjection like "actually skip
+        the wipe" contradicts what then happens in the video, which is
+        what qwen36moe-10/11 surfaced as low-quality interjections.
+
+        Instead, anchor every interjection at a subtask boundary and
+        write it as a natural user request for the *upcoming* subtask.
+        The robot's visible next behavior IS the interjection's effect,
+        so the training signal stays consistent: interjection text →
+        plan refresh → action stream all line up.
+        """
        if self.config.max_interjections_per_episode <= 0:
            return []
+        if len(subtask_spans) < 2:
+            # Need at least one transition (subtask 0 → subtask 1).
+            return []
        # Deterministic per-episode RNG so reruns are stable across SLURM jobs.
        rng = random.Random(f"{self.seed}:{record.episode_index}:interjection")
-        candidate_ts = [t for t in record.frame_timestamps if t >= self.config.interjection_min_t]
-        if not candidate_ts:
+
+        # Boundaries: the start time of every subtask except the first
+        # (which is just t0 and is covered by the initial-task speech atom).
+        boundaries: list[tuple[float, str, str]] = []
+        for i in range(1, len(subtask_spans)):
+            ts = float(subtask_spans[i]["start"])
+            if ts < self.config.interjection_min_t:
+                continue
+            prev_text = (subtask_spans[i - 1].get("text") or "").strip()
+            next_text = (subtask_spans[i].get("text") or "").strip()
+            if not next_text:
+                continue
+            boundaries.append((ts, prev_text, next_text))
+        if not boundaries:
            return []
-        # Pick at most ``max_interjections_per_episode`` distinct timestamps.
-        # Previously capped at ``len(candidate_ts) // 4`` — that floor was
-        # only relevant for very short episodes; for any real ~20-30s
-        # episode it had no effect, but it silently set the count to 0 on
-        # short fixtures. Just take ``min(max, len)`` directly.
-        n = min(self.config.max_interjections_per_episode, len(candidate_ts))
-        if n <= 0:
-            return []
-        chosen = sorted(rng.sample(candidate_ts, n))
+
+        n = min(self.config.max_interjections_per_episode, len(boundaries))
+        chosen = sorted(rng.sample(boundaries, n), key=lambda b: b[0])

        out: list[dict[str, Any]] = []
-        for t in chosen:
+        for t, prev_subtask, next_subtask in chosen:
            t_snap = _snap_to_frame(t, record.frame_timestamps)
+            # Window straddles the boundary so the VLM sees the end of the
+            # previous subtask and the start of the next one — same
+            # conditioning the policy will see at training time.
            window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
-            current_subtask = (
-                self._subtask_at(subtask_spans, t_snap) or record.episode_task
-            )
            prompt = load_prompt("module_2_interjection").format(
                episode_task=record.episode_task,
-                current_subtask=current_subtask,
+                prev_subtask=prev_subtask or "(starting from initial state)",
+                next_subtask=next_subtask,
                timestamp=t_snap,
                window_seconds=self.config.interjection_window_seconds,
            )
@@ -177,13 +200,14 @@ class InterjectionsAndSpeechModule:
    def _window_timestamps(
        self, t_anchor: float, frame_timestamps: Sequence[float]
    ) -> list[float]:
-        """Return a small set of frame timestamps spanning the lead-up to ``t``.
+        """Return a small set of frame timestamps centered on ``t_anchor``.

-        The VLM receives roughly ``num_frames`` frames over the
-        ``window_seconds`` immediately before ``t_anchor``, snapped to
-        actual source frame timestamps. This gives the interjection
-        prompt enough temporal context to read what's visibly happening
-        instead of looking at one frozen frame.
+        The window straddles the subtask boundary the interjection sits
+        on: roughly half the frames cover the end of the previous
+        subtask, half cover the start of the next one. The VLM therefore
+        sees BOTH what just finished AND what's about to start, which is
+        the conditioning we need to write a natural "now please do X"
+        request that matches the visible upcoming behavior.
        """
        if not frame_timestamps:
            return [t_anchor]
@@ -192,11 +216,15 @@ class InterjectionsAndSpeechModule:
            return [t_anchor]
        window = float(self.config.interjection_window_seconds)
        step = window / max(1, n - 1)
-        targets = [t_anchor - step * (n - 1 - i) for i in range(n)]
+        # Center the window on the anchor so half lands before, half after.
+        start_offset = -window / 2.0
+        targets = [t_anchor + start_offset + step * i for i in range(n)]
+        last_ts = float(frame_timestamps[-1])
        snapped: list[float] = []
        seen: set[float] = set()
        for tgt in targets:
-            t = _snap_to_frame(max(0.0, tgt), frame_timestamps)
+            clamped = min(last_ts, max(0.0, tgt))
+            t = _snap_to_frame(clamped, frame_timestamps)
            if t not in seen:
                seen.add(t)
                snapped.append(t)
@@ -1,34 +1,46 @@
-You are simulating a user mid-episode interruption for a robot doing:
-"{episode_task}".
+You are generating training data for a Hi Robot-style hierarchical
+robot policy. The robot in this demonstration has ALREADY executed
+every step shown in the video — we cannot retroactively change the
+action stream. To keep training data consistent with the video, the
+"interjection" must align with what the robot is *about to do next* in
+the demonstration, framed as a natural mid-task user request.

-The images above show roughly the last {window_seconds:.1f} seconds of the
-demonstration in chronological order. Read what the robot is actually
-doing right now and write an interruption that responds to that exact
-visible activity — not a generic one.
+The episode's overall task: "{episode_task}".

-Current subtask the robot is executing: {current_subtask}
-Time into episode: {timestamp:.2f}s
+The images above show roughly {window_seconds:.1f} seconds straddling a
+subtask boundary in the demonstration:

-Synthesize ONE realistic interruption the user might say at this moment,
-plus the robot's verbal acknowledgement.
+- Subtask the robot just finished: "{prev_subtask}"
+- Subtask the robot is about to start: "{next_subtask}"
+- Time into episode: {timestamp:.2f}s

-Context (Hi Robot, Shi 2025) — interjections fall into one of these
-scenario types:
- negative task: "actually skip X" (where X is the visible current step)
- situated correction: "that's not the right one, use the blue one"
- specific constraint: "be more careful with that one"
- preference: "could you also do Y after this"
+Write ONE interjection the user would naturally say at this moment to
+prompt / confirm / encourage the robot to do "{next_subtask}". Phrase it
+like a real human mid-task remark — conversational, varied, sometimes
+just a nudge, sometimes a clarification, sometimes a small constraint
+that the upcoming motion happens to satisfy. Plus the robot's verbal
+acknowledgement.

-Interruption rules:
- Must reference an object, motion, or sub-step that is visible in the
-  attached frames OR explicitly named in the current subtask. Do not
-  invent objects that aren't there.
- Must change the plan in a non-trivial way (a new constraint, skipped
-  step, or correction).
+Hard rules:
+
+- The interjection MUST be consistent with the next subtask. The user
+  cannot ask for something different from what the robot then does in
+  the video. If you're tempted to say "actually skip X" or "do Y
+  instead", DO NOT — those would contradict the demonstration.
+- The interjection must reference an object, location, or action that
+  is plausible given the visible scene and the next subtask text.
 - One sentence each. Conversational, not robotic.

+Style examples (vary the phrasing — don't reuse these verbatim):
+  - "Now go ahead and {next_subtask}."
+  - "Great, can you {next_subtask} next?"
+  - "{next_subtask}, please."
+  - "Before you continue, please {next_subtask}."
+  - "Looking good — {next_subtask} now."
+  - "Okay, {next_subtask}."
+
 Output strictly valid JSON:
  {{
-    "interjection": "<single sentence the user says about what is visible right now>",
-    "speech":       "<single sentence the robot speaks back, acknowledging the change>"
+    "interjection": "<single sentence the user says, asking for the next subtask>",
+    "speech":       "<single sentence the robot speaks back, confirming and starting>"
  }}