docs(recipes): trim header comments, drop diversity-knobs note in run_hf_job

Recipes were over-commented (paper citations, history of removed sub-recipes, inference-time loop walkthroughs). Stripped down to a short header + a one-line note on the boundary-frame memory tail. Also removed the ``_tool3`` diversity-knobs comment block in ``examples/annotation/run_hf_job.py`` — it was a personal note about a since-merged experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 09:39:47 +00:00 · 2026-05-13 12:55:03 +02:00
parent b2aa372fcf
commit 9ff62cb08c
3 changed files with 16 additions and 135 deletions
@@ -23,18 +23,6 @@ token = os.environ.get("HF_TOKEN") or get_token()
 if not token:
    raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")
 # --- Diversity knobs (Pi0.7-style prompt expansion) -----------------------
 # Bumped roughly 3x across the board to fight memorization on small datasets.
 # A single dataset trained for many epochs with deterministic atom wording
 # converges to perfect recall on training prompts but produces JSON-token
 # garbage at inference for any wording that drifts slightly. More atom
 # variants per episode + higher sampling temperature widens the training
 # distribution so the model has to actually use its language head, not
 # just memorize.
 #
 # Pushes to a *new* hub repo (``_tool3``) so the previous annotation pass
 # (``_tool2``) stays intact — re-train from scratch on the new dataset and
 # compare loss-curve shapes to verify the diversity bump is doing something.
 CMD = (
    "apt-get update -qq && apt-get install -y -qq git ffmpeg && "
    "pip install --no-deps "
@@ -1,51 +1,13 @@
-# π0.5 v2 (pi052) — Hi-Robot / MEM / ECoT blend.
+# π0.5 v2 (pi052) Hi-Robot blend.
 #
-# Architecturally mirrors ``smolvla2_hirobot.yaml`` — same two
+# Same shape as ``smolvla2_hirobot.yaml`` — see that file for the
-# flavors, same sub-recipes — but the rendered messages are fed
+# flavor breakdown. The only difference here is the backbone:
-# to PaliGemma (PaliGemma is not chat-pretrained, so the
+# PaliGemma isn't chat-pretrained, so ``PI052TextTokenizerStep``
-# ``PI052TextTokenizerStep`` concatenates them as ``Role: content``
+# concatenates messages as ``Role: content`` plain text instead
-# plain text rather than calling ``apply_chat_template``).
+# of calling ``apply_chat_template``.
 #
 # Two flavors
 # -----------
 #
 #   Flavor 1 — ``action_execution`` (~60% weight)
 #     The main always-on recipe. Fuses all available context
 #     (task + plan + memory) into a unified user prompt, and
 #     uses the current subtask as the assistant target. This
 #     single recipe supervises *both*:
 #       * subtask prediction (text CE on the assistant span,
 #         lm_head), and
 #       * action chunks (flow MSE on the action expert via
 #         ``stream: low_level, target: true``, plus the FAST
 #         CE on the action tokens when enabled).
 #     Pi 0.7 §V.A — subtask in the prompt + flow on actions.
 #
 #   Flavor 2 — event-driven text-only recipes
 #     ``ask_vqa_*``. Each handles a specific high-level event
 #     with a TEXT output. ``if_present`` guards keep them from
 #     firing on frames without the relevant annotation.
 #
 # Memory updates are folded INTO ``action_execution`` as a
 # conditional second target gated on boundary frames — see
 # ``smolvla2_hirobot.yaml`` for the rationale. The
 # ``user_interjection_response`` recipe was dropped — the
 # current datasets don't include interjection / say() annotations.
 blend:
  # ----------------------------------------------------------
  # FLAVOR 1: action_execution (main path)
  #
  # Bundles memory updates inline. On most frames the binding
  # ``new_memory: emitted_at(t, style=memory)`` returns None and
  # only the subtask is supervised. On *boundary* frames (the
  # exact timestamp a new memory was annotated — i.e. when a
  # subtask just completed) the binding fires and the recipe
  # supervises the new memory as a follow-up assistant turn,
  # with a "Completed subtask: …" user message in between to
  # separate the two outputs in the rendered prefix.
  # ----------------------------------------------------------
  action_execution:
    weight: 0.85
    bindings:
@@ -55,17 +17,10 @@ blend:
        stream: high_level
        content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
-      # Memory-update tail — only renders at boundary frames where
+      # Boundary-frame tail: at a subtask transition, predict the
-      # ``new_memory`` fires. The new memory is appended as a second
+      # new memory as a second assistant turn (same forward pass).
      # assistant turn right after the subtask, with no intervening
      # user filler: at a subtask boundary the model emits the new
      # subtask AND the updated memory in one forward pass.
      - {role: assistant, content: "${new_memory}", stream: high_level, target: true, if_present: new_memory}
  # ----------------------------------------------------------
  # FLAVOR 2: event-driven text-only paths
  # ----------------------------------------------------------
  ask_vqa_top:
    weight: 0.075
    bindings:
@@ -1,68 +1,13 @@
-# SmolVLA2 canonical training recipe — Hi Robot / MEM / ECoT blend.
+# SmolVLA2 Hi-Robot blend — two flavors:
 #
-# Inspired by Pi 0.7 §V (Diversifying the Prompt) and Pi 0.5's
+#   1. action_execution  — fused (task + plan + memory) prompt;
-# hierarchical subtask training. The blend has **two flavors**:
+#      supervises the current subtask (low_level: flow + text CE)
-#
+#      and, at memory-boundary frames, the new memory too.
-#   Flavor 1 — ``action_execution`` (~60% weight)
+#   2. ask_vqa_{top,wrist} — text-only VQA on a camera image,
-#     The main always-on recipe. Fuses all available context
+#      gated by ``if_present`` so they only fire on annotated frames.
 #     (task + plan + memory) into a unified user prompt, and
 #     uses the current subtask as the assistant target. This
 #     single recipe supervises *both*:
 #       * subtask prediction (text CE on the assistant span,
 #         lm_head), and
 #       * action chunks (flow MSE on the action expert via
 #         ``stream: low_level, target: true``, plus the FAST
 #         CE on the action tokens when enabled).
 #     At inference, the same prompt structure is used:
 #       * the high-level loop calls ``select_message`` with the
 #         user prompt only → generates the next subtask.
 #       * the low-level loop calls ``predict_action_chunk`` with
 #         the user prompt + the generated subtask as the
 #         assistant turn → generates the action chunk.
 #     Replaces what used to be three separate recipes
 #     (``high_level_subtask`` + ``low_level_execution`` + the
 #     implicit subtask-in-prompt context) in earlier drafts.
 #     Pi 0.7's §V.A "Subtask instructions" pattern.
 #
 #   Flavor 2 — event-driven text-only recipes
 #     Each handles a specific high-level event with a TEXT
 #     output (no action supervision). They fire when the
 #     binding for the event resolves to non-None:
 #       * ``ask_vqa_top`` / ``ask_vqa_wrist``: answer a
 #         camera-grounded visual question.
 #     All use ``stream: high_level`` (no flow loss) and rely on
 #     ``if_present`` guards so they only fire on frames where
 #     the relevant event annotation is present.
 #
 #     ``memory_update`` is folded into Flavor 1 (gated on the
 #     ``new_memory`` binding at boundary frames).
 #     ``user_interjection_response`` was dropped — the current
 #     datasets don't include interjection / say() annotations.
 #
 # How the chat tokenizer interprets the flavor split
 # ---------------------------------------------------
 #   * predict_actions = bool(targets_by_stream.get("low_level"))
 #     → True only for Flavor 1 (action_execution).
 #   * text_labels supervises whatever assistant turns are marked
 #     target=true. For action_execution, this is the subtask
 #     string. For Flavor 2, it's the corresponding text output.
 blend:
  # ----------------------------------------------------------
  # FLAVOR 1: action_execution (main path)
  #
  # Bundles memory updates inline. On most frames the binding
  # ``new_memory: emitted_at(t, style=memory)`` returns None and
  # only the subtask is supervised. On *boundary* frames (the
  # exact timestamp a new memory was annotated — i.e. when a
  # subtask just completed) the binding fires and the recipe
  # supervises the new memory as a follow-up assistant turn,
  # with a "Completed subtask: …" user message in between to
  # separate the two outputs in the chat sequence. Mirrors the
  # behaviour of the old standalone ``memory_update`` recipe
  # but keeps everything inside the unified action_execution.
  # ----------------------------------------------------------
  action_execution:
    weight: 0.85
    bindings:
@@ -72,17 +17,10 @@ blend:
        stream: high_level
        content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
      - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
-      # Memory-update tail — only renders at boundary frames where
+      # Boundary-frame tail: at a subtask transition, predict the
-      # ``new_memory`` fires. The new memory is appended as a second
+      # new memory as a second assistant turn (same forward pass).
      # assistant turn right after the subtask, with no intervening
      # user filler: at a subtask boundary the model emits the new
      # subtask AND the updated memory in one forward pass.
      - {role: assistant, content: "${new_memory}", stream: high_level, target: true, if_present: new_memory}
  # ----------------------------------------------------------
  # FLAVOR 2: event-driven text-only paths
  # ----------------------------------------------------------
  ask_vqa_top:
    weight: 0.075
    bindings: