docs(recipes): trim header comments, drop diversity-knobs note in run_hf_job

Recipes were over-commented (paper citations, history of removed
sub-recipes, inference-time loop walkthroughs). Stripped down to a
short header + a one-line note on the boundary-frame memory tail.

Also removed the ``_tool3`` diversity-knobs comment block in
``examples/annotation/run_hf_job.py`` — it was a personal note about
a since-merged experiment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pepijn
2026-05-13 12:55:03 +02:00
parent b2aa372fcf
commit 9ff62cb08c
3 changed files with 16 additions and 135 deletions
-12
View File
@@ -23,18 +23,6 @@ token = os.environ.get("HF_TOKEN") or get_token()
if not token: if not token:
raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`") raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")
# --- Diversity knobs (Pi0.7-style prompt expansion) -----------------------
# Bumped roughly 3x across the board to fight memorization on small datasets.
# A single dataset trained for many epochs with deterministic atom wording
# converges to perfect recall on training prompts but produces JSON-token
# garbage at inference for any wording that drifts slightly. More atom
# variants per episode + higher sampling temperature widens the training
# distribution so the model has to actually use its language head, not
# just memorize.
#
# Pushes to a *new* hub repo (``_tool3``) so the previous annotation pass
# (``_tool2``) stays intact — re-train from scratch on the new dataset and
# compare loss-curve shapes to verify the diversity bump is doing something.
CMD = ( CMD = (
"apt-get update -qq && apt-get install -y -qq git ffmpeg && " "apt-get update -qq && apt-get install -y -qq git ffmpeg && "
"pip install --no-deps " "pip install --no-deps "
+8 -53
View File
@@ -1,51 +1,13 @@
# π0.5 v2 (pi052) Hi-Robot / MEM / ECoT blend. # π0.5 v2 (pi052) Hi-Robot blend.
# #
# Architecturally mirrors ``smolvla2_hirobot.yaml`` — same two # Same shape as ``smolvla2_hirobot.yaml`` — see that file for the
# flavors, same sub-recipes — but the rendered messages are fed # flavor breakdown. The only difference here is the backbone:
# to PaliGemma (PaliGemma is not chat-pretrained, so the # PaliGemma isn't chat-pretrained, so ``PI052TextTokenizerStep``
# ``PI052TextTokenizerStep`` concatenates them as ``Role: content`` # concatenates messages as ``Role: content`` plain text instead
# plain text rather than calling ``apply_chat_template``). # of calling ``apply_chat_template``.
#
# Two flavors
# -----------
#
# Flavor 1 — ``action_execution`` (~60% weight)
# The main always-on recipe. Fuses all available context
# (task + plan + memory) into a unified user prompt, and
# uses the current subtask as the assistant target. This
# single recipe supervises *both*:
# * subtask prediction (text CE on the assistant span,
# lm_head), and
# * action chunks (flow MSE on the action expert via
# ``stream: low_level, target: true``, plus the FAST
# CE on the action tokens when enabled).
# Pi 0.7 §V.A — subtask in the prompt + flow on actions.
#
# Flavor 2 — event-driven text-only recipes
# ``ask_vqa_*``. Each handles a specific high-level event
# with a TEXT output. ``if_present`` guards keep them from
# firing on frames without the relevant annotation.
#
# Memory updates are folded INTO ``action_execution`` as a
# conditional second target gated on boundary frames — see
# ``smolvla2_hirobot.yaml`` for the rationale. The
# ``user_interjection_response`` recipe was dropped — the
# current datasets don't include interjection / say() annotations.
blend: blend:
# ----------------------------------------------------------
# FLAVOR 1: action_execution (main path)
#
# Bundles memory updates inline. On most frames the binding
# ``new_memory: emitted_at(t, style=memory)`` returns None and
# only the subtask is supervised. On *boundary* frames (the
# exact timestamp a new memory was annotated — i.e. when a
# subtask just completed) the binding fires and the recipe
# supervises the new memory as a follow-up assistant turn,
# with a "Completed subtask: …" user message in between to
# separate the two outputs in the rendered prefix.
# ----------------------------------------------------------
action_execution: action_execution:
weight: 0.85 weight: 0.85
bindings: bindings:
@@ -55,17 +17,10 @@ blend:
stream: high_level stream: high_level
content: "${task}\nPlan: ${plan}\nMemory: ${memory}" content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
- {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask} - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
# Memory-update tail — only renders at boundary frames where # Boundary-frame tail: at a subtask transition, predict the
# ``new_memory`` fires. The new memory is appended as a second # new memory as a second assistant turn (same forward pass).
# assistant turn right after the subtask, with no intervening
# user filler: at a subtask boundary the model emits the new
# subtask AND the updated memory in one forward pass.
- {role: assistant, content: "${new_memory}", stream: high_level, target: true, if_present: new_memory} - {role: assistant, content: "${new_memory}", stream: high_level, target: true, if_present: new_memory}
# ----------------------------------------------------------
# FLAVOR 2: event-driven text-only paths
# ----------------------------------------------------------
ask_vqa_top: ask_vqa_top:
weight: 0.075 weight: 0.075
bindings: bindings:
@@ -1,68 +1,13 @@
# SmolVLA2 canonical training recipe — Hi Robot / MEM / ECoT blend. # SmolVLA2 Hi-Robot blend — two flavors:
# #
# Inspired by Pi 0.7 §V (Diversifying the Prompt) and Pi 0.5's # 1. action_execution — fused (task + plan + memory) prompt;
# hierarchical subtask training. The blend has **two flavors**: # supervises the current subtask (low_level: flow + text CE)
# # and, at memory-boundary frames, the new memory too.
# Flavor 1 — ``action_execution`` (~60% weight) # 2. ask_vqa_{top,wrist} — text-only VQA on a camera image,
# The main always-on recipe. Fuses all available context # gated by ``if_present`` so they only fire on annotated frames.
# (task + plan + memory) into a unified user prompt, and
# uses the current subtask as the assistant target. This
# single recipe supervises *both*:
# * subtask prediction (text CE on the assistant span,
# lm_head), and
# * action chunks (flow MSE on the action expert via
# ``stream: low_level, target: true``, plus the FAST
# CE on the action tokens when enabled).
# At inference, the same prompt structure is used:
# * the high-level loop calls ``select_message`` with the
# user prompt only → generates the next subtask.
# * the low-level loop calls ``predict_action_chunk`` with
# the user prompt + the generated subtask as the
# assistant turn → generates the action chunk.
# Replaces what used to be three separate recipes
# (``high_level_subtask`` + ``low_level_execution`` + the
# implicit subtask-in-prompt context) in earlier drafts.
# Pi 0.7's §V.A "Subtask instructions" pattern.
#
# Flavor 2 — event-driven text-only recipes
# Each handles a specific high-level event with a TEXT
# output (no action supervision). They fire when the
# binding for the event resolves to non-None:
# * ``ask_vqa_top`` / ``ask_vqa_wrist``: answer a
# camera-grounded visual question.
# All use ``stream: high_level`` (no flow loss) and rely on
# ``if_present`` guards so they only fire on frames where
# the relevant event annotation is present.
#
# ``memory_update`` is folded into Flavor 1 (gated on the
# ``new_memory`` binding at boundary frames).
# ``user_interjection_response`` was dropped — the current
# datasets don't include interjection / say() annotations.
#
# How the chat tokenizer interprets the flavor split
# ---------------------------------------------------
# * predict_actions = bool(targets_by_stream.get("low_level"))
# → True only for Flavor 1 (action_execution).
# * text_labels supervises whatever assistant turns are marked
# target=true. For action_execution, this is the subtask
# string. For Flavor 2, it's the corresponding text output.
blend: blend:
# ----------------------------------------------------------
# FLAVOR 1: action_execution (main path)
#
# Bundles memory updates inline. On most frames the binding
# ``new_memory: emitted_at(t, style=memory)`` returns None and
# only the subtask is supervised. On *boundary* frames (the
# exact timestamp a new memory was annotated — i.e. when a
# subtask just completed) the binding fires and the recipe
# supervises the new memory as a follow-up assistant turn,
# with a "Completed subtask: …" user message in between to
# separate the two outputs in the chat sequence. Mirrors the
# behaviour of the old standalone ``memory_update`` recipe
# but keeps everything inside the unified action_execution.
# ----------------------------------------------------------
action_execution: action_execution:
weight: 0.85 weight: 0.85
bindings: bindings:
@@ -72,17 +17,10 @@ blend:
stream: high_level stream: high_level
content: "${task}\nPlan: ${plan}\nMemory: ${memory}" content: "${task}\nPlan: ${plan}\nMemory: ${memory}"
- {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask} - {role: assistant, content: "${subtask}", stream: low_level, target: true, if_present: subtask}
# Memory-update tail — only renders at boundary frames where # Boundary-frame tail: at a subtask transition, predict the
# ``new_memory`` fires. The new memory is appended as a second # new memory as a second assistant turn (same forward pass).
# assistant turn right after the subtask, with no intervening
# user filler: at a subtask boundary the model emits the new
# subtask AND the updated memory in one forward pass.
- {role: assistant, content: "${new_memory}", stream: high_level, target: true, if_present: new_memory} - {role: assistant, content: "${new_memory}", stream: high_level, target: true, if_present: new_memory}
# ----------------------------------------------------------
# FLAVOR 2: event-driven text-only paths
# ----------------------------------------------------------
ask_vqa_top: ask_vqa_top:
weight: 0.075 weight: 0.075
bindings: bindings: