fix(datasets,annotate): tag pushed dataset + clean revision error

Two bugs combining to make the brand-new ``_tool3`` dataset unloadable: 1. ``lerobot_annotate.py:_push_to_hub`` uploads the annotated dataset folder but never creates a codebase-version tag, so ``api/datasets/<repo>/refs`` returns ``"tags": []``. Then ``LeRobotDatasetMetadata`` → ``get_safe_version`` → ``get_repo_versions`` returns empty and the loader raises ``RevisionNotFoundError``. 2. ``RevisionNotFoundError`` itself was unconstructible: its ``HfHubHTTPError.__init__`` indexes ``response.headers`` unconditionally on current ``huggingface_hub`` versions, so constructing it without a real ``Response`` blew up with ``AttributeError: 'NoneType' object has no attribute 'headers'``, masking the real "no tag" message. Fix #1: after upload, read ``meta/info.json["codebase_version"]`` and ``HfApi.create_tag(..., tag=<v3.x>, repo_type='dataset', exist_ok=True)`` so the dataset is loadable straight from the Hub on the next ``LeRobotDataset(repo_id)`` call. Falls back to the in-tree ``CODEBASE_VERSION`` if info.json is missing/malformed; on tag creation failure, prints the manual one-liner the user needs. Fix #2: stop trying to instantiate ``RevisionNotFoundError`` (which inherits HfHubHTTPError) for what is really a config issue, not an HTTP failure. Raise plain ``RuntimeError`` with the same message — the caller actually sees what's wrong instead of an upstream attribute error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(datasets): raise readable error when repo has no version tags
2026-05-21 19:49:49 +00:00 · 2026-05-05 18:23:18 +02:00 · 2026-05-05 18:12:40 +02:00 · 2026-05-05 15:07:18 +02:00 · 2026-05-05 14:32:05 +02:00 · 2026-05-05 14:07:25 +02:00
5 changed files with 215 additions and 15 deletions
@@ -23,6 +23,18 @@ token = os.environ.get("HF_TOKEN") or get_token()
 if not token:
    raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")

+# --- Diversity knobs (Pi0.7-style prompt expansion) -----------------------
+# Bumped roughly 3x across the board to fight memorization on small datasets.
+# A single dataset trained for many epochs with deterministic atom wording
+# converges to perfect recall on training prompts but produces JSON-token
+# garbage at inference for any wording that drifts slightly. More atom
+# variants per episode + higher sampling temperature widens the training
+# distribution so the model has to actually use its language head, not
+# just memorize.
+#
+# Pushes to a *new* hub repo (``_tool3``) so the previous annotation pass
+# (``_tool2``) stays intact — re-train from scratch on the new dataset and
+# compare loss-curve shapes to verify the diversity bump is doing something.
 CMD = (
    "apt-get update -qq && apt-get install -y -qq git ffmpeg && "
    "pip install --no-deps "
@@ -41,19 +53,21 @@ CMD = (
    "--tensor-parallel-size 1 --max-model-len 32768 "
    '--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
    "--vlm.serve_ready_timeout_s=1800 "
-    "--vlm.client_concurrency=256 "
+    "--vlm.client_concurrency=128 "
    "--vlm.max_new_tokens=512 "
-    "--executor.episode_parallelism=32 "
+    "--vlm.temperature=0.7 "
+    "--executor.episode_parallelism=16 "
    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
    "--vlm.camera_key=observation.images.wrist "
    "--module_1.frames_per_second=1.0 "
    "--module_1.use_video_url=true "
    "--module_1.use_video_url_fps=1.0 "
    "--module_1.derive_task_from_video=always "
-    "--module_1.n_task_rephrasings=10 "
-    "--module_3.K=1 "
+    "--module_1.n_task_rephrasings=30 "
+    "--module_2.max_interjections_per_episode=6 "
+    "--module_3.K=3 "
    "--module_3.vqa_emission_hz=1.0 "
-    "--push_to_hub=pepijn223/super_poulain_full_tool2"
+    "--push_to_hub=pepijn223/super_poulain_full_tool3"
 )

 job = run_job(
@@ -237,17 +237,24 @@ def get_safe_version(repo_id: str, version: str | packaging.version.Version) ->
    hub_versions = get_repo_versions(repo_id)

    if not hub_versions:
-        raise RevisionNotFoundError(
-            f"""Your dataset must be tagged with a codebase version.
-            Assuming _version_ is the codebase_version value in the info.json, you can run this:
-            ```python
-            from huggingface_hub import HfApi
-
-            hub_api = HfApi()
-            hub_api.create_tag("{repo_id}", tag="_version_", repo_type="dataset")
-            ```
-            """
+        msg = (
+            f"Repo {repo_id!r} has no codebase-version tags. The dataset "
+            f"either doesn't exist on the Hub yet, or it was uploaded "
+            f"without a ``v3.x``-style tag. To tag an existing dataset run:\n"
+            f"  from huggingface_hub import HfApi\n"
+            f"  HfApi().create_tag({repo_id!r}, tag='v3.0', repo_type='dataset', exist_ok=True)"
        )
+        # ``RevisionNotFoundError`` extends ``HfHubHTTPError`` whose
+        # ``__init__`` indexes ``response.headers`` unconditionally on
+        # current ``huggingface_hub`` versions. Constructing it without
+        # a real ``Response`` object crashes with either
+        # ``TypeError: missing 1 required keyword-only argument`` (old
+        # builds) or ``AttributeError: 'NoneType' object has no attribute
+        # 'headers'`` (new builds). Skip that path entirely — this isn't
+        # really an HTTP error, it's a configuration issue — and raise a
+        # plain ``RuntimeError`` so the message actually reaches the
+        # caller.
+        raise RuntimeError(msg)

    if target_version in hub_versions:
        return f"v{target_version}"
@@ -271,6 +271,9 @@ class HighLevelSubtaskFwd(InferenceStep):
        msg = _generate_with_policy(
            self.policy, ctx, observation=observation, state=state, label="subtask gen"
        )
+        if msg and _looks_like_gibberish(msg):
+            push_log(state, f"  [info] subtask gen rejected (gibberish): {msg[:60]!r}")
+            return None
        if msg:
            changed = set_if_changed(state, "current_subtask", msg, label="subtask")
            if changed:
@@ -307,6 +310,9 @@ class MemoryUpdateFwd(InferenceStep):
        new_memory = _generate_with_policy(
            self.policy, ctx, observation=observation, state=state, label="memory gen"
        )
+        if new_memory and _looks_like_gibberish(new_memory):
+            push_log(state, f"  [info] memory gen rejected (gibberish): {new_memory[:60]!r}")
+            return None
        if new_memory:
            set_if_changed(state, "current_memory", new_memory, label="memory")
        return None
@@ -340,11 +346,16 @@ class UserInterjectionFwd(InferenceStep):
        if not out:
            push_log(state, "  [info] plan/say gen produced no text this tick")
            return None
+        if _looks_like_gibberish(out):
+            push_log(state, f"  [info] plan/say gen rejected (gibberish): {out[:60]!r}")
+            return None
        # Heuristic split: model is trained to emit one assistant turn
        # carrying both plan text AND a `say` tool call. Look for a
        # "<say>...</say>" or "say(...)" marker; fall back to whole
        # text → plan, no speech.
        plan_text, speech_text = _split_plan_and_say(out)
+        if plan_text and _looks_like_gibberish(plan_text):
+            plan_text = ""
        if plan_text:
            set_if_changed(state, "current_plan", plan_text, label="plan")
        if speech_text:
@@ -390,6 +401,9 @@ class AskVQAFwd(InferenceStep):
        answer = _generate_with_policy(
            self.policy, ctx, observation=observation, state=state, label="vqa gen"
        )
+        # VQA answers are intentionally JSON-like during training, so
+        # ``_looks_like_gibberish`` would false-positive on them. Keep
+        # the answer as-is — the VQA panel line lets the user judge.
        if answer:
            push_log(state, f"  vqa: {answer}")
        state["recent_vqa_query"] = None
@@ -432,6 +446,38 @@ class DispatchToolCalls(InferenceStep):
 # ---------------------------------------------------------------------------


+def _looks_like_gibberish(text: str) -> bool:
+    """Heuristically detect generation that's clearly off the rails.
+
+    Memorised models can collapse to dominant-mode outputs (often the
+    JSON-token salad ``":":":":...`` from VQA training) when the prompt
+    drifts even slightly from training distribution. If we accept those
+    as new state, they pollute the next tick's prompt and cascade into
+    worse outputs. Reject anything that looks pathological:
+
+    * empty / whitespace-only
+    * mostly punctuation (``"``, ``:``, ``,``)
+    * a single character repeated past the threshold
+    * starts with ``":"`` and contains no letters
+
+    The thresholds are intentionally lenient — a real subtask like
+    ``"close the gripper"`` has ~70%+ alpha characters, while gibberish
+    like ``":":":"`` has ~0%.
+    """
+    if not text or not text.strip():
+        return True
+    stripped = text.strip()
+    alpha = sum(1 for c in stripped if c.isalpha())
+    if alpha < max(3, len(stripped) // 8):
+        return True
+    if stripped.startswith('":') and stripped.count('"') > stripped.count(" "):
+        return True
+    # Single repeating char: e.g. ``""""""``
+    if len(set(stripped)) <= 2 and len(stripped) > 4:
+        return True
+    return False
+
+
 def _control_context_messages(
    state: dict[str, Any],
    *,
@@ -141,6 +141,43 @@ def _push_to_hub(root: Path, cfg: AnnotationPipelineConfig) -> None:
    )
    print(f"[lerobot-annotate] uploaded to https://huggingface.co/datasets/{repo_id}", flush=True)

+    # Tag the upload with the codebase version. ``LeRobotDatasetMetadata``
+    # resolves the dataset revision via ``get_safe_version`` which scans
+    # for tags like ``v3.0``; without a tag it raises
+    # ``RevisionNotFoundError``. Read the version straight from the
+    # dataset's own ``meta/info.json`` so we tag whatever the writer
+    # actually wrote (no accidental drift if the codebase floor moves).
+    from lerobot.datasets.dataset_metadata import CODEBASE_VERSION  # noqa: PLC0415
+
+    info_path = root / "meta" / "info.json"
+    version_tag = CODEBASE_VERSION
+    if info_path.exists():
+        try:
+            from lerobot.utils.io_utils import load_json  # noqa: PLC0415
+
+            info = load_json(info_path)
+            ds_version = info.get("codebase_version")
+            if isinstance(ds_version, str) and ds_version.startswith("v"):
+                version_tag = ds_version
+        except Exception as exc:  # noqa: BLE001
+            print(f"[lerobot-annotate] could not read codebase_version from info.json ({exc}); falling back to {version_tag}", flush=True)
+    try:
+        api.create_tag(
+            repo_id=repo_id,
+            tag=version_tag,
+            repo_type="dataset",
+            exist_ok=True,
+        )
+        print(f"[lerobot-annotate] tagged {repo_id} as {version_tag}", flush=True)
+    except Exception as exc:  # noqa: BLE001
+        print(
+            f"[lerobot-annotate] WARNING: could not create tag {version_tag!r} on {repo_id}: {exc}. "
+            "Dataset is uploaded but ``LeRobotDataset`` won't be able to load it until it's tagged. "
+            "Run: from huggingface_hub import HfApi; "
+            f"HfApi().create_tag({repo_id!r}, tag={version_tag!r}, repo_type='dataset', exist_ok=True)",
+            flush=True,
+        )
+

 def main() -> None:
    annotate()
@@ -307,6 +307,71 @@ def _build_observation_provider(
    return _provider


+def _bootstrap_state_from_dataset(
+    *,
+    dataset_repo_id: str,
+    episode: int,
+    start_frame: int,
+) -> dict[str, str]:
+    """Pull task / active plan / active memory / active subtask at ``start_frame``.
+
+    The model is heavily memorised on the exact training prompts the
+    recipe rendered from this dataset (canonical task wording,
+    persistent atoms emitted earlier in the episode). Reconstructing
+    that state at REPL startup lets the runtime's first prompt line
+    up with what training looked like — without it the model sees an
+    out-of-distribution prompt and falls back to its dominant
+    training mode (VQA JSON spam).
+    """
+    from lerobot.datasets.lerobot_dataset import LeRobotDataset  # noqa: PLC0415
+
+    ds = LeRobotDataset(dataset_repo_id, episodes=[episode])
+    if len(ds) == 0:
+        return {}
+    idx = max(0, min(start_frame, len(ds) - 1))
+    sample = ds[idx]
+
+    out: dict[str, str] = {}
+    task = sample.get("task")
+    if isinstance(task, str) and task.strip():
+        out["task"] = task
+
+    persistent = sample.get("language_persistent") or []
+    # ``persistent`` is the broadcast slice of the episode; pick the
+    # *latest* row of each style whose ``timestamp`` is ≤ the
+    # frame's timestamp (matches the renderer's ``active_at``
+    # semantics).
+    try:
+        frame_ts = (
+            float(sample["timestamp"])
+            if not hasattr(sample["timestamp"], "item")
+            else sample["timestamp"].item()
+        )
+    except Exception:  # noqa: BLE001
+        frame_ts = float("inf")
+
+    by_style: dict[str, tuple[float, str]] = {}
+    for row in persistent:
+        style = row.get("style")
+        ts = row.get("timestamp")
+        content = row.get("content")
+        if not (style and content) or ts is None:
+            continue
+        try:
+            ts_f = float(ts)
+        except (TypeError, ValueError):
+            continue
+        if ts_f > frame_ts:
+            continue
+        prev = by_style.get(style)
+        if prev is None or ts_f >= prev[0]:
+            by_style[style] = (ts_f, content)
+    for style, (_, content) in by_style.items():
+        if style in {"plan", "memory", "subtask"}:
+            out[style] = content
+    return out
+
+
 def _build_tools(no_tts: bool, tts_voice: str) -> dict[str, Any]:
    """Instantiate the tools declared on this dataset/policy."""
    if no_tts:
@@ -364,6 +429,7 @@ def main(argv: list[str] | None = None) -> int:
    )

    observation_provider: Callable[[], dict | None] | None = None
+    bootstrap_state: dict[str, str] = {}
    if args.dataset_repo_id is not None:
        print(
            f"[smolvla2] streaming observations from {args.dataset_repo_id} "
@@ -379,6 +445,25 @@ def main(argv: list[str] | None = None) -> int:
            preprocessor=preprocessor,
            device=str(getattr(policy.config, "device", "cpu")),
        )
+        # Pull the dataset's canonical task + the persistent atoms in
+        # force at the chosen start frame. The model is heavily
+        # memorised on the *exact* training prompts (task wording,
+        # current plan, current memory) — feeding ad-hoc user
+        # alternatives gives it nothing to recall against, so it
+        # collapses to its dominant training mode (VQA JSON). Reading
+        # the canonical state straight from the dataset gives the
+        # runtime a starting point that lines up with training.
+        bootstrap_state = _bootstrap_state_from_dataset(
+            dataset_repo_id=args.dataset_repo_id,
+            episode=args.dataset_episode,
+            start_frame=args.dataset_start_frame,
+        )
+        if bootstrap_state.get("task") and not args.task:
+            args.task = bootstrap_state["task"]
+            print(
+                f"[smolvla2] using canonical task from dataset: {args.task!r}",
+                flush=True,
+            )

    tools = _build_tools(args.no_tts, args.tts_voice)
    if tools:
@@ -411,6 +496,17 @@ def main(argv: list[str] | None = None) -> int:
    )
    if args.task:
        runtime.set_task(args.task)
+    # Bootstrap plan/memory from the dataset so the first prompt the
+    # runtime builds matches what training rendered (task + active
+    # plan + active memory). Without this the runtime starts with
+    # plan/memory empty, which only matched the very-early frames in
+    # training and is an out-of-distribution prompt for the rest.
+    if bootstrap_state.get("plan"):
+        runtime.state["current_plan"] = bootstrap_state["plan"]
+    if bootstrap_state.get("memory"):
+        runtime.state["current_memory"] = bootstrap_state["memory"]
+    if bootstrap_state.get("subtask"):
+        runtime.state["current_subtask"] = bootstrap_state["subtask"]

    return _run_repl(runtime, initial_task=args.task, max_ticks=args.max_ticks)
Author	SHA1	Message	Date
Pepijn	a764c3e1d6	fix(datasets,annotate): tag pushed dataset + clean revision error Two bugs combining to make the brand-new ``_tool3`` dataset unloadable: 1. ``lerobot_annotate.py:_push_to_hub`` uploads the annotated dataset folder but never creates a codebase-version tag, so ``api/datasets/<repo>/refs`` returns ``"tags": []``. Then ``LeRobotDatasetMetadata`` → ``get_safe_version`` → ``get_repo_versions`` returns empty and the loader raises ``RevisionNotFoundError``. 2. ``RevisionNotFoundError`` itself was unconstructible: its ``HfHubHTTPError.__init__`` indexes ``response.headers`` unconditionally on current ``huggingface_hub`` versions, so constructing it without a real ``Response`` blew up with ``AttributeError: 'NoneType' object has no attribute 'headers'``, masking the real "no tag" message. Fix #1: after upload, read ``meta/info.json["codebase_version"]`` and ``HfApi.create_tag(..., tag=<v3.x>, repo_type='dataset', exist_ok=True)`` so the dataset is loadable straight from the Hub on the next ``LeRobotDataset(repo_id)`` call. Falls back to the in-tree ``CODEBASE_VERSION`` if info.json is missing/malformed; on tag creation failure, prints the manual one-liner the user needs. Fix #2: stop trying to instantiate ``RevisionNotFoundError`` (which inherits HfHubHTTPError) for what is really a config issue, not an HTTP failure. Raise plain ``RuntimeError`` with the same message — the caller actually sees what's wrong instead of an upstream attribute error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 18:23:18 +02:00
Pepijn	b416f287f2	fix(datasets): raise readable error when repo has no version tags ``RevisionNotFoundError`` inherits from ``huggingface_hub.HfHubHTTPError`` which made ``response`` a required keyword-only argument on recent versions. Constructing it with just a message string blew up with ``TypeError: HfHubHTTPError.__init__() missing 1 required keyword-only argument: 'response'`` instead of surfacing the actual problem (the dataset/checkpoint repo doesn't exist on the Hub yet). Pass ``response=None`` explicitly. Fall back to the bare-message form for older ``huggingface_hub`` versions that don't accept the kwarg. Also clarify the message to call out the most common cause: typing a hub repo id that hasn't been pushed yet (instead of just "needs a version tag"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 18:12:40 +02:00
Pepijn	aa749d4947	chore(annotate): throttle Module 3 + executor parallelism to fix vLLM stall Last bump combined ``module_3.K=3`` with ``vqa_emission_hz=2.0`` and ``executor.episode_parallelism=32``. With 2 cameras per dataset that produced ~12× the original VQA call volume, all submitted concurrently. Module 3 latency went from ~30s/phase to ~490s per episode, vLLM's KV cache pegged at 94% with 800+ in-flight requests, and the multimodal cache corrupted with ``AssertionError: Expected a cached item for mm_hash='...'`` (a known vLLM bug under image-heavy concurrency). Module 1 and 2 ran fine; Module 3 was the bottleneck. Pull back the multipliers to land in a sustainable spot: * module_3.K: 3 (kept) — three diverse questions per emission, where the diversity actually helps the LM head. * module_3.vqa_emission_hz: 2.0 → 1.0 — back to the original emission rate. Net VQA volume is now ~3× original (K alone) on a single camera, ~6× across both cameras — manageable. * module_2.max_interjections_per_episode: 9 → 6 — still 2× the default, fewer than the prior 3× to keep total request volume in check. * vlm.client_concurrency: 256 → 128 — gives vLLM headroom on the multimodal request path so the mm_cache doesn't desync. * executor.episode_parallelism: 32 → 16 — half the episodes in flight at once, so peak vLLM load is ~half. n_task_rephrasings stays at 30 (text-only, doesn't load the image path) and vlm.temperature stays at 0.7. The diversity gains are preserved; only the throughput knobs come down. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:07:18 +02:00
Pepijn	1394a6ab5d	chore(annotate): bump diversity knobs ~3x to fight memorisation Following Pi0.7 §V (prompt expansion / diverse context conditioning), push more atom variants per episode and higher VLM sampling temperature so the training distribution has enough wording diversity that the LM head is forced to use its parameters rather than memorise specific (prompt, target) pairs. Changes vs prior annotation pass: * vlm.temperature: 0.2 (default) → 0.7 — every Module-1/2/3 call now produces diverse phrasings; same prompt yields different completions across emissions. * module_1.n_task_rephrasings: 10 → 30 — three times as many ``task_aug`` rows in language_persistent. ``${task}`` already rotates through them deterministically per sample_idx (see ``_resolve_task`` in language_render.py). * module_2.max_interjections_per_episode: 3 (default) → 9 — more ``user_interjection_response`` training samples + more plan refresh events. * module_3.K: 1 → 3 — three VQA pairs per emission tick instead of one. Combined with the hz bump below, ~6× more VQA samples. * module_3.vqa_emission_hz: 1.0 → 2.0 — double the VQA emission rate within each subtask span. Pushes to a new hub repo (``_tool3``) so the working ``_tool2`` dataset stays intact for comparison. ``${task}`` already wired to rotate through ``task_aug`` rows, so no renderer change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:32:05 +02:00
Pepijn	db9118f16f	fix(smolvla2): reject gibberish high-level generations Memorised models can collapse to dominant-mode outputs (the JSON-token salad ``":":":":...`` from VQA training) when the prompt drifts even slightly from training distribution. Without a guard, that gibberish lands in ``current_subtask`` / ``current_plan`` / ``current_memory``, which feeds the next tick's prompt and cascades into worse outputs. The user observed exactly this: a clean run followed by a tick that wrote ``" " "`` into plan and memory, then slow recovery several ticks later. Add ``_looks_like_gibberish`` heuristic (alpha density, repeating chars, JSON-prefix sniff) and apply it before mutating state in ``HighLevelSubtaskFwd`` / ``MemoryUpdateFwd`` / ``UserInterjectionFwd``. Bad generations are logged inline (``[info] subtask gen rejected (gibberish): "":":":..."``) so the user can see what was dropped, but the state stays at its last-known-good value (typically the dataset bootstrap) instead of being polluted. VQA path is intentionally exempt — its training targets are JSON-shaped, so the heuristic would false-positive on them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:07:25 +02:00
Pepijn	7a945d7bdc	fix(smolvla2): bootstrap canonical task + plan/memory from dataset The user-typed task and the dataset's canonical task differ in wording (capitalisation, ``green box`` vs ``green bin``, etc.). With ``text_loss`` driven down to ~6e-6 across 78 epochs the model is memorised on the exact rendered training prompts: any wording drift puts the prompt out of distribution and the model collapses to its dominant training mode (VQA JSON output). When ``--dataset.repo_id`` is set, automatically: * read the canonical task string from the chosen episode (and use it as ``--task`` when the user didn't pass one); * pull the active ``plan`` / ``memory`` / ``subtask`` rows from the persistent slice (latest row whose timestamp ≤ start frame's timestamp — same semantics as the renderer's ``active_at``) and seed them into the runtime state. The first prompt the runtime builds at REPL start now mirrors what the recipe rendered during training (task + active plan + active memory + optional current subtask). The user can still override any of these by typing. Memorisation itself is upstream (training mix collapsed to too few unique high-level targets); this commit only fixes the inference-side prompt mismatch that was making the memorisation surface as gibberish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:00:36 +02:00