test(groot): regression coverage and CI guards for the N1.7 review fixes

Adds/updates unit tests for the N1.5 removal surfaces (config, checkpoint markers, removed processor steps, v2 action-unpack registration), the legacy-default remap warnings, action_decode_transform auto/none resolution, 2-D action decoding, the per-instance raw-state cache and pack/decode reconnection, raw-checkpoint stats fallback and override handling, camera-match warnings, bf16 handling, and backbone loading kwargs. Adds pytest.importorskip guards so the fast_tests tiers pass without transformers, and asserts the training forward pass returns a finite loss. Note: these tests exercise symbols introduced by the GR00T N1.7 source PRs (source-core, backbone); merge those for green CI on this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 14:59:57 +00:00 · 2026-06-12 23:38:08 +02:00
4 changed files with 952 additions and 99 deletions
@@ -4,9 +4,6 @@ GR00T is an NVIDIA foundation model family for generalized humanoid robot reason

 LeRobot integrates GR00T N1.7 through the `groot` policy type.

-> [!WARNING]
-> **Breaking change:** GR00T N1.5 support was removed from LeRobot, and current releases support GR00T N1.7 only. N1.5 checkpoints, configs, and `--policy.model_version=n1.5` are rejected with a clear error. To keep using an N1.5 checkpoint, pin the last release that supports it: `pip install 'lerobot==0.5.1'`. To use the current release, migrate to GR00T N1.7 (`model_version='n1.7'`, base model [`nvidia/GR00T-N1.7-3B`](https://huggingface.co/nvidia/GR00T-N1.7-3B)).
-
 ## Model Overview

 GR00T N1.7 uses a Cosmos-Reason2/Qwen3-VL backbone and provides checkpoints for SimplerEnv, DROID, and LIBERO.
@@ -136,7 +133,7 @@ Replace the `XX` placeholders with final eval artifacts before merge.
 Download the suite checkpoint locally, then point `--policy.base_model_path` at the downloaded subdirectory. `--policy.path` is reserved for LeRobot checkpoints that contain a LeRobot `config.json` with a `type` field.

 ```bash
-hf download nvidia/GR00T-N1.7-LIBERO \
+huggingface-cli download nvidia/GR00T-N1.7-LIBERO \
  --include "libero_spatial/*" \
  --local-dir ./GR00T-N1.7-LIBERO

@@ -1,13 +1,6 @@
 ## Research Paper

-GR00T N1 technical report (covers the GR00T N1.x family, including N1.7): https://arxiv.org/abs/2503.14734
-
-GR00T N1.7 model card: https://huggingface.co/nvidia/GR00T-N1.7-3B
-
-GR00T N1.5 research page (earlier version): https://research.nvidia.com/labs/gear/gr00t-n1_5/
-
-> GR00T N1.5 support was removed from LeRobot; the last release supporting it is `lerobot==0.5.1`.
-> Current releases support GR00T N1.7 only.
+Paper: https://research.nvidia.com/labs/gear/gr00t-n1_5/

 ## Repository

@@ -38,22 +31,12 @@ Hugging Face Models:

 ## Original-vs-LeRobot parity test

-`tests/policies/groot/test_groot_vs_original.py` verifies this LeRobot
+`tests/policies/groot/test_groot_vs_original.py` verifies that this LeRobot
 reimplementation of GR00T N1.7 (Qwen3-VL backbone + flow-matching action head)
-against NVIDIA's original `gr00t` package with two comparisons, each parametrized
-over every embodiment tag present in the checkpoint:
-
-1. **Model parity** — given byte-identical pre-processed inputs and the same
-   flow-matching seed (recorded in each artifact), both implementations must produce
-   the **same raw model output** (`get_action(...)["action_pred"]`, the normalized
-   flow-matching prediction). Output shapes must match exactly; any action-horizon
-   or action-dim mismatch fails the test.
-2. **Preprocessor parity** — given the identical raw observations (per-camera
-   frames, state vectors, language instruction), LeRobot's own preprocessor pipeline
-   (real Qwen3-VL chat template / tokenizer / image packing + checkpoint-driven
-   state normalization, no mocks) must produce the **same collated model inputs**
-   (`input_ids`, `attention_mask`, `pixel_values`, `image_grid_thw`, `state`,
-   `embodiment_id`) as the original package's processor.
+produces the **same raw model output** (`get_action(...)["action_pred"]`, the
+normalized flow-matching prediction) as NVIDIA's original `gr00t` package, given
+byte-identical pre-processed inputs and the same flow-matching seed. It is
+parametrized over every embodiment tag present in the checkpoint.

 ### Why two environments

@@ -65,37 +48,25 @@ is itself a defaulted dataclass, so the original config dataclasses fail to impo

 So the test uses a **producer / consumer** split across two venvs:

-1. **Producer** — `tests/policies/groot/utils/dump_original_n1_7.py`, run in the _original_
+1. **Producer** — `tests/policies/groot/utils/dump_original_n1_7.py`, run in the *original*
   gr00t venv. For each embodiment it builds dummy inputs generically from the
   checkpoint metadata (state dims from `statistics.json`; camera/language keys from
-   the processor modality configs), runs the original model, and saves to one `.npz`
-   per tag: the raw observations (`raw::` keys), the exact collated inputs
-   (`in::` keys), the seed, and the raw `action_pred`.
-2. **Consumer** — the pytest above, run in the _LeRobot_ venv. It discovers every
-   `.npz`; the model-parity case replays the byte-identical collated inputs through
-   the LeRobot model with the recorded seed and asserts the outputs match, and the
-   preprocessor-parity case replays the raw observations through LeRobot's full
-   preprocessor pipeline and asserts the collated tensors match.
-
-> Artifacts generated by older versions of the dump script contain no `raw::`
-> fields; the preprocessor-parity case then **skips** with a regeneration hint.
-> Re-run the producer to refresh them.
+   the processor modality configs), runs the original model, and saves the exact
+   collated inputs + raw `action_pred` to one `.npz` per tag.
+2. **Consumer** — the pytest above, run in the *LeRobot* venv. It discovers every
+   `.npz`, replays the byte-identical inputs through the LeRobot model with the same
+   seed, and asserts the outputs match.

 ### Fairness controls

- **Same pre-processed inputs (model parity)** — the original processor's `input_ids`,
+- **Same pre-processed inputs** — the original processor's `input_ids`,
  `pixel_values`, `image_grid_thw`, `attention_mask`, `state`, `embodiment_id` are
-  fed verbatim to the LeRobot model (no re-tokenization / re-normalization), so the
-  model comparison isolates the model. LeRobot's own tokenization / image packing is
-  covered separately by the preprocessor-parity case, which compares its output
-  against those same collated tensors from identical raw observations.
+  fed verbatim to the LeRobot model (no re-tokenization / re-normalization).
 - **Same precision + attention kernel** — both sides run **fp32 + SDPA**. The
  original defaults to `use_flash_attention=True` (flash_attention_2 + bf16); the
  producer forces SDPA + fp32. (With the defaults the gap is ~3e-2 — pure
  kernel/rounding noise, not an implementation difference.)
- **Same flow-matching seed** — fixed right before sampling on both sides; the
-  producer records it in each artifact (`--seed`, default 42) and the consumer
-  replays the recorded value.
+- **Same flow-matching seed** — fixed (42) right before sampling on both sides.

 ### How to run

@@ -119,15 +90,15 @@ CUDA_VISIBLE_DEVICES=0 GROOT_PARITY_DEVICE=cuda \
    uv run pytest tests/policies/groot/test_groot_vs_original.py -v -s
 ```

-The `.npz` artifacts are local-only (gitignored, ~6–10 MB each) and are regenerated by
-the producer; they are never committed. The tests **skip** (do not fail) on CI or
+The `.npz` artifacts are local-only (gitignored, ~6–9 MB each) and are regenerated by
+the producer; they are never committed. The test **skips** (does not fail) on CI or
 when the checkpoint / artifacts are absent.

 #### Env knobs (all optional)

-| Var                                       | Default                          | Purpose                               |
-| ----------------------------------------- | -------------------------------- | ------------------------------------- |
-| `GROOT_N1_7_PARITY_DIR`                   | `tests/policies/groot/artifacts` | directory of per-tag `.npz` artifacts |
-| `GROOT_N1_7_LIBERO_CKPT`                  | auto (HF cache)                  | override checkpoint dir               |
-| `GROOT_PARITY_DEVICE`                     | `cuda` if available              | `cpu` or `cuda`                       |
-| `GROOT_PARITY_ATOL` / `GROOT_PARITY_RTOL` | `1e-3`                           | comparison tolerance                  |
+| Var | Default | Purpose |
+|---|---|---|
+| `GROOT_N1_7_PARITY_DIR` | `tests/policies/groot/artifacts` | directory of per-tag `.npz` artifacts |
+| `GROOT_N1_7_LIBERO_CKPT` | auto (HF cache) | override checkpoint dir |
+| `GROOT_PARITY_DEVICE` | `cuda` if available | `cpu` or `cuda` |
+| `GROOT_PARITY_ATOL` / `GROOT_PARITY_RTOL` | `1e-3` | comparison tolerance |
@@ -207,6 +207,11 @@ def test_lerobot_groot_forward_pass():
    with torch.no_grad():
        lerobot_loss, lerobot_metrics = lerobot_policy.forward(batch_lerobot_processed)

+    assert isinstance(lerobot_loss, torch.Tensor)
+    assert torch.isfinite(lerobot_loss).all()
+    assert "loss" in lerobot_metrics
+    assert np.isfinite(lerobot_metrics["loss"])
+
    print("\nForward pass successful.")
    print(f"  - Loss: {lerobot_loss.item():.6f}")
    print(f"  - Metrics: {lerobot_metrics}")