docs(groot): document the N1.5 removal and the N1.7 parity test

- groot.mdx: breaking-change warning and migration path (pin lerobot==0.5.1 to keep N1.5, or move to N1.7); the dead `huggingface-cli download` is replaced with `hf download`. - policy_groot_README.md: N1.5 removal note, updated paper / model-card links, and the two-comparison (model parity + preprocessor parity) description of the original-vs-LeRobot test, including the raw-observation artifacts and recorded seed.
2026-06-17 16:27:04 +00:00 · 2026-06-12 23:40:36 +02:00
2 changed files with 56 additions and 24 deletions
@@ -4,6 +4,9 @@ GR00T is an NVIDIA foundation model family for generalized humanoid robot reason
 LeRobot integrates GR00T N1.7 through the `groot` policy type.
 > [!WARNING]
 > **Breaking change:** GR00T N1.5 support was removed from LeRobot, and current releases support GR00T N1.7 only. N1.5 checkpoints, configs, and `--policy.model_version=n1.5` are rejected with a clear error. To keep using an N1.5 checkpoint, pin the last release that supports it: `pip install 'lerobot==0.5.1'`. To use the current release, migrate to GR00T N1.7 (`model_version='n1.7'`, base model [`nvidia/GR00T-N1.7-3B`](https://huggingface.co/nvidia/GR00T-N1.7-3B)).
 ## Model Overview
 GR00T N1.7 uses a Cosmos-Reason2/Qwen3-VL backbone and provides checkpoints for SimplerEnv, DROID, and LIBERO.
@@ -133,7 +136,7 @@ Replace the `XX` placeholders with final eval artifacts before merge.
 Download the suite checkpoint locally, then point `--policy.base_model_path` at the downloaded subdirectory. `--policy.path` is reserved for LeRobot checkpoints that contain a LeRobot `config.json` with a `type` field.
 ```bash
-huggingface-cli download nvidia/GR00T-N1.7-LIBERO \
+hf download nvidia/GR00T-N1.7-LIBERO \
  --include "libero_spatial/*" \
  --local-dir ./GR00T-N1.7-LIBERO
@@ -1,6 +1,13 @@
 ## Research Paper
-Paper: https://research.nvidia.com/labs/gear/gr00t-n1_5/
+GR00T N1 technical report (covers the GR00T N1.x family, including N1.7): https://arxiv.org/abs/2503.14734
 GR00T N1.7 model card: https://huggingface.co/nvidia/GR00T-N1.7-3B
 GR00T N1.5 research page (earlier version): https://research.nvidia.com/labs/gear/gr00t-n1_5/
 > GR00T N1.5 support was removed from LeRobot; the last release supporting it is `lerobot==0.5.1`.
 > Current releases support GR00T N1.7 only.
 ## Repository
@@ -31,12 +38,22 @@ Hugging Face Models:
 ## Original-vs-LeRobot parity test
-`tests/policies/groot/test_groot_vs_original.py` verifies that this LeRobot
+`tests/policies/groot/test_groot_vs_original.py` verifies this LeRobot
 reimplementation of GR00T N1.7 (Qwen3-VL backbone + flow-matching action head)
-produces the **same raw model output** (`get_action(...)["action_pred"]`, the
+against NVIDIA's original `gr00t` package with two comparisons, each parametrized
-normalized flow-matching prediction) as NVIDIA's original `gr00t` package, given
+over every embodiment tag present in the checkpoint:
-byte-identical pre-processed inputs and the same flow-matching seed. It is
+
-parametrized over every embodiment tag present in the checkpoint.
+1. **Model parity** — given byte-identical pre-processed inputs and the same
   flow-matching seed (recorded in each artifact), both implementations must produce
   the **same raw model output** (`get_action(...)["action_pred"]`, the normalized
   flow-matching prediction). Output shapes must match exactly; any action-horizon
   or action-dim mismatch fails the test.
 2. **Preprocessor parity** — given the identical raw observations (per-camera
   frames, state vectors, language instruction), LeRobot's own preprocessor pipeline
   (real Qwen3-VL chat template / tokenizer / image packing + checkpoint-driven
   state normalization, no mocks) must produce the **same collated model inputs**
   (`input_ids`, `attention_mask`, `pixel_values`, `image_grid_thw`, `state`,
   `embodiment_id`) as the original package's processor.
 ### Why two environments
@@ -48,25 +65,37 @@ is itself a defaulted dataclass, so the original config dataclasses fail to impo
 So the test uses a **producer / consumer** split across two venvs:
-1. **Producer** — `tests/policies/groot/utils/dump_original_n1_7.py`, run in the *original*
+1. **Producer** — `tests/policies/groot/utils/dump_original_n1_7.py`, run in the _original_
   gr00t venv. For each embodiment it builds dummy inputs generically from the
   checkpoint metadata (state dims from `statistics.json`; camera/language keys from
-   the processor modality configs), runs the original model, and saves the exact
+   the processor modality configs), runs the original model, and saves to one `.npz`
-   collated inputs + raw `action_pred` to one `.npz` per tag.
+   per tag: the raw observations (`raw::` keys), the exact collated inputs
-2. **Consumer** — the pytest above, run in the *LeRobot* venv. It discovers every
+   (`in::` keys), the seed, and the raw `action_pred`.
-   `.npz`, replays the byte-identical inputs through the LeRobot model with the same
+2. **Consumer** — the pytest above, run in the _LeRobot_ venv. It discovers every
-   seed, and asserts the outputs match.
+   `.npz`; the model-parity case replays the byte-identical collated inputs through
   the LeRobot model with the recorded seed and asserts the outputs match, and the
   preprocessor-parity case replays the raw observations through LeRobot's full
   preprocessor pipeline and asserts the collated tensors match.
 > Artifacts generated by older versions of the dump script contain no `raw::`
 > fields; the preprocessor-parity case then **skips** with a regeneration hint.
 > Re-run the producer to refresh them.
 ### Fairness controls
- **Same pre-processed inputs** — the original processor's `input_ids`,
+- **Same pre-processed inputs (model parity)** — the original processor's `input_ids`,
  `pixel_values`, `image_grid_thw`, `attention_mask`, `state`, `embodiment_id` are
-  fed verbatim to the LeRobot model (no re-tokenization / re-normalization).
+  fed verbatim to the LeRobot model (no re-tokenization / re-normalization), so the
  model comparison isolates the model. LeRobot's own tokenization / image packing is
  covered separately by the preprocessor-parity case, which compares its output
  against those same collated tensors from identical raw observations.
 - **Same precision + attention kernel** — both sides run **fp32 + SDPA**. The
  original defaults to `use_flash_attention=True` (flash_attention_2 + bf16); the
  producer forces SDPA + fp32. (With the defaults the gap is ~3e-2 — pure
  kernel/rounding noise, not an implementation difference.)
- **Same flow-matching seed** — fixed (42) right before sampling on both sides.
+- **Same flow-matching seed** — fixed right before sampling on both sides; the
  producer records it in each artifact (`--seed`, default 42) and the consumer
  replays the recorded value.
 ### How to run
@@ -90,15 +119,15 @@ CUDA_VISIBLE_DEVICES=0 GROOT_PARITY_DEVICE=cuda \
    uv run pytest tests/policies/groot/test_groot_vs_original.py -v -s
 ```
-The `.npz` artifacts are local-only (gitignored, ~6–9 MB each) and are regenerated by
+The `.npz` artifacts are local-only (gitignored, ~6–10 MB each) and are regenerated by
-the producer; they are never committed. The test **skips** (does not fail) on CI or
+the producer; they are never committed. The tests **skip** (do not fail) on CI or
 when the checkpoint / artifacts are absent.
 #### Env knobs (all optional)
-| Var | Default | Purpose |
+| Var                                       | Default                          | Purpose                               |
-|---|---|---|
+| ----------------------------------------- | -------------------------------- | ------------------------------------- |
-| `GROOT_N1_7_PARITY_DIR` | `tests/policies/groot/artifacts` | directory of per-tag `.npz` artifacts |
+| `GROOT_N1_7_PARITY_DIR`                   | `tests/policies/groot/artifacts` | directory of per-tag `.npz` artifacts |
-| `GROOT_N1_7_LIBERO_CKPT` | auto (HF cache) | override checkpoint dir |
+| `GROOT_N1_7_LIBERO_CKPT`                  | auto (HF cache)                  | override checkpoint dir               |
-| `GROOT_PARITY_DEVICE` | `cuda` if available | `cpu` or `cuda` |
+| `GROOT_PARITY_DEVICE`                     | `cuda` if available              | `cpu` or `cuda`                       |
-| `GROOT_PARITY_ATOL` / `GROOT_PARITY_RTOL` | `1e-3` | comparison tolerance |
+| `GROOT_PARITY_ATOL` / `GROOT_PARITY_RTOL` | `1e-3`                           | comparison tolerance                  |