fix gpu saw

Merge commit 'refs/groot/docs'; commit 'refs/groot/backbone'; commit 'refs/groot/core' into fix/groot_training_experiment
fix(groot): N1.7 backbone loading and DiT parameter-count logging
2026-06-14 14:59:57 +00:00 · 2026-06-13 23:09:18 +02:00 · 2026-06-13 19:59:57 +02:00 · 2026-06-12 23:55:33 +02:00 · 2026-06-12 23:40:36 +02:00
5 changed files with 227 additions and 54 deletions
@@ -4,6 +4,9 @@ GR00T is an NVIDIA foundation model family for generalized humanoid robot reason

 LeRobot integrates GR00T N1.7 through the `groot` policy type.

+> [!WARNING]
+> **Breaking change:** GR00T N1.5 support was removed from LeRobot, and current releases support GR00T N1.7 only. N1.5 checkpoints, configs, and `--policy.model_version=n1.5` are rejected with a clear error. To keep using an N1.5 checkpoint, pin the last release that supports it: `pip install 'lerobot==0.5.1'`. To use the current release, migrate to GR00T N1.7 (`model_version='n1.7'`, base model [`nvidia/GR00T-N1.7-3B`](https://huggingface.co/nvidia/GR00T-N1.7-3B)).
+
 ## Model Overview

 GR00T N1.7 uses a Cosmos-Reason2/Qwen3-VL backbone and provides checkpoints for SimplerEnv, DROID, and LIBERO.
@@ -133,7 +136,7 @@ Replace the `XX` placeholders with final eval artifacts before merge.
 Download the suite checkpoint locally, then point `--policy.base_model_path` at the downloaded subdirectory. `--policy.path` is reserved for LeRobot checkpoints that contain a LeRobot `config.json` with a `type` field.

 ```bash
-huggingface-cli download nvidia/GR00T-N1.7-LIBERO \
+hf download nvidia/GR00T-N1.7-LIBERO \
  --include "libero_spatial/*" \
  --local-dir ./GR00T-N1.7-LIBERO

@@ -1,6 +1,13 @@
 ## Research Paper

-Paper: https://research.nvidia.com/labs/gear/gr00t-n1_5/
+GR00T N1 technical report (covers the GR00T N1.x family, including N1.7): https://arxiv.org/abs/2503.14734
+
+GR00T N1.7 model card: https://huggingface.co/nvidia/GR00T-N1.7-3B
+
+GR00T N1.5 research page (earlier version): https://research.nvidia.com/labs/gear/gr00t-n1_5/
+
+> GR00T N1.5 support was removed from LeRobot; the last release supporting it is `lerobot==0.5.1`.
+> Current releases support GR00T N1.7 only.

 ## Repository

@@ -31,12 +38,22 @@ Hugging Face Models:

 ## Original-vs-LeRobot parity test

-`tests/policies/groot/test_groot_vs_original.py` verifies that this LeRobot
+`tests/policies/groot/test_groot_vs_original.py` verifies this LeRobot
 reimplementation of GR00T N1.7 (Qwen3-VL backbone + flow-matching action head)
-produces the **same raw model output** (`get_action(...)["action_pred"]`, the
-normalized flow-matching prediction) as NVIDIA's original `gr00t` package, given
-byte-identical pre-processed inputs and the same flow-matching seed. It is
-parametrized over every embodiment tag present in the checkpoint.
+against NVIDIA's original `gr00t` package with two comparisons, each parametrized
+over every embodiment tag present in the checkpoint:
+
+1. **Model parity** — given byte-identical pre-processed inputs and the same
+   flow-matching seed (recorded in each artifact), both implementations must produce
+   the **same raw model output** (`get_action(...)["action_pred"]`, the normalized
+   flow-matching prediction). Output shapes must match exactly; any action-horizon
+   or action-dim mismatch fails the test.
+2. **Preprocessor parity** — given the identical raw observations (per-camera
+   frames, state vectors, language instruction), LeRobot's own preprocessor pipeline
+   (real Qwen3-VL chat template / tokenizer / image packing + checkpoint-driven
+   state normalization, no mocks) must produce the **same collated model inputs**
+   (`input_ids`, `attention_mask`, `pixel_values`, `image_grid_thw`, `state`,
+   `embodiment_id`) as the original package's processor.

 ### Why two environments

@@ -48,25 +65,37 @@ is itself a defaulted dataclass, so the original config dataclasses fail to impo

 So the test uses a **producer / consumer** split across two venvs:

-1. **Producer** — `tests/policies/groot/utils/dump_original_n1_7.py`, run in the *original*
+1. **Producer** — `tests/policies/groot/utils/dump_original_n1_7.py`, run in the _original_
   gr00t venv. For each embodiment it builds dummy inputs generically from the
   checkpoint metadata (state dims from `statistics.json`; camera/language keys from
-   the processor modality configs), runs the original model, and saves the exact
-   collated inputs + raw `action_pred` to one `.npz` per tag.
-2. **Consumer** — the pytest above, run in the *LeRobot* venv. It discovers every
-   `.npz`, replays the byte-identical inputs through the LeRobot model with the same
-   seed, and asserts the outputs match.
+   the processor modality configs), runs the original model, and saves to one `.npz`
+   per tag: the raw observations (`raw::` keys), the exact collated inputs
+   (`in::` keys), the seed, and the raw `action_pred`.
+2. **Consumer** — the pytest above, run in the _LeRobot_ venv. It discovers every
+   `.npz`; the model-parity case replays the byte-identical collated inputs through
+   the LeRobot model with the recorded seed and asserts the outputs match, and the
+   preprocessor-parity case replays the raw observations through LeRobot's full
+   preprocessor pipeline and asserts the collated tensors match.
+
+> Artifacts generated by older versions of the dump script contain no `raw::`
+> fields; the preprocessor-parity case then **skips** with a regeneration hint.
+> Re-run the producer to refresh them.

 ### Fairness controls

- **Same pre-processed inputs** — the original processor's `input_ids`,
+- **Same pre-processed inputs (model parity)** — the original processor's `input_ids`,
  `pixel_values`, `image_grid_thw`, `attention_mask`, `state`, `embodiment_id` are
-  fed verbatim to the LeRobot model (no re-tokenization / re-normalization).
+  fed verbatim to the LeRobot model (no re-tokenization / re-normalization), so the
+  model comparison isolates the model. LeRobot's own tokenization / image packing is
+  covered separately by the preprocessor-parity case, which compares its output
+  against those same collated tensors from identical raw observations.
 - **Same precision + attention kernel** — both sides run **fp32 + SDPA**. The
  original defaults to `use_flash_attention=True` (flash_attention_2 + bf16); the
  producer forces SDPA + fp32. (With the defaults the gap is ~3e-2 — pure
  kernel/rounding noise, not an implementation difference.)
- **Same flow-matching seed** — fixed (42) right before sampling on both sides.
+- **Same flow-matching seed** — fixed right before sampling on both sides; the
+  producer records it in each artifact (`--seed`, default 42) and the consumer
+  replays the recorded value.

 ### How to run

@@ -90,15 +119,15 @@ CUDA_VISIBLE_DEVICES=0 GROOT_PARITY_DEVICE=cuda \
    uv run pytest tests/policies/groot/test_groot_vs_original.py -v -s
 ```

-The `.npz` artifacts are local-only (gitignored, ~6–9 MB each) and are regenerated by
-the producer; they are never committed. The test **skips** (does not fail) on CI or
+The `.npz` artifacts are local-only (gitignored, ~6–10 MB each) and are regenerated by
+the producer; they are never committed. The tests **skip** (do not fail) on CI or
 when the checkpoint / artifacts are absent.

 #### Env knobs (all optional)

-| Var | Default | Purpose |
-|---|---|---|
-| `GROOT_N1_7_PARITY_DIR` | `tests/policies/groot/artifacts` | directory of per-tag `.npz` artifacts |
-| `GROOT_N1_7_LIBERO_CKPT` | auto (HF cache) | override checkpoint dir |
-| `GROOT_PARITY_DEVICE` | `cuda` if available | `cpu` or `cuda` |
-| `GROOT_PARITY_ATOL` / `GROOT_PARITY_RTOL` | `1e-3` | comparison tolerance |
+| Var                                       | Default                          | Purpose                               |
+| ----------------------------------------- | -------------------------------- | ------------------------------------- |
+| `GROOT_N1_7_PARITY_DIR`                   | `tests/policies/groot/artifacts` | directory of per-tag `.npz` artifacts |
+| `GROOT_N1_7_LIBERO_CKPT`                  | auto (HF cache)                  | override checkpoint dir               |
+| `GROOT_PARITY_DEVICE`                     | `cuda` if available              | `cpu` or `cuda`                       |
+| `GROOT_PARITY_ATOL` / `GROOT_PARITY_RTOL` | `1e-3`                           | comparison tolerance                  |
@@ -14,6 +14,7 @@
 # limitations under the License.


+import logging
 from typing import TYPE_CHECKING

 import torch
@@ -42,6 +43,9 @@ else:
    Timesteps = None


+logger = logging.getLogger(__name__)
+
+
 class TimestepEncoder(nn.Module):
    def __init__(self, embedding_dim, compute_dtype=torch.float32):
        require_package("diffusers", extra="groot")
@@ -265,8 +269,8 @@ class DiT(ModelMixin, ConfigMixin):
        self.norm_out = nn.LayerNorm(self.inner_dim, elementwise_affine=False, eps=1e-6)
        self.proj_out_1 = nn.Linear(self.inner_dim, 2 * self.inner_dim)
        self.proj_out_2 = nn.Linear(self.inner_dim, self.config.output_dim)
-        print(
-            "Total number of DiT parameters: ",
+        logger.debug(
+            "Total number of DiT parameters: %d",
            sum(p.numel() for p in self.parameters() if p.requires_grad),
        )

@@ -426,8 +430,8 @@ class SelfAttentionTransformer(ModelMixin, ConfigMixin):
                for _ in range(self.config.num_layers)
            ]
        )
-        print(
-            "Total number of SelfAttentionTransformer parameters: ",
+        logger.debug(
+            "Total number of SelfAttentionTransformer parameters: %d",
            sum(p.numel() for p in self.parameters() if p.requires_grad),
        )

@@ -71,7 +71,7 @@ GR00T_N1_7_DEFAULTS: dict[str, Any] = {
    "backbone_embedding_dim": 2048,
    "tune_llm": False,
    "tune_visual": False,
-    "select_layer": 12,
+    "select_layer": 16,
    "reproject_vision": False,
    "use_flash_attention": True,
    "load_bf16": False,
@@ -819,11 +819,14 @@ def _cosmos_reason2_qwen3_vl_config() -> PretrainedConfig:


 def get_backbone_cls(config: GR00TN17Config):
-    if (
-        config.backbone_model_type == "qwen"
-        or "nvidia/Cosmos-Reason2" in config.model_name
-        or "Qwen/Qwen3-VL" in config.model_name
-    ):
+    if "nvidia/Cosmos-Reason2" in config.model_name or "Qwen/Qwen3-VL" in config.model_name:
+        return Qwen3Backbone
+    if config.backbone_model_type == "qwen":
+        logger.warning(
+            "Unrecognized GR00T N1.7 backbone model name '%s'; assuming a Qwen3-VL-compatible "
+            "backbone because backbone_model_type='qwen'.",
+            config.model_name,
+        )
        return Qwen3Backbone
    raise ValueError(f"Unsupported GR00T N1.7 backbone model: {config.model_name}")

@@ -909,7 +912,7 @@ class GR00TN17(PreTrainedModel):
            "trust_remote_code": True
        }
        load_backbone_weights = kwargs.pop("load_backbone_weights", False)
-        for key in ("revision", "cache_dir", "local_files_only", "token"):
+        for key in ("cache_dir", "local_files_only", "token"):
            if key in kwargs:
                transformers_loading_kwargs.setdefault(key, kwargs[key])

@@ -23,8 +23,10 @@ from typing import TYPE_CHECKING, Any

 import numpy as np
 import torch
+import torchvision.transforms.v2.functional as tv_functional
 from einops import rearrange
 from PIL import Image
+from torchvision.transforms import InterpolationMode

 from lerobot.utils.import_utils import _transformers_available

@@ -57,6 +59,7 @@ from lerobot.utils.constants import (
    POLICY_POSTPROCESSOR_DEFAULT_NAME,
    POLICY_PREPROCESSOR_DEFAULT_NAME,
 )
+from lerobot.utils.device_utils import get_safe_torch_device

 from .configuration_groot import (
    GROOT_ACTION_DECODE_TRANSFORM_LIBERO,
@@ -744,6 +747,10 @@ def make_groot_pre_post_processors(
            use_albumentations=checkpoint_assets.use_albumentations
            if checkpoint_assets is not None
            else False,
+            # Run the image resize/normalize/patchify on the training device when
+            # possible instead of the single CPU main-loop thread (the dominant
+            # cost folded into dataloading_s).
+            device=config.device,
        ),
        DeviceProcessorStep(device=config.device),
    ]
@@ -982,6 +989,61 @@ def _transform_n1_7_image_for_vlm(
    return image


+def _transform_n1_7_image_for_vlm_torch(
+    image: torch.Tensor,
+    *,
+    image_crop_size: list[int] | None,
+    image_target_size: list[int] | None,
+    shortest_image_edge: int | None,
+    crop_fraction: float | None,
+) -> torch.Tensor:
+    """Torch/torchvision port of the non-albumentations branch of
+    :func:`_transform_n1_7_image_for_vlm`.
+
+    Operates on a ``(C, H, W)`` uint8 tensor and keeps the result on the input
+    tensor's device so the resize/crop run on GPU when the tensor is. Bicubic
+    interpolation with antialiasing matches PIL's ``Image.Resampling.BICUBIC``
+    closely (sub-``2/255`` per-pixel on worst-case inputs). The ``use_albumentations``
+    cv2/INTER_AREA path has no torch equivalent and stays on the PIL helper.
+    """
+    if image_target_size is None:
+        return image
+
+    target_h, target_w = image_target_size
+    _, height, width = image.shape
+
+    square_edge = max(height, width)
+    if height != width:
+        left = (square_edge - width) // 2
+        top = (square_edge - height) // 2
+        image = tv_functional.pad(
+            image, [left, top, square_edge - width - left, square_edge - height - top], fill=0
+        )
+
+    resize_edge = shortest_image_edge or target_h
+    image = tv_functional.resize(
+        image, [resize_edge, resize_edge], interpolation=InterpolationMode.BICUBIC, antialias=True
+    )
+
+    if crop_fraction is None and image_crop_size is not None:
+        crop_fraction = image_crop_size[0] / float(target_h)
+    if crop_fraction is not None and 0.0 < crop_fraction < 1.0:
+        # Match the PIL helper's center crop exactly: round() the crop size but
+        # floor() the offset (torchvision.center_crop rounds the offset, which
+        # shifts the region by 1px when (edge - crop) is odd).
+        crop_h = max(1, int(round(image.shape[-2] * crop_fraction)))
+        crop_w = max(1, int(round(image.shape[-1] * crop_fraction)))
+        top = max(0, (image.shape[-2] - crop_h) // 2)
+        left = max(0, (image.shape[-1] - crop_w) // 2)
+        image = image[..., top : top + crop_h, left : left + crop_w]
+
+    if tuple(image.shape[-2:]) != (target_h, target_w):
+        image = tv_functional.resize(
+            image, [target_h, target_w], interpolation=InterpolationMode.BICUBIC, antialias=True
+        )
+    return image
+
+
@dataclass
@ProcessorStepRegistry.register(name="groot_n1_7_pack_inputs_v1")
 class GrootN17PackInputsStep(ProcessorStep):
@@ -1280,6 +1342,12 @@ class GrootN17VLMEncodeStep(ProcessorStep):
    The packed video has shape ``(B, T, V, H, W, C)``. Each frame/view becomes
    an image item in the same chat message so the resulting image tokens match
    the temporal VLM packing used by Isaac-GR00T.
+
+    Images are handed to the torchvision-backed Qwen3-VL processor as ``(C, H, W)``
+    uint8 tensors (no per-frame PIL roundtrip), and, when ``device`` resolves to a
+    CUDA device, the resize/rescale/normalize/patchify run there instead of on the
+    single CPU main-loop thread. This keeps the output bit-identical on CPU and
+    moves the dominant preprocessing cost off the critical path on GPU.
    """

    model_name: str = GROOT_N1_7_BACKBONE_MODEL
@@ -1288,6 +1356,7 @@ class GrootN17VLMEncodeStep(ProcessorStep):
    shortest_image_edge: int | None = None
    crop_fraction: float | None = None
    use_albumentations: bool = False
+    device: str | None = None
    _proc: ProcessorMixin | None = field(default=None, init=False, repr=False)

    @property
@@ -1296,6 +1365,70 @@ class GrootN17VLMEncodeStep(ProcessorStep):
            self._proc = _build_n1_7_processor(self.model_name)
        return self._proc

+    def _target_device(self) -> torch.device | None:
+        # The albumentations path is cv2/PIL only, so it cannot run on GPU.
+        if self.device is None or self.use_albumentations:
+            return None
+        try:
+            return get_safe_torch_device(self.device)
+        except (AssertionError, RuntimeError):
+            # A device serialized at train time (e.g. "cuda") may be unavailable
+            # when the processor is reloaded elsewhere (e.g. CPU-only eval), and
+            # this step is not in the standard device-override set. Fall back to
+            # the CPU path, which is bit-identical, instead of crashing.
+            return None
+
+    def _build_sample_images(
+        self, video: Any, batch_size: int, target_device: torch.device | None
+    ) -> list[list[Any]]:
+        """Return, per batch item, its ordered ``(timestep, view)`` frames.
+
+        ``use_albumentations`` keeps the legacy per-frame PIL/cv2 transform;
+        otherwise frames are ``(C, H, W)`` uint8 tensors (moved to
+        ``target_device`` when set) for the torchvision-backed Qwen processor.
+        """
+        if self.use_albumentations:
+            video_np = np.asarray(video)
+            return [
+                [
+                    _transform_n1_7_image_for_vlm(
+                        Image.fromarray(video_np[batch_idx, timestep, view_idx]),
+                        image_crop_size=self.image_crop_size,
+                        image_target_size=self.image_target_size,
+                        shortest_image_edge=self.shortest_image_edge,
+                        crop_fraction=self.crop_fraction,
+                        use_albumentations=True,
+                    )
+                    for timestep in range(video_np.shape[1])
+                    for view_idx in range(video_np.shape[2])
+                ]
+                for batch_idx in range(batch_size)
+            ]
+
+        video_t = video if torch.is_tensor(video) else torch.from_numpy(np.ascontiguousarray(video))
+        # (B, T, V, H, W, C) uint8 -> (B, T, V, C, H, W)
+        video_t = video_t.permute(0, 1, 2, 5, 3, 4).contiguous()
+        if target_device is not None and video_t.device != target_device:
+            video_t = video_t.to(target_device, non_blocking=(target_device.type == "cuda"))
+
+        frames_per_sample: list[list[Any]] = []
+        for batch_idx in range(batch_size):
+            sample = video_t[batch_idx]  # (T, V, C, H, W)
+            frames_per_sample.append(
+                [
+                    _transform_n1_7_image_for_vlm_torch(
+                        sample[timestep, view_idx],
+                        image_crop_size=self.image_crop_size,
+                        image_target_size=self.image_target_size,
+                        shortest_image_edge=self.shortest_image_edge,
+                        crop_fraction=self.crop_fraction,
+                    )
+                    for timestep in range(sample.shape[0])
+                    for view_idx in range(sample.shape[1])
+                ]
+            )
+        return frames_per_sample
+
    def __call__(self, transition: EnvTransition) -> EnvTransition:
        obs = transition.get(TransitionKey.OBSERVATION, {}) or {}
        comp = transition.get(TransitionKey.COMPLEMENTARY_DATA, {}) or {}
@@ -1303,33 +1436,25 @@ class GrootN17VLMEncodeStep(ProcessorStep):
        if video is None:
            return transition

+        batch_size = int(video.shape[0])
        languages = _prepare_n1_7_language_batch(
            comp.get("language"),
-            video.shape[0],
+            batch_size,
            formalize_language=False,
        )

+        target_device = self._target_device()
+        sample_images = self._build_sample_images(video, batch_size, target_device)
+
        texts: list[str] = []
-        images: list[Image.Image] = []
-        for batch_idx in range(video.shape[0]):
-            sample = video[batch_idx]  # (T, V, H, W, C)
-            sample_images = [
-                _transform_n1_7_image_for_vlm(
-                    Image.fromarray(sample[timestep, view_idx]),
-                    image_crop_size=self.image_crop_size,
-                    image_target_size=self.image_target_size,
-                    shortest_image_edge=self.shortest_image_edge,
-                    crop_fraction=self.crop_fraction,
-                    use_albumentations=self.use_albumentations,
-                )
-                for timestep in range(sample.shape[0])
-                for view_idx in range(sample.shape[1])
-            ]
+        images: list[Any] = []
+        for batch_idx in range(batch_size):
+            frames = sample_images[batch_idx]
            conversation = [
                {
                    "role": "user",
                    "content": [
-                        *[{"type": "image", "image": image} for image in sample_images],
+                        *[{"type": "image", "image": image} for image in frames],
                        {"type": "text", "text": languages[batch_idx]},
                    ],
                }
@@ -1341,9 +1466,17 @@ class GrootN17VLMEncodeStep(ProcessorStep):
                    add_generation_prompt=False,
                )
            )
-            images.extend(sample_images)
+            images.extend(frames)

-        encoded = self.proc(text=texts, images=images, return_tensors="pt", padding=True)
+        proc_kwargs: dict[str, Any] = {
+            "text": texts,
+            "images": images,
+            "return_tensors": "pt",
+            "padding": True,
+        }
+        if target_device is not None:
+            proc_kwargs["device"] = str(target_device)
+        encoded = self.proc(**proc_kwargs)
        for key, value in encoded.items():
            comp[key] = value
        obs.pop("video", None)
@@ -1362,6 +1495,7 @@ class GrootN17VLMEncodeStep(ProcessorStep):
            "shortest_image_edge": self.shortest_image_edge,
            "crop_fraction": self.crop_fraction,
            "use_albumentations": self.use_albumentations,
+            "device": self.device,
        }
Author	SHA1	Message	Date
Steven Palma	404751ba8b	fix gpu saw	2026-06-13 23:09:18 +02:00
Steven Palma	559cba212d	Merge commit 'refs/groot/docs'; commit 'refs/groot/backbone'; commit 'refs/groot/core' into fix/groot_training_experiment	2026-06-13 19:59:57 +02:00
Steven Palma	895eaf0d7c	fix(groot): N1.7 backbone loading and DiT parameter-count logging - select_layer default tracks the N1.7-3B checkpoint value (16); real checkpoint loads still override it from config.json. - get_backbone_cls recognizes Cosmos-Reason2 / Qwen3-VL backbones by name and warns (instead of silently assuming) when an unrecognized backbone is loaded only on the strength of backbone_model_type='qwen'. - 'revision' pins the GR00T checkpoint repo only and is no longer forwarded into the unrelated backbone repo load; pin the backbone via transformers_loading_kwargs instead. - DiT / SelfAttentionTransformer parameter counts go through logging.debug instead of print().	2026-06-12 23:55:33 +02:00
Steven Palma	edda8552ec	docs(groot): document the N1.5 removal and the N1.7 parity test - groot.mdx: breaking-change warning and migration path (pin lerobot==0.5.1 to keep N1.5, or move to N1.7); the dead `huggingface-cli download` is replaced with `hf download`. - policy_groot_README.md: N1.5 removal note, updated paper / model-card links, and the two-comparison (model parity + preprocessor parity) description of the original-vs-LeRobot test, including the raw-observation artifacts and recorded seed.	2026-06-12 23:40:36 +02:00