Compare commits

..

8 Commits

Author SHA1 Message Date
Steven Palma 5753f8c18b fix(groot): GPU/tensor N1.7 image preprocessing + resize to trained resolution
GR00T training was dataloader-bound (0->100->0 GPU-utilization sawtooth).
GrootN17VLMEncodeStep ran the Qwen3-VL image processor per frame on PIL images
on the single CPU main-loop thread, and that cost is timed inside dataloading_s
(preprocessor(batch) runs in the main process, not the dataloader workers), so
adding workers cannot hide it.

- Feed the torchvision-backed Qwen3-VL processor (C,H,W) uint8 tensors instead
  of a per-frame Image.fromarray PIL roundtrip, and run resize/normalize/patchify
  on config.device (GPU) when available. Bit-identical on CPU when no resize is
  configured; with a resize only the PIL->torchvision bicubic backend differs
  (<2/255 per pixel). The use_albumentations path stays PIL/cv2; reload on a box
  without the saved device falls back to CPU.

- Default image_target_size/crop to the N1.7 backbone's training geometry
  (256x256 / 230x230) when a checkpoint ships no image sizing (checkpoint_assets
  is None, e.g. finetuning nvidia/GR00T-N1.7-3B via repo-id with a new
  embodiment). Previously image_target_size=None disabled the resize, so
  full-resolution frames were patchified into ~4.7x more vision tokens than the
  model was trained on -- inflating dataloading_s (patchify) and update_s (VLM
  sequence) and skewing the input distribution. Checkpoints that pin their own
  sizing are honored; the default constants are shared with GR00T_N1_7_DEFAULTS.

Net: preprocessing leaves the CPU critical path and the VLM sees the resolution
it was trained on -- faster training/inference and a correct train/serve
distribution. Affects inference too (shared preprocessor); existing checkpoints
still load (backward compatible) but must be retrained to gain the benefits.
2026-06-15 18:20:49 +02:00
Kartik 97bd373d15 Merge pull request #15 from huggingface/fix/groot_n17_core
fix(groot): N1.7 config defaults, N1.5 rejection, and processor/model runtime fixes
2026-06-13 23:05:51 +02:00
Kartik 10a73e3c95 Merge pull request #14 from huggingface/fix/groot_n17_backbone
fix(groot): N1.7 backbone loading and DiT parameter-count logging
2026-06-13 21:47:35 +02:00
Kartik 27c9288b24 Merge pull request #13 from huggingface/fix/groot_n17_docs
docs(groot): document the N1.5 removal and the N1.7 parity test
2026-06-13 21:47:05 +02:00
Steven Palma 378897800a fix(groot): skip normalization overrides for training 2026-06-13 19:51:29 +02:00
Steven Palma fcb371eddd fix(groot): N1.7 config defaults, N1.5 rejection, and processor/model runtime fixes
Covers the GR00T N1.7 source trio (configuration, processor, model wrapper).

Config:
- GrootConfig defaults are the N1.7 values; explicitly passed legacy N1.5-era
  values (chunk_size=50, max_state_dim=64, ...) are remapped with a warning
  instead of silently.
- action_decode_transform gains an 'auto' sentinel so an explicit 'none'
  opt-out wins over the libero_sim default and survives save/load round-trips.
- action_delta_indices is cached on the inputs that determine it.
- Legacy N1.5 checkpoints/configs (tokenizer_assets_repo, model_type/
  architectures/eagle backbone markers) are rejected with a single clear
  error pointing to lerobot==0.5.1.

Processor:
- GrootN17ActionDecodeStep handles the 2-D (B, D) actions delivered by sync
  select_action (relative eef/non-eef decode in eval/record flows).
- Postprocessor falls back to dataset stats when a raw checkpoint lacks the
  configured embodiment tag; raw-state cache is per-instance, not
  process-global; caller overrides (device, rename_map) are honored on the
  raw-checkpoint branch.
- Camera/modality-key mismatches warn (including the zero-match fallback);
  deprecated Qwen2VLImageProcessorFast replaced with Qwen2VLImageProcessor;
  removed N1.5 processor steps are stubbed to raise the removal guidance and
  the action-unpack step is re-registered as _v2.

Model:
- Flash-attention probe is diagnostic-only; forward raises on a missing loss;
  print() replaced with logging; N1.5 base-path mismatch includes the
  removal guidance.
2026-06-13 18:30:21 +02:00
Steven Palma 895eaf0d7c fix(groot): N1.7 backbone loading and DiT parameter-count logging
- select_layer default tracks the N1.7-3B checkpoint value (16); real
  checkpoint loads still override it from config.json.
- get_backbone_cls recognizes Cosmos-Reason2 / Qwen3-VL backbones by name and
  warns (instead of silently assuming) when an unrecognized backbone is loaded
  only on the strength of backbone_model_type='qwen'.
- 'revision' pins the GR00T checkpoint repo only and is no longer forwarded
  into the unrelated backbone repo load; pin the backbone via
  transformers_loading_kwargs instead.
- DiT / SelfAttentionTransformer parameter counts go through logging.debug
  instead of print().
2026-06-12 23:55:33 +02:00
Steven Palma edda8552ec docs(groot): document the N1.5 removal and the N1.7 parity test
- groot.mdx: breaking-change warning and migration path (pin lerobot==0.5.1 to
  keep N1.5, or move to N1.7); the dead `huggingface-cli download` is replaced
  with `hf download`.
- policy_groot_README.md: N1.5 removal note, updated paper / model-card links,
  and the two-comparison (model parity + preprocessor parity) description of
  the original-vs-LeRobot test, including the raw-observation artifacts and
  recorded seed.
2026-06-12 23:40:36 +02:00
8 changed files with 353 additions and 1371 deletions
@@ -42,6 +42,10 @@ GROOT_N1_5_REMOVAL_GUIDANCE = (
)
GROOT_N1_7_BASE_MODEL = "nvidia/GR00T-N1.7-3B"
GROOT_N1_7_BACKBONE_MODEL = "nvidia/Cosmos-Reason2-2B"
# Default GR00T N1.7 training resolution. Fallback if processor_config lacks sizing. Prevents mismatched
# full-res patchification by forcing a resize. Mirrored by GR00T_N1_7_DEFAULTS in groot_n1_7.py.
N1_7_DEFAULT_IMAGE_TARGET_SIZE = (256, 256)
N1_7_DEFAULT_IMAGE_CROP_SIZE = (230, 230)
GROOT_ACTION_DECODE_TRANSFORM_LIBERO = "libero"
# Sentinel meaning "the user did not pick an action decode transform": __post_init__ resolves it
# to the embodiment default ('libero' for 'libero_sim', otherwise None). It is distinct from an
@@ -321,9 +325,6 @@ def _infer_groot_model_version_from_config(config: dict) -> str | None:
normalized = candidate.lower().replace("-", "_")
if normalized in {"gr00tn1d7", "gr00t_n1d7", "gr00t_n1_7"}:
return GROOT_N1_7
# nvidia/GR00T-N1.5-3B ships model_type 'gr00t_n1_5' and architectures ['GR00T_N1_5'].
# Recognise them so N1.5 checkpoints at generic local paths are rejected loudly
# instead of being silently treated as N1.7 (see infer_groot_model_version).
if normalized in {"gr00t_n1_5", "gr00tn1_5", "gr00t_n15", "gr00t_n1d5", "gr00tn1d5"}:
return GROOT_N1_5
if config.get("model_name") == GROOT_N1_7_BACKBONE_MODEL:
@@ -365,11 +366,7 @@ class GrootConfig(PreTrainedConfig):
}
)
# Deprecated and unused: image sizing is handled by the backbone's image processor.
# Kept only so config.json files saved with earlier versions still parse.
image_size: tuple[int, int] = (256, 256)
# Groot-specific model parameters (from groot_finetune_script.py)
# Groot-specific model parameters
# Explicit GR00T model family selection. LeRobot supports GR00T N1.7 only.
model_version: str = GROOT_N1_7
@@ -385,11 +382,6 @@ class GrootConfig(PreTrainedConfig):
# transform). Pass 'none' to explicitly disable the transform, including for 'libero_sim'.
action_decode_transform: str | None = GROOT_ACTION_DECODE_TRANSFORM_AUTO
# Deprecated, GR00T N1.5 only — do not set. Kept so config.json files saved by lerobot<=0.5.1
# still parse (draccus rejects unknown fields) and can be rejected in __post_init__ with a
# clear error pointing at GROOT_N1_5_REMOVAL_GUIDANCE instead of a cryptic DecodingError.
tokenizer_assets_repo: str | None = None
# Embodiment tag to use for training (e.g. 'new_embodiment', 'gr1')
embodiment_tag: str = "new_embodiment"
@@ -428,10 +420,13 @@ class GrootConfig(PreTrainedConfig):
warmup_ratio: float = 0.05
use_bf16: bool = True
# Deprecated Isaac-GR00T runner fields below — unused by the LeRobot N1.7 implementation
# TODO(Steven): Remove these deprecated fields in a future release.
# Deprecated Isaac-GR00T runner/N1.5 fields below — unused by the LeRobot N1.7 implementation
# (nothing in src/lerobot reads them). They are kept only so config.json files saved by
# earlier lerobot releases still parse: draccus rejects unknown fields, so removing them
# would break every previously saved groot checkpoint at config-load time.
image_size: tuple[int, int] = (256, 256) # image sizing is handled by the backbone's image processor.
tokenizer_assets_repo: str | None = None
video_backend: str = "decord"
balance_dataset_weights: bool = True
balance_trajectory_weights: bool = True
@@ -445,9 +440,6 @@ class GrootConfig(PreTrainedConfig):
resume: bool = False
def __post_init__(self):
# 'tokenizer_assets_repo' only ever existed for GR00T N1.5 (lerobot<=0.5.1) and was
# serialized into every groot checkpoint config.json, so a value here means a legacy
# N1.5 checkpoint or config is being loaded.
if self.tokenizer_assets_repo is not None:
raise ValueError(
"Config sets 'tokenizer_assets_repo', which only existed for GR00T N1.5; this looks "
@@ -582,22 +574,11 @@ class GrootConfig(PreTrainedConfig):
@property
def action_delta_indices(self) -> list[int]:
"""Return indices for delta actions.
The model action horizon is read from the checkpoint's processor_config.json
when available; the result is cached (keyed on the inputs that determine it) so
repeated access during dataset/training setup does not re-read from disk.
"""
cache_key = (self.base_model_path, self.embodiment_tag, self.chunk_size)
cached = getattr(self, "_action_delta_indices_cache", None)
if cached is not None and cached[0] == cache_key:
return cached[1]
"""Return indices for delta actions."""
model_action_horizon = (
infer_groot_n1_7_action_horizon(self.base_model_path, self.embodiment_tag) or 40
)
indices = list(range(min(self.chunk_size, model_action_horizon)))
object.__setattr__(self, "_action_delta_indices_cache", (cache_key, indices))
return indices
return list(range(min(self.chunk_size, model_action_horizon)))
@property
def reward_delta_indices(self) -> None:
+4 -9
View File
@@ -32,6 +32,7 @@ from torch.distributions import Beta
from lerobot.utils.import_utils import _transformers_available, require_package
from .action_head.cross_attention_dit import AlternateVLDiT, DiT, SelfAttentionTransformer
from .configuration_groot import N1_7_DEFAULT_IMAGE_CROP_SIZE, N1_7_DEFAULT_IMAGE_TARGET_SIZE
if TYPE_CHECKING or _transformers_available:
from transformers import AutoConfig, AutoModel, PretrainedConfig, PreTrainedModel
@@ -71,13 +72,13 @@ GR00T_N1_7_DEFAULTS: dict[str, Any] = {
"backbone_embedding_dim": 2048,
"tune_llm": False,
"tune_visual": False,
"select_layer": 16, # N1.7-3B checkpoint value; real checkpoint loads override this from config.json
"select_layer": 16,
"reproject_vision": False,
"use_flash_attention": True,
"load_bf16": False,
"backbone_trainable_params_fp32": True,
"image_crop_size": (230, 230),
"image_target_size": (256, 256),
"image_crop_size": N1_7_DEFAULT_IMAGE_CROP_SIZE,
"image_target_size": N1_7_DEFAULT_IMAGE_TARGET_SIZE,
"shortest_image_edge": None,
"crop_fraction": None,
"random_rotation_angle": None,
@@ -822,8 +823,6 @@ def get_backbone_cls(config: GR00TN17Config):
if "nvidia/Cosmos-Reason2" in config.model_name or "Qwen/Qwen3-VL" in config.model_name:
return Qwen3Backbone
if config.backbone_model_type == "qwen":
# Local backbone checkpoints (e.g. hub-cache snapshot paths) contain neither hub
# marker, so trust the explicit backbone type but surface what is being assumed.
logger.warning(
"Unrecognized GR00T N1.7 backbone model name '%s'; assuming a Qwen3-VL-compatible "
"backbone because backbone_model_type='qwen'.",
@@ -914,10 +913,6 @@ class GR00TN17(PreTrainedModel):
"trust_remote_code": True
}
load_backbone_weights = kwargs.pop("load_backbone_weights", False)
# Only repo-agnostic hub kwargs are forwarded to the backbone loading kwargs:
# ``revision`` pins the GR00T checkpoint repo (see snapshot_download below) and would
# be invalid for the unrelated backbone repo (``config.model_name``). Pin the backbone
# itself by passing ``revision`` inside ``transformers_loading_kwargs``.
for key in ("cache_dir", "local_files_only", "token"):
if key in kwargs:
transformers_loading_kwargs.setdefault(key, kwargs[key])
@@ -93,12 +93,6 @@ class GrootPolicy(PreTrainedPolicy):
transformers_loading_kwargs={"trust_remote_code": True},
)
# GR00TN17 defines no compute_dtype attribute, so only record the
# bf16 preference when it is enabled instead of reading a default back.
if self.config.use_bf16:
model.compute_dtype = "bfloat16"
model.config.compute_dtype = "bfloat16"
return model
def reset(self):
+260 -186
View File
@@ -23,9 +23,10 @@ from typing import TYPE_CHECKING, Any
import numpy as np
import torch
import torchvision.transforms.v2.functional as tv_functional
from einops import rearrange
from huggingface_hub import hf_hub_download
from PIL import Image
from torchvision.transforms import InterpolationMode
from lerobot.utils.import_utils import _transformers_available
@@ -58,11 +59,14 @@ from lerobot.utils.constants import (
POLICY_POSTPROCESSOR_DEFAULT_NAME,
POLICY_PREPROCESSOR_DEFAULT_NAME,
)
from lerobot.utils.device_utils import get_safe_torch_device
from .configuration_groot import (
GROOT_ACTION_DECODE_TRANSFORM_LIBERO,
GROOT_N1_5_REMOVAL_GUIDANCE,
GROOT_N1_7_BACKBONE_MODEL,
N1_7_DEFAULT_IMAGE_CROP_SIZE,
N1_7_DEFAULT_IMAGE_TARGET_SIZE,
GrootConfig,
is_raw_groot_n1_7_checkpoint,
)
@@ -448,60 +452,40 @@ def _has_modality_stats(stats: dict[str, dict[str, Any]] | None) -> bool:
return any(bool(modality_stats) for modality_stats in stats.values())
def _legacy_groot_processor_overrides(
config: GrootConfig,
dataset_stats: dict[str, dict[str, torch.Tensor]] | None,
preprocessor_overrides: dict[str, Any] | None = None,
postprocessor_overrides: dict[str, Any] | None = None,
) -> tuple[dict[str, Any], dict[str, Any]]:
"""Patch older serialized Groot processors with fields current processors expect."""
preprocessor_overrides = dict(preprocessor_overrides or {})
postprocessor_overrides = dict(postprocessor_overrides or {})
pack_inputs_key = "groot_n1_7_pack_inputs_v1"
pack_input_overrides = dict(preprocessor_overrides.get(pack_inputs_key, {}))
pack_input_overrides["normalize_min_max"] = True
preprocessor_overrides[pack_inputs_key] = pack_input_overrides
try:
env_action_dim = int(config.output_features[ACTION].shape[0])
except Exception:
env_action_dim = 0
action_unpack_overrides = dict(postprocessor_overrides.get("groot_action_unpack_unnormalize_v2", {}))
action_unpack_overrides["normalize_min_max"] = True
action_unpack_overrides["env_action_dim"] = env_action_dim
postprocessor_overrides["groot_action_unpack_unnormalize_v2"] = action_unpack_overrides
return preprocessor_overrides, postprocessor_overrides
# GR00T normalizes state/action inside its own processor steps and so deliberately has no
# NormalizerProcessorStep/UnnormalizerProcessorStep (see GrootConfig.normalization_mapping, which is
# IDENTITY for every feature). lerobot-train nonetheless emits these standard override keys
# unconditionally, so for a GR00T pipeline they legitimately match no step. They are dropped up front
# by _drop_groot_absent_standard_overrides so they neither break loading nor mask genuine typos.
_GROOT_ABSENT_STANDARD_OVERRIDE_KEYS = frozenset({"normalizer_processor", "unnormalizer_processor"})
def _pretrained_processor_config_has_step(pretrained_path: str, config_filename: str, step_name: str) -> bool:
"""Check whether a serialized processor pipeline contains a registry step.
def _drop_groot_absent_standard_overrides(overrides: dict[str, Any] | None) -> dict[str, Any] | None:
"""Strip standard normalization override keys that a GR00T pipeline has no step for.
Resolves the processor config from a local directory or, for Hub repo ids,
via ``hf_hub_download`` (which serves the cached copy when offline). Returns
False when the config cannot be resolved; loading then proceeds with the
legacy overrides and `make_groot_pre_post_processors_from_pretrained` retries
without them if they do not match the serialized pipeline.
``lerobot-train`` emits ``normalizer_processor``/``unnormalizer_processor`` overrides
unconditionally, but GR00T normalizes inside its own steps and has no such step (see
``GrootConfig.normalization_mapping``). Both override-application paths reject keys that match no
step — ``_apply_groot_step_overrides`` raises for the freshly built raw-checkpoint pipeline, and
``PolicyProcessorPipeline.from_pretrained`` raises via its used-override validation for the
serialized pipeline — so these keys are removed before either path runs. Any other unknown key
(e.g. a typo) is left in place and still raises.
"""
path = Path(pretrained_path).expanduser()
if path.is_dir():
config = _read_json(path / config_filename)
elif path.exists():
return False
else:
try:
config_path = hf_hub_download(
repo_id=str(pretrained_path), filename=config_filename, repo_type="model"
if not overrides:
return overrides
filtered: dict[str, Any] = {}
for key, value in overrides.items():
if key in _GROOT_ABSENT_STANDARD_OVERRIDE_KEYS:
logging.debug(
"Ignoring override key '%s': GR00T normalizes inside its own processor steps and has "
"no matching step (see GrootConfig.normalization_mapping).",
key,
)
except Exception:
return False
config = _read_json(Path(config_path))
steps = config.get("steps", [])
if not isinstance(steps, list):
return False
return any(isinstance(step, dict) and step.get("registry_name") == step_name for step in steps)
continue
filtered[key] = value
return filtered
def _apply_groot_step_overrides(
@@ -517,7 +501,8 @@ def _apply_groot_step_overrides(
steps by registry name only — prefer registry names so overrides keep
working after the checkpoint is converted and reloaded from a serialized
pipeline). Keys or fields that match nothing raise instead of being dropped
silently.
silently (standard normalization keys GR00T has no step for are removed
beforehand by ``_drop_groot_absent_standard_overrides``).
"""
if not overrides:
@@ -573,7 +558,13 @@ def make_groot_pre_post_processors_from_pretrained(
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
PolicyProcessorPipeline[PolicyAction, PolicyAction],
]:
"""Load Groot processors while preserving compatibility with older serialized configs."""
"""Load Groot processors for a raw N1.7 checkpoint or a serialized LeRobot pipeline."""
# Drop the standard normalizer/unnormalizer override keys lerobot-train emits unconditionally:
# GR00T has no such steps, so they would make both the raw-checkpoint and serialized override
# paths raise. This must happen before either branch below.
preprocessor_overrides = _drop_groot_absent_standard_overrides(preprocessor_overrides)
postprocessor_overrides = _drop_groot_absent_standard_overrides(postprocessor_overrides)
if is_raw_groot_n1_7_checkpoint(pretrained_path):
processor_cfg = copy(config)
@@ -589,49 +580,13 @@ def make_groot_pre_post_processors_from_pretrained(
_apply_groot_step_overrides(postprocessor, postprocessor_overrides)
return preprocessor, postprocessor
caller_preprocessor_overrides = dict(preprocessor_overrides or {})
caller_postprocessor_overrides = dict(postprocessor_overrides or {})
if _pretrained_processor_config_has_step(
preprocessor, postprocessor = _load_groot_processor_pipelines(
pretrained_path,
postprocessor_config_filename,
"groot_n1_7_action_decode_v1",
):
# Converted raw N1.7 checkpoints already carry the checkpoint-specific
# action decoder. Adding the legacy action-unpack override would target
# a step that is not present and break loading.
applied_legacy_overrides = False
preprocessor_overrides = caller_preprocessor_overrides
postprocessor_overrides = caller_postprocessor_overrides
else:
applied_legacy_overrides = True
preprocessor_overrides, postprocessor_overrides = _legacy_groot_processor_overrides(
config=config,
dataset_stats=dataset_stats,
preprocessor_overrides=preprocessor_overrides,
postprocessor_overrides=postprocessor_overrides,
)
try:
preprocessor, postprocessor = _load_groot_processor_pipelines(
pretrained_path,
preprocessor_overrides=preprocessor_overrides,
postprocessor_overrides=postprocessor_overrides,
preprocessor_config_filename=preprocessor_config_filename,
postprocessor_config_filename=postprocessor_config_filename,
)
except KeyError:
if not applied_legacy_overrides:
raise
# The legacy overrides target steps that are absent from the serialized
# pipelines (e.g. a converted raw N1.7 checkpoint whose postprocessor
# config could not be inspected before loading); retry with the caller
# overrides only.
preprocessor, postprocessor = _load_groot_processor_pipelines(
pretrained_path,
preprocessor_overrides=caller_preprocessor_overrides,
postprocessor_overrides=caller_postprocessor_overrides,
preprocessor_config_filename=preprocessor_config_filename,
postprocessor_config_filename=postprocessor_config_filename,
)
preprocessor_overrides=preprocessor_overrides,
postprocessor_overrides=postprocessor_overrides,
preprocessor_config_filename=preprocessor_config_filename,
postprocessor_config_filename=postprocessor_config_filename,
)
_reconnect_groot_relative_absolute_steps(preprocessor, postprocessor)
_reconnect_groot_n1_7_pack_decode_steps(preprocessor, postprocessor)
return preprocessor, postprocessor
@@ -779,21 +734,36 @@ def make_groot_pre_post_processors(
modality_config=checkpoint_assets.modality_config if checkpoint_assets is not None else None,
)
# Resolve the image preprocessing geometry. Honor the checkpoint's processor_config
# when it provides an image_target_size; otherwise fall back to the geometry the
# N1.7 backbone was trained on. Without this fallback a raw base checkpoint with no
# processor_config image sizing (e.g. fine-tuning nvidia/GR00T-N1.7-3B with a new
# embodiment, where checkpoint_assets is None) would patchify full-resolution camera
# frames, inflating the VLM token count and feeding the model a resolution it was not trained on.
if checkpoint_assets is not None and checkpoint_assets.image_target_size is not None:
image_target_size = checkpoint_assets.image_target_size
image_crop_size = checkpoint_assets.image_crop_size
shortest_image_edge = checkpoint_assets.shortest_image_edge
crop_fraction = checkpoint_assets.crop_fraction
else:
image_target_size = list(N1_7_DEFAULT_IMAGE_TARGET_SIZE)
image_crop_size = list(N1_7_DEFAULT_IMAGE_CROP_SIZE)
shortest_image_edge = None
crop_fraction = None
use_albumentations = checkpoint_assets.use_albumentations if checkpoint_assets is not None else False
input_steps: list[ProcessorStep] = [
RenameObservationsProcessorStep(rename_map={}),
AddBatchDimensionProcessorStep(),
pack_step,
GrootN17VLMEncodeStep(
model_name=config.n1_7_backbone_model,
image_crop_size=checkpoint_assets.image_crop_size if checkpoint_assets is not None else None,
image_target_size=checkpoint_assets.image_target_size if checkpoint_assets is not None else None,
shortest_image_edge=checkpoint_assets.shortest_image_edge
if checkpoint_assets is not None
else None,
crop_fraction=checkpoint_assets.crop_fraction if checkpoint_assets is not None else None,
use_albumentations=checkpoint_assets.use_albumentations
if checkpoint_assets is not None
else False,
image_crop_size=image_crop_size,
image_target_size=image_target_size,
shortest_image_edge=shortest_image_edge,
crop_fraction=crop_fraction,
use_albumentations=use_albumentations,
device=config.device,
),
DeviceProcessorStep(device=config.device),
]
@@ -949,15 +919,22 @@ def _build_n1_7_processor(model_name: str = GROOT_N1_7_BACKBONE_MODEL) -> Proces
return proc
def _transform_n1_7_image_for_vlm(
def _transform_n1_7_image_for_vlm_albumentations(
image: Image.Image,
*,
image_crop_size: list[int] | None,
image_target_size: list[int] | None,
shortest_image_edge: int | None,
crop_fraction: float | None,
use_albumentations: bool = False,
) -> Image.Image:
"""cv2/INTER_AREA eval transform mirroring Isaac-GR00T's albumentations preprocessing.
Used only for checkpoints saved with ``use_albumentations=True``. cv2 is
CPU/numpy-only so this path cannot run on GPU; the default (non-albumentations)
geometry is handled on-device by :func:`_transform_n1_7_image_for_vlm_torch`. The
cv2/INTER_AREA resize and floored center-crop here intentionally differ from that
torch path and must stay bit-exact to the upstream reference.
"""
if image_target_size is None:
return image
@@ -965,70 +942,101 @@ def _transform_n1_7_image_for_vlm(
if image.mode != "RGB":
image = image.convert("RGB")
if use_albumentations:
try:
import cv2
except ImportError as exc:
raise ImportError(
"GR00T N1.7 checkpoints with use_albumentations=True require opencv-python-headless."
) from exc
try:
import cv2
except ImportError as exc:
raise ImportError(
"GR00T N1.7 checkpoints with use_albumentations=True require opencv-python-headless."
) from exc
image_np = np.asarray(image)
height, width = image_np.shape[:2]
if height != width:
square_edge = max(height, width)
pad_h = square_edge - height
pad_w = square_edge - width
image_np = cv2.copyMakeBorder(
image_np,
pad_h // 2,
pad_h - pad_h // 2,
pad_w // 2,
pad_w - pad_w // 2,
cv2.BORDER_CONSTANT,
value=(0, 0, 0),
)
resize_edge = shortest_image_edge or target_h
if image_np.shape[:2] != (resize_edge, resize_edge):
image_np = cv2.resize(image_np, (resize_edge, resize_edge), interpolation=cv2.INTER_AREA)
if crop_fraction is None and image_crop_size is not None:
crop_fraction = image_crop_size[0] / float(target_h)
if crop_fraction is not None and 0.0 < crop_fraction < 1.0:
height, width = image_np.shape[:2]
crop_h = max(1, int(height * crop_fraction))
crop_w = max(1, int(width * crop_fraction))
top = max(0, (height - crop_h) // 2)
left = max(0, (width - crop_w) // 2)
image_np = image_np[top : top + crop_h, left : left + crop_w]
if image_np.shape[:2] != (target_h, target_w):
image_np = cv2.resize(image_np, (target_w, target_h), interpolation=cv2.INTER_AREA)
return Image.fromarray(image_np)
square_edge = max(image.width, image.height)
if image.width != image.height:
padded = Image.new("RGB", (square_edge, square_edge))
left = (square_edge - image.width) // 2
top = (square_edge - image.height) // 2
padded.paste(image, (left, top))
image = padded
image_np = np.asarray(image)
height, width = image_np.shape[:2]
if height != width:
square_edge = max(height, width)
pad_h = square_edge - height
pad_w = square_edge - width
image_np = cv2.copyMakeBorder(
image_np,
pad_h // 2,
pad_h - pad_h // 2,
pad_w // 2,
pad_w - pad_w // 2,
cv2.BORDER_CONSTANT,
value=(0, 0, 0),
)
resize_edge = shortest_image_edge or target_h
image = image.resize((resize_edge, resize_edge), Image.Resampling.BICUBIC)
if image_np.shape[:2] != (resize_edge, resize_edge):
image_np = cv2.resize(image_np, (resize_edge, resize_edge), interpolation=cv2.INTER_AREA)
if crop_fraction is None and image_crop_size is not None:
crop_fraction = image_crop_size[0] / float(target_h)
if crop_fraction is not None and 0.0 < crop_fraction < 1.0:
crop_w = max(1, int(round(image.width * crop_fraction)))
crop_h = max(1, int(round(image.height * crop_fraction)))
left = max(0, (image.width - crop_w) // 2)
top = max(0, (image.height - crop_h) // 2)
image = image.crop((left, top, left + crop_w, top + crop_h))
height, width = image_np.shape[:2]
crop_h = max(1, int(height * crop_fraction))
crop_w = max(1, int(width * crop_fraction))
top = max(0, (height - crop_h) // 2)
left = max(0, (width - crop_w) // 2)
image_np = image_np[top : top + crop_h, left : left + crop_w]
if image.size != (target_w, target_h):
image = image.resize((target_w, target_h), Image.Resampling.BICUBIC)
if image_np.shape[:2] != (target_h, target_w):
image_np = cv2.resize(image_np, (target_w, target_h), interpolation=cv2.INTER_AREA)
return Image.fromarray(image_np)
def _transform_n1_7_image_for_vlm_torch(
image: torch.Tensor,
*,
image_crop_size: list[int] | None,
image_target_size: list[int] | None,
shortest_image_edge: int | None,
crop_fraction: float | None,
) -> torch.Tensor:
"""Default (non-albumentations) N1.7 image transform: pad-to-square, resize to
``shortest_image_edge``, center-crop by ``crop_fraction``, resize to ``image_target_size``.
Operates on a ``(C, H, W)`` uint8 tensor and keeps the result on the input
tensor's device so the resize/crop run on GPU when the tensor is. Bicubic
interpolation with antialiasing matches PIL's ``Image.Resampling.BICUBIC``
closely (sub-``2/255`` per-pixel on worst-case inputs). The ``use_albumentations``
cv2/INTER_AREA path has no torch equivalent and stays on
:func:`_transform_n1_7_image_for_vlm_albumentations`.
"""
if image_target_size is None:
return image
target_h, target_w = image_target_size
_, height, width = image.shape
square_edge = max(height, width)
if height != width:
left = (square_edge - width) // 2
top = (square_edge - height) // 2
image = tv_functional.pad(
image, [left, top, square_edge - width - left, square_edge - height - top], fill=0
)
resize_edge = shortest_image_edge or target_h
image = tv_functional.resize(
image, [resize_edge, resize_edge], interpolation=InterpolationMode.BICUBIC, antialias=True
)
if crop_fraction is None and image_crop_size is not None:
crop_fraction = image_crop_size[0] / float(target_h)
if crop_fraction is not None and 0.0 < crop_fraction < 1.0:
# Match the PIL helper's center crop exactly: round() the crop size but
# floor() the offset (torchvision.center_crop rounds the offset, which
# shifts the region by 1px when (edge - crop) is odd).
crop_h = max(1, int(round(image.shape[-2] * crop_fraction)))
crop_w = max(1, int(round(image.shape[-1] * crop_fraction)))
top = max(0, (image.shape[-2] - crop_h) // 2)
left = max(0, (image.shape[-1] - crop_w) // 2)
image = image[..., top : top + crop_h, left : left + crop_w]
if tuple(image.shape[-2:]) != (target_h, target_w):
image = tv_functional.resize(
image, [target_h, target_w], interpolation=InterpolationMode.BICUBIC, antialias=True
)
return image
@@ -1058,9 +1066,6 @@ class GrootN17PackInputsStep(ProcessorStep):
video_modality_keys: list[str] | None = None
raw_stats: dict[str, Any] | None = None
modality_config: dict[str, Any] | None = None
# Unused: kept so serialized configs that include it still load. The raw
# state cache is per instance (_last_raw_state), never process-global.
state_cache_key: str = ""
_last_raw_state: dict[str, np.ndarray] | None = field(default=None, init=False, repr=False)
_warned_image_keys: bool = field(default=False, init=False, repr=False)
@@ -1333,6 +1338,12 @@ class GrootN17VLMEncodeStep(ProcessorStep):
The packed video has shape ``(B, T, V, H, W, C)``. Each frame/view becomes
an image item in the same chat message so the resulting image tokens match
the temporal VLM packing used by Isaac-GR00T.
Images are handed to the torchvision-backed Qwen3-VL processor as ``(C, H, W)``
uint8 tensors (no per-frame PIL roundtrip), and, when ``device`` resolves to a
CUDA device, the resize/rescale/normalize/patchify run there. This keeps the
output bit-identical on CPU and moves the dominant preprocessing cost off
the critical path on GPU.
"""
model_name: str = GROOT_N1_7_BACKBONE_MODEL
@@ -1341,6 +1352,7 @@ class GrootN17VLMEncodeStep(ProcessorStep):
shortest_image_edge: int | None = None
crop_fraction: float | None = None
use_albumentations: bool = False
device: str | None = None
_proc: ProcessorMixin | None = field(default=None, init=False, repr=False)
@property
@@ -1349,6 +1361,69 @@ class GrootN17VLMEncodeStep(ProcessorStep):
self._proc = _build_n1_7_processor(self.model_name)
return self._proc
def _target_device(self) -> torch.device | None:
# The albumentations path is cv2/PIL only, so it cannot run on GPU.
if self.device is None or self.use_albumentations:
return None
try:
return get_safe_torch_device(self.device)
except (AssertionError, RuntimeError):
# A device serialized at train time (e.g. "cuda") may be unavailable
# when the processor is reloaded elsewhere (e.g. CPU-only eval), and
# this step is not in the standard device-override set. Fall back to
# the CPU path, which is bit-identical, instead of crashing.
return None
def _build_sample_images(
self, video: Any, batch_size: int, target_device: torch.device | None
) -> list[list[Any]]:
"""Return, per batch item, its ordered ``(timestep, view)`` frames.
``use_albumentations`` keeps the legacy per-frame PIL/cv2 transform;
otherwise frames are ``(C, H, W)`` uint8 tensors (moved to
``target_device`` when set) for the torchvision-backed Qwen processor.
"""
if self.use_albumentations:
video_np = np.asarray(video)
return [
[
_transform_n1_7_image_for_vlm_albumentations(
Image.fromarray(video_np[batch_idx, timestep, view_idx]),
image_crop_size=self.image_crop_size,
image_target_size=self.image_target_size,
shortest_image_edge=self.shortest_image_edge,
crop_fraction=self.crop_fraction,
)
for timestep in range(video_np.shape[1])
for view_idx in range(video_np.shape[2])
]
for batch_idx in range(batch_size)
]
video_t = video if torch.is_tensor(video) else torch.from_numpy(np.ascontiguousarray(video))
# (B, T, V, H, W, C) uint8 -> (B, T, V, C, H, W)
video_t = video_t.permute(0, 1, 2, 5, 3, 4).contiguous()
if target_device is not None and video_t.device != target_device:
video_t = video_t.to(target_device, non_blocking=(target_device.type == "cuda"))
frames_per_sample: list[list[Any]] = []
for batch_idx in range(batch_size):
sample = video_t[batch_idx] # (T, V, C, H, W)
frames_per_sample.append(
[
_transform_n1_7_image_for_vlm_torch(
sample[timestep, view_idx],
image_crop_size=self.image_crop_size,
image_target_size=self.image_target_size,
shortest_image_edge=self.shortest_image_edge,
crop_fraction=self.crop_fraction,
)
for timestep in range(sample.shape[0])
for view_idx in range(sample.shape[1])
]
)
return frames_per_sample
def __call__(self, transition: EnvTransition) -> EnvTransition:
obs = transition.get(TransitionKey.OBSERVATION, {}) or {}
comp = transition.get(TransitionKey.COMPLEMENTARY_DATA, {}) or {}
@@ -1356,33 +1431,25 @@ class GrootN17VLMEncodeStep(ProcessorStep):
if video is None:
return transition
batch_size = int(video.shape[0])
languages = _prepare_n1_7_language_batch(
comp.get("language"),
video.shape[0],
batch_size,
formalize_language=False,
)
target_device = self._target_device()
sample_images = self._build_sample_images(video, batch_size, target_device)
texts: list[str] = []
images: list[Image.Image] = []
for batch_idx in range(video.shape[0]):
sample = video[batch_idx] # (T, V, H, W, C)
sample_images = [
_transform_n1_7_image_for_vlm(
Image.fromarray(sample[timestep, view_idx]),
image_crop_size=self.image_crop_size,
image_target_size=self.image_target_size,
shortest_image_edge=self.shortest_image_edge,
crop_fraction=self.crop_fraction,
use_albumentations=self.use_albumentations,
)
for timestep in range(sample.shape[0])
for view_idx in range(sample.shape[1])
]
images: list[Any] = []
for batch_idx in range(batch_size):
frames = sample_images[batch_idx]
conversation = [
{
"role": "user",
"content": [
*[{"type": "image", "image": image} for image in sample_images],
*[{"type": "image", "image": image} for image in frames],
{"type": "text", "text": languages[batch_idx]},
],
}
@@ -1394,9 +1461,17 @@ class GrootN17VLMEncodeStep(ProcessorStep):
add_generation_prompt=False,
)
)
images.extend(sample_images)
images.extend(frames)
encoded = self.proc(text=texts, images=images, return_tensors="pt", padding=True)
proc_kwargs: dict[str, Any] = {
"text": texts,
"images": images,
"return_tensors": "pt",
"padding": True,
}
if target_device is not None:
proc_kwargs["device"] = str(target_device)
encoded = self.proc(**proc_kwargs)
for key, value in encoded.items():
comp[key] = value
obs.pop("video", None)
@@ -1415,6 +1490,7 @@ class GrootN17VLMEncodeStep(ProcessorStep):
"shortest_image_edge": self.shortest_image_edge,
"crop_fraction": self.crop_fraction,
"use_albumentations": self.use_albumentations,
"device": self.device,
}
@@ -1565,8 +1641,6 @@ class GrootN17ActionDecodeStep(ProcessorStep):
modality_config: dict[str, Any] | None = None
use_percentiles: bool = False
use_relative_action: bool = False
# Unused: kept so serialized configs that include it still load.
state_cache_key: str = ""
action_decode_transform: str | None = None
pack_step: GrootN17PackInputsStep | None = field(default=None, repr=False)
@@ -1694,10 +1768,10 @@ class GrootN17ActionDecodeStep(ProcessorStep):
}
@dataclass
# v2: unlike the N1.5-era v1 step, this step no longer collapses (B, T, D)
# action chunks to the last timestep, so old serialized v1 pipelines must not
# silently load into it (v1 is stubbed below with the removal guidance).
@dataclass
@ProcessorStepRegistry.register(name="groot_action_unpack_unnormalize_v2")
class GrootActionUnpackUnnormalizeStep(ProcessorStep):
env_action_dim: int = 0
@@ -207,11 +207,6 @@ def test_lerobot_groot_forward_pass():
with torch.no_grad():
lerobot_loss, lerobot_metrics = lerobot_policy.forward(batch_lerobot_processed)
assert isinstance(lerobot_loss, torch.Tensor)
assert torch.isfinite(lerobot_loss).all()
assert "loss" in lerobot_metrics
assert np.isfinite(lerobot_metrics["loss"])
print("\nForward pass successful.")
print(f" - Loss: {lerobot_loss.item():.6f}")
print(f" - Metrics: {lerobot_metrics}")
File diff suppressed because it is too large Load Diff
+26 -175
View File
@@ -14,36 +14,31 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""Parity tests: original NVIDIA GR00T N1.7 vs the GR00T N1.7 integration in LeRobot.
"""Parity test: original NVIDIA GR00T N1.7 vs the GR00T N1.7 integration in LeRobot.
Two comparisons run per embodiment tag, against per-tag ``.npz`` artifacts produced
once in the original ``gr00t`` env by the companion script
``utils/dump_original_n1_7.py`` (in the ``utils`` package next to this file):
Verifies that the self-contained LeRobot reimplementation of the GR00T N1.7 action
head + Qwen3-VL backbone produces the SAME raw model output (``action_pred``, the
normalized flow-matching prediction before any action decoding) as NVIDIA's original
``gr00t`` package, given byte-identical pre-processed inputs and the same
flow-matching seed. The comparison is parametrized over every embodiment tag present
in the checkpoint.
1. **Model parity** -- the self-contained LeRobot reimplementation of the GR00T N1.7
action head + Qwen3-VL backbone must produce the SAME raw model output
(``action_pred``, the normalized flow-matching prediction before any action
decoding) as NVIDIA's original ``gr00t`` package, given byte-identical
pre-processed inputs and the flow-matching seed recorded in the artifact.
2. **Preprocessor parity** -- LeRobot's own preprocessor pipeline (real Qwen3-VL chat
template / tokenizer / image packing + state normalization, no mocks) must produce
the SAME collated model inputs (``input_ids``, ``pixel_values``, ``state``, ...)
as the original package's processor, given the identical raw observations
(images, state, language) recorded in the artifact. Artifacts written by older
versions of the dump script carry no raw observations; this case then SKIPS with
a regeneration hint.
To keep the comparison fair, the original outputs + the exact collated inputs are
produced once per embodiment in the original ``gr00t`` env via the companion script
``utils/dump_original_n1_7.py`` (in the ``utils`` package next to this file) and saved
to per-tag ``.npz`` files.
This test discovers those artifacts, replays the identical inputs through the LeRobot
model, and compares.
These tests are LOCAL-only and skip on CI, when ``gr00t``-side prerequisites are not
present, or when no artifact has been generated. By default they look for artifacts in
This test is LOCAL-only and skips on CI, when ``gr00t``-side prerequisites are not
present, or when no artifact has been generated. By default it looks for artifacts in
``<this dir>/artifacts/``; override with ``GROOT_N1_7_PARITY_DIR``. See the
"Original-vs-LeRobot parity test" section of ``src/lerobot/policies/groot/README.md``
for the full run procedure.
"""
import os
import warnings
from pathlib import Path
from typing import Any
import numpy as np
import pytest
@@ -55,9 +50,7 @@ pytestmark = pytest.mark.skipif(
)
from lerobot.policies.groot.configuration_groot import GROOT_N1_7 # noqa: E402,F401
from lerobot.utils.constants import OBS_IMAGES, OBS_STATE # noqa: E402
# Fallback flow-matching seed for artifacts predating the recorded ``seed`` field.
SEED = 42
DEVICE = os.environ.get("GROOT_PARITY_DEVICE", "cuda" if torch.cuda.is_available() else "cpu")
ATOL = float(os.environ.get("GROOT_PARITY_ATOL", "1e-3"))
@@ -67,11 +60,6 @@ RTOL = float(os.environ.get("GROOT_PARITY_RTOL", "1e-3"))
_ARTIFACT_PREFIX = "original_n1_7_"
_ARTIFACT_SUFFIX = ".npz"
# Collated keys compared by the preprocessor parity case: integer/id tensors must
# match exactly; float tensors within ATOL/RTOL.
_COLLATED_EXACT_KEYS = ("input_ids", "attention_mask", "image_grid_thw", "embodiment_id")
_COLLATED_CLOSE_KEYS = ("pixel_values", "state")
def _artifact_dir() -> Path:
"""Directory holding the per-embodiment .npz artifacts.
@@ -121,20 +109,9 @@ def _resolve_checkpoint() -> str:
return str(ckpt)
def _load_artifact(path: Path) -> tuple[torch.Tensor, dict[str, torch.Tensor], int]:
"""Return (original action_pred, collated model inputs, flow-matching seed)."""
def _load_artifact(path: Path):
data = np.load(path, allow_pickle=True)
original_action = torch.from_numpy(data["action_pred"]).float()
if "seed" in data.files:
seed = int(data["seed"])
else:
warnings.warn(
f"Artifact '{path.name}' does not record the producer seed (it predates the current "
f"dump_original_n1_7.py); falling back to seed={SEED}. If the parity comparison fails, "
"regenerate the artifact with the current dump script.",
stacklevel=2,
)
seed = SEED
dtypes = dict(zip(data["meta_keys"].tolist(), data["meta_dtypes"].tolist(), strict=False))
inputs = {}
for key in data.files:
@@ -147,45 +124,7 @@ def _load_artifact(path: Path) -> tuple[torch.Tensor, dict[str, torch.Tensor], i
if "int" in declared or "long" in declared:
t = t.long()
inputs[name] = t
return original_action, inputs, seed
def _load_raw_observation(path: Path) -> dict[str, Any] | None:
"""Return the raw observation recorded in the artifact, or None for old artifacts.
Artifacts produced by the current ``dump_original_n1_7.py`` additionally store the
exact raw observation the producer fed to the original processor: per-camera uint8
frames (``raw::video.<key>``, (B, T, H, W, C)), per-key state vectors
(``raw::state.<key>``, (B, T, dim)) and the language instruction
(``raw::language``, one string per batch element). ``raw_video_keys`` /
``raw_state_keys`` record the checkpoint modality-key order.
"""
data = np.load(path, allow_pickle=True)
markers = ("raw_video_keys", "raw_state_keys", "raw::language")
if any(marker not in data.files for marker in markers):
return None
video_keys = [str(k) for k in data["raw_video_keys"].tolist()]
state_keys = [str(k) for k in data["raw_state_keys"].tolist()]
return {
"video": {k: data[f"raw::video.{k}"] for k in video_keys},
"state": {k: data[f"raw::state.{k}"] for k in state_keys},
"language": [str(t) for t in data["raw::language"].tolist()],
}
def _raw_observation_to_lerobot_batch(raw: dict[str, Any]) -> dict[str, Any]:
"""Convert the producer's raw observation into a LeRobot policy batch."""
batch: dict[str, Any] = {}
for key, frames in raw["video"].items():
# (B, T, H, W, C) uint8 -> (B, T, C, H, W); the pack step converts back losslessly.
batch[f"{OBS_IMAGES}.{key}"] = torch.from_numpy(frames).permute(0, 1, 4, 2, 3).contiguous()
# observation.state is the per-key state vectors (latest frame) concatenated in
# checkpoint modality-key order -- the layout the LeRobot pack step and the
# flattened checkpoint statistics expect.
state_parts = [torch.from_numpy(np.asarray(arr)[:, -1, :]).float() for arr in raw["state"].values()]
batch[OBS_STATE] = torch.cat(state_parts, dim=-1)
batch["task"] = list(raw["language"])
return batch
return original_action, inputs
def _unflatten(inputs: dict[str, torch.Tensor]) -> dict:
@@ -200,36 +139,6 @@ def _unflatten(inputs: dict[str, torch.Tensor]) -> dict:
return nested.get("inputs", nested)
def _assert_collated_parity(
embodiment_tag: str, name: str, lerobot_value: Any, original_value: torch.Tensor, *, exact: bool
) -> None:
"""Compare one collated tensor produced by LeRobot against the original's."""
assert isinstance(lerobot_value, torch.Tensor), (
f"[{embodiment_tag}] LeRobot preprocessor output '{name}' is "
f"{type(lerobot_value).__name__}, expected a tensor."
)
lerobot_t = lerobot_value.detach().cpu()
original_t = original_value.detach().cpu()
assert lerobot_t.shape == original_t.shape, (
f"[{embodiment_tag}] collated '{name}' shape mismatch: lerobot={tuple(lerobot_t.shape)} vs "
f"original={tuple(original_t.shape)}."
)
if exact:
mismatched = int((lerobot_t.long() != original_t.long()).sum())
assert mismatched == 0, (
f"[{embodiment_tag}] collated '{name}' differs from the original processor output: "
f"{mismatched}/{original_t.numel()} elements mismatch."
)
else:
lerobot_f, original_f = lerobot_t.float(), original_t.float()
max_diff = (lerobot_f - original_f).abs().max().item()
print(f"[{embodiment_tag}] {name}: shape {tuple(lerobot_t.shape)} max|diff|={max_diff:.6e}")
assert torch.allclose(lerobot_f, original_f, atol=ATOL, rtol=RTOL), (
f"[{embodiment_tag}] collated '{name}' differs from the original processor output beyond "
f"atol={ATOL}, rtol={RTOL}: max|diff|={max_diff:.6e}."
)
@pytest.fixture(scope="module")
def lerobot_model():
"""Load the LeRobot GR00T N1.7 model once (fp32 + SDPA) and reuse across tags."""
@@ -256,7 +165,8 @@ def lerobot_model():
_ARTIFACTS = _discover_artifacts()
_requires_artifacts = pytest.mark.skipif(
@pytest.mark.skipif(
not _ARTIFACTS,
reason=(
"No GR00T N1.7 parity artifacts found. Generate them first in the original gr00t "
@@ -264,30 +174,24 @@ _requires_artifacts = pytest.mark.skipif(
"--ckpt <ckpt> --out-dir tests/policies/groot/artifacts --device cuda"
),
)
@_requires_artifacts
@pytest.mark.parametrize("embodiment_tag,artifact", _ARTIFACTS, ids=[t for t, _ in _ARTIFACTS])
def test_groot_get_action_parity(embodiment_tag, artifact, lerobot_model):
"""Raw model.get_action(action_pred) parity per embodiment: original vs LeRobot."""
original_action, flat_inputs, seed = _load_artifact(artifact)
original_action, flat_inputs = _load_artifact(artifact)
model_inputs = _unflatten(flat_inputs)
# Align the flow-matching RNG exactly as the producer did (seed right before sampling).
torch.manual_seed(seed)
torch.manual_seed(SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
torch.cuda.manual_seed_all(SEED)
with torch.inference_mode():
out = lerobot_model.get_action(model_inputs)
lerobot_action = out["action_pred"].float().cpu()
assert lerobot_action.shape == original_action.shape, (
f"GR00T N1.7 action_pred shape mismatch for embodiment '{embodiment_tag}': "
f"lerobot={tuple(lerobot_action.shape)} vs original={tuple(original_action.shape)}. "
"The same checkpoint and inputs must produce identical shapes; this indicates an "
"action-horizon or action-dim regression (or a stale artifact -- regenerate it with "
"utils/dump_original_n1_7.py)."
)
t = min(original_action.shape[1], lerobot_action.shape[1])
d = min(original_action.shape[2], lerobot_action.shape[2])
original_action = original_action[:, :t, :d]
lerobot_action = lerobot_action[:, :t, :d]
diff = torch.abs(lerobot_action - original_action)
max_diff = diff.max().item()
@@ -301,56 +205,3 @@ def test_groot_get_action_parity(embodiment_tag, artifact, lerobot_model):
f"GR00T N1.7 raw action_pred differs for embodiment '{embodiment_tag}' beyond "
f"atol={ATOL}, rtol={RTOL}: max|diff|={max_diff:.6e}"
)
@_requires_artifacts
@pytest.mark.parametrize("embodiment_tag,artifact", _ARTIFACTS, ids=[t for t, _ in _ARTIFACTS])
def test_groot_preprocessor_parity(embodiment_tag, artifact):
"""LeRobot's real preprocessor vs the original's collated tensors, from identical raw obs.
Runs LeRobot's full preprocessor pipeline -- including the real Qwen3-VL chat
template, tokenizer and image packing plus the checkpoint-driven state
normalization (no mocks) -- on the raw observations recorded in the artifact, and
compares every collated model input against the ones the original ``gr00t``
processor produced from the same raw observations.
"""
raw = _load_raw_observation(artifact)
if raw is None:
pytest.skip(
f"Artifact '{artifact.name}' was produced by an older dump_original_n1_7.py that does "
"not record raw observations; regenerate it with the current dump script to run the "
"preprocessor parity case."
)
_, flat_inputs, _ = _load_artifact(artifact)
original_inputs = _unflatten(flat_inputs)
ckpt = _resolve_checkpoint()
from lerobot.policies.groot.configuration_groot import GrootConfig
from lerobot.policies.groot.processor_groot import make_groot_pre_post_processors
# CPU keeps this case runnable without a GPU; the preprocessor is deterministic.
config = GrootConfig(base_model_path=ckpt, embodiment_tag=embodiment_tag, device="cpu")
preprocessor, _ = make_groot_pre_post_processors(config)
processed = preprocessor(_raw_observation_to_lerobot_batch(raw))
compared_keys = (*_COLLATED_EXACT_KEYS, *_COLLATED_CLOSE_KEYS)
missing_original = [k for k in compared_keys if k not in original_inputs]
missing_lerobot = [k for k in compared_keys if k not in processed]
assert not missing_original, (
f"[{embodiment_tag}] artifact collated inputs miss {missing_original} "
f"(available: {sorted(original_inputs)}); regenerate the artifact with the current dump script."
)
assert not missing_lerobot, (
f"[{embodiment_tag}] LeRobot preprocessor output misses {missing_lerobot} (tensor keys "
f"available: {sorted(k for k, v in processed.items() if isinstance(v, torch.Tensor))})."
)
for name in compared_keys:
_assert_collated_parity(
embodiment_tag,
name,
processed[name],
original_inputs[name],
exact=name in _COLLATED_EXACT_KEYS,
)
@@ -9,9 +9,6 @@ LeRobot GR00T N1.7 integration requires. The two implementations therefore canno
imported in the same Python process. To keep the parity comparison FAIR, we run the
original model in its native env here and serialize, PER EMBODIMENT TAG:
* the RAW observation fed to the original processor (per-camera uint8 frames,
per-key state vectors, the language instruction), so the LeRobot side can also
run its OWN preprocessor on identical raw inputs and compare collated tensors,
* the exact pre-processed/collated model inputs (so the LeRobot side consumes the
byte-identical tensors -- same image preprocessing, tokenization, normalization),
* the random seed used right before the flow-matching sampler,
@@ -24,10 +21,8 @@ processor's per-embodiment modality configs. This lets us test many embodiment t
from the SAME checkpoint and confirm the LeRobot integration is not overfit to
``libero_sim``.
The companion pytest (run in the LeRobot env) loads each .npz and asserts parity
twice: the collated inputs + seed are replayed through the LeRobot GR00T N1.7 model
(model parity), and the raw observation is replayed through LeRobot's own
preprocessor pipeline and compared against the collated inputs (preprocessor parity).
The companion pytest (run in the LeRobot env) loads each .npz, replays the identical
inputs + seed through the LeRobot GR00T N1.7 model, and asserts the outputs match.
Usage:
.venv-original/bin/python tests/policies/groot/utils/dump_original_n1_7.py \
@@ -67,7 +62,10 @@ def make_observation(seed: int, video_keys, lang_key, state_spec):
# One ndarray per state key, shape (B, T=1, key_dim); dim taken from statistics.
# Keys with dim 0 (e.g. disabled eef on some embodiments) are still emitted as
# present-but-empty so the processor's state transform finds every expected key.
state = {k: rng.standard_normal((BATCH_SIZE, 1, dim)).astype(np.float32) for k, dim in state_spec}
state = {
k: rng.standard_normal((BATCH_SIZE, 1, dim)).astype(np.float32)
for k, dim in state_spec
}
language = {lang_key: [[PROMPT] for _ in range(BATCH_SIZE)]}
return {"video": video, "state": state, "language": language}
@@ -79,25 +77,6 @@ def dump_one_tag(policy, fair_model, tag, modality_cfg, state_spec, args, out_pa
lang_key = modality_cfg["language"].modality_keys[0]
observation = make_observation(args.seed, video_keys, lang_key, state_spec)
# Snapshot the RAW observation exactly as fed to the original processor below. The
# consumer's preprocessor-parity case replays it through LeRobot's own preprocessor
# and compares the resulting collated tensors against the "in::" ones saved further
# down. raw_state_keys records the checkpoint modality-key order, which is the
# concatenation order of the flat LeRobot ``observation.state`` vector.
spec_keys = [key for key, _ in state_spec]
state_modality = modality_cfg.get("state")
state_keys = [key for key in state_modality.modality_keys if key in spec_keys] if state_modality else []
state_keys += [key for key in spec_keys if key not in state_keys]
raw_language = [
str(item[0]) if isinstance(item, (list, tuple)) else str(item)
for item in observation["language"][lang_key]
]
raw_flat = {f"raw::video.{key}": arr.copy() for key, arr in observation["video"].items()}
raw_flat.update({f"raw::state.{key}": arr.copy() for key, arr in observation["state"].items()})
raw_flat["raw::language"] = np.array(raw_language, dtype=object)
raw_flat["raw_video_keys"] = np.array([str(key) for key in video_keys], dtype=object)
raw_flat["raw_state_keys"] = np.array([str(key) for key in state_keys], dtype=object)
# Point the policy preprocessing at this embodiment (mirrors Gr00tPolicy.__init__).
policy.embodiment_tag = type(policy.embodiment_tag)(tag)
policy.modality_configs = {
@@ -157,7 +136,6 @@ def dump_one_tag(policy, fair_model, tag, modality_cfg, state_spec, args, out_pa
embodiment_tag=np.array(tag),
meta_keys=np.array(list(meta.keys()), dtype=object),
meta_dtypes=np.array(list(meta.values()), dtype=object),
**raw_flat,
**flat,
)
print(f"[{tag}] action_pred {action_pred.shape} -> {out_path.name} ({os.path.getsize(out_path)} B)")
@@ -203,12 +181,7 @@ def main():
state_spec = [(k, len(v["min"])) for k, v in stats[tag]["state"].items()]
try:
dump_one_tag(
policy,
fair_model,
tag,
all_modality[tag],
state_spec,
args,
policy, fair_model, tag, all_modality[tag], state_spec, args,
out_dir / f"original_n1_7_{tag}.npz",
)
done.append(tag)