mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-16 07:49:48 +00:00
Compare commits
6 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 226a4c5a8c | |||
| 05a9ca274b | |||
| 13ed657056 | |||
| 559cba212d | |||
| 895eaf0d7c | |||
| edda8552ec |
@@ -4,6 +4,9 @@ GR00T is an NVIDIA foundation model family for generalized humanoid robot reason
|
||||
|
||||
LeRobot integrates GR00T N1.7 through the `groot` policy type.
|
||||
|
||||
> [!WARNING]
|
||||
> **Breaking change:** GR00T N1.5 support was removed from LeRobot, and current releases support GR00T N1.7 only. N1.5 checkpoints, configs, and `--policy.model_version=n1.5` are rejected with a clear error. To keep using an N1.5 checkpoint, pin the last release that supports it: `pip install 'lerobot==0.5.1'`. To use the current release, migrate to GR00T N1.7 (`model_version='n1.7'`, base model [`nvidia/GR00T-N1.7-3B`](https://huggingface.co/nvidia/GR00T-N1.7-3B)).
|
||||
|
||||
## Model Overview
|
||||
|
||||
GR00T N1.7 uses a Cosmos-Reason2/Qwen3-VL backbone and provides checkpoints for SimplerEnv, DROID, and LIBERO.
|
||||
@@ -133,7 +136,7 @@ Replace the `XX` placeholders with final eval artifacts before merge.
|
||||
Download the suite checkpoint locally, then point `--policy.base_model_path` at the downloaded subdirectory. `--policy.path` is reserved for LeRobot checkpoints that contain a LeRobot `config.json` with a `type` field.
|
||||
|
||||
```bash
|
||||
huggingface-cli download nvidia/GR00T-N1.7-LIBERO \
|
||||
hf download nvidia/GR00T-N1.7-LIBERO \
|
||||
--include "libero_spatial/*" \
|
||||
--local-dir ./GR00T-N1.7-LIBERO
|
||||
|
||||
|
||||
@@ -1,6 +1,13 @@
|
||||
## Research Paper
|
||||
|
||||
Paper: https://research.nvidia.com/labs/gear/gr00t-n1_5/
|
||||
GR00T N1 technical report (covers the GR00T N1.x family, including N1.7): https://arxiv.org/abs/2503.14734
|
||||
|
||||
GR00T N1.7 model card: https://huggingface.co/nvidia/GR00T-N1.7-3B
|
||||
|
||||
GR00T N1.5 research page (earlier version): https://research.nvidia.com/labs/gear/gr00t-n1_5/
|
||||
|
||||
> GR00T N1.5 support was removed from LeRobot; the last release supporting it is `lerobot==0.5.1`.
|
||||
> Current releases support GR00T N1.7 only.
|
||||
|
||||
## Repository
|
||||
|
||||
@@ -31,12 +38,22 @@ Hugging Face Models:
|
||||
|
||||
## Original-vs-LeRobot parity test
|
||||
|
||||
`tests/policies/groot/test_groot_vs_original.py` verifies that this LeRobot
|
||||
`tests/policies/groot/test_groot_vs_original.py` verifies this LeRobot
|
||||
reimplementation of GR00T N1.7 (Qwen3-VL backbone + flow-matching action head)
|
||||
produces the **same raw model output** (`get_action(...)["action_pred"]`, the
|
||||
normalized flow-matching prediction) as NVIDIA's original `gr00t` package, given
|
||||
byte-identical pre-processed inputs and the same flow-matching seed. It is
|
||||
parametrized over every embodiment tag present in the checkpoint.
|
||||
against NVIDIA's original `gr00t` package with two comparisons, each parametrized
|
||||
over every embodiment tag present in the checkpoint:
|
||||
|
||||
1. **Model parity** — given byte-identical pre-processed inputs and the same
|
||||
flow-matching seed (recorded in each artifact), both implementations must produce
|
||||
the **same raw model output** (`get_action(...)["action_pred"]`, the normalized
|
||||
flow-matching prediction). Output shapes must match exactly; any action-horizon
|
||||
or action-dim mismatch fails the test.
|
||||
2. **Preprocessor parity** — given the identical raw observations (per-camera
|
||||
frames, state vectors, language instruction), LeRobot's own preprocessor pipeline
|
||||
(real Qwen3-VL chat template / tokenizer / image packing + checkpoint-driven
|
||||
state normalization, no mocks) must produce the **same collated model inputs**
|
||||
(`input_ids`, `attention_mask`, `pixel_values`, `image_grid_thw`, `state`,
|
||||
`embodiment_id`) as the original package's processor.
|
||||
|
||||
### Why two environments
|
||||
|
||||
@@ -48,25 +65,37 @@ is itself a defaulted dataclass, so the original config dataclasses fail to impo
|
||||
|
||||
So the test uses a **producer / consumer** split across two venvs:
|
||||
|
||||
1. **Producer** — `tests/policies/groot/utils/dump_original_n1_7.py`, run in the *original*
|
||||
1. **Producer** — `tests/policies/groot/utils/dump_original_n1_7.py`, run in the _original_
|
||||
gr00t venv. For each embodiment it builds dummy inputs generically from the
|
||||
checkpoint metadata (state dims from `statistics.json`; camera/language keys from
|
||||
the processor modality configs), runs the original model, and saves the exact
|
||||
collated inputs + raw `action_pred` to one `.npz` per tag.
|
||||
2. **Consumer** — the pytest above, run in the *LeRobot* venv. It discovers every
|
||||
`.npz`, replays the byte-identical inputs through the LeRobot model with the same
|
||||
seed, and asserts the outputs match.
|
||||
the processor modality configs), runs the original model, and saves to one `.npz`
|
||||
per tag: the raw observations (`raw::` keys), the exact collated inputs
|
||||
(`in::` keys), the seed, and the raw `action_pred`.
|
||||
2. **Consumer** — the pytest above, run in the _LeRobot_ venv. It discovers every
|
||||
`.npz`; the model-parity case replays the byte-identical collated inputs through
|
||||
the LeRobot model with the recorded seed and asserts the outputs match, and the
|
||||
preprocessor-parity case replays the raw observations through LeRobot's full
|
||||
preprocessor pipeline and asserts the collated tensors match.
|
||||
|
||||
> Artifacts generated by older versions of the dump script contain no `raw::`
|
||||
> fields; the preprocessor-parity case then **skips** with a regeneration hint.
|
||||
> Re-run the producer to refresh them.
|
||||
|
||||
### Fairness controls
|
||||
|
||||
- **Same pre-processed inputs** — the original processor's `input_ids`,
|
||||
- **Same pre-processed inputs (model parity)** — the original processor's `input_ids`,
|
||||
`pixel_values`, `image_grid_thw`, `attention_mask`, `state`, `embodiment_id` are
|
||||
fed verbatim to the LeRobot model (no re-tokenization / re-normalization).
|
||||
fed verbatim to the LeRobot model (no re-tokenization / re-normalization), so the
|
||||
model comparison isolates the model. LeRobot's own tokenization / image packing is
|
||||
covered separately by the preprocessor-parity case, which compares its output
|
||||
against those same collated tensors from identical raw observations.
|
||||
- **Same precision + attention kernel** — both sides run **fp32 + SDPA**. The
|
||||
original defaults to `use_flash_attention=True` (flash_attention_2 + bf16); the
|
||||
producer forces SDPA + fp32. (With the defaults the gap is ~3e-2 — pure
|
||||
kernel/rounding noise, not an implementation difference.)
|
||||
- **Same flow-matching seed** — fixed (42) right before sampling on both sides.
|
||||
- **Same flow-matching seed** — fixed right before sampling on both sides; the
|
||||
producer records it in each artifact (`--seed`, default 42) and the consumer
|
||||
replays the recorded value.
|
||||
|
||||
### How to run
|
||||
|
||||
@@ -90,15 +119,15 @@ CUDA_VISIBLE_DEVICES=0 GROOT_PARITY_DEVICE=cuda \
|
||||
uv run pytest tests/policies/groot/test_groot_vs_original.py -v -s
|
||||
```
|
||||
|
||||
The `.npz` artifacts are local-only (gitignored, ~6–9 MB each) and are regenerated by
|
||||
the producer; they are never committed. The test **skips** (does not fail) on CI or
|
||||
The `.npz` artifacts are local-only (gitignored, ~6–10 MB each) and are regenerated by
|
||||
the producer; they are never committed. The tests **skip** (do not fail) on CI or
|
||||
when the checkpoint / artifacts are absent.
|
||||
|
||||
#### Env knobs (all optional)
|
||||
|
||||
| Var | Default | Purpose |
|
||||
|---|---|---|
|
||||
| `GROOT_N1_7_PARITY_DIR` | `tests/policies/groot/artifacts` | directory of per-tag `.npz` artifacts |
|
||||
| `GROOT_N1_7_LIBERO_CKPT` | auto (HF cache) | override checkpoint dir |
|
||||
| `GROOT_PARITY_DEVICE` | `cuda` if available | `cpu` or `cuda` |
|
||||
| `GROOT_PARITY_ATOL` / `GROOT_PARITY_RTOL` | `1e-3` | comparison tolerance |
|
||||
| Var | Default | Purpose |
|
||||
| ----------------------------------------- | -------------------------------- | ------------------------------------- |
|
||||
| `GROOT_N1_7_PARITY_DIR` | `tests/policies/groot/artifacts` | directory of per-tag `.npz` artifacts |
|
||||
| `GROOT_N1_7_LIBERO_CKPT` | auto (HF cache) | override checkpoint dir |
|
||||
| `GROOT_PARITY_DEVICE` | `cuda` if available | `cpu` or `cuda` |
|
||||
| `GROOT_PARITY_ATOL` / `GROOT_PARITY_RTOL` | `1e-3` | comparison tolerance |
|
||||
|
||||
@@ -14,6 +14,7 @@
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import logging
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
import torch
|
||||
@@ -42,6 +43,9 @@ else:
|
||||
Timesteps = None
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class TimestepEncoder(nn.Module):
|
||||
def __init__(self, embedding_dim, compute_dtype=torch.float32):
|
||||
require_package("diffusers", extra="groot")
|
||||
@@ -265,8 +269,8 @@ class DiT(ModelMixin, ConfigMixin):
|
||||
self.norm_out = nn.LayerNorm(self.inner_dim, elementwise_affine=False, eps=1e-6)
|
||||
self.proj_out_1 = nn.Linear(self.inner_dim, 2 * self.inner_dim)
|
||||
self.proj_out_2 = nn.Linear(self.inner_dim, self.config.output_dim)
|
||||
print(
|
||||
"Total number of DiT parameters: ",
|
||||
logger.debug(
|
||||
"Total number of DiT parameters: %d",
|
||||
sum(p.numel() for p in self.parameters() if p.requires_grad),
|
||||
)
|
||||
|
||||
@@ -426,8 +430,8 @@ class SelfAttentionTransformer(ModelMixin, ConfigMixin):
|
||||
for _ in range(self.config.num_layers)
|
||||
]
|
||||
)
|
||||
print(
|
||||
"Total number of SelfAttentionTransformer parameters: ",
|
||||
logger.debug(
|
||||
"Total number of SelfAttentionTransformer parameters: %d",
|
||||
sum(p.numel() for p in self.parameters() if p.requires_grad),
|
||||
)
|
||||
|
||||
|
||||
@@ -42,6 +42,14 @@ GROOT_N1_5_REMOVAL_GUIDANCE = (
|
||||
)
|
||||
GROOT_N1_7_BASE_MODEL = "nvidia/GR00T-N1.7-3B"
|
||||
GROOT_N1_7_BACKBONE_MODEL = "nvidia/Cosmos-Reason2-2B"
|
||||
# Image preprocessing geometry the GR00T N1.7 backbone was trained on. The processor
|
||||
# falls back to these when a checkpoint ships no image sizing in its processor_config
|
||||
# (e.g. fine-tuning the raw nvidia/GR00T-N1.7-3B base with a new embodiment), so frames
|
||||
# are resized to the expected resolution instead of being patchified at full camera
|
||||
# resolution (which both slows training and is a train/checkpoint distribution mismatch).
|
||||
# Mirrored by GR00T_N1_7_DEFAULTS in groot_n1_7.py.
|
||||
N1_7_DEFAULT_IMAGE_TARGET_SIZE = (256, 256)
|
||||
N1_7_DEFAULT_IMAGE_CROP_SIZE = (230, 230)
|
||||
GROOT_ACTION_DECODE_TRANSFORM_LIBERO = "libero"
|
||||
# Sentinel meaning "the user did not pick an action decode transform": __post_init__ resolves it
|
||||
# to the embodiment default ('libero' for 'libero_sim', otherwise None). It is distinct from an
|
||||
@@ -381,6 +389,40 @@ class GrootConfig(PreTrainedConfig):
|
||||
# Embodiment tag to use for training (e.g. 'new_embodiment', 'gr1')
|
||||
embodiment_tag: str = "new_embodiment"
|
||||
|
||||
# Inference-only override for the number of flow-matching denoising steps used to decode an
|
||||
# action chunk. None = use the model checkpoint default (currently 4). Higher values trade
|
||||
# inference speed for action quality; applied at base-model load via _create_groot_model.
|
||||
num_inference_timesteps: int | None = None
|
||||
|
||||
# If set, caps the number of open-loop actions executed before replanning (inference cadence).
|
||||
# Overrides the value inferred from the checkpoint/embodiment in _resolve_action_queue_steps.
|
||||
execution_horizon: int | None = None
|
||||
|
||||
# Opt-in. Copy a pretrained embodiment category slot's action-head weights into the target
|
||||
# embodiment slot at base-model build (in _create_groot_model), to warm-start a cold
|
||||
# 'new_embodiment' slot. Accepts an embodiment name (e.g.
|
||||
# 'oxe_droid_relative_eef_relative_joint') or an int embodiment id. Runs on every fresh
|
||||
# base-model build (so it applies during lerobot-train, which uses __init__ not
|
||||
# from_pretrained); on a fine-tuned checkpoint reload it is harmlessly overwritten.
|
||||
warm_start_embodiment_slot: int | str | None = None
|
||||
|
||||
# Opt-in relative-action support for the 'new_embodiment' slot (sync-safe, GR00T-native).
|
||||
# When True, GR00T converts absolute->relative inside its own pack step (training) and
|
||||
# reconstructs absolute inside its own flat decode step (inference), using a cached
|
||||
# reference state. The dataset stays absolute; compute relative ACTION stats with
|
||||
# `lerobot-edit-dataset --operation.relative_action true --operation.relative_exclude_joints
|
||||
# "['gripper']"` (this only rewrites stats, not actions).
|
||||
use_relative_actions: bool = False
|
||||
|
||||
# Joint names kept absolute (not converted to relative) when use_relative_actions is True.
|
||||
# Case-insensitive token match against action_feature_names.
|
||||
relative_exclude_joints: list[str] = field(default_factory=lambda: ["gripper"])
|
||||
|
||||
# Action dimension names from dataset metadata; auto-populated by the factory from dataset
|
||||
# meta (see factory.py:528). Used to build the relative-action mask so the gripper can be
|
||||
# identified and kept absolute. When None, the gripper cannot be identified.
|
||||
action_feature_names: list[str] | None = None
|
||||
|
||||
# Fine-tuning control arguments
|
||||
|
||||
# Whether to fine-tune the llm backbone
|
||||
|
||||
@@ -32,6 +32,7 @@ from torch.distributions import Beta
|
||||
from lerobot.utils.import_utils import _transformers_available, require_package
|
||||
|
||||
from .action_head.cross_attention_dit import AlternateVLDiT, DiT, SelfAttentionTransformer
|
||||
from .configuration_groot import N1_7_DEFAULT_IMAGE_CROP_SIZE, N1_7_DEFAULT_IMAGE_TARGET_SIZE
|
||||
|
||||
if TYPE_CHECKING or _transformers_available:
|
||||
from transformers import AutoConfig, AutoModel, PretrainedConfig, PreTrainedModel
|
||||
@@ -71,13 +72,13 @@ GR00T_N1_7_DEFAULTS: dict[str, Any] = {
|
||||
"backbone_embedding_dim": 2048,
|
||||
"tune_llm": False,
|
||||
"tune_visual": False,
|
||||
"select_layer": 12,
|
||||
"select_layer": 16,
|
||||
"reproject_vision": False,
|
||||
"use_flash_attention": True,
|
||||
"load_bf16": False,
|
||||
"backbone_trainable_params_fp32": True,
|
||||
"image_crop_size": (230, 230),
|
||||
"image_target_size": (256, 256),
|
||||
"image_crop_size": N1_7_DEFAULT_IMAGE_CROP_SIZE,
|
||||
"image_target_size": N1_7_DEFAULT_IMAGE_TARGET_SIZE,
|
||||
"shortest_image_edge": None,
|
||||
"crop_fraction": None,
|
||||
"random_rotation_angle": None,
|
||||
@@ -819,11 +820,14 @@ def _cosmos_reason2_qwen3_vl_config() -> PretrainedConfig:
|
||||
|
||||
|
||||
def get_backbone_cls(config: GR00TN17Config):
|
||||
if (
|
||||
config.backbone_model_type == "qwen"
|
||||
or "nvidia/Cosmos-Reason2" in config.model_name
|
||||
or "Qwen/Qwen3-VL" in config.model_name
|
||||
):
|
||||
if "nvidia/Cosmos-Reason2" in config.model_name or "Qwen/Qwen3-VL" in config.model_name:
|
||||
return Qwen3Backbone
|
||||
if config.backbone_model_type == "qwen":
|
||||
logger.warning(
|
||||
"Unrecognized GR00T N1.7 backbone model name '%s'; assuming a Qwen3-VL-compatible "
|
||||
"backbone because backbone_model_type='qwen'.",
|
||||
config.model_name,
|
||||
)
|
||||
return Qwen3Backbone
|
||||
raise ValueError(f"Unsupported GR00T N1.7 backbone model: {config.model_name}")
|
||||
|
||||
@@ -909,7 +913,7 @@ class GR00TN17(PreTrainedModel):
|
||||
"trust_remote_code": True
|
||||
}
|
||||
load_backbone_weights = kwargs.pop("load_backbone_weights", False)
|
||||
for key in ("revision", "cache_dir", "local_files_only", "token"):
|
||||
for key in ("cache_dir", "local_files_only", "token"):
|
||||
if key in kwargs:
|
||||
transformers_loading_kwargs.setdefault(key, kwargs[key])
|
||||
|
||||
|
||||
@@ -54,6 +54,98 @@ logger = logging.getLogger(__name__)
|
||||
T = TypeVar("T", bound="GrootPolicy")
|
||||
|
||||
|
||||
def _resolve_embodiment_id(value: int | str) -> int:
|
||||
"""Resolve an embodiment id from an int or an N1.7 embodiment name.
|
||||
|
||||
Names are looked up in N1_7_EMBODIMENT_MAPPING (e.g. 'new_embodiment' -> 10).
|
||||
Raises ValueError listing the known keys if the name is unknown.
|
||||
"""
|
||||
from .processor_groot import N1_7_EMBODIMENT_MAPPING
|
||||
|
||||
if isinstance(value, bool): # bool is a subclass of int; reject it explicitly.
|
||||
raise ValueError(f"Embodiment id must be an int or embodiment name, got bool {value!r}.")
|
||||
if isinstance(value, int):
|
||||
return value
|
||||
if value in N1_7_EMBODIMENT_MAPPING:
|
||||
return N1_7_EMBODIMENT_MAPPING[value]
|
||||
raise ValueError(
|
||||
f"Unknown GR00T N1.7 embodiment name '{value}'. Known names: "
|
||||
f"{sorted(N1_7_EMBODIMENT_MAPPING.keys())}."
|
||||
)
|
||||
|
||||
|
||||
def _warm_start_embodiment_slot(model, source_id: int, target_id: int) -> None:
|
||||
"""Copy category-specific action-head weights from one embodiment slot to another.
|
||||
|
||||
Used at base-model load (training only) to warm-start a cold target embodiment slot
|
||||
(e.g. 'new_embodiment') from a pretrained slot. Copies the per-category ``W``/``b``
|
||||
parameters across every CategorySpecificLinear in the action head's state encoder,
|
||||
action encoder, and action decoder. No-ops (with a logged warning) if the ids are out
|
||||
of range or identical.
|
||||
"""
|
||||
if source_id == target_id:
|
||||
logger.warning(
|
||||
"GR00T warm_start_embodiment_slot: source and target embodiment id are both %d; "
|
||||
"skipping (nothing to copy).",
|
||||
source_id,
|
||||
)
|
||||
return
|
||||
|
||||
action_head = getattr(model, "action_head", None)
|
||||
if action_head is None:
|
||||
logger.warning("GR00T warm_start_embodiment_slot: model has no action_head; skipping.")
|
||||
return
|
||||
|
||||
# Each entry is (submodule, [CategorySpecificLinear attribute names]).
|
||||
linear_groups = [
|
||||
(getattr(action_head, "state_encoder", None), ["layer1", "layer2"]),
|
||||
(getattr(action_head, "action_encoder", None), ["W1", "W2", "W3"]),
|
||||
(getattr(action_head, "action_decoder", None), ["layer1", "layer2"]),
|
||||
]
|
||||
|
||||
copied: list[str] = []
|
||||
with torch.no_grad():
|
||||
for submodule, attr_names in linear_groups:
|
||||
if submodule is None:
|
||||
continue
|
||||
submodule_name = type(submodule).__name__
|
||||
for attr_name in attr_names:
|
||||
lin = getattr(submodule, attr_name, None)
|
||||
if lin is None or not hasattr(lin, "W") or not hasattr(lin, "b"):
|
||||
continue
|
||||
num_categories = lin.W.shape[0]
|
||||
if not (0 <= source_id < num_categories and 0 <= target_id < num_categories):
|
||||
logger.warning(
|
||||
"GR00T warm_start_embodiment_slot: source_id=%d/target_id=%d out of range "
|
||||
"for %s.%s (num_categories=%d); skipping this layer.",
|
||||
source_id,
|
||||
target_id,
|
||||
submodule_name,
|
||||
attr_name,
|
||||
num_categories,
|
||||
)
|
||||
continue
|
||||
lin.W.data[target_id] = lin.W.data[source_id].clone()
|
||||
lin.b.data[target_id] = lin.b.data[source_id].clone()
|
||||
copied.append(f"{submodule_name}.{attr_name}")
|
||||
|
||||
if copied:
|
||||
logger.info(
|
||||
"GR00T warm_start_embodiment_slot: copied action-head weights from embodiment slot %d "
|
||||
"to slot %d for: %s.",
|
||||
source_id,
|
||||
target_id,
|
||||
", ".join(copied),
|
||||
)
|
||||
else:
|
||||
logger.warning(
|
||||
"GR00T warm_start_embodiment_slot: no action-head weights were copied "
|
||||
"(source_id=%d, target_id=%d).",
|
||||
source_id,
|
||||
target_id,
|
||||
)
|
||||
|
||||
|
||||
class GrootPolicy(PreTrainedPolicy):
|
||||
"""Wrapper around external Groot model for LeRobot integration."""
|
||||
|
||||
@@ -93,6 +185,25 @@ class GrootPolicy(PreTrainedPolicy):
|
||||
transformers_loading_kwargs={"trust_remote_code": True},
|
||||
)
|
||||
|
||||
# Inference-only override for the number of flow-matching denoising steps. The action
|
||||
# head reads self.num_inference_timesteps in get_action_with_features; dt (1/n) and the
|
||||
# t schedule adapt automatically.
|
||||
if self.config.num_inference_timesteps is not None:
|
||||
n = int(self.config.num_inference_timesteps)
|
||||
model.config.num_inference_timesteps = n
|
||||
model.action_head.num_inference_timesteps = n
|
||||
|
||||
# Opt-in: warm-start a cold embodiment slot (e.g. 'new_embodiment') from a pretrained
|
||||
# slot's action-head weights. Done here (not in from_pretrained) so it applies on every
|
||||
# fresh base-model build -- training via make_policy instantiates GrootPolicy(config)
|
||||
# directly (factory uses __init__ when cfg.pretrained_path is unset), it does NOT go
|
||||
# through from_pretrained. On a fine-tuned checkpoint reload this also runs but is
|
||||
# immediately overwritten by the loaded state_dict, so it is a harmless no-op there.
|
||||
if self.config.warm_start_embodiment_slot is not None:
|
||||
source_id = _resolve_embodiment_id(self.config.warm_start_embodiment_slot)
|
||||
target_id = _resolve_embodiment_id(self.config.embodiment_tag)
|
||||
_warm_start_embodiment_slot(model, source_id, target_id)
|
||||
|
||||
return model
|
||||
|
||||
def reset(self):
|
||||
@@ -260,7 +371,11 @@ class GrootPolicy(PreTrainedPolicy):
|
||||
horizons.append(checkpoint_action_horizon)
|
||||
if execution_horizon is not None:
|
||||
horizons.append(execution_horizon)
|
||||
return min(horizons)
|
||||
# An explicit config override caps the open-loop horizon (inference cadence), overriding
|
||||
# the value inferred from the checkpoint/embodiment.
|
||||
if self.config.execution_horizon is not None:
|
||||
horizons.append(max(1, int(self.config.execution_horizon)))
|
||||
return max(1, min(horizons))
|
||||
|
||||
def _resolve_prediction_horizon(self, actions: Tensor) -> int:
|
||||
"""Return the policy-facing action horizon for a native GR00T prediction."""
|
||||
@@ -428,6 +543,16 @@ class GrootPolicy(PreTrainedPolicy):
|
||||
"""
|
||||
self.eval()
|
||||
|
||||
# Freeze the relative-action reference at the exact chunk-prediction event so every popped
|
||||
# delta of this chunk is reconstructed (in the postprocessor) against this S_T, not the
|
||||
# per-tick latest state. Driven by the predict event, so it is correct under any runtime
|
||||
# n_action_steps/execution_horizon. No-op for non-relative checkpoints (holder absent/unused).
|
||||
from .processor_groot import _GROOT_REF_HOLDER_KEY
|
||||
|
||||
holder = batch.get(_GROOT_REF_HOLDER_KEY)
|
||||
if holder is not None:
|
||||
holder.freeze()
|
||||
|
||||
# Preprocessing is handled by the processor pipeline, so we just filter the batch.
|
||||
# During inference, we do not pass action because it is predicted.
|
||||
# N1.7 still carries a 2-D action horizon mask from its checkpoint processor.
|
||||
|
||||
@@ -23,8 +23,10 @@ from typing import TYPE_CHECKING, Any
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torchvision.transforms.v2.functional as tv_functional
|
||||
from einops import rearrange
|
||||
from PIL import Image
|
||||
from torchvision.transforms import InterpolationMode
|
||||
|
||||
from lerobot.utils.import_utils import _transformers_available
|
||||
|
||||
@@ -45,6 +47,8 @@ from lerobot.processor import (
|
||||
RenameObservationsProcessorStep,
|
||||
batch_to_transition,
|
||||
policy_action_to_transition,
|
||||
to_absolute_actions,
|
||||
to_relative_actions,
|
||||
transition_to_batch,
|
||||
transition_to_policy_action,
|
||||
)
|
||||
@@ -57,11 +61,14 @@ from lerobot.utils.constants import (
|
||||
POLICY_POSTPROCESSOR_DEFAULT_NAME,
|
||||
POLICY_PREPROCESSOR_DEFAULT_NAME,
|
||||
)
|
||||
from lerobot.utils.device_utils import get_safe_torch_device
|
||||
|
||||
from .configuration_groot import (
|
||||
GROOT_ACTION_DECODE_TRANSFORM_LIBERO,
|
||||
GROOT_N1_5_REMOVAL_GUIDANCE,
|
||||
GROOT_N1_7_BACKBONE_MODEL,
|
||||
N1_7_DEFAULT_IMAGE_CROP_SIZE,
|
||||
N1_7_DEFAULT_IMAGE_TARGET_SIZE,
|
||||
GrootConfig,
|
||||
is_raw_groot_n1_7_checkpoint,
|
||||
)
|
||||
@@ -83,6 +90,30 @@ N1_7_EMBODIMENT_MAPPING = {
|
||||
}
|
||||
|
||||
|
||||
_GROOT_REF_HOLDER_KEY = "_groot_relative_ref_holder" # private; dropped by _filter_groot_inputs, never reaches the model
|
||||
|
||||
|
||||
class _GrootRelativeRefHolder:
|
||||
"""Runtime-only carrier shared (by object identity) between the pack step (owner/writer of the
|
||||
live reference), GrootPolicy.predict_action_chunk (freezes it at a real predict event), and the
|
||||
decode step (reads the frozen reference). Not serialized. One instance per pack step."""
|
||||
|
||||
__slots__ = ("reference_state", "raw_state", "frozen_reference", "frozen_raw")
|
||||
|
||||
def __init__(self):
|
||||
self.reference_state = None
|
||||
self.raw_state = None
|
||||
self.frozen_reference = None
|
||||
self.frozen_raw = None
|
||||
|
||||
def freeze(self) -> None:
|
||||
self.frozen_reference = self.reference_state
|
||||
self.frozen_raw = self.raw_state
|
||||
|
||||
def clear(self) -> None:
|
||||
self.reference_state = self.raw_state = self.frozen_reference = self.frozen_raw = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class _GrootN17CheckpointProcessorAssets:
|
||||
"""Processor metadata loaded from a raw Isaac-GR00T N1.7 checkpoint.
|
||||
@@ -112,6 +143,39 @@ class _GrootN17CheckpointProcessorAssets:
|
||||
use_albumentations: bool
|
||||
|
||||
|
||||
def _resolve_base_model_local_dir(base_model_path: str | None) -> str | None:
|
||||
"""Resolve a base model path to a local snapshot dir holding its sidecar JSONs.
|
||||
|
||||
``is_raw_groot_n1_7_checkpoint`` needs a local directory (or config.json) to inspect, so a
|
||||
bare HF repo-id (e.g. ``nvidia/GR00T-N1.7-3B``) would never be recognised as a raw N1.7
|
||||
checkpoint and the processor would fall back to LeRobot default image geometry instead of the
|
||||
checkpoint's processor_config.json geometry. When the path is not already a local dir, this
|
||||
downloads just the JSON sidecars and returns the local snapshot dir. Offline-safe: any failure
|
||||
returns the original string unchanged. Only used on the fresh-build (training) path; inference
|
||||
loads the serialized processor, so no per-inference network call is added.
|
||||
"""
|
||||
if base_model_path is None:
|
||||
return None
|
||||
if Path(base_model_path).expanduser().is_dir():
|
||||
return base_model_path
|
||||
try:
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
local_dir = snapshot_download(
|
||||
base_model_path,
|
||||
repo_type="model",
|
||||
allow_patterns=["*.json"],
|
||||
)
|
||||
logging.debug(
|
||||
"Resolved GR00T base model '%s' to local snapshot '%s' for processor asset loading.",
|
||||
base_model_path,
|
||||
local_dir,
|
||||
)
|
||||
return local_dir
|
||||
except Exception: # noqa: BLE001 (offline-safe: fall back to the original path on any failure)
|
||||
return base_model_path
|
||||
|
||||
|
||||
def _load_n1_7_checkpoint_processor_assets(config: GrootConfig) -> _GrootN17CheckpointProcessorAssets | None:
|
||||
"""Load N1.7 processor settings from checkpoint sidecar JSON files.
|
||||
|
||||
@@ -119,10 +183,11 @@ def _load_n1_7_checkpoint_processor_assets(config: GrootConfig) -> _GrootN17Chec
|
||||
can keep using caller-provided dataset stats and config values.
|
||||
"""
|
||||
|
||||
if not is_raw_groot_n1_7_checkpoint(config.base_model_path):
|
||||
resolved_base_model_path = _resolve_base_model_local_dir(config.base_model_path)
|
||||
if not is_raw_groot_n1_7_checkpoint(resolved_base_model_path):
|
||||
return None
|
||||
|
||||
checkpoint_path = Path(config.base_model_path).expanduser()
|
||||
checkpoint_path = Path(resolved_base_model_path).expanduser()
|
||||
processor_config = _read_json(checkpoint_path / "processor_config.json")
|
||||
processor_kwargs = processor_config.get("processor_kwargs", {})
|
||||
if not isinstance(processor_kwargs, dict):
|
||||
@@ -447,6 +512,40 @@ def _has_modality_stats(stats: dict[str, dict[str, Any]] | None) -> bool:
|
||||
return any(bool(modality_stats) for modality_stats in stats.values())
|
||||
|
||||
|
||||
def _build_relative_action_mask(
|
||||
action_dim: int,
|
||||
exclude_joints: list[str] | None,
|
||||
action_names: list[str] | None,
|
||||
) -> list[bool]:
|
||||
"""Build the per-dim relative-action mask (True = convert to relative, False = keep absolute).
|
||||
|
||||
Replicates ``RelativeActionsProcessorStep._build_mask`` semantics: dims are excluded
|
||||
(kept absolute) by case-insensitive token match against ``action_names``.
|
||||
|
||||
When ``action_names`` is None we cannot identify the gripper, so this returns all-True
|
||||
(every dim treated as relative). The user should ensure ``config.action_feature_names`` is
|
||||
populated (the factory does this from dataset meta) so the gripper can be kept absolute;
|
||||
arm-relative still works either way, but a missing-name gripper would be treated as relative.
|
||||
"""
|
||||
if not exclude_joints or action_names is None:
|
||||
return [True] * action_dim
|
||||
|
||||
exclude_tokens = [str(name).lower() for name in exclude_joints if name]
|
||||
if not exclude_tokens:
|
||||
return [True] * action_dim
|
||||
|
||||
mask: list[bool] = []
|
||||
for name in action_names[:action_dim]:
|
||||
action_name = str(name).lower()
|
||||
is_excluded = any(token == action_name or token in action_name for token in exclude_tokens)
|
||||
mask.append(not is_excluded)
|
||||
|
||||
if len(mask) < action_dim:
|
||||
mask.extend([True] * (action_dim - len(mask)))
|
||||
|
||||
return mask
|
||||
|
||||
|
||||
# GR00T normalizes state/action inside its own processor steps and so deliberately has no
|
||||
# NormalizerProcessorStep/UnnormalizerProcessorStep (see GrootConfig.normalization_mapping, which is
|
||||
# IDENTITY for every feature). lerobot-train nonetheless emits these standard override keys
|
||||
@@ -648,8 +747,15 @@ def _reconnect_groot_n1_7_pack_decode_steps(
|
||||
if pack_step is None:
|
||||
return
|
||||
|
||||
# Both decode steps read the pack step's cached state via a non-serialized ``pack_step`` link:
|
||||
# GrootN17ActionDecodeStep reads the per-modality raw state; the relative-action path
|
||||
# (GrootActionUnpackUnnormalizeStep) reads the cached reference state. Restore both links after
|
||||
# deserialization.
|
||||
for step in postprocessor.steps:
|
||||
if isinstance(step, GrootN17ActionDecodeStep) and step.pack_step is None:
|
||||
if (
|
||||
isinstance(step, (GrootN17ActionDecodeStep, GrootActionUnpackUnnormalizeStep))
|
||||
and step.pack_step is None
|
||||
):
|
||||
step.pack_step = pack_step
|
||||
|
||||
|
||||
@@ -727,23 +833,45 @@ def make_groot_pre_post_processors(
|
||||
video_modality_keys=video_modality_keys,
|
||||
raw_stats=checkpoint_assets.raw_stats if checkpoint_assets is not None else None,
|
||||
modality_config=checkpoint_assets.modality_config if checkpoint_assets is not None else None,
|
||||
use_relative_actions=config.use_relative_actions,
|
||||
relative_exclude_joints=config.relative_exclude_joints,
|
||||
action_feature_names=config.action_feature_names,
|
||||
)
|
||||
|
||||
# Resolve the image preprocessing geometry. Honor the checkpoint's processor_config
|
||||
# when it provides an image_target_size; otherwise fall back to the geometry the
|
||||
# N1.7 backbone was trained on. Without this fallback a raw base checkpoint with no
|
||||
# processor_config image sizing (e.g. fine-tuning nvidia/GR00T-N1.7-3B with a new
|
||||
# embodiment, where checkpoint_assets is None) would patchify full-resolution camera
|
||||
# frames, inflating the VLM token count -- slowing both dataloading_s and update_s --
|
||||
# and feeding the model a resolution it was not trained on.
|
||||
if checkpoint_assets is not None and checkpoint_assets.image_target_size is not None:
|
||||
image_target_size = checkpoint_assets.image_target_size
|
||||
image_crop_size = checkpoint_assets.image_crop_size
|
||||
shortest_image_edge = checkpoint_assets.shortest_image_edge
|
||||
crop_fraction = checkpoint_assets.crop_fraction
|
||||
else:
|
||||
image_target_size = list(N1_7_DEFAULT_IMAGE_TARGET_SIZE)
|
||||
image_crop_size = list(N1_7_DEFAULT_IMAGE_CROP_SIZE)
|
||||
shortest_image_edge = None
|
||||
crop_fraction = None
|
||||
use_albumentations = checkpoint_assets.use_albumentations if checkpoint_assets is not None else False
|
||||
|
||||
input_steps: list[ProcessorStep] = [
|
||||
RenameObservationsProcessorStep(rename_map={}),
|
||||
AddBatchDimensionProcessorStep(),
|
||||
pack_step,
|
||||
GrootN17VLMEncodeStep(
|
||||
model_name=config.n1_7_backbone_model,
|
||||
image_crop_size=checkpoint_assets.image_crop_size if checkpoint_assets is not None else None,
|
||||
image_target_size=checkpoint_assets.image_target_size if checkpoint_assets is not None else None,
|
||||
shortest_image_edge=checkpoint_assets.shortest_image_edge
|
||||
if checkpoint_assets is not None
|
||||
else None,
|
||||
crop_fraction=checkpoint_assets.crop_fraction if checkpoint_assets is not None else None,
|
||||
use_albumentations=checkpoint_assets.use_albumentations
|
||||
if checkpoint_assets is not None
|
||||
else False,
|
||||
image_crop_size=image_crop_size,
|
||||
image_target_size=image_target_size,
|
||||
shortest_image_edge=shortest_image_edge,
|
||||
crop_fraction=crop_fraction,
|
||||
use_albumentations=use_albumentations,
|
||||
# Run the image resize/normalize/patchify on the training device when
|
||||
# possible instead of the single CPU main-loop thread (the dominant
|
||||
# cost folded into dataloading_s).
|
||||
device=config.device,
|
||||
),
|
||||
DeviceProcessorStep(device=config.device),
|
||||
]
|
||||
@@ -767,6 +895,10 @@ def make_groot_pre_post_processors(
|
||||
stats=padded_stats,
|
||||
normalize_min_max=True,
|
||||
clip_normalized_action=True,
|
||||
use_relative_actions=config.use_relative_actions,
|
||||
relative_exclude_joints=config.relative_exclude_joints,
|
||||
action_feature_names=config.action_feature_names,
|
||||
pack_step=pack_step,
|
||||
)
|
||||
else:
|
||||
action_decode_step = GrootN17ActionDecodeStep(
|
||||
@@ -982,6 +1114,61 @@ def _transform_n1_7_image_for_vlm(
|
||||
return image
|
||||
|
||||
|
||||
def _transform_n1_7_image_for_vlm_torch(
|
||||
image: torch.Tensor,
|
||||
*,
|
||||
image_crop_size: list[int] | None,
|
||||
image_target_size: list[int] | None,
|
||||
shortest_image_edge: int | None,
|
||||
crop_fraction: float | None,
|
||||
) -> torch.Tensor:
|
||||
"""Torch/torchvision port of the non-albumentations branch of
|
||||
:func:`_transform_n1_7_image_for_vlm`.
|
||||
|
||||
Operates on a ``(C, H, W)`` uint8 tensor and keeps the result on the input
|
||||
tensor's device so the resize/crop run on GPU when the tensor is. Bicubic
|
||||
interpolation with antialiasing matches PIL's ``Image.Resampling.BICUBIC``
|
||||
closely (sub-``2/255`` per-pixel on worst-case inputs). The ``use_albumentations``
|
||||
cv2/INTER_AREA path has no torch equivalent and stays on the PIL helper.
|
||||
"""
|
||||
if image_target_size is None:
|
||||
return image
|
||||
|
||||
target_h, target_w = image_target_size
|
||||
_, height, width = image.shape
|
||||
|
||||
square_edge = max(height, width)
|
||||
if height != width:
|
||||
left = (square_edge - width) // 2
|
||||
top = (square_edge - height) // 2
|
||||
image = tv_functional.pad(
|
||||
image, [left, top, square_edge - width - left, square_edge - height - top], fill=0
|
||||
)
|
||||
|
||||
resize_edge = shortest_image_edge or target_h
|
||||
image = tv_functional.resize(
|
||||
image, [resize_edge, resize_edge], interpolation=InterpolationMode.BICUBIC, antialias=True
|
||||
)
|
||||
|
||||
if crop_fraction is None and image_crop_size is not None:
|
||||
crop_fraction = image_crop_size[0] / float(target_h)
|
||||
if crop_fraction is not None and 0.0 < crop_fraction < 1.0:
|
||||
# Match the PIL helper's center crop exactly: round() the crop size but
|
||||
# floor() the offset (torchvision.center_crop rounds the offset, which
|
||||
# shifts the region by 1px when (edge - crop) is odd).
|
||||
crop_h = max(1, int(round(image.shape[-2] * crop_fraction)))
|
||||
crop_w = max(1, int(round(image.shape[-1] * crop_fraction)))
|
||||
top = max(0, (image.shape[-2] - crop_h) // 2)
|
||||
left = max(0, (image.shape[-1] - crop_w) // 2)
|
||||
image = image[..., top : top + crop_h, left : left + crop_w]
|
||||
|
||||
if tuple(image.shape[-2:]) != (target_h, target_w):
|
||||
image = tv_functional.resize(
|
||||
image, [target_h, target_w], interpolation=InterpolationMode.BICUBIC, antialias=True
|
||||
)
|
||||
return image
|
||||
|
||||
|
||||
@dataclass
|
||||
@ProcessorStepRegistry.register(name="groot_n1_7_pack_inputs_v1")
|
||||
class GrootN17PackInputsStep(ProcessorStep):
|
||||
@@ -1008,8 +1195,18 @@ class GrootN17PackInputsStep(ProcessorStep):
|
||||
video_modality_keys: list[str] | None = None
|
||||
raw_stats: dict[str, Any] | None = None
|
||||
modality_config: dict[str, Any] | None = None
|
||||
# Opt-in relative-action support: convert absolute->relative actions inside this pack step
|
||||
# (training) using the cached raw reference state, keeping excluded joints (e.g. gripper)
|
||||
# absolute. The paired GrootActionUnpackUnnormalizeStep reconstructs absolute on decode.
|
||||
use_relative_actions: bool = False
|
||||
relative_exclude_joints: list[str] = field(default_factory=list)
|
||||
action_feature_names: list[str] | None = None
|
||||
_last_raw_state: dict[str, np.ndarray] | None = field(default=None, init=False, repr=False)
|
||||
_last_reference_state: torch.Tensor | None = field(default=None, init=False, repr=False)
|
||||
_warned_image_keys: bool = field(default=False, init=False, repr=False)
|
||||
_ref_holder: "_GrootRelativeRefHolder" = field(
|
||||
default_factory=_GrootRelativeRefHolder, init=False, repr=False
|
||||
)
|
||||
|
||||
def _ordered_image_keys(self, obs: dict[str, Any]) -> list[str]:
|
||||
available = {key for key in obs if key.startswith(OBS_IMAGES)}
|
||||
@@ -1131,6 +1328,7 @@ class GrootN17PackInputsStep(ProcessorStep):
|
||||
start_idx += dim
|
||||
if grouped:
|
||||
self._last_raw_state = grouped
|
||||
self._ref_holder.raw_state = grouped
|
||||
|
||||
img_keys = self._ordered_image_keys(obs)
|
||||
if img_keys:
|
||||
@@ -1150,6 +1348,9 @@ class GrootN17PackInputsStep(ProcessorStep):
|
||||
formalize_language=self.formalize_language,
|
||||
)
|
||||
|
||||
# Reference state for relative-action conversion (RAW, pre-normalization, (B, D)). Cached
|
||||
# regardless of whether an action is present so inference caches it too for decode.
|
||||
relative_reference_state: torch.Tensor | None = None
|
||||
if OBS_STATE in obs:
|
||||
state = obs[OBS_STATE]
|
||||
if state.dim() != 2:
|
||||
@@ -1158,6 +1359,10 @@ class GrootN17PackInputsStep(ProcessorStep):
|
||||
if dim > self.max_state_dim:
|
||||
raise ValueError(f"State dimension {dim} exceeds max_state_dim {self.max_state_dim}.")
|
||||
_cache_raw_state(state)
|
||||
if self.use_relative_actions:
|
||||
relative_reference_state = state.detach().clone()
|
||||
self._last_reference_state = relative_reference_state
|
||||
self._ref_holder.reference_state = relative_reference_state
|
||||
if self.normalize_min_max:
|
||||
state = _min_max_norm(state, OBS_STATE)
|
||||
state = state.unsqueeze(1)
|
||||
@@ -1180,6 +1385,19 @@ class GrootN17PackInputsStep(ProcessorStep):
|
||||
raise ValueError(f"Action horizon {horizon} exceeds action_horizon {self.action_horizon}.")
|
||||
if dim > self.max_action_dim:
|
||||
raise ValueError(f"Action dimension {dim} exceeds max_action_dim {self.max_action_dim}.")
|
||||
# Convert absolute->relative BEFORE normalization. The mask keeps excluded joints (e.g.
|
||||
# gripper) absolute; to_relative_actions broadcasts the (B, D) reference state over T.
|
||||
if self.use_relative_actions:
|
||||
if relative_reference_state is None:
|
||||
raise RuntimeError(
|
||||
"GrootN17PackInputsStep.use_relative_actions requires observation.state "
|
||||
"(OBS_STATE) to be present alongside the action to build the relative "
|
||||
"reference, but no state was found in this transition."
|
||||
)
|
||||
mask = _build_relative_action_mask(
|
||||
action.shape[-1], self.relative_exclude_joints, self.action_feature_names
|
||||
)
|
||||
action = to_relative_actions(action, relative_reference_state, mask)
|
||||
if self.normalize_min_max:
|
||||
flat = _min_max_norm(action.reshape(bsz * horizon, dim), ACTION)
|
||||
action = flat.view(bsz, horizon, dim)
|
||||
@@ -1219,6 +1437,12 @@ class GrootN17PackInputsStep(ProcessorStep):
|
||||
comp["action_mask"] = action_mask
|
||||
comp["embodiment_id"] = torch.full((bsz,), emb_id, dtype=torch.int32, device=device)
|
||||
|
||||
# Publish the runtime-only reference holder so the policy can freeze it at the predict
|
||||
# event and the decode step can read the frozen reference. It rides in COMPLEMENTARY_DATA,
|
||||
# survives the VLM-encode step and DeviceProcessorStep as a non-tensor, and reaches the
|
||||
# policy via the batch (by object identity) through the pipeline's shallow copies.
|
||||
comp[_GROOT_REF_HOLDER_KEY] = self._ref_holder
|
||||
|
||||
transition[TransitionKey.OBSERVATION] = obs
|
||||
transition[TransitionKey.COMPLEMENTARY_DATA] = comp
|
||||
return transition
|
||||
@@ -1243,6 +1467,9 @@ class GrootN17PackInputsStep(ProcessorStep):
|
||||
"video_modality_keys": self.video_modality_keys,
|
||||
"raw_stats": self.raw_stats,
|
||||
"modality_config": self.modality_config,
|
||||
"use_relative_actions": self.use_relative_actions,
|
||||
"relative_exclude_joints": self.relative_exclude_joints,
|
||||
"action_feature_names": self.action_feature_names,
|
||||
}
|
||||
|
||||
def get_cached_raw_state(self) -> dict[str, np.ndarray] | None:
|
||||
@@ -1250,6 +1477,23 @@ class GrootN17PackInputsStep(ProcessorStep):
|
||||
|
||||
return self._last_raw_state
|
||||
|
||||
def get_cached_reference_state(self) -> torch.Tensor | None:
|
||||
"""Return the latest RAW (pre-normalization) (B, D) state used for relative-action conversion."""
|
||||
|
||||
return self._last_reference_state
|
||||
|
||||
def get_reference_holder(self) -> "_GrootRelativeRefHolder":
|
||||
"""Return the runtime-only holder shared with the policy (writer) and decode step (reader)."""
|
||||
|
||||
return self._ref_holder
|
||||
|
||||
def reset(self) -> None:
|
||||
"""Clear cached per-episode relative-action references (sync engine resets on episode boundaries)."""
|
||||
|
||||
self._last_reference_state = None
|
||||
self._last_raw_state = None
|
||||
self._ref_holder.clear()
|
||||
|
||||
def state_dict(self) -> dict[str, torch.Tensor]:
|
||||
if not self.stats:
|
||||
return {}
|
||||
@@ -1280,6 +1524,12 @@ class GrootN17VLMEncodeStep(ProcessorStep):
|
||||
The packed video has shape ``(B, T, V, H, W, C)``. Each frame/view becomes
|
||||
an image item in the same chat message so the resulting image tokens match
|
||||
the temporal VLM packing used by Isaac-GR00T.
|
||||
|
||||
Images are handed to the torchvision-backed Qwen3-VL processor as ``(C, H, W)``
|
||||
uint8 tensors (no per-frame PIL roundtrip), and, when ``device`` resolves to a
|
||||
CUDA device, the resize/rescale/normalize/patchify run there instead of on the
|
||||
single CPU main-loop thread. This keeps the output bit-identical on CPU and
|
||||
moves the dominant preprocessing cost off the critical path on GPU.
|
||||
"""
|
||||
|
||||
model_name: str = GROOT_N1_7_BACKBONE_MODEL
|
||||
@@ -1288,6 +1538,7 @@ class GrootN17VLMEncodeStep(ProcessorStep):
|
||||
shortest_image_edge: int | None = None
|
||||
crop_fraction: float | None = None
|
||||
use_albumentations: bool = False
|
||||
device: str | None = None
|
||||
_proc: ProcessorMixin | None = field(default=None, init=False, repr=False)
|
||||
|
||||
@property
|
||||
@@ -1296,6 +1547,70 @@ class GrootN17VLMEncodeStep(ProcessorStep):
|
||||
self._proc = _build_n1_7_processor(self.model_name)
|
||||
return self._proc
|
||||
|
||||
def _target_device(self) -> torch.device | None:
|
||||
# The albumentations path is cv2/PIL only, so it cannot run on GPU.
|
||||
if self.device is None or self.use_albumentations:
|
||||
return None
|
||||
try:
|
||||
return get_safe_torch_device(self.device)
|
||||
except (AssertionError, RuntimeError):
|
||||
# A device serialized at train time (e.g. "cuda") may be unavailable
|
||||
# when the processor is reloaded elsewhere (e.g. CPU-only eval), and
|
||||
# this step is not in the standard device-override set. Fall back to
|
||||
# the CPU path, which is bit-identical, instead of crashing.
|
||||
return None
|
||||
|
||||
def _build_sample_images(
|
||||
self, video: Any, batch_size: int, target_device: torch.device | None
|
||||
) -> list[list[Any]]:
|
||||
"""Return, per batch item, its ordered ``(timestep, view)`` frames.
|
||||
|
||||
``use_albumentations`` keeps the legacy per-frame PIL/cv2 transform;
|
||||
otherwise frames are ``(C, H, W)`` uint8 tensors (moved to
|
||||
``target_device`` when set) for the torchvision-backed Qwen processor.
|
||||
"""
|
||||
if self.use_albumentations:
|
||||
video_np = np.asarray(video)
|
||||
return [
|
||||
[
|
||||
_transform_n1_7_image_for_vlm(
|
||||
Image.fromarray(video_np[batch_idx, timestep, view_idx]),
|
||||
image_crop_size=self.image_crop_size,
|
||||
image_target_size=self.image_target_size,
|
||||
shortest_image_edge=self.shortest_image_edge,
|
||||
crop_fraction=self.crop_fraction,
|
||||
use_albumentations=True,
|
||||
)
|
||||
for timestep in range(video_np.shape[1])
|
||||
for view_idx in range(video_np.shape[2])
|
||||
]
|
||||
for batch_idx in range(batch_size)
|
||||
]
|
||||
|
||||
video_t = video if torch.is_tensor(video) else torch.from_numpy(np.ascontiguousarray(video))
|
||||
# (B, T, V, H, W, C) uint8 -> (B, T, V, C, H, W)
|
||||
video_t = video_t.permute(0, 1, 2, 5, 3, 4).contiguous()
|
||||
if target_device is not None and video_t.device != target_device:
|
||||
video_t = video_t.to(target_device, non_blocking=(target_device.type == "cuda"))
|
||||
|
||||
frames_per_sample: list[list[Any]] = []
|
||||
for batch_idx in range(batch_size):
|
||||
sample = video_t[batch_idx] # (T, V, C, H, W)
|
||||
frames_per_sample.append(
|
||||
[
|
||||
_transform_n1_7_image_for_vlm_torch(
|
||||
sample[timestep, view_idx],
|
||||
image_crop_size=self.image_crop_size,
|
||||
image_target_size=self.image_target_size,
|
||||
shortest_image_edge=self.shortest_image_edge,
|
||||
crop_fraction=self.crop_fraction,
|
||||
)
|
||||
for timestep in range(sample.shape[0])
|
||||
for view_idx in range(sample.shape[1])
|
||||
]
|
||||
)
|
||||
return frames_per_sample
|
||||
|
||||
def __call__(self, transition: EnvTransition) -> EnvTransition:
|
||||
obs = transition.get(TransitionKey.OBSERVATION, {}) or {}
|
||||
comp = transition.get(TransitionKey.COMPLEMENTARY_DATA, {}) or {}
|
||||
@@ -1303,33 +1618,25 @@ class GrootN17VLMEncodeStep(ProcessorStep):
|
||||
if video is None:
|
||||
return transition
|
||||
|
||||
batch_size = int(video.shape[0])
|
||||
languages = _prepare_n1_7_language_batch(
|
||||
comp.get("language"),
|
||||
video.shape[0],
|
||||
batch_size,
|
||||
formalize_language=False,
|
||||
)
|
||||
|
||||
target_device = self._target_device()
|
||||
sample_images = self._build_sample_images(video, batch_size, target_device)
|
||||
|
||||
texts: list[str] = []
|
||||
images: list[Image.Image] = []
|
||||
for batch_idx in range(video.shape[0]):
|
||||
sample = video[batch_idx] # (T, V, H, W, C)
|
||||
sample_images = [
|
||||
_transform_n1_7_image_for_vlm(
|
||||
Image.fromarray(sample[timestep, view_idx]),
|
||||
image_crop_size=self.image_crop_size,
|
||||
image_target_size=self.image_target_size,
|
||||
shortest_image_edge=self.shortest_image_edge,
|
||||
crop_fraction=self.crop_fraction,
|
||||
use_albumentations=self.use_albumentations,
|
||||
)
|
||||
for timestep in range(sample.shape[0])
|
||||
for view_idx in range(sample.shape[1])
|
||||
]
|
||||
images: list[Any] = []
|
||||
for batch_idx in range(batch_size):
|
||||
frames = sample_images[batch_idx]
|
||||
conversation = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
*[{"type": "image", "image": image} for image in sample_images],
|
||||
*[{"type": "image", "image": image} for image in frames],
|
||||
{"type": "text", "text": languages[batch_idx]},
|
||||
],
|
||||
}
|
||||
@@ -1341,9 +1648,17 @@ class GrootN17VLMEncodeStep(ProcessorStep):
|
||||
add_generation_prompt=False,
|
||||
)
|
||||
)
|
||||
images.extend(sample_images)
|
||||
images.extend(frames)
|
||||
|
||||
encoded = self.proc(text=texts, images=images, return_tensors="pt", padding=True)
|
||||
proc_kwargs: dict[str, Any] = {
|
||||
"text": texts,
|
||||
"images": images,
|
||||
"return_tensors": "pt",
|
||||
"padding": True,
|
||||
}
|
||||
if target_device is not None:
|
||||
proc_kwargs["device"] = str(target_device)
|
||||
encoded = self.proc(**proc_kwargs)
|
||||
for key, value in encoded.items():
|
||||
comp[key] = value
|
||||
obs.pop("video", None)
|
||||
@@ -1362,6 +1677,7 @@ class GrootN17VLMEncodeStep(ProcessorStep):
|
||||
"shortest_image_edge": self.shortest_image_edge,
|
||||
"crop_fraction": self.crop_fraction,
|
||||
"use_albumentations": self.use_albumentations,
|
||||
"device": self.device,
|
||||
}
|
||||
|
||||
|
||||
@@ -1574,7 +1890,14 @@ class GrootN17ActionDecodeStep(ProcessorStep):
|
||||
start_idx += dim
|
||||
|
||||
if self.use_relative_action:
|
||||
raw_state = self.pack_step.get_cached_raw_state() if self.pack_step is not None else None
|
||||
# Prefer the raw state frozen at the chunk-prediction event (see the relative-action
|
||||
# branch of GrootActionUnpackUnnormalizeStep). Falls back to the live cached raw state.
|
||||
holder = self.pack_step.get_reference_holder() if self.pack_step is not None else None
|
||||
raw_state = None
|
||||
if holder is not None:
|
||||
raw_state = holder.frozen_raw if holder.frozen_raw is not None else holder.raw_state
|
||||
if raw_state is None and self.pack_step is not None:
|
||||
raw_state = self.pack_step.get_cached_raw_state()
|
||||
if raw_state is None:
|
||||
raise RuntimeError(
|
||||
"GrootN17ActionDecodeStep requires the raw state cached by its connected "
|
||||
@@ -1652,6 +1975,13 @@ class GrootActionUnpackUnnormalizeStep(ProcessorStep):
|
||||
clip_normalized_action: bool = False
|
||||
libero_gripper_action: bool = False
|
||||
libero_gripper_binarize: bool = True
|
||||
# Opt-in relative-action reconstruction (paired with GrootN17PackInputsStep). After the
|
||||
# min-max inverse, relative deltas (arm) + absolute gripper are converted back to absolute
|
||||
# using the reference state cached by the linked pack_step (re-linked on reload).
|
||||
use_relative_actions: bool = False
|
||||
relative_exclude_joints: list[str] = field(default_factory=list)
|
||||
action_feature_names: list[str] | None = None
|
||||
pack_step: "GrootN17PackInputsStep | None" = field(default=None, repr=False)
|
||||
|
||||
def __call__(self, transition: EnvTransition) -> EnvTransition:
|
||||
# Expect model outputs to be in TransitionKey.ACTION as (B, T, D_model)
|
||||
@@ -1691,6 +2021,35 @@ class GrootActionUnpackUnnormalizeStep(ProcessorStep):
|
||||
inv = (action + 1.0) * 0.5 * safe_denom + min_v
|
||||
action = torch.where(mask, inv, min_v)
|
||||
|
||||
# Reconstruct absolute actions from relative deltas (arm) + absolute gripper, using the
|
||||
# reference state cached by the linked pack step. The link is restored on reload by
|
||||
# _reconnect_groot_n1_7_pack_decode_steps.
|
||||
if self.use_relative_actions:
|
||||
if self.pack_step is None:
|
||||
raise RuntimeError(
|
||||
"GrootActionUnpackUnnormalizeStep.use_relative_actions requires a linked "
|
||||
"GrootN17PackInputsStep to read the cached reference state, but pack_step is None. "
|
||||
"Build both pipelines through make_groot_pre_post_processors (or load them together "
|
||||
"via make_groot_pre_post_processors_from_pretrained)."
|
||||
)
|
||||
# Prefer the reference frozen at the chunk-prediction event (set by
|
||||
# GrootPolicy.predict_action_chunk via the shared holder) so every popped delta of a
|
||||
# chunk reconstructs against that chunk's start state S_T, not the per-tick latest
|
||||
# state. Falls back to the live reference when nothing was frozen (e.g. decode without
|
||||
# a preceding predict event, or RTC/async where frozen == live).
|
||||
holder = self.pack_step.get_reference_holder()
|
||||
ref = holder.frozen_reference if holder.frozen_reference is not None else holder.reference_state
|
||||
if ref is None:
|
||||
raise RuntimeError(
|
||||
"GrootActionUnpackUnnormalizeStep.use_relative_actions requires the reference state "
|
||||
"cached by its connected GrootN17PackInputsStep to convert relative actions back to "
|
||||
"absolute. Run the preprocessor on an observation before decoding actions."
|
||||
)
|
||||
relative_mask = _build_relative_action_mask(
|
||||
action.shape[-1], self.relative_exclude_joints, self.action_feature_names
|
||||
)
|
||||
action = to_absolute_actions(action, ref, relative_mask)
|
||||
|
||||
if self.libero_gripper_action and action.shape[-1] >= 7:
|
||||
gripper = action[..., -1]
|
||||
if self.libero_gripper_binarize:
|
||||
@@ -1718,6 +2077,9 @@ class GrootActionUnpackUnnormalizeStep(ProcessorStep):
|
||||
"clip_normalized_action": self.clip_normalized_action,
|
||||
"libero_gripper_action": self.libero_gripper_action,
|
||||
"libero_gripper_binarize": self.libero_gripper_binarize,
|
||||
"use_relative_actions": self.use_relative_actions,
|
||||
"relative_exclude_joints": self.relative_exclude_joints,
|
||||
"action_feature_names": self.action_feature_names,
|
||||
}
|
||||
|
||||
def state_dict(self) -> dict[str, torch.Tensor]:
|
||||
|
||||
Reference in New Issue
Block a user