diff --git a/benchmarks/tokens/README.md b/benchmarks/tokens/README.md new file mode 100644 index 000000000..a864d7580 --- /dev/null +++ b/benchmarks/tokens/README.md @@ -0,0 +1,134 @@ +# Action tokenizer benchmark + +## Questions + +What is the trade-off between: + +- **Compression**: how many tokens are needed to represent an action chunk (e.g. horizon × action_dim floats)? +- **Reconstruction quality**: how well does encode-then-decode preserve the original actions? +- **Speed**: how long does encoding and decoding take per chunk? + +How to choose an action tokenizer? + +- Which tokenizer architecture (e.g. dct + BPE, DCT + BPE)? +- Which **action horizon** and **encoded dimensions** to use? +- Which **normalization** (QUANTILES, MEAN_STD, MIN_MAX) and **delta transform** (relative vs absolute actions)? +- How do reconstruction error and compression ratio vary across datasets and tokenizer settings? + +This benchmark loads action chunks from a LeRobot dataset using the same pipeline as `lerobot-train-tokenizer`, runs a trained action tokenizer in encode/decode mode, and reports reconstruction error, compression stats, and timing. Results are saved as JSON under `outputs/` for comparison and analysis. + +## Variables + +**Dataset & chunking** + +- **repo_id**: LeRobot dataset (e.g. `lerobot/pusht`). Action statistics and normalization are taken from the dataset metadata when available. +- **action_horizon**: Number of future steps per action chunk (must match the tokenizer’s training). +- **encoded_dims**: Dimension ranges to encode (e.g. `0:6` or `0:6,7:14`). Must match the tokenizer. +- **max_episodes**: Cap on episodes to load (default: all). +- **sample_fraction**: Fraction of chunks to sample per episode (default `0.2`) to keep runtime manageable. + +**Transform & normalization** + +- **normalization_mode**: `IDENTITY`, `MEAN_STD`, `MIN_MAX`, `QUANTILES`, `QUANTILE10`. Should match the tokenizer’s training. +- **delta_dims**: Comma-separated dimension indices for delta (relative) transform. +- **use_delta_transform**: Whether to convert actions to relative to current state for those dimensions. +- **state_key**: Dataset key for state (e.g. `observation.state`) used when applying delta transform. + +**Tokenizer & evaluation** + +- **action_tokenizer_path**: Path or HuggingFace repo id of the trained tokenizer (e.g. `outputs/wavetoken`). +- **max_chunks_for_reconstruction**: Max number of chunks to use for reconstruction and timing (default `500`) to limit runtime. + +### Main parameters + +| parameter | default | description | +| -------------------------------- | ---------------------------- | ------------------------------------------------ | +| **action_tokenizer_path** | (required) | Path or Hub id of the trained action tokenizer. | +| **repo_id** | (required) | LeRobot dataset repo id. | +| **action_horizon** | `10` | Future steps per chunk. | +| **encoded_dims** | `0:6` | Dimension ranges to encode (e.g. `0:6,7:14`). | +| **normalization_mode** | `QUANTILES` | Normalization mode for actions. | +| **max_episodes** | all | Max episodes to load. | +| **sample_fraction** | `0.2` | Fraction of chunks sampled per episode. | +| **max_chunks_for_reconstruction**| `500` | Chunks used for reconstruction and timing. | +| **output_dir** | `outputs/action_tokenizer_benchmark` | Directory for results JSON. | + +## Metrics + +**Reconstruction (lower is better)** + +- **reconstruction_mae**: Mean absolute error between original and decoded action chunks. +- **reconstruction_mse**: Mean squared error. +- **reconstruction_rmse**: Root mean squared error. +- **reconstruction_max_abs_error**: Maximum absolute error over all dimensions and samples. +- **per_dimension_mae**: MAE per action dimension (list of length `action_dim`). + +**Compression** + +- **compression_ratio**: Ratio (action_horizon × action_dim) / mean number of tokens. Higher means more compression. +- **mean_token_length**, **std_token_length**: Mean and standard deviation of token count per chunk. +- **min_token_length**, **max_token_length**: Min and max token count. +- **p50_token_length**, **p99_token_length**: 50th and 99th percentile token counts. + +**Timing (seconds per chunk)** + +- **mean_encode_time_sec**: Mean time to encode one chunk. +- **mean_decode_time_sec**: Mean time to decode one chunk. + +The JSON output also includes **num_chunks_evaluated** and **total_chunks_available** for context. + +## How the benchmark works + +1. **Load dataset**: LeRobot dataset is loaded for the given `repo_id` and `root`. +2. **Build action chunks**: For each episode (up to `max_episodes`), action chunks are built with the same logic as `lerobot-train-tokenizer`: sliding window of length `action_horizon`, optional delta transform, and per-episode sampling with `sample_fraction`. +3. **Extract and normalize**: Only `encoded_dims` are kept. Normalization is applied using the dataset’s action stats when available, according to `normalization_mode`. +4. **Encode / decode**: A random sample of chunks (size `max_chunks_for_reconstruction`) is encoded and then decoded with the tokenizer. Encode and decode times are recorded per chunk. +5. **Compute metrics**: Reconstruction metrics are computed between original and decoded chunks; compression and timing stats are aggregated. +6. **Save results**: A JSON file is written to `output_dir` with name `{timestamp}_{repo_id}_action_tokenizer_results.json`, containing the full config and all metrics. + +The pipeline (chunking, dimensions, normalization, delta) must match how the tokenizer was trained; otherwise reconstruction error can be large or the tokenizer may raise. + +## Caveats + +- The tokenizer’s **action_horizon** and **action_dim** (and optionally DCT settings) are fixed at training time. The benchmark infers dimensions from the dataset and encoded dims; the tokenizer path must correspond to a model trained with the same horizon and encoded dimensions. +- Reconstruction is evaluated in **normalized space** (the same space the tokenizer sees). For interpretation in raw action space, you would need to invert normalization outside this script. +- Only one tokenizer and one dataset are evaluated per run. To compare tokenizers or datasets, run the script multiple times and compare the saved JSON files. + +## Example + +Quick run with a local tokenizer and a small number of episodes: + +```bash +python benchmarks/tokens/run_action_tokenizer_benchmark.py \ + --action-tokenizer-path=outputs/wavetoken \ + --repo-id=lerobot/pusht \ + --action-horizon=10 \ + --max-episodes=50 \ + --output-dir=outputs/action_tokenizer_benchmark +``` + +With delta transform and custom encoded dimensions: + +```bash +python benchmarks/tokens/run_action_tokenizer_benchmark.py \ + --action-tokenizer-path=outputs/wavetoken \ + --repo-id=lerobot/pusht \ + --action-horizon=10 \ + --encoded-dims=0:6,7:14 \ + --delta-dims=0,1,2,3,4,5 \ + --use-delta-transform \ + --normalization-mode=QUANTILES \ + --max-chunks-for-reconstruction=500 \ + --output-dir=outputs/action_tokenizer_benchmark +``` + +Results are written to e.g. `outputs/action_tokenizer_benchmark/2026-02-12_14-30-00_lerobot_pusht_action_tokenizer_results.json`. + +## Results + +Results are stored as JSON in the directory given by `--output-dir` (default: `outputs/action_tokenizer_benchmark`). Each file contains: + +- **config**: All script arguments (tokenizer path, repo_id, action_horizon, encoded_dims, normalization_mode, etc.) for reproducibility. +- **metrics**: All reconstruction, compression, and timing metrics described above. + +To compare runs, load and diff or aggregate these JSON files with your own scripts or notebooks. diff --git a/benchmarks/tokens/run_action_tokenizer_benchmark.py b/benchmarks/tokens/run_action_tokenizer_benchmark.py new file mode 100644 index 000000000..e2b3c3ad7 --- /dev/null +++ b/benchmarks/tokens/run_action_tokenizer_benchmark.py @@ -0,0 +1,442 @@ +#!/usr/bin/env python +# Copyright 2026 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Benchmark action tokenization: reconstruction error, compression ratio, and timing. + +Loads action chunks from a LeRobot dataset, encodes/decodes them with a trained action +tokenizer, and reports: +- Reconstruction: MAE, MSE, RMSE, max absolute error, per-dimension MAE +- Jerk: mean absolute jerk (original and reconstructed), jerk reconstruction MAE +- Compression: ratio (input size / mean tokens), token length stats +- Timing: mean encode/decode time per chunk + +Results are saved to outputs/action_tokenizer_benchmark/_results.json. + +Example: + +```bash +python benchmarks/tokens/run_action_tokenizer_benchmark.py \ + --action-tokenizer-path=outputs/wavetoken \ + --repo-id=lerobot/pusht \ + --action-horizon=10 \ + --max-episodes=50 \ + --output-dir=outputs/action_tokenizer_benchmark +``` +""" + +import argparse +import json +import time +from pathlib import Path + +import numpy as np + +from lerobot.configs.types import NormalizationMode +from lerobot.datasets.lerobot_dataset import LeRobotDataset +from lerobot.utils.constants import ACTION, OBS_STATE + +# Optional: use same helpers as train script if we want to avoid duplication +from lerobot.scripts.lerobot_train_tokenizer import ( + apply_normalization, + process_episode, +) + + +def load_action_chunks( + repo_id: str, + root: str | None, + action_horizon: int, + max_episodes: int | None, + sample_fraction: float, + encoded_dims: str, + delta_dims: str | None, + use_delta_transform: bool, + state_key: str, + normalization_mode: NormalizationMode, +): + """Load and normalize action chunks from a LeRobot dataset (same pipeline as training).""" + dataset = LeRobotDataset(repo_id=repo_id, root=root) + num_episodes = dataset.num_episodes + if max_episodes is not None: + num_episodes = min(max_episodes, num_episodes) + + # Parse encoded dims + encoded_dim_ranges = [] + for range_str in encoded_dims.split(","): + start, end = map(int, range_str.strip().split(":")) + encoded_dim_ranges.append((start, end)) + total_encoded_dims = sum(end - start for start, end in encoded_dim_ranges) + + delta_dim_list = None + if delta_dims is not None and delta_dims.strip(): + delta_dim_list = [int(d.strip()) for d in delta_dims.split(",")] + + all_chunks = [] + for ep_idx in range(num_episodes): + chunks = process_episode( + ( + dataset, + ep_idx, + action_horizon, + delta_dim_list, + sample_fraction, + state_key, + use_delta_transform, + ) + ) + if chunks is not None: + all_chunks.append(chunks) + + if not all_chunks: + raise ValueError("No action chunks collected. Check action_horizon and dataset.") + + all_chunks = np.concatenate(all_chunks, axis=0) + + # Extract encoded dimensions only + encoded_chunks = [] + for start, end in encoded_dim_ranges: + encoded_chunks.append(all_chunks[:, :, start:end]) + encoded_chunks = np.concatenate(encoded_chunks, axis=-1) + + # Normalize + norm_stats = dataset.meta.stats + if norm_stats is not None and ACTION in norm_stats: + action_stats = norm_stats[ACTION] + encoded_dim_indices = [] + for start, end in encoded_dim_ranges: + encoded_dim_indices.extend(range(start, end)) + encoded_dim_indices = np.array(encoded_dim_indices) + encoded_stats = {} + for stat_name, stat_values in action_stats.items(): + if isinstance(stat_values, (list, np.ndarray)): + stat_array = np.array(stat_values) + if len(stat_array) > max(encoded_dim_indices): + encoded_stats[stat_name] = stat_array[encoded_dim_indices] + if encoded_stats: + try: + encoded_chunks = apply_normalization( + encoded_chunks, encoded_stats, normalization_mode, eps=1e-8 + ) + except ValueError: + pass + + return encoded_chunks, total_encoded_dims, action_horizon, dataset.repo_id + + +def compute_reconstruction_metrics(original: np.ndarray, reconstructed: np.ndarray): + """Compute reconstruction error metrics (original and reconstructed same shape [N, T, D]).""" + diff = reconstructed - original + mae = float(np.mean(np.abs(diff))) + mse = float(np.mean(diff**2)) + rmse = float(np.sqrt(mse)) + max_abs_err = float(np.max(np.abs(diff))) + + # Per-dimension MAE (over N and T) + per_dim_mae = np.mean(np.abs(diff), axis=(0, 1)) + per_dim_mae = per_dim_mae.tolist() + + return { + "reconstruction_mae": mae, + "reconstruction_mse": mse, + "reconstruction_rmse": rmse, + "reconstruction_max_abs_error": max_abs_err, + "per_dimension_mae": per_dim_mae, + } + + +def compute_jerk_metrics(original: np.ndarray, reconstructed: np.ndarray) -> dict: + """Compute jerk (3rd derivative of action w.r.t. time) metrics. + + Args: + original: Action chunks [N, T, D]. + reconstructed: Reconstructed action chunks [N, T, D]. + + Returns: + Dict with mean absolute jerk for original, reconstructed, and jerk reconstruction MAE. + """ + # Jerk = 3rd discrete difference along time axis; need T >= 4 + if original.shape[1] < 4: + return {} + jerk_orig = np.diff(original, n=3, axis=1) # (N, T-3, D) + jerk_recon = np.diff(reconstructed, n=3, axis=1) + mae_jerk_orig = float(np.mean(np.abs(jerk_orig))) + mae_jerk_recon = float(np.mean(np.abs(jerk_recon))) + jerk_reconstruction_mae = float(np.mean(np.abs(jerk_recon - jerk_orig))) + return { + "jerk_mae_original": mae_jerk_orig, + "jerk_mae_reconstructed": mae_jerk_recon, + "jerk_reconstruction_mae": jerk_reconstruction_mae, + } + + +def run_benchmark( + action_chunks: np.ndarray, + action_horizon: int, + action_dim: int, + tokenizer_path: str, + max_chunks_for_reconstruction: int | None = 500, +): + """Encode/decode action chunks and compute metrics.""" + from transformers import AutoProcessor + + processor = AutoProcessor.from_pretrained(tokenizer_path, trust_remote_code=True) + + n_chunks = len(action_chunks) + sample_size = n_chunks + if max_chunks_for_reconstruction is not None: + sample_size = min(max_chunks_for_reconstruction, n_chunks) + rng = np.random.RandomState(42) + indices = rng.choice(n_chunks, size=sample_size, replace=False) + sample_chunks = action_chunks[indices] + + # Encode + token_lengths = [] + encode_times = [] + all_tokens = [] + for i in range(len(sample_chunks)): + chunk = sample_chunks[i : i + 1] + t0 = time.perf_counter() + tokens = processor(chunk)[0] + encode_times.append(time.perf_counter() - t0) + if isinstance(tokens, list): + token_lengths.append(len(tokens)) + all_tokens.append(tokens) + else: + n = tokens.shape[0] if hasattr(tokens, "shape") else len(tokens) + token_lengths.append(n) + all_tokens.append(tokens.tolist() if hasattr(tokens, "tolist") else list(tokens)) + + # Decode (processor keeps time_horizon/action_dim from encode) + decoded_list = [] + decode_times = [] + for i, tok_list in enumerate(all_tokens): + t0 = time.perf_counter() + recon = processor.decode( + [tok_list], + time_horizon=action_horizon, + action_dim=action_dim, + ) + decode_times.append(time.perf_counter() - t0) + decoded_list.append(recon) + decoded = np.concatenate(decoded_list, axis=0) + + # Reconstruction metrics + metrics = compute_reconstruction_metrics(sample_chunks, decoded) + + # Jerk metrics (3rd derivative along time) + jerk_metrics = compute_jerk_metrics(sample_chunks, decoded) + metrics.update(jerk_metrics) + + # Compression + token_lengths = np.array(token_lengths) + input_size = action_horizon * action_dim + compression_ratio = input_size / float(np.mean(token_lengths)) + metrics["compression_ratio"] = compression_ratio + metrics["mean_token_length"] = float(np.mean(token_lengths)) + metrics["std_token_length"] = float(np.std(token_lengths)) + metrics["min_token_length"] = int(np.min(token_lengths)) + metrics["max_token_length"] = int(np.max(token_lengths)) + metrics["p50_token_length"] = float(np.percentile(token_lengths, 50)) + metrics["p99_token_length"] = float(np.percentile(token_lengths, 99)) + + # Timing (seconds per chunk) + metrics["mean_encode_time_sec"] = float(np.mean(encode_times)) + metrics["mean_decode_time_sec"] = float(np.mean(decode_times)) + metrics["num_chunks_evaluated"] = sample_size + metrics["total_chunks_available"] = n_chunks + + return metrics + + +def main( + action_tokenizer_path: str, + repo_id: str, + root: str | None = None, + action_horizon: int = 10, + max_episodes: int | None = 100, + sample_fraction: float = 0.2, + encoded_dims: str = "0:6", + delta_dims: str | None = None, + use_delta_transform: bool = False, + state_key: str = OBS_STATE, + normalization_mode: str = "QUANTILES", + max_chunks_for_reconstruction: int | None = 500, + output_dir: str | None = None, +): + if output_dir is None: + output_dir = "outputs/action_tokenizer_benchmark" + output_path = Path(output_dir) + output_path.mkdir(parents=True, exist_ok=True) + + try: + norm_mode = NormalizationMode(normalization_mode) + except ValueError: + norm_mode = NormalizationMode.QUANTILES + + print("Loading action chunks...") + encoded_chunks, action_dim, horizon, _ = load_action_chunks( + repo_id=repo_id, + root=root, + action_horizon=action_horizon, + max_episodes=max_episodes, + sample_fraction=sample_fraction, + encoded_dims=encoded_dims, + delta_dims=delta_dims, + use_delta_transform=use_delta_transform, + state_key=state_key, + normalization_mode=norm_mode, + ) + print(f"Loaded {len(encoded_chunks)} chunks, shape {encoded_chunks.shape} (H={horizon}, D={action_dim})") + + print("Running tokenizer benchmark...") + metrics = run_benchmark( + action_chunks=encoded_chunks, + action_horizon=horizon, + action_dim=action_dim, + tokenizer_path=action_tokenizer_path, + max_chunks_for_reconstruction=max_chunks_for_reconstruction, + ) + + # Attach config for reproducibility + results = { + "config": { + "action_tokenizer_path": action_tokenizer_path, + "repo_id": repo_id, + "action_horizon": action_horizon, + "max_episodes": max_episodes, + "sample_fraction": sample_fraction, + "encoded_dims": encoded_dims, + "delta_dims": delta_dims, + "use_delta_transform": use_delta_transform, + "state_key": state_key, + "normalization_mode": normalization_mode, + }, + "metrics": metrics, + } + + timestamp = time.strftime("%Y-%m-%d_%H-%M-%S") + safe_repo = repo_id.replace("/", "_") + out_file = output_path / f"{timestamp}_{safe_repo}_action_tokenizer_results.json" + with open(out_file, "w") as f: + json.dump(results, f, indent=2) + + print(f"Results saved to {out_file}") + print("Metrics:") + for k, v in metrics.items(): + if isinstance(v, list): + print(f" {k}: (length {len(v)})") + else: + print(f" {k}: {v}") + + return results + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Benchmark action tokenization (reconstruction error, compression, timing)." + ) + parser.add_argument( + "--action-tokenizer-path", + type=str, + required=True, + help="Path or HuggingFace repo id of the trained action tokenizer (e.g. outputs/wavetoken).", + ) + parser.add_argument( + "--repo-id", + type=str, + required=True, + help="LeRobot dataset repo id (e.g. lerobot/pusht).", + ) + parser.add_argument( + "--root", + type=str, + default=None, + help="Root directory for LeRobot datasets.", + ) + parser.add_argument( + "--action-horizon", + type=int, + default=10, + help="Number of future steps per action chunk.", + ) + parser.add_argument( + "--max-episodes", + type=int, + default=None, + help="Max episodes to use (default: all).", + ) + parser.add_argument( + "--sample-fraction", + type=float, + default=0.2, + help="Fraction of chunks to sample per episode.", + ) + parser.add_argument( + "--encoded-dims", + type=str, + default="0:6", + help="Dimension ranges to encode (e.g. 0:6,7:14).", + ) + parser.add_argument( + "--delta-dims", + type=str, + default=None, + help="Comma-separated dimensions for delta transform.", + ) + parser.add_argument( + "--use-delta-transform", + action="store_true", + help="Apply delta (relative) transform to specified dimensions.", + ) + parser.add_argument( + "--state-key", + type=str, + default=OBS_STATE, + help="Dataset key for state (for delta transform).", + ) + parser.add_argument( + "--normalization-mode", + type=str, + default="QUANTILES", + choices=[m.value for m in NormalizationMode], + help="Normalization mode for actions.", + ) + parser.add_argument( + "--max-chunks-for-reconstruction", + type=int, + default=500, + help="Max chunks to use for reconstruction metrics (default: 500).", + ) + parser.add_argument( + "--output-dir", + type=str, + default="outputs/action_tokenizer_benchmark", + help="Directory to save results JSON (default: outputs/action_tokenizer_benchmark).", + ) + args = parser.parse_args() + main( + action_tokenizer_path=args.action_tokenizer_path, + repo_id=args.repo_id, + root=args.root, + action_horizon=args.action_horizon, + max_episodes=args.max_episodes, + sample_fraction=args.sample_fraction, + encoded_dims=args.encoded_dims, + delta_dims=args.delta_dims, + use_delta_transform=args.use_delta_transform, + state_key=args.state_key, + normalization_mode=args.normalization_mode, + max_chunks_for_reconstruction=args.max_chunks_for_reconstruction, + output_dir=args.output_dir, + ) diff --git a/src/lerobot/scripts/lerobot_train_tokenizer.py b/src/lerobot/scripts/lerobot_train_tokenizer.py index 1d8f4644b..25dd44f42 100644 --- a/src/lerobot/scripts/lerobot_train_tokenizer.py +++ b/src/lerobot/scripts/lerobot_train_tokenizer.py @@ -166,9 +166,9 @@ def apply_normalization( if q01 is None or q99 is None: raise ValueError("QUANTILES mode requires 'q01' and 'q99' in stats") denom = np.maximum(q99 - q01, eps) - # Clip to quantile range then normalize to [-1, 1] - clipped = np.clip(data, q01, q99) - return 2.0 * (clipped - q01) / denom - 1.0 + # No clipping: match training pipeline NormalizerProcessorStep so tokenizer + # is fit on the full range of normalized values (including tails outside [-1, 1]). + return 2.0 * (data - q01) / denom - 1.0 if mode == NormalizationMode.QUANTILE10: q10 = stats.get("q10") @@ -176,9 +176,8 @@ def apply_normalization( if q10 is None or q90 is None: raise ValueError("QUANTILE10 mode requires 'q10' and 'q90' in stats") denom = np.maximum(q90 - q10, eps) - # Clip to quantile range then normalize to [-1, 1] - clipped = np.clip(data, q10, q90) - return 2.0 * (clipped - q10) / denom - 1.0 + # No clipping: match training pipeline NormalizerProcessorStep. + return 2.0 * (data - q10) / denom - 1.0 raise ValueError(f"Unsupported normalization mode: {mode}") @@ -306,7 +305,7 @@ def train_fast_tokenizer( # download the tokenizer source code (not pretrained weights) # we'll train a new tokenizer on our own data - base_tokenizer = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True) + base_tokenizer = AutoProcessor.from_pretrained("/fsx/jade_choghari/outputs/libero_tokenizer_wavetoken1", trust_remote_code=True) # convert action_chunks array to list of arrays (expected by .fit()) action_data_list = [action_chunks[i] for i in range(len(action_chunks))] @@ -320,6 +319,8 @@ def train_fast_tokenizer( vocab_size=vocab_size, time_horizon=action_chunks.shape[1], # action_horizon action_dim=action_chunks.shape[2], # encoded dimensions + wavelet="dmey", + level=1, ) print("✓ Tokenizer training complete!")