# Action tokenizer benchmark ## Questions What is the trade-off between: - **Compression**: how many tokens are needed to represent an action chunk (e.g. horizon × action_dim floats)? - **Reconstruction quality**: how well does encode-then-decode preserve the original actions? - **Speed**: how long does encoding and decoding take per chunk? How to choose an action tokenizer? - Which tokenizer architecture (e.g. dct + BPE, DCT + BPE)? - Which **action horizon** and **encoded dimensions** to use? - Which **normalization** (QUANTILES, MEAN_STD, MIN_MAX) and **delta transform** (relative vs absolute actions)? - How do reconstruction error and compression ratio vary across datasets and tokenizer settings? This benchmark loads action chunks from a LeRobot dataset using the same pipeline as `lerobot-train-tokenizer`, runs a trained action tokenizer in encode/decode mode, and reports reconstruction error, compression stats, and timing. Results are saved as JSON under `outputs/` for comparison and analysis. ## Variables **Dataset & chunking** - **repo_id**: LeRobot dataset (e.g. `lerobot/pusht`). Action statistics and normalization are taken from the dataset metadata when available. - **action_horizon**: Number of future steps per action chunk (must match the tokenizer’s training). - **encoded_dims**: Dimension ranges to encode (e.g. `0:6` or `0:6,7:14`). Must match the tokenizer. - **max_episodes**: Cap on episodes to load (default: all). - **sample_fraction**: Fraction of chunks to sample per episode (default `0.2`) to keep runtime manageable. **Transform & normalization** - **normalization_mode**: `IDENTITY`, `MEAN_STD`, `MIN_MAX`, `QUANTILES`, `QUANTILE10`. Should match the tokenizer’s training. - **delta_dims**: Comma-separated dimension indices for delta (relative) transform. - **use_delta_transform**: Whether to convert actions to relative to current state for those dimensions. - **state_key**: Dataset key for state (e.g. `observation.state`) used when applying delta transform. **Tokenizer & evaluation** - **action_tokenizer_path**: Path or HuggingFace repo id of the trained tokenizer (e.g. `outputs/wavetoken`). - **max_chunks_for_reconstruction**: Max number of chunks to use for reconstruction and timing (default `500`) to limit runtime. ### Main parameters | parameter | default | description | | -------------------------------- | ---------------------------- | ------------------------------------------------ | | **action_tokenizer_path** | (required) | Path or Hub id of the trained action tokenizer. | | **repo_id** | (required) | LeRobot dataset repo id. | | **action_horizon** | `10` | Future steps per chunk. | | **encoded_dims** | `0:6` | Dimension ranges to encode (e.g. `0:6,7:14`). | | **normalization_mode** | `QUANTILES` | Normalization mode for actions. | | **max_episodes** | all | Max episodes to load. | | **sample_fraction** | `0.2` | Fraction of chunks sampled per episode. | | **max_chunks_for_reconstruction**| `500` | Chunks used for reconstruction and timing. | | **output_dir** | `outputs/action_tokenizer_benchmark` | Directory for results JSON. | ## Metrics **Reconstruction (lower is better)** - **reconstruction_mae**: Mean absolute error between original and decoded action chunks. - **reconstruction_mse**: Mean squared error. - **reconstruction_rmse**: Root mean squared error. - **reconstruction_max_abs_error**: Maximum absolute error over all dimensions and samples. - **per_dimension_mae**: MAE per action dimension (list of length `action_dim`). **Compression** - **compression_ratio**: Ratio (action_horizon × action_dim) / mean number of tokens. Higher means more compression. - **mean_token_length**, **std_token_length**: Mean and standard deviation of token count per chunk. - **min_token_length**, **max_token_length**: Min and max token count. - **p50_token_length**, **p99_token_length**: 50th and 99th percentile token counts. **Timing (seconds per chunk)** - **mean_encode_time_sec**: Mean time to encode one chunk. - **mean_decode_time_sec**: Mean time to decode one chunk. The JSON output also includes **num_chunks_evaluated** and **total_chunks_available** for context. ## How the benchmark works 1. **Load dataset**: LeRobot dataset is loaded for the given `repo_id` and `root`. 2. **Build action chunks**: For each episode (up to `max_episodes`), action chunks are built with the same logic as `lerobot-train-tokenizer`: sliding window of length `action_horizon`, optional delta transform, and per-episode sampling with `sample_fraction`. 3. **Extract and normalize**: Only `encoded_dims` are kept. Normalization is applied using the dataset’s action stats when available, according to `normalization_mode`. 4. **Encode / decode**: A random sample of chunks (size `max_chunks_for_reconstruction`) is encoded and then decoded with the tokenizer. Encode and decode times are recorded per chunk. 5. **Compute metrics**: Reconstruction metrics are computed between original and decoded chunks; compression and timing stats are aggregated. 6. **Save results**: A JSON file is written to `output_dir` with name `{timestamp}_{repo_id}_action_tokenizer_results.json`, containing the full config and all metrics. The pipeline (chunking, dimensions, normalization, delta) must match how the tokenizer was trained; otherwise reconstruction error can be large or the tokenizer may raise. ## Caveats - The tokenizer’s **action_horizon** and **action_dim** (and optionally DCT settings) are fixed at training time. The benchmark infers dimensions from the dataset and encoded dims; the tokenizer path must correspond to a model trained with the same horizon and encoded dimensions. - Reconstruction is evaluated in **normalized space** (the same space the tokenizer sees). For interpretation in raw action space, you would need to invert normalization outside this script. - Only one tokenizer and one dataset are evaluated per run. To compare tokenizers or datasets, run the script multiple times and compare the saved JSON files. ## Example Quick run with a local tokenizer and a small number of episodes: ```bash python benchmarks/tokens/run_action_tokenizer_benchmark.py \ --action-tokenizer-path=outputs/wavetoken \ --repo-id=lerobot/pusht \ --action-horizon=10 \ --max-episodes=50 \ --output-dir=outputs/action_tokenizer_benchmark ``` With delta transform and custom encoded dimensions: ```bash python benchmarks/tokens/run_action_tokenizer_benchmark.py \ --action-tokenizer-path=outputs/wavetoken \ --repo-id=lerobot/pusht \ --action-horizon=10 \ --encoded-dims=0:6,7:14 \ --delta-dims=0,1,2,3,4,5 \ --use-delta-transform \ --normalization-mode=QUANTILES \ --max-chunks-for-reconstruction=500 \ --output-dir=outputs/action_tokenizer_benchmark ``` Results are written to e.g. `outputs/action_tokenizer_benchmark/2026-02-12_14-30-00_lerobot_pusht_action_tokenizer_results.json`. ## Results Results are stored as JSON in the directory given by `--output-dir` (default: `outputs/action_tokenizer_benchmark`). Each file contains: - **config**: All script arguments (tokenizer path, repo_id, action_horizon, encoded_dims, normalization_mode, etc.) for reproducibility. - **metrics**: All reconstruction, compression, and timing metrics described above. To compare runs, load and diff or aggregate these JSON files with your own scripts or notebooks.