Files
lerobot/benchmarks/tokens/README.md
T
2026-02-13 10:27:02 +00:00

135 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Action tokenizer benchmark
## Questions
What is the trade-off between:
- **Compression**: how many tokens are needed to represent an action chunk (e.g. horizon × action_dim floats)?
- **Reconstruction quality**: how well does encode-then-decode preserve the original actions?
- **Speed**: how long does encoding and decoding take per chunk?
How to choose an action tokenizer?
- Which tokenizer architecture (e.g. dct + BPE, DCT + BPE)?
- Which **action horizon** and **encoded dimensions** to use?
- Which **normalization** (QUANTILES, MEAN_STD, MIN_MAX) and **delta transform** (relative vs absolute actions)?
- How do reconstruction error and compression ratio vary across datasets and tokenizer settings?
This benchmark loads action chunks from a LeRobot dataset using the same pipeline as `lerobot-train-tokenizer`, runs a trained action tokenizer in encode/decode mode, and reports reconstruction error, compression stats, and timing. Results are saved as JSON under `outputs/` for comparison and analysis.
## Variables
**Dataset & chunking**
- **repo_id**: LeRobot dataset (e.g. `lerobot/pusht`). Action statistics and normalization are taken from the dataset metadata when available.
- **action_horizon**: Number of future steps per action chunk (must match the tokenizers training).
- **encoded_dims**: Dimension ranges to encode (e.g. `0:6` or `0:6,7:14`). Must match the tokenizer.
- **max_episodes**: Cap on episodes to load (default: all).
- **sample_fraction**: Fraction of chunks to sample per episode (default `0.2`) to keep runtime manageable.
**Transform & normalization**
- **normalization_mode**: `IDENTITY`, `MEAN_STD`, `MIN_MAX`, `QUANTILES`, `QUANTILE10`. Should match the tokenizers training.
- **delta_dims**: Comma-separated dimension indices for delta (relative) transform.
- **use_delta_transform**: Whether to convert actions to relative to current state for those dimensions.
- **state_key**: Dataset key for state (e.g. `observation.state`) used when applying delta transform.
**Tokenizer & evaluation**
- **action_tokenizer_path**: Path or HuggingFace repo id of the trained tokenizer (e.g. `outputs/wavetoken`).
- **max_chunks_for_reconstruction**: Max number of chunks to use for reconstruction and timing (default `500`) to limit runtime.
### Main parameters
| parameter | default | description |
| -------------------------------- | ---------------------------- | ------------------------------------------------ |
| **action_tokenizer_path** | (required) | Path or Hub id of the trained action tokenizer. |
| **repo_id** | (required) | LeRobot dataset repo id. |
| **action_horizon** | `10` | Future steps per chunk. |
| **encoded_dims** | `0:6` | Dimension ranges to encode (e.g. `0:6,7:14`). |
| **normalization_mode** | `QUANTILES` | Normalization mode for actions. |
| **max_episodes** | all | Max episodes to load. |
| **sample_fraction** | `0.2` | Fraction of chunks sampled per episode. |
| **max_chunks_for_reconstruction**| `500` | Chunks used for reconstruction and timing. |
| **output_dir** | `outputs/action_tokenizer_benchmark` | Directory for results JSON. |
## Metrics
**Reconstruction (lower is better)**
- **reconstruction_mae**: Mean absolute error between original and decoded action chunks.
- **reconstruction_mse**: Mean squared error.
- **reconstruction_rmse**: Root mean squared error.
- **reconstruction_max_abs_error**: Maximum absolute error over all dimensions and samples.
- **per_dimension_mae**: MAE per action dimension (list of length `action_dim`).
**Compression**
- **compression_ratio**: Ratio (action_horizon × action_dim) / mean number of tokens. Higher means more compression.
- **mean_token_length**, **std_token_length**: Mean and standard deviation of token count per chunk.
- **min_token_length**, **max_token_length**: Min and max token count.
- **p50_token_length**, **p99_token_length**: 50th and 99th percentile token counts.
**Timing (seconds per chunk)**
- **mean_encode_time_sec**: Mean time to encode one chunk.
- **mean_decode_time_sec**: Mean time to decode one chunk.
The JSON output also includes **num_chunks_evaluated** and **total_chunks_available** for context.
## How the benchmark works
1. **Load dataset**: LeRobot dataset is loaded for the given `repo_id` and `root`.
2. **Build action chunks**: For each episode (up to `max_episodes`), action chunks are built with the same logic as `lerobot-train-tokenizer`: sliding window of length `action_horizon`, optional delta transform, and per-episode sampling with `sample_fraction`.
3. **Extract and normalize**: Only `encoded_dims` are kept. Normalization is applied using the datasets action stats when available, according to `normalization_mode`.
4. **Encode / decode**: A random sample of chunks (size `max_chunks_for_reconstruction`) is encoded and then decoded with the tokenizer. Encode and decode times are recorded per chunk.
5. **Compute metrics**: Reconstruction metrics are computed between original and decoded chunks; compression and timing stats are aggregated.
6. **Save results**: A JSON file is written to `output_dir` with name `{timestamp}_{repo_id}_action_tokenizer_results.json`, containing the full config and all metrics.
The pipeline (chunking, dimensions, normalization, delta) must match how the tokenizer was trained; otherwise reconstruction error can be large or the tokenizer may raise.
## Caveats
- The tokenizers **action_horizon** and **action_dim** (and optionally DCT settings) are fixed at training time. The benchmark infers dimensions from the dataset and encoded dims; the tokenizer path must correspond to a model trained with the same horizon and encoded dimensions.
- Reconstruction is evaluated in **normalized space** (the same space the tokenizer sees). For interpretation in raw action space, you would need to invert normalization outside this script.
- Only one tokenizer and one dataset are evaluated per run. To compare tokenizers or datasets, run the script multiple times and compare the saved JSON files.
## Example
Quick run with a local tokenizer and a small number of episodes:
```bash
python benchmarks/tokens/run_action_tokenizer_benchmark.py \
--action-tokenizer-path=outputs/wavetoken \
--repo-id=lerobot/pusht \
--action-horizon=10 \
--max-episodes=50 \
--output-dir=outputs/action_tokenizer_benchmark
```
With delta transform and custom encoded dimensions:
```bash
python benchmarks/tokens/run_action_tokenizer_benchmark.py \
--action-tokenizer-path=outputs/wavetoken \
--repo-id=lerobot/pusht \
--action-horizon=10 \
--encoded-dims=0:6,7:14 \
--delta-dims=0,1,2,3,4,5 \
--use-delta-transform \
--normalization-mode=QUANTILES \
--max-chunks-for-reconstruction=500 \
--output-dir=outputs/action_tokenizer_benchmark
```
Results are written to e.g. `outputs/action_tokenizer_benchmark/2026-02-12_14-30-00_lerobot_pusht_action_tokenizer_results.json`.
## Results
Results are stored as JSON in the directory given by `--output-dir` (default: `outputs/action_tokenizer_benchmark`). Each file contains:
- **config**: All script arguments (tokenizer path, repo_id, action_horizon, encoded_dims, normalization_mode, etc.) for reproducibility.
- **metrics**: All reconstruction, compression, and timing metrics described above.
To compare runs, load and diff or aggregate these JSON files with your own scripts or notebooks.