mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-18 18:20:08 +00:00
135 lines
7.7 KiB
Markdown
135 lines
7.7 KiB
Markdown
# Action tokenizer benchmark
|
||
|
||
## Questions
|
||
|
||
What is the trade-off between:
|
||
|
||
- **Compression**: how many tokens are needed to represent an action chunk (e.g. horizon × action_dim floats)?
|
||
- **Reconstruction quality**: how well does encode-then-decode preserve the original actions?
|
||
- **Speed**: how long does encoding and decoding take per chunk?
|
||
|
||
How to choose an action tokenizer?
|
||
|
||
- Which tokenizer architecture (e.g. dct + BPE, DCT + BPE)?
|
||
- Which **action horizon** and **encoded dimensions** to use?
|
||
- Which **normalization** (QUANTILES, MEAN_STD, MIN_MAX) and **delta transform** (relative vs absolute actions)?
|
||
- How do reconstruction error and compression ratio vary across datasets and tokenizer settings?
|
||
|
||
This benchmark loads action chunks from a LeRobot dataset using the same pipeline as `lerobot-train-tokenizer`, runs a trained action tokenizer in encode/decode mode, and reports reconstruction error, compression stats, and timing. Results are saved as JSON under `outputs/` for comparison and analysis.
|
||
|
||
## Variables
|
||
|
||
**Dataset & chunking**
|
||
|
||
- **repo_id**: LeRobot dataset (e.g. `lerobot/pusht`). Action statistics and normalization are taken from the dataset metadata when available.
|
||
- **action_horizon**: Number of future steps per action chunk (must match the tokenizer’s training).
|
||
- **encoded_dims**: Dimension ranges to encode (e.g. `0:6` or `0:6,7:14`). Must match the tokenizer.
|
||
- **max_episodes**: Cap on episodes to load (default: all).
|
||
- **sample_fraction**: Fraction of chunks to sample per episode (default `0.2`) to keep runtime manageable.
|
||
|
||
**Transform & normalization**
|
||
|
||
- **normalization_mode**: `IDENTITY`, `MEAN_STD`, `MIN_MAX`, `QUANTILES`, `QUANTILE10`. Should match the tokenizer’s training.
|
||
- **delta_dims**: Comma-separated dimension indices for delta (relative) transform.
|
||
- **use_delta_transform**: Whether to convert actions to relative to current state for those dimensions.
|
||
- **state_key**: Dataset key for state (e.g. `observation.state`) used when applying delta transform.
|
||
|
||
**Tokenizer & evaluation**
|
||
|
||
- **action_tokenizer_path**: Path or HuggingFace repo id of the trained tokenizer (e.g. `outputs/wavetoken`).
|
||
- **max_chunks_for_reconstruction**: Max number of chunks to use for reconstruction and timing (default `500`) to limit runtime.
|
||
|
||
### Main parameters
|
||
|
||
| parameter | default | description |
|
||
| -------------------------------- | ---------------------------- | ------------------------------------------------ |
|
||
| **action_tokenizer_path** | (required) | Path or Hub id of the trained action tokenizer. |
|
||
| **repo_id** | (required) | LeRobot dataset repo id. |
|
||
| **action_horizon** | `10` | Future steps per chunk. |
|
||
| **encoded_dims** | `0:6` | Dimension ranges to encode (e.g. `0:6,7:14`). |
|
||
| **normalization_mode** | `QUANTILES` | Normalization mode for actions. |
|
||
| **max_episodes** | all | Max episodes to load. |
|
||
| **sample_fraction** | `0.2` | Fraction of chunks sampled per episode. |
|
||
| **max_chunks_for_reconstruction**| `500` | Chunks used for reconstruction and timing. |
|
||
| **output_dir** | `outputs/action_tokenizer_benchmark` | Directory for results JSON. |
|
||
|
||
## Metrics
|
||
|
||
**Reconstruction (lower is better)**
|
||
|
||
- **reconstruction_mae**: Mean absolute error between original and decoded action chunks.
|
||
- **reconstruction_mse**: Mean squared error.
|
||
- **reconstruction_rmse**: Root mean squared error.
|
||
- **reconstruction_max_abs_error**: Maximum absolute error over all dimensions and samples.
|
||
- **per_dimension_mae**: MAE per action dimension (list of length `action_dim`).
|
||
|
||
**Compression**
|
||
|
||
- **compression_ratio**: Ratio (action_horizon × action_dim) / mean number of tokens. Higher means more compression.
|
||
- **mean_token_length**, **std_token_length**: Mean and standard deviation of token count per chunk.
|
||
- **min_token_length**, **max_token_length**: Min and max token count.
|
||
- **p50_token_length**, **p99_token_length**: 50th and 99th percentile token counts.
|
||
|
||
**Timing (seconds per chunk)**
|
||
|
||
- **mean_encode_time_sec**: Mean time to encode one chunk.
|
||
- **mean_decode_time_sec**: Mean time to decode one chunk.
|
||
|
||
The JSON output also includes **num_chunks_evaluated** and **total_chunks_available** for context.
|
||
|
||
## How the benchmark works
|
||
|
||
1. **Load dataset**: LeRobot dataset is loaded for the given `repo_id` and `root`.
|
||
2. **Build action chunks**: For each episode (up to `max_episodes`), action chunks are built with the same logic as `lerobot-train-tokenizer`: sliding window of length `action_horizon`, optional delta transform, and per-episode sampling with `sample_fraction`.
|
||
3. **Extract and normalize**: Only `encoded_dims` are kept. Normalization is applied using the dataset’s action stats when available, according to `normalization_mode`.
|
||
4. **Encode / decode**: A random sample of chunks (size `max_chunks_for_reconstruction`) is encoded and then decoded with the tokenizer. Encode and decode times are recorded per chunk.
|
||
5. **Compute metrics**: Reconstruction metrics are computed between original and decoded chunks; compression and timing stats are aggregated.
|
||
6. **Save results**: A JSON file is written to `output_dir` with name `{timestamp}_{repo_id}_action_tokenizer_results.json`, containing the full config and all metrics.
|
||
|
||
The pipeline (chunking, dimensions, normalization, delta) must match how the tokenizer was trained; otherwise reconstruction error can be large or the tokenizer may raise.
|
||
|
||
## Caveats
|
||
|
||
- The tokenizer’s **action_horizon** and **action_dim** (and optionally DCT settings) are fixed at training time. The benchmark infers dimensions from the dataset and encoded dims; the tokenizer path must correspond to a model trained with the same horizon and encoded dimensions.
|
||
- Reconstruction is evaluated in **normalized space** (the same space the tokenizer sees). For interpretation in raw action space, you would need to invert normalization outside this script.
|
||
- Only one tokenizer and one dataset are evaluated per run. To compare tokenizers or datasets, run the script multiple times and compare the saved JSON files.
|
||
|
||
## Example
|
||
|
||
Quick run with a local tokenizer and a small number of episodes:
|
||
|
||
```bash
|
||
python benchmarks/tokens/run_action_tokenizer_benchmark.py \
|
||
--action-tokenizer-path=outputs/wavetoken \
|
||
--repo-id=lerobot/pusht \
|
||
--action-horizon=10 \
|
||
--max-episodes=50 \
|
||
--output-dir=outputs/action_tokenizer_benchmark
|
||
```
|
||
|
||
With delta transform and custom encoded dimensions:
|
||
|
||
```bash
|
||
python benchmarks/tokens/run_action_tokenizer_benchmark.py \
|
||
--action-tokenizer-path=outputs/wavetoken \
|
||
--repo-id=lerobot/pusht \
|
||
--action-horizon=10 \
|
||
--encoded-dims=0:6,7:14 \
|
||
--delta-dims=0,1,2,3,4,5 \
|
||
--use-delta-transform \
|
||
--normalization-mode=QUANTILES \
|
||
--max-chunks-for-reconstruction=500 \
|
||
--output-dir=outputs/action_tokenizer_benchmark
|
||
```
|
||
|
||
Results are written to e.g. `outputs/action_tokenizer_benchmark/2026-02-12_14-30-00_lerobot_pusht_action_tokenizer_results.json`.
|
||
|
||
## Results
|
||
|
||
Results are stored as JSON in the directory given by `--output-dir` (default: `outputs/action_tokenizer_benchmark`). Each file contains:
|
||
|
||
- **config**: All script arguments (tokenizer path, repo_id, action_horizon, encoded_dims, normalization_mode, etc.) for reproducibility.
|
||
- **metrics**: All reconstruction, compression, and timing metrics described above.
|
||
|
||
To compare runs, load and diff or aggregate these JSON files with your own scripts or notebooks.
|