Update README

2026-07-23 01:41:54 +00:00 · 2025-11-10 19:04:12 +07:00
parent 70d5ca387e
commit dd39d7a037
8 changed files with 321 additions and 1066 deletions
@@ -16,156 +16,161 @@ Real-Time Chunking addresses the challenge of maintaining consistency and reacti

 ## Scripts

-### 1. `real_time_chunking_evaluate.py`
+### 1. `eval_dataset.py`

-Real-time evaluation on physical robots or simulation environments.
+Offline evaluation on dataset samples with detailed visualization and validation.

 **Features:**

- Run policy with RTC on real robot or simulation
- Compare RTC vs non-RTC actions in real-time
- Multi-threaded action execution and inference
+- Compare RTC vs non-RTC predictions on two random dataset samples
+- Validate RTC behavior (delay region, blend region, post-horizon region)
+- Generate debug visualizations:
+  - Denoising step comparisons (x_t, v_t, x1_t, corrections)
+  - Final action predictions comparison
 - Support for torch.compile() optimization
+- Memory-efficient sequential policy loading for large models

 **Usage:**

 ```bash
-# With real robot
-uv run python examples/rtc/real_time_chunking_evaluate.py \
-    --policy.path=lerobot/smolvla_base \
-    --robot.type=so100 \
-    --task="pick up the cup"
+# Basic usage with SmolVLA policy
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=helper2424/smolvla_check_rtc_last3 \
+    --dataset.repo_id=helper2424/check_rtc \
+    --rtc.execution_horizon=8 \
+    --device=mps \
+    --rtc.max_guidance_weight=10.0 \
+    --seed=10

-# With simulation environment
-uv run python examples/rtc/real_time_chunking_evaluate.py \
-    --policy.path=lerobot/smolvla_base \
-    --env.type=pusht \
-    --duration=60.0
+# With Pi0.5 policy on CUDA
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=lerobot/pi05_libero_finetuned \
+    --dataset.repo_id=HuggingFaceVLA/libero \
+    --rtc.execution_horizon=8 \
+    --device=cuda

-# Disable verbose comparison (faster)
-uv run python examples/rtc/real_time_chunking_evaluate.py \
-    --policy.path=lerobot/smolvla_base \
-    --robot.type=so100 \
-    --verbose_rtc_comparison=false
+# With Pi0 policy
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=lerobot/pi0_libero_finetuned \
+    --dataset.repo_id=HuggingFaceVLA/libero \
+    --rtc.execution_horizon=8 \
+    --device=cuda

-# With policy compilation (CUDA only, not MPS)
-uv run python examples/rtc/real_time_chunking_evaluate.py \
-    --policy.path=lerobot/smolvla_base \
-    --robot.type=so100 \
-    --compile_policy=true \
-    --compile_mode=max-autotune
-```
+# With torch.compile for faster inference
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=helper2424/smolvla_check_rtc_last3 \
+    --dataset.repo_id=helper2424/check_rtc \
+    --rtc.execution_horizon=8 \
+    --device=cuda \
+    --use_torch_compile=true \
+    --torch_compile_mode=max-autotune

-**Key Parameters:**
-
- `--policy.path`: Path to pretrained policy
- `--robot.type` or `--env.type`: Robot or environment to use
- `--rtc.execution_horizon`: Number of steps to maintain consistency (default: 10)
- `--rtc.max_guidance_weight`: Maximum guidance weight (default: 1.0)
- `--rtc.prefix_attention_schedule`: Schedule type (ZEROS, ONES, LINEAR, EXP)
- `--verbose_rtc_comparison`: Enable detailed RTC comparison logging (default: true)
- `--duration`: How long to run (seconds, default: 30.0)
- `--fps`: Action execution frequency (Hz, default: 10.0)
-
-### 2. `evaluate_rtc_on_dataset.py`
-
-Offline evaluation on dataset samples to measure RTC effectiveness.
-
-**Features:**
-
- Evaluate RTC on dataset without running robot
- Compare RTC vs non-RTC predictions
- Measure consistency and ground truth alignment
- Simulate different inference delays
- Save detailed metrics to JSON
-
-**Usage:**
-
-```bash
-# Basic evaluation
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=100
-
-# Simulate inference delay (every 3rd step)
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=200 \
-    --skip_steps=3
-
-# Custom RTC configuration
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=100 \
-    --rtc.execution_horizon=12 \
-    --rtc.max_guidance_weight=5.0 \
-    --rtc.prefix_attention_schedule=LINEAR
-
-# Save results to file
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=100 \
-    --output_path=results/rtc_evaluation.json
-
-# Verbose mode with detailed logging
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=50 \
-    --verbose=true
+# Enable CUDA graphs (advanced - may cause tensor aliasing errors)
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=helper2424/smolvla_check_rtc_last3 \
+    --dataset.repo_id=helper2424/check_rtc \
+    --use_torch_compile=true \
+    --torch_compile_backend=inductor \
+    --torch_compile_mode=max-autotune \
+    --torch_compile_disable_cudagraphs=false
 ```

 **Key Parameters:**

 - `--policy.path`: Path to pretrained policy
 - `--dataset.repo_id`: Dataset to evaluate on
- `--num_iterations`: Number of samples to evaluate (default: 100)
- `--skip_steps`: Steps to skip between inferences, simulates inference delay (default: 1)
- `--start_episode`: Episode to start from (default: 0)
- `--output_path`: Path to save results JSON
- `--verbose`: Enable detailed per-sample logging
+- `--rtc.execution_horizon`: Number of steps to maintain consistency (default: 20)
+- `--rtc.max_guidance_weight`: Maximum guidance weight (default: 10.0)
+- `--rtc.prefix_attention_schedule`: Schedule type (ZEROS, ONES, LINEAR, EXP)
+- `--inference_delay`: Inference delay for RTC (default: 4)
+- `--seed`: Random seed for reproducibility (default: 42)
+- `--output_dir`: Directory to save visualizations (default: rtc_debug_output)
 - `--device`: Device to use (cuda, cpu, mps, auto)
+- `--use_torch_compile`: Enable torch.compile() for faster inference

-**Metrics Reported:**
+**Output:**

- **RTC vs Ground Truth MSE**: How close RTC predictions are to actual actions
- **No-RTC vs Ground Truth MSE**: Baseline without RTC
- **RTC Improvement**: Absolute and relative improvement over baseline
- **RTC Consistency**: How well RTC maintains consistency in prefix region
-  - Prefix MSE
-  - Mean/Max error in overlap region
+The script generates several visualization files in `rtc_debug_output/`:

-### 3. `run_dataset_evaluation.sh`
+- `denoising_xt_comparison.png` - Noisy state evolution during denoising
+- `denoising_vt_comparison.png` - Velocity predictions during denoising
+- `denoising_x1t_comparison.png` - Predicted final states during denoising
+- `denoising_correction_comparison.png` - RTC guidance corrections applied
+- `final_actions_comparison.png` - Final action predictions (prev_chunk, no_rtc, rtc)

-Convenience script with multiple evaluation scenarios.
+The script also validates RTC behavior and reports:
+
+- ✅ Delay region [0:inference_delay]: RTC = prev_chunk
+- ✅ Blend region [inference_delay:execution_horizon]: prev_chunk ≤ RTC ≤ no_rtc
+- ✅ Post-horizon [execution_horizon:]: RTC = no_rtc
+
+### 2. `eval_with_real_robot.py`
+
+Real-time evaluation on physical robots or simulation environments.
+
+**Features:**
+
+- Run policy with RTC on real robot or simulation
+- Multi-threaded action execution and inference
+- Action queue management with proper timing
+- Latency tracking and adaptive inference delay
+- Support for both robots and gym environments
+- Support for torch.compile() optimization

 **Usage:**

 ```bash
-# Edit the script to set your policy and dataset
-# Then run all examples:
-./examples/rtc/run_dataset_evaluation.sh
+# With real robot
+uv run python examples/rtc/eval_with_real_robot.py \
+    --policy.path=lerobot/smolvla_base \
+    --robot.type=so100 \
+    --task="pick up the cup" \
+    --duration=30.0

-# Or run individual examples from the script
+# With simulation environment
+uv run python examples/rtc/eval_with_real_robot.py \
+    --policy.path=lerobot/smolvla_base \
+    --env.type=pusht \
+    --duration=60.0
+
+# With policy compilation (CUDA only, not MPS)
+uv run python examples/rtc/eval_with_real_robot.py \
+    --policy.path=lerobot/smolvla_base \
+    --robot.type=so100 \
+    --use_torch_compile=true \
+    --torch_compile_mode=max-autotune
 ```

+**Key Parameters:**
+
+- `--policy.path`: Path to pretrained policy
+- `--robot.type` or `--env.type`: Robot or environment to use
+- `--task`: Task description (for VLA models)
+- `--rtc.execution_horizon`: Number of steps to maintain consistency (default: 10)
+- `--rtc.max_guidance_weight`: Maximum guidance weight (default: 1.0)
+- `--rtc.prefix_attention_schedule`: Schedule type (ZEROS, ONES, LINEAR, EXP)
+- `--duration`: How long to run (seconds, default: 30.0)
+- `--fps`: Action execution frequency (Hz, default: 10.0)
+- `--action_queue_size_to_get_new_actions`: Queue size threshold to request new actions (default: 30)
+- `--device`: Device to use (cuda, cpu, mps, auto)
+- `--use_torch_compile`: Enable torch.compile() for faster inference
+
 ## Understanding RTC Parameters

 ### `execution_horizon`

 Number of timesteps from previous chunk to maintain consistency with. Higher values mean more consistency but potentially less reactivity.

-**Typical values:** 8-12 steps
+**Typical values:** 8-12 steps for dataset evaluation, 10 steps for real-time execution

 ### `max_guidance_weight`

 Upper bound on guidance strength. Higher values give stronger consistency but may over-constrain new predictions.

-**Typical values:** 1.0-10.0
+**Typical values:**
+
+- Dataset evaluation: 10.0-100.0 (can be higher for analysis)
+- Real-time execution: 1.0-10.0 (more conservative)

 ### `prefix_attention_schedule`

@@ -178,104 +183,69 @@ How to weight consistency across the overlap region:

 **Recommended:** `EXP`

-### `skip_steps` (evaluation only)
+### `inference_delay`

-Simulates inference delay by evaluating every N-th step. This helps understand how RTC performs with realistic delays.
+Number of timesteps from the prefix to use for guidance. Typically calculated dynamically based on inference latency in real-time execution, but fixed for dataset evaluation.

-**Example:** `skip_steps=3` means policy infers every 3 steps, simulating 3x action execution frequency vs inference frequency.
+**Typical values:** 3-5 steps for dataset evaluation

-## Output Format (Dataset Evaluation)
+### `action_queue_size_to_get_new_actions` (real-time only)

-When using `--output_path`, results are saved in JSON format:
+Threshold for requesting new action chunks. Should be higher than `inference_delay + execution_horizon` to ensure smooth operation.

-```json
-{
-  "summary": {
-    "rtc_vs_ground_truth_mse": {
-      "mean": 0.00123,
-      "std": 0.00045,
-      "min": 0.00012,
-      "max": 0.00456
-    },
-    "improvement": {
-      "absolute": 0.00034,
-      "relative_percent": 12.5
-    },
-    ...
-  },
-  "config": {
-    "num_iterations": 100,
-    "skip_steps": 3,
-    "execution_horizon": 10,
-    ...
-  },
-  "detailed_results": [
-    {
-      "sample_idx": 0,
-      "rtc_vs_ground_truth_mse": 0.00112,
-      "no_rtc_vs_ground_truth_mse": 0.00145,
-      ...
-    },
-    ...
-  ]
-}
-```
+**Typical values:** 20-30 steps
+
+## Validation Rules (Dataset Evaluation)
+
+The dataset evaluation script validates that RTC behavior matches expectations:
+
+1. **Delay Region [0:inference_delay]**: RTC actions should equal previous chunk
+   - Ensures consistency during the inference delay period
+
+2. **Blend Region [inference_delay:execution_horizon]**: RTC should be between prev_chunk and no_rtc
+   - Smooth transition from previous plan to new predictions
+
+3. **Post-Horizon [execution_horizon:]**: RTC should equal no_rtc
+   - Full adoption of new predictions after execution horizon

 ## Tips

-1. **Start with dataset evaluation** to understand RTC behavior before running on robot
-2. **Use verbose mode** for debugging unexpected behavior
+1. **Start with dataset evaluation** (`eval_dataset.py`) to understand RTC behavior and tune parameters before running on robot
+2. **Use visualizations** to debug unexpected behavior - check denoising steps and final actions
 3. **Tune execution_horizon** based on your inference latency and action frequency
-4. **Monitor consistency metrics** - very low consistency might indicate execution_horizon is too small
+4. **Monitor validation output** - failures indicate potential implementation issues or misconfigured parameters
 5. **Compare different schedules** - EXP usually works best but LINEAR can be more interpretable

 ## Troubleshooting

-### High RTC vs No-RTC difference but no improvement
+### Validation fails in delay region

- Try reducing `max_guidance_weight`
- Check if `execution_horizon` is too large
+- Check that `prev_chunk_left_over` is properly passed to the policy
+- Verify RTC guidance is being applied during denoising
+- Look at denoising visualizations to see where guidance diverges

-### Poor consistency metrics
+### Validation fails in post-horizon region

- Increase `execution_horizon`
- Check that `skip_steps` is not larger than your action chunk size
- Verify episodes are being reset correctly
+- RTC and no_rtc use different noise - verify same noise is being used for comparison
+- Check that weights are correctly zeroed out after execution horizon
+- Review prefix_attention_schedule visualization

-### RTC worse than No-RTC
+### Poor performance on real robot

- RTC may not help if inference is faster than action execution
- Try different `prefix_attention_schedule`
- Ensure `execution_horizon` matches your use case
+- Increase `action_queue_size_to_get_new_actions` if you see warnings
+- Reduce `max_guidance_weight` if robot is too conservative
+- Try different `prefix_attention_schedule` values
+- Enable torch.compile() for faster inference (CUDA only)

-## Examples Results
+### Memory issues with large models

-Example output from dataset evaluation:
-
-```
-================================================================================
-EVALUATION SUMMARY
-================================================================================
-
-Ground Truth Alignment:
-  RTC MSE:        0.001234 ± 0.000456
-  No-RTC MSE:     0.001567 ± 0.000512
-
-RTC Improvement:
-  Absolute:       0.000333
-  Relative:       21.23%
-
-RTC vs No-RTC Difference:
-  MSE:            0.000112 ± 0.000034
-
-RTC Consistency (Prefix Region):
-  MSE:            0.000089 ± 0.000023
-  Mean Error:     0.007654 ± 0.002341
-  Max Error:      0.023456 ± 0.008765
-```
+- The dataset evaluation script loads policies sequentially to minimize memory
+- For real-time execution, only one policy is loaded
+- Use smaller batch sizes if needed

 ## Related Documentation

 - [RTC Implementation](../../src/lerobot/policies/rtc/modeling_rtc.py)
 - [RTC Configuration](../../src/lerobot/policies/rtc/configuration_rtc.py)
+- [Action Queue](../../src/lerobot/policies/rtc/action_queue.py)
 - [Physical Intelligence Paper](https://www.physicalintelligence.company/download/real_time_chunking.pdf)