# RTC Profiling Toolkit Complete toolkit for profiling Pi0 with RTC to identify performance bottlenecks. ## 📦 What's Included ### Scripts 1. **`eval_with_real_robot_profiled.py`** - Profiled version of the real robot eval script - Adds timing measurements throughout execution - Works with actual robot hardware - Same usage as original but with profiling output 2. **`profile_rtc_comparison.py`** - Side-by-side comparison of RTC vs no-RTC - No robot needed (uses mock observations) - Shows clear verdict on whether RTC is helping - Great for quick performance checks 3. **`profile_pi0_rtc_detailed.py`** - Most detailed profiling available - Can enable RTC method-level profiling - Provides insights and recommendations - Perfect for deep-dive investigations 4. **`add_rtc_profiling.py`** - Monkey-patching utility for RTC internals - Profiles individual RTC operations - Can be applied without modifying source - Shows exactly where RTC spends time ### Utilities 5. **`src/lerobot/utils/profiling.py`** - Core profiling utilities - Decorators for method profiling - Context managers for code blocks - Statistics collection and reporting ### Documentation 6. **`PROFILING_GUIDE.md`** - Comprehensive guide 7. **`PROFILING_QUICK_START.md`** - Quick reference ## 🚀 Quick Start ### Step 1: Compare Performance Run this first to see if RTC is actually slower: ```bash uv run examples/rtc/profile_rtc_comparison.py \ --policy_path=helper2424/pi05_check_rtc \ --device=mps \ --num_iterations=50 \ --execution_horizon=20 ``` **Expected output:** ``` COMPARISON SUMMARY ================================================================================ Metric Without RTC With RTC Difference -------------------------------------------------------------------------------- Mean time (ms) 150.23 165.45 +15.22 Throughput (iter/s) 6.66 6.05 -0.61 ================================================================================ VERDICT ✗ RTC is SLOWER by 10.1% Mean time increased by 15.22 ms Possible reasons: - RTC overhead exceeds benefits at current execution horizon - No torch.compile enabled ``` ### Step 2: Identify Bottleneck If RTC is slower, find out why: ```bash uv run examples/rtc/profile_pi0_rtc_detailed.py \ --policy_path=helper2424/pi05_check_rtc \ --device=mps \ --num_iterations=20 \ --execution_horizon=20 \ --enable_rtc_profiling ``` **Expected output:** ``` PROFILING SUMMARY ================================================================================ Function Count Mean (ms) Total (s) ------------------------------------------------------------------------------------ iteration.policy_inference 20 150.23 3.00 rtc.denoise_step.guidance_computation 200 15.67 3.13 rtc.denoise_step.autograd_correction 200 8.23 1.65 iteration.preprocessing 20 12.45 0.25 ================================================================================ KEY INSIGHTS ================================================================================ Time breakdown: Policy inference: 150.23 ms (87.2%) Preprocessing: 12.45 ms (7.2%) Postprocessing: 2.10 ms (1.2%) RTC breakdown: Base denoising: 120.45 ms Guidance compute: 15.67 ms Autograd correct: 8.23 ms RTC overhead: 23.90 ms (19.8% of base) Recommendations: ⚠ RTC autograd overhead is significant → This is expected, but consider increasing execution_horizon → Try torch.compile if not already enabled 💡 torch.compile not enabled → Try --use_torch_compile for potential speedup ================================================================================ ``` ### Step 3: Try Optimizations Based on recommendations: ```bash # Try with torch.compile uv run examples/rtc/profile_rtc_comparison.py \ --policy_path=helper2424/pi05_check_rtc \ --device=mps \ --num_iterations=50 \ --execution_horizon=20 \ --use_torch_compile # Try larger execution horizon uv run examples/rtc/profile_rtc_comparison.py \ --policy_path=helper2424/pi05_check_rtc \ --device=mps \ --num_iterations=50 \ --execution_horizon=30 ``` ### Step 4: Profile Real Robot (Optional) Test with actual hardware: ```bash uv run examples/rtc/eval_with_real_robot_profiled.py \ --policy.path=helper2424/pi05_check_rtc \ --policy.device=mps \ --rtc.enabled=true \ --rtc.execution_horizon=20 \ --robot.type=so100_follower \ --robot.port=/dev/tty.usbmodem58FA0834591 \ --robot.cameras="{...}" \ --task="Pick up object" \ --duration=30 ``` ## 🎯 Common Scenarios ### "RTC is 2x slower!" This usually means: - RTC overhead is high but not getting benefits - Need to enable torch.compile - Execution horizon too small - Inference delay not calculated correctly **Try:** 1. `--use_torch_compile` 2. Increase `--execution_horizon` to 30+ 3. Check inference_delay calculation ### "RTC is only slightly slower" This is expected! RTC overhead is about 10-30% typically. The benefit comes during **execution**, not single inference: - Actions are reused across chunks - Overall system latency is reduced - Robot gets smoother actions ### "Want to optimize specific part" Use the profiling utilities: ```python from lerobot.utils.profiling import enable_profiling, profile_section, print_profiling_summary enable_profiling() with profile_section("my_custom_operation"): # Your code here pass print_profiling_summary() ``` ## 📊 Understanding Results ### Key Metrics **Policy Inference Time** - Time for forward pass through model - Should be largest component (70-90%) - Includes RTC guidance if enabled **Preprocessing Time** - Image normalization, resizing - Should be < 20% of total - If high: reduce image resolution **RTC Guidance Overhead** - Extra time for RTC guidance computation - Typically 10-30% of base inference - If > 50%: RTC may not be beneficial at current settings **Autograd Correction** - Time computing gradients for RTC - Usually 5-15% of base inference - Can be reduced with torch.compile ### Expected Ranges (Apple Silicon MPS) | Metric | Good | Acceptable | Poor | |--------|------|------------|------| | Policy inference | 100-150ms | 150-250ms | >250ms | | Preprocessing | <20ms | 20-50ms | >50ms | | RTC overhead | 10-30% | 30-50% | >50% | ## 🔧 Optimization Guide ### If RTC overhead is too high: 1. **Enable compilation:** ```bash --use_torch_compile ``` Expected improvement: 20-40% faster 2. **Increase execution horizon:** ```bash --execution_horizon=30 # or higher ``` Amortizes RTC cost over more actions 3. **Check guidance weight:** ```python # In config rtc.max_guidance_weight=1.0 # try 0.5 for less overhead ``` ### If preprocessing is slow: 1. **Reduce image resolution:** ```python # In robot config cameras={ "gripper": {"width": 320, "height": 240} # instead of 640x480 } ``` 2. **Use fewer cameras:** - Profile which cameras are essential - Remove unnecessary views ### If inference is generally slow: 1. Use torch.compile (if not already) 2. Check device is correct (MPS vs CUDA) 3. Verify model is in eval mode 4. Check for unnecessary gradient tracking ## 🐛 Troubleshooting ### Empty profiling output ```python # Make sure to enable profiling! from lerobot.utils.profiling import enable_profiling enable_profiling() ``` ### Inconsistent timings - Run more iterations (50-100) - Check thermal throttling - Close background apps - Use `--warmup_iterations=10` ### Can't find bottleneck 1. Start with `profile_rtc_comparison.py` 2. Then run `profile_pi0_rtc_detailed.py --enable_rtc_profiling` 3. Compare with/without RTC 4. Check each component separately ## 📖 Full Documentation - **`PROFILING_GUIDE.md`** - Complete reference with examples - **`PROFILING_QUICK_START.md`** - Quick commands and tips ## 🤝 Getting Help If you're still experiencing issues: 1. Run comparison script and save output 2. Run detailed profiling and save output 3. Include: - Policy path - Device type - RTC settings (execution_horizon, etc.) - Hardware specs - Full profiling output ## 🎓 Learning More ### Profiling your own code: ```python from lerobot.utils.profiling import profile_method, enable_profiling enable_profiling() @profile_method def my_function(): # Automatically profiled pass ``` ### RTC internals: ```python from examples.rtc.add_rtc_profiling import monkey_patch_rtc_profiling enable_profiling() monkey_patch_rtc_profiling() # Now RTC methods are profiled policy.predict_action_chunk(...) ``` ## ✨ Next Steps 1. Run `profile_rtc_comparison.py` to establish baseline 2. Use `profile_pi0_rtc_detailed.py` to find bottlenecks 3. Apply optimizations (torch.compile, larger horizon) 4. Re-run comparison to verify improvements 5. Test with real robot using profiled version Happy profiling! 🚀