profile

2026-05-20 02:59:50 +00:00 · 2025-11-18 09:51:50 +01:00
parent 8847e75c55
commit c868777752
8 changed files with 2567 additions and 0 deletions
@@ -0,0 +1,352 @@
+# RTC Profiling Toolkit
+
+Complete toolkit for profiling Pi0 with RTC to identify performance bottlenecks.
+
+## 📦 What's Included
+
+### Scripts
+
+1. **`eval_with_real_robot_profiled.py`**
+   - Profiled version of the real robot eval script
+   - Adds timing measurements throughout execution
+   - Works with actual robot hardware
+   - Same usage as original but with profiling output
+
+2. **`profile_rtc_comparison.py`**
+   - Side-by-side comparison of RTC vs no-RTC
+   - No robot needed (uses mock observations)
+   - Shows clear verdict on whether RTC is helping
+   - Great for quick performance checks
+
+3. **`profile_pi0_rtc_detailed.py`**
+   - Most detailed profiling available
+   - Can enable RTC method-level profiling
+   - Provides insights and recommendations
+   - Perfect for deep-dive investigations
+
+4. **`add_rtc_profiling.py`**
+   - Monkey-patching utility for RTC internals
+   - Profiles individual RTC operations
+   - Can be applied without modifying source
+   - Shows exactly where RTC spends time
+
+### Utilities
+
+5. **`src/lerobot/utils/profiling.py`**
+   - Core profiling utilities
+   - Decorators for method profiling
+   - Context managers for code blocks
+   - Statistics collection and reporting
+
+### Documentation
+
+6. **`PROFILING_GUIDE.md`** - Comprehensive guide
+7. **`PROFILING_QUICK_START.md`** - Quick reference
+
+## 🚀 Quick Start
+
+### Step 1: Compare Performance
+
+Run this first to see if RTC is actually slower:
+
+```bash
+uv run examples/rtc/profile_rtc_comparison.py \
+    --policy_path=helper2424/pi05_check_rtc \
+    --device=mps \
+    --num_iterations=50 \
+    --execution_horizon=20
+```
+
+**Expected output:**
+```
+COMPARISON SUMMARY
+================================================================================
+Metric                         Without RTC        With RTC      Difference
+--------------------------------------------------------------------------------
+Mean time (ms)                       150.23         165.45          +15.22
+Throughput (iter/s)                    6.66           6.05           -0.61
+================================================================================
+VERDICT
+✗ RTC is SLOWER by 10.1%
+  Mean time increased by 15.22 ms
+  
+  Possible reasons:
+  - RTC overhead exceeds benefits at current execution horizon
+  - No torch.compile enabled
+```
+
+### Step 2: Identify Bottleneck
+
+If RTC is slower, find out why:
+
+```bash
+uv run examples/rtc/profile_pi0_rtc_detailed.py \
+    --policy_path=helper2424/pi05_check_rtc \
+    --device=mps \
+    --num_iterations=20 \
+    --execution_horizon=20 \
+    --enable_rtc_profiling
+```
+
+**Expected output:**
+```
+PROFILING SUMMARY
+================================================================================
+Function                                             Count    Mean (ms)    Total (s)
+------------------------------------------------------------------------------------
+iteration.policy_inference                              20      150.23         3.00
+rtc.denoise_step.guidance_computation                  200       15.67         3.13
+rtc.denoise_step.autograd_correction                   200        8.23         1.65
+iteration.preprocessing                                 20       12.45         0.25
+================================================================================
+
+KEY INSIGHTS
+================================================================================
+Time breakdown:
+  Policy inference:  150.23 ms (87.2%)
+  Preprocessing:     12.45 ms (7.2%)
+  Postprocessing:    2.10 ms (1.2%)
+
+RTC breakdown:
+  Base denoising:    120.45 ms
+  Guidance compute:  15.67 ms
+  Autograd correct:  8.23 ms
+  RTC overhead:      23.90 ms (19.8% of base)
+
+Recommendations:
+  ⚠ RTC autograd overhead is significant
+    → This is expected, but consider increasing execution_horizon
+    → Try torch.compile if not already enabled
+  💡 torch.compile not enabled
+    → Try --use_torch_compile for potential speedup
+================================================================================
+```
+
+### Step 3: Try Optimizations
+
+Based on recommendations:
+
+```bash
+# Try with torch.compile
+uv run examples/rtc/profile_rtc_comparison.py \
+    --policy_path=helper2424/pi05_check_rtc \
+    --device=mps \
+    --num_iterations=50 \
+    --execution_horizon=20 \
+    --use_torch_compile
+
+# Try larger execution horizon
+uv run examples/rtc/profile_rtc_comparison.py \
+    --policy_path=helper2424/pi05_check_rtc \
+    --device=mps \
+    --num_iterations=50 \
+    --execution_horizon=30
+```
+
+### Step 4: Profile Real Robot (Optional)
+
+Test with actual hardware:
+
+```bash
+uv run examples/rtc/eval_with_real_robot_profiled.py \
+    --policy.path=helper2424/pi05_check_rtc \
+    --policy.device=mps \
+    --rtc.enabled=true \
+    --rtc.execution_horizon=20 \
+    --robot.type=so100_follower \
+    --robot.port=/dev/tty.usbmodem58FA0834591 \
+    --robot.cameras="{...}" \
+    --task="Pick up object" \
+    --duration=30
+```
+
+## 🎯 Common Scenarios
+
+### "RTC is 2x slower!"
+
+This usually means:
+- RTC overhead is high but not getting benefits
+- Need to enable torch.compile
+- Execution horizon too small
+- Inference delay not calculated correctly
+
+**Try:**
+1. `--use_torch_compile`
+2. Increase `--execution_horizon` to 30+
+3. Check inference_delay calculation
+
+### "RTC is only slightly slower"
+
+This is expected! RTC overhead is about 10-30% typically.
+The benefit comes during **execution**, not single inference:
+- Actions are reused across chunks
+- Overall system latency is reduced
+- Robot gets smoother actions
+
+### "Want to optimize specific part"
+
+Use the profiling utilities:
+
+```python
+from lerobot.utils.profiling import enable_profiling, profile_section, print_profiling_summary
+
+enable_profiling()
+
+with profile_section("my_custom_operation"):
+    # Your code here
+    pass
+
+print_profiling_summary()
+```
+
+## 📊 Understanding Results
+
+### Key Metrics
+
+**Policy Inference Time**
+- Time for forward pass through model
+- Should be largest component (70-90%)
+- Includes RTC guidance if enabled
+
+**Preprocessing Time**
+- Image normalization, resizing
+- Should be < 20% of total
+- If high: reduce image resolution
+
+**RTC Guidance Overhead**
+- Extra time for RTC guidance computation
+- Typically 10-30% of base inference
+- If > 50%: RTC may not be beneficial at current settings
+
+**Autograd Correction**
+- Time computing gradients for RTC
+- Usually 5-15% of base inference
+- Can be reduced with torch.compile
+
+### Expected Ranges (Apple Silicon MPS)
+
+| Metric | Good | Acceptable | Poor |
+|--------|------|------------|------|
+| Policy inference | 100-150ms | 150-250ms | >250ms |
+| Preprocessing | <20ms | 20-50ms | >50ms |
+| RTC overhead | 10-30% | 30-50% | >50% |
+
+## 🔧 Optimization Guide
+
+### If RTC overhead is too high:
+
+1. **Enable compilation:**
+   ```bash
+   --use_torch_compile
+   ```
+   Expected improvement: 20-40% faster
+
+2. **Increase execution horizon:**
+   ```bash
+   --execution_horizon=30  # or higher
+   ```
+   Amortizes RTC cost over more actions
+
+3. **Check guidance weight:**
+   ```python
+   # In config
+   rtc.max_guidance_weight=1.0  # try 0.5 for less overhead
+   ```
+
+### If preprocessing is slow:
+
+1. **Reduce image resolution:**
+   ```python
+   # In robot config
+   cameras={
+       "gripper": {"width": 320, "height": 240}  # instead of 640x480
+   }
+   ```
+
+2. **Use fewer cameras:**
+   - Profile which cameras are essential
+   - Remove unnecessary views
+
+### If inference is generally slow:
+
+1. Use torch.compile (if not already)
+2. Check device is correct (MPS vs CUDA)
+3. Verify model is in eval mode
+4. Check for unnecessary gradient tracking
+
+## 🐛 Troubleshooting
+
+### Empty profiling output
+```python
+# Make sure to enable profiling!
+from lerobot.utils.profiling import enable_profiling
+enable_profiling()
+```
+
+### Inconsistent timings
+- Run more iterations (50-100)
+- Check thermal throttling
+- Close background apps
+- Use `--warmup_iterations=10`
+
+### Can't find bottleneck
+1. Start with `profile_rtc_comparison.py`
+2. Then run `profile_pi0_rtc_detailed.py --enable_rtc_profiling`
+3. Compare with/without RTC
+4. Check each component separately
+
+## 📖 Full Documentation
+
+- **`PROFILING_GUIDE.md`** - Complete reference with examples
+- **`PROFILING_QUICK_START.md`** - Quick commands and tips
+
+## 🤝 Getting Help
+
+If you're still experiencing issues:
+
+1. Run comparison script and save output
+2. Run detailed profiling and save output
+3. Include:
+   - Policy path
+   - Device type
+   - RTC settings (execution_horizon, etc.)
+   - Hardware specs
+   - Full profiling output
+
+## 🎓 Learning More
+
+### Profiling your own code:
+
+```python
+from lerobot.utils.profiling import profile_method, enable_profiling
+
+enable_profiling()
+
+@profile_method
+def my_function():
+    # Automatically profiled
+    pass
+```
+
+### RTC internals:
+
+```python
+from examples.rtc.add_rtc_profiling import monkey_patch_rtc_profiling
+
+enable_profiling()
+monkey_patch_rtc_profiling()
+
+# Now RTC methods are profiled
+policy.predict_action_chunk(...)
+```
+
+## ✨ Next Steps
+
+1. Run `profile_rtc_comparison.py` to establish baseline
+2. Use `profile_pi0_rtc_detailed.py` to find bottlenecks
+3. Apply optimizations (torch.compile, larger horizon)
+4. Re-run comparison to verify improvements
+5. Test with real robot using profiled version
+
+Happy profiling! 🚀
+