admin/lerobot

Fork 0

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-14 08:09:45 +00:00

Files

T

Michel Aractingi c868777752 profile

2025-11-18 09:51:50 +01:00

9.1 KiB

Raw Blame History

RTC Profiling Toolkit

Complete toolkit for profiling Pi0 with RTC to identify performance bottlenecks.

📦 What's Included

Scripts

eval_with_real_robot_profiled.py
- Profiled version of the real robot eval script
- Adds timing measurements throughout execution
- Works with actual robot hardware
- Same usage as original but with profiling output
profile_rtc_comparison.py
- Side-by-side comparison of RTC vs no-RTC
- No robot needed (uses mock observations)
- Shows clear verdict on whether RTC is helping
- Great for quick performance checks
profile_pi0_rtc_detailed.py
- Most detailed profiling available
- Can enable RTC method-level profiling
- Provides insights and recommendations
- Perfect for deep-dive investigations
add_rtc_profiling.py
- Monkey-patching utility for RTC internals
- Profiles individual RTC operations
- Can be applied without modifying source
- Shows exactly where RTC spends time

Utilities

src/lerobot/utils/profiling.py
- Core profiling utilities
- Decorators for method profiling
- Context managers for code blocks
- Statistics collection and reporting

Documentation

PROFILING_GUIDE.md - Comprehensive guide
PROFILING_QUICK_START.md - Quick reference

🚀 Quick Start

Step 1: Compare Performance

Run this first to see if RTC is actually slower:

uv run examples/rtc/profile_rtc_comparison.py \
    --policy_path=helper2424/pi05_check_rtc \
    --device=mps \
    --num_iterations=50 \
    --execution_horizon=20

Expected output:

COMPARISON SUMMARY
================================================================================
Metric                         Without RTC        With RTC      Difference
--------------------------------------------------------------------------------
Mean time (ms)                       150.23         165.45          +15.22
Throughput (iter/s)                    6.66           6.05           -0.61
================================================================================
VERDICT
✗ RTC is SLOWER by 10.1%
  Mean time increased by 15.22 ms
  
  Possible reasons:
  - RTC overhead exceeds benefits at current execution horizon
  - No torch.compile enabled

Step 2: Identify Bottleneck

If RTC is slower, find out why:

uv run examples/rtc/profile_pi0_rtc_detailed.py \
    --policy_path=helper2424/pi05_check_rtc \
    --device=mps \
    --num_iterations=20 \
    --execution_horizon=20 \
    --enable_rtc_profiling

Expected output:

PROFILING SUMMARY
================================================================================
Function                                             Count    Mean (ms)    Total (s)
------------------------------------------------------------------------------------
iteration.policy_inference                              20      150.23         3.00
rtc.denoise_step.guidance_computation                  200       15.67         3.13
rtc.denoise_step.autograd_correction                   200        8.23         1.65
iteration.preprocessing                                 20       12.45         0.25
================================================================================

KEY INSIGHTS
================================================================================
Time breakdown:
  Policy inference:  150.23 ms (87.2%)
  Preprocessing:     12.45 ms (7.2%)
  Postprocessing:    2.10 ms (1.2%)

RTC breakdown:
  Base denoising:    120.45 ms
  Guidance compute:  15.67 ms
  Autograd correct:  8.23 ms
  RTC overhead:      23.90 ms (19.8% of base)

Recommendations:
  ⚠ RTC autograd overhead is significant
    → This is expected, but consider increasing execution_horizon
    → Try torch.compile if not already enabled
  💡 torch.compile not enabled
    → Try --use_torch_compile for potential speedup
================================================================================

Step 3: Try Optimizations

Based on recommendations:

# Try with torch.compile
uv run examples/rtc/profile_rtc_comparison.py \
    --policy_path=helper2424/pi05_check_rtc \
    --device=mps \
    --num_iterations=50 \
    --execution_horizon=20 \
    --use_torch_compile

# Try larger execution horizon
uv run examples/rtc/profile_rtc_comparison.py \
    --policy_path=helper2424/pi05_check_rtc \
    --device=mps \
    --num_iterations=50 \
    --execution_horizon=30

Step 4: Profile Real Robot (Optional)

Test with actual hardware:

uv run examples/rtc/eval_with_real_robot_profiled.py \
    --policy.path=helper2424/pi05_check_rtc \
    --policy.device=mps \
    --rtc.enabled=true \
    --rtc.execution_horizon=20 \
    --robot.type=so100_follower \
    --robot.port=/dev/tty.usbmodem58FA0834591 \
    --robot.cameras="{...}" \
    --task="Pick up object" \
    --duration=30

🎯 Common Scenarios

"RTC is 2x slower!"

This usually means:

RTC overhead is high but not getting benefits
Need to enable torch.compile
Execution horizon too small
Inference delay not calculated correctly

Try:

--use_torch_compile
Increase --execution_horizon to 30+
Check inference_delay calculation

"RTC is only slightly slower"

This is expected! RTC overhead is about 10-30% typically. The benefit comes during execution, not single inference:

Actions are reused across chunks
Overall system latency is reduced
Robot gets smoother actions

"Want to optimize specific part"

Use the profiling utilities:

from lerobot.utils.profiling import enable_profiling, profile_section, print_profiling_summary

enable_profiling()

with profile_section("my_custom_operation"):
    # Your code here
    pass

print_profiling_summary()

📊 Understanding Results

Key Metrics

Policy Inference Time

Time for forward pass through model
Should be largest component (70-90%)
Includes RTC guidance if enabled

Preprocessing Time

Image normalization, resizing
Should be < 20% of total
If high: reduce image resolution

RTC Guidance Overhead

Extra time for RTC guidance computation
Typically 10-30% of base inference
If > 50%: RTC may not be beneficial at current settings

Autograd Correction

Time computing gradients for RTC
Usually 5-15% of base inference
Can be reduced with torch.compile

Expected Ranges (Apple Silicon MPS)

Metric	Good	Acceptable	Poor
Policy inference	100-150ms	150-250ms	>250ms
Preprocessing	<20ms	20-50ms	>50ms
RTC overhead	10-30%	30-50%	>50%

🔧 Optimization Guide

If RTC overhead is too high:

Enable compilation:
```
--use_torch_compile
```
Expected improvement: 20-40% faster
Increase execution horizon:
```
--execution_horizon=30  # or higher
```
Amortizes RTC cost over more actions

Check guidance weight:

# In config
rtc.max_guidance_weight=1.0  # try 0.5 for less overhead

If preprocessing is slow:

Reduce image resolution:

# In robot config
cameras={
    "gripper": {"width": 320, "height": 240}  # instead of 640x480
}

Use fewer cameras:
- Profile which cameras are essential
- Remove unnecessary views

If inference is generally slow:

Use torch.compile (if not already)
Check device is correct (MPS vs CUDA)
Verify model is in eval mode
Check for unnecessary gradient tracking

🐛 Troubleshooting

Empty profiling output

# Make sure to enable profiling!
from lerobot.utils.profiling import enable_profiling
enable_profiling()

Inconsistent timings

Run more iterations (50-100)
Check thermal throttling
Close background apps
Use --warmup_iterations=10

Can't find bottleneck

Start with profile_rtc_comparison.py
Then run profile_pi0_rtc_detailed.py --enable_rtc_profiling
Compare with/without RTC
Check each component separately

📖 Full Documentation

PROFILING_GUIDE.md - Complete reference with examples
PROFILING_QUICK_START.md - Quick commands and tips

🤝 Getting Help

If you're still experiencing issues:

Run comparison script and save output
Run detailed profiling and save output
Include:
- Policy path
- Device type
- RTC settings (execution_horizon, etc.)
- Hardware specs
- Full profiling output

🎓 Learning More

Profiling your own code:

from lerobot.utils.profiling import profile_method, enable_profiling

enable_profiling()

@profile_method
def my_function():
    # Automatically profiled
    pass

RTC internals:

from examples.rtc.add_rtc_profiling import monkey_patch_rtc_profiling

enable_profiling()
monkey_patch_rtc_profiling()

# Now RTC methods are profiled
policy.predict_action_chunk(...)

✨ Next Steps

Run profile_rtc_comparison.py to establish baseline
Use profile_pi0_rtc_detailed.py to find bottlenecks
Apply optimizations (torch.compile, larger horizon)
Re-run comparison to verify improvements
Test with real robot using profiled version

Happy profiling! 🚀

9.1 KiB Raw Blame History