Update README

2026-05-15 08:39:49 +00:00 · 2025-11-10 19:04:12 +07:00
parent 9e92337f24
commit 433ccc9603
8 changed files with 321 additions and 1066 deletions
@@ -16,156 +16,161 @@ Real-Time Chunking addresses the challenge of maintaining consistency and reacti

 ## Scripts

-### 1. `real_time_chunking_evaluate.py`
+### 1. `eval_dataset.py`

-Real-time evaluation on physical robots or simulation environments.
+Offline evaluation on dataset samples with detailed visualization and validation.

 **Features:**

- Run policy with RTC on real robot or simulation
- Compare RTC vs non-RTC actions in real-time
- Multi-threaded action execution and inference
+- Compare RTC vs non-RTC predictions on two random dataset samples
+- Validate RTC behavior (delay region, blend region, post-horizon region)
+- Generate debug visualizations:
+  - Denoising step comparisons (x_t, v_t, x1_t, corrections)
+  - Final action predictions comparison
 - Support for torch.compile() optimization
+- Memory-efficient sequential policy loading for large models

 **Usage:**

 ```bash
-# With real robot
-uv run python examples/rtc/real_time_chunking_evaluate.py \
-    --policy.path=lerobot/smolvla_base \
-    --robot.type=so100 \
-    --task="pick up the cup"
+# Basic usage with SmolVLA policy
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=helper2424/smolvla_check_rtc_last3 \
+    --dataset.repo_id=helper2424/check_rtc \
+    --rtc.execution_horizon=8 \
+    --device=mps \
+    --rtc.max_guidance_weight=10.0 \
+    --seed=10

-# With simulation environment
-uv run python examples/rtc/real_time_chunking_evaluate.py \
-    --policy.path=lerobot/smolvla_base \
-    --env.type=pusht \
-    --duration=60.0
+# With Pi0.5 policy on CUDA
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=lerobot/pi05_libero_finetuned \
+    --dataset.repo_id=HuggingFaceVLA/libero \
+    --rtc.execution_horizon=8 \
+    --device=cuda

-# Disable verbose comparison (faster)
-uv run python examples/rtc/real_time_chunking_evaluate.py \
-    --policy.path=lerobot/smolvla_base \
-    --robot.type=so100 \
-    --verbose_rtc_comparison=false
+# With Pi0 policy
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=lerobot/pi0_libero_finetuned \
+    --dataset.repo_id=HuggingFaceVLA/libero \
+    --rtc.execution_horizon=8 \
+    --device=cuda

-# With policy compilation (CUDA only, not MPS)
-uv run python examples/rtc/real_time_chunking_evaluate.py \
-    --policy.path=lerobot/smolvla_base \
-    --robot.type=so100 \
-    --compile_policy=true \
-    --compile_mode=max-autotune
-```
+# With torch.compile for faster inference
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=helper2424/smolvla_check_rtc_last3 \
+    --dataset.repo_id=helper2424/check_rtc \
+    --rtc.execution_horizon=8 \
+    --device=cuda \
+    --use_torch_compile=true \
+    --torch_compile_mode=max-autotune

-**Key Parameters:**
-
- `--policy.path`: Path to pretrained policy
- `--robot.type` or `--env.type`: Robot or environment to use
- `--rtc.execution_horizon`: Number of steps to maintain consistency (default: 10)
- `--rtc.max_guidance_weight`: Maximum guidance weight (default: 1.0)
- `--rtc.prefix_attention_schedule`: Schedule type (ZEROS, ONES, LINEAR, EXP)
- `--verbose_rtc_comparison`: Enable detailed RTC comparison logging (default: true)
- `--duration`: How long to run (seconds, default: 30.0)
- `--fps`: Action execution frequency (Hz, default: 10.0)
-
-### 2. `evaluate_rtc_on_dataset.py`
-
-Offline evaluation on dataset samples to measure RTC effectiveness.
-
-**Features:**
-
- Evaluate RTC on dataset without running robot
- Compare RTC vs non-RTC predictions
- Measure consistency and ground truth alignment
- Simulate different inference delays
- Save detailed metrics to JSON
-
-**Usage:**
-
-```bash
-# Basic evaluation
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=100
-
-# Simulate inference delay (every 3rd step)
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=200 \
-    --skip_steps=3
-
-# Custom RTC configuration
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=100 \
-    --rtc.execution_horizon=12 \
-    --rtc.max_guidance_weight=5.0 \
-    --rtc.prefix_attention_schedule=LINEAR
-
-# Save results to file
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=100 \
-    --output_path=results/rtc_evaluation.json
-
-# Verbose mode with detailed logging
-uv run python examples/rtc/evaluate_rtc_on_dataset.py \
-    --policy.path=lerobot/smolvla_base \
-    --dataset.repo_id=lerobot/pusht \
-    --num_iterations=50 \
-    --verbose=true
+# Enable CUDA graphs (advanced - may cause tensor aliasing errors)
+uv run python examples/rtc/eval_dataset.py \
+    --policy.path=helper2424/smolvla_check_rtc_last3 \
+    --dataset.repo_id=helper2424/check_rtc \
+    --use_torch_compile=true \
+    --torch_compile_backend=inductor \
+    --torch_compile_mode=max-autotune \
+    --torch_compile_disable_cudagraphs=false
 ```

 **Key Parameters:**

 - `--policy.path`: Path to pretrained policy
 - `--dataset.repo_id`: Dataset to evaluate on
- `--num_iterations`: Number of samples to evaluate (default: 100)
- `--skip_steps`: Steps to skip between inferences, simulates inference delay (default: 1)
- `--start_episode`: Episode to start from (default: 0)
- `--output_path`: Path to save results JSON
- `--verbose`: Enable detailed per-sample logging
+- `--rtc.execution_horizon`: Number of steps to maintain consistency (default: 20)
+- `--rtc.max_guidance_weight`: Maximum guidance weight (default: 10.0)
+- `--rtc.prefix_attention_schedule`: Schedule type (ZEROS, ONES, LINEAR, EXP)
+- `--inference_delay`: Inference delay for RTC (default: 4)
+- `--seed`: Random seed for reproducibility (default: 42)
+- `--output_dir`: Directory to save visualizations (default: rtc_debug_output)
 - `--device`: Device to use (cuda, cpu, mps, auto)
+- `--use_torch_compile`: Enable torch.compile() for faster inference

-**Metrics Reported:**
+**Output:**

- **RTC vs Ground Truth MSE**: How close RTC predictions are to actual actions
- **No-RTC vs Ground Truth MSE**: Baseline without RTC
- **RTC Improvement**: Absolute and relative improvement over baseline
- **RTC Consistency**: How well RTC maintains consistency in prefix region
-  - Prefix MSE
-  - Mean/Max error in overlap region
+The script generates several visualization files in `rtc_debug_output/`:

-### 3. `run_dataset_evaluation.sh`
+- `denoising_xt_comparison.png` - Noisy state evolution during denoising
+- `denoising_vt_comparison.png` - Velocity predictions during denoising
+- `denoising_x1t_comparison.png` - Predicted final states during denoising
+- `denoising_correction_comparison.png` - RTC guidance corrections applied
+- `final_actions_comparison.png` - Final action predictions (prev_chunk, no_rtc, rtc)

-Convenience script with multiple evaluation scenarios.
+The script also validates RTC behavior and reports:
+
+- ✅ Delay region [0:inference_delay]: RTC = prev_chunk
+- ✅ Blend region [inference_delay:execution_horizon]: prev_chunk ≤ RTC ≤ no_rtc
+- ✅ Post-horizon [execution_horizon:]: RTC = no_rtc
+
+### 2. `eval_with_real_robot.py`
+
+Real-time evaluation on physical robots or simulation environments.
+
+**Features:**
+
+- Run policy with RTC on real robot or simulation
+- Multi-threaded action execution and inference
+- Action queue management with proper timing
+- Latency tracking and adaptive inference delay
+- Support for both robots and gym environments
+- Support for torch.compile() optimization

 **Usage:**

 ```bash
-# Edit the script to set your policy and dataset
-# Then run all examples:
-./examples/rtc/run_dataset_evaluation.sh
+# With real robot
+uv run python examples/rtc/eval_with_real_robot.py \
+    --policy.path=lerobot/smolvla_base \
+    --robot.type=so100 \
+    --task="pick up the cup" \
+    --duration=30.0

-# Or run individual examples from the script
+# With simulation environment
+uv run python examples/rtc/eval_with_real_robot.py \
+    --policy.path=lerobot/smolvla_base \
+    --env.type=pusht \
+    --duration=60.0
+
+# With policy compilation (CUDA only, not MPS)
+uv run python examples/rtc/eval_with_real_robot.py \
+    --policy.path=lerobot/smolvla_base \
+    --robot.type=so100 \
+    --use_torch_compile=true \
+    --torch_compile_mode=max-autotune
 ```

+**Key Parameters:**
+
+- `--policy.path`: Path to pretrained policy
+- `--robot.type` or `--env.type`: Robot or environment to use
+- `--task`: Task description (for VLA models)
+- `--rtc.execution_horizon`: Number of steps to maintain consistency (default: 10)
+- `--rtc.max_guidance_weight`: Maximum guidance weight (default: 1.0)
+- `--rtc.prefix_attention_schedule`: Schedule type (ZEROS, ONES, LINEAR, EXP)
+- `--duration`: How long to run (seconds, default: 30.0)
+- `--fps`: Action execution frequency (Hz, default: 10.0)
+- `--action_queue_size_to_get_new_actions`: Queue size threshold to request new actions (default: 30)
+- `--device`: Device to use (cuda, cpu, mps, auto)
+- `--use_torch_compile`: Enable torch.compile() for faster inference
+
 ## Understanding RTC Parameters

 ### `execution_horizon`

 Number of timesteps from previous chunk to maintain consistency with. Higher values mean more consistency but potentially less reactivity.

-**Typical values:** 8-12 steps
+**Typical values:** 8-12 steps for dataset evaluation, 10 steps for real-time execution

 ### `max_guidance_weight`

 Upper bound on guidance strength. Higher values give stronger consistency but may over-constrain new predictions.

-**Typical values:** 1.0-10.0
+**Typical values:**
+
+- Dataset evaluation: 10.0-100.0 (can be higher for analysis)
+- Real-time execution: 1.0-10.0 (more conservative)

 ### `prefix_attention_schedule`

@@ -178,104 +183,69 @@ How to weight consistency across the overlap region:

 **Recommended:** `EXP`

-### `skip_steps` (evaluation only)
+### `inference_delay`

-Simulates inference delay by evaluating every N-th step. This helps understand how RTC performs with realistic delays.
+Number of timesteps from the prefix to use for guidance. Typically calculated dynamically based on inference latency in real-time execution, but fixed for dataset evaluation.

-**Example:** `skip_steps=3` means policy infers every 3 steps, simulating 3x action execution frequency vs inference frequency.
+**Typical values:** 3-5 steps for dataset evaluation

-## Output Format (Dataset Evaluation)
+### `action_queue_size_to_get_new_actions` (real-time only)

-When using `--output_path`, results are saved in JSON format:
+Threshold for requesting new action chunks. Should be higher than `inference_delay + execution_horizon` to ensure smooth operation.

-```json
-{
-  "summary": {
-    "rtc_vs_ground_truth_mse": {
-      "mean": 0.00123,
-      "std": 0.00045,
-      "min": 0.00012,
-      "max": 0.00456
-    },
-    "improvement": {
-      "absolute": 0.00034,
-      "relative_percent": 12.5
-    },
-    ...
-  },
-  "config": {
-    "num_iterations": 100,
-    "skip_steps": 3,
-    "execution_horizon": 10,
-    ...
-  },
-  "detailed_results": [
-    {
-      "sample_idx": 0,
-      "rtc_vs_ground_truth_mse": 0.00112,
-      "no_rtc_vs_ground_truth_mse": 0.00145,
-      ...
-    },
-    ...
-  ]
-}
-```
+**Typical values:** 20-30 steps
+
+## Validation Rules (Dataset Evaluation)
+
+The dataset evaluation script validates that RTC behavior matches expectations:
+
+1. **Delay Region [0:inference_delay]**: RTC actions should equal previous chunk
+   - Ensures consistency during the inference delay period
+
+2. **Blend Region [inference_delay:execution_horizon]**: RTC should be between prev_chunk and no_rtc
+   - Smooth transition from previous plan to new predictions
+
+3. **Post-Horizon [execution_horizon:]**: RTC should equal no_rtc
+   - Full adoption of new predictions after execution horizon

 ## Tips

-1. **Start with dataset evaluation** to understand RTC behavior before running on robot
-2. **Use verbose mode** for debugging unexpected behavior
+1. **Start with dataset evaluation** (`eval_dataset.py`) to understand RTC behavior and tune parameters before running on robot
+2. **Use visualizations** to debug unexpected behavior - check denoising steps and final actions
 3. **Tune execution_horizon** based on your inference latency and action frequency
-4. **Monitor consistency metrics** - very low consistency might indicate execution_horizon is too small
+4. **Monitor validation output** - failures indicate potential implementation issues or misconfigured parameters
 5. **Compare different schedules** - EXP usually works best but LINEAR can be more interpretable

 ## Troubleshooting

-### High RTC vs No-RTC difference but no improvement
+### Validation fails in delay region

- Try reducing `max_guidance_weight`
- Check if `execution_horizon` is too large
+- Check that `prev_chunk_left_over` is properly passed to the policy
+- Verify RTC guidance is being applied during denoising
+- Look at denoising visualizations to see where guidance diverges

-### Poor consistency metrics
+### Validation fails in post-horizon region

- Increase `execution_horizon`
- Check that `skip_steps` is not larger than your action chunk size
- Verify episodes are being reset correctly
+- RTC and no_rtc use different noise - verify same noise is being used for comparison
+- Check that weights are correctly zeroed out after execution horizon
+- Review prefix_attention_schedule visualization

-### RTC worse than No-RTC
+### Poor performance on real robot

- RTC may not help if inference is faster than action execution
- Try different `prefix_attention_schedule`
- Ensure `execution_horizon` matches your use case
+- Increase `action_queue_size_to_get_new_actions` if you see warnings
+- Reduce `max_guidance_weight` if robot is too conservative
+- Try different `prefix_attention_schedule` values
+- Enable torch.compile() for faster inference (CUDA only)

-## Examples Results
+### Memory issues with large models

-Example output from dataset evaluation:
-
-```
-================================================================================
-EVALUATION SUMMARY
-================================================================================
-
-Ground Truth Alignment:
-  RTC MSE:        0.001234 ± 0.000456
-  No-RTC MSE:     0.001567 ± 0.000512
-
-RTC Improvement:
-  Absolute:       0.000333
-  Relative:       21.23%
-
-RTC vs No-RTC Difference:
-  MSE:            0.000112 ± 0.000034
-
-RTC Consistency (Prefix Region):
-  MSE:            0.000089 ± 0.000023
-  Mean Error:     0.007654 ± 0.002341
-  Max Error:      0.023456 ± 0.008765
-```
+- The dataset evaluation script loads policies sequentially to minimize memory
+- For real-time execution, only one policy is loaded
+- Use smaller batch sizes if needed

 ## Related Documentation

 - [RTC Implementation](../../src/lerobot/policies/rtc/modeling_rtc.py)
 - [RTC Configuration](../../src/lerobot/policies/rtc/configuration_rtc.py)
+- [Action Queue](../../src/lerobot/policies/rtc/action_queue.py)
 - [Physical Intelligence Paper](https://www.physicalintelligence.company/download/real_time_chunking.pdf)
@@ -16,7 +16,9 @@ Usage:
        --policy.path=helper2424/smolvla_check_rtc_last3 \
        --dataset.repo_id=helper2424/check_rtc \
        --rtc.execution_horizon=8 \
-        --device=mps
+        --device=mps \
+        --rtc.max_guidance_weight=10.0 \
+        --seed=10

    # Basic usage with pi0.5 policy
    uv run python examples/rtc/eval_dataset.py \
@@ -439,6 +441,8 @@ class RTCEvaluator:
        logging.info("Step 2: Generating actions WITHOUT RTC with policy_no_rtc")
        logging.info("=" * 80)

+        set_seed(self.cfg.seed)
+
        # Initialize policy 2
        policy_no_rtc_policy = self._init_policy(
            name="policy_no_rtc",
@@ -470,6 +474,8 @@ class RTCEvaluator:
        logging.info("Step 3: Generating actions WITH RTC with policy_rtc")
        logging.info("=" * 80)

+        set_seed(self.cfg.seed)
+
        # Initialize policy 3
        policy_rtc_policy = self._init_policy(
            name="policy_rtc",
@@ -510,6 +516,11 @@ class RTCEvaluator:
        logging.info("Validating RTC behavior...")
        self.validate_rtc_behavior(rtc_actions, no_rtc_actions, prev_chunk_left_over)

+        # Plot final actions comparison
+        logging.info("=" * 80)
+        logging.info("Plotting final actions comparison...")
+        self.plot_final_actions_comparison(rtc_actions, no_rtc_actions, prev_chunk_left_over)
+
        logging.info("=" * 80)
        logging.info("Evaluation completed successfully")

@@ -527,29 +538,24 @@ class RTCEvaluator:
            no_rtc_actions: Final actions from non-RTC policy (batch, time, action_dim)
            prev_chunk_left_over: Previous chunk used as ground truth (time, action_dim)
        """
-        if rtc_actions is None or no_rtc_actions is None:
-            logging.warning("  ⚠ Cannot validate: missing action predictions")
-            return
+        # Remove batch dimension if present and move to CPU
+        rtc_actions_t = rtc_actions.squeeze(0).cpu() if len(rtc_actions.shape) == 3 else rtc_actions.cpu()
+        no_rtc_actions_t = (
+            no_rtc_actions.squeeze(0).cpu() if len(no_rtc_actions.shape) == 3 else no_rtc_actions.cpu()
+        )
+        prev_chunk = prev_chunk_left_over.cpu()

-        # Convert to numpy for comparison (remove batch dimension if present)
-        rtc_actions_np = (
-            rtc_actions.squeeze(0).cpu().numpy() if len(rtc_actions.shape) == 3 else rtc_actions.cpu().numpy()
-        )
-        no_rtc_actions_np = (
-            no_rtc_actions.squeeze(0).cpu().numpy()
-            if len(no_rtc_actions.shape) == 3
-            else no_rtc_actions.cpu().numpy()
-        )
-        prev_chunk = prev_chunk_left_over.cpu().numpy()
+        logging.info(f"  rtc_actions shape: {rtc_actions_t.shape}")
+        logging.info(f"  no_rtc_actions shape: {no_rtc_actions_t.shape}")
+        logging.info(f"  prev_chunk shape: {prev_chunk.shape}")

        # Determine chunk length for comparison
-        chunk_len = min(rtc_actions_np.shape[0], no_rtc_actions_np.shape[0], prev_chunk.shape[0])
+        chunk_len = min(rtc_actions_t.shape[0], no_rtc_actions_t.shape[0], prev_chunk.shape[0])
        inference_delay = self.cfg.inference_delay
        execution_horizon = self.cfg.rtc.execution_horizon

        # Tolerance for floating point comparison
-        rtol = 1e-3  # Relative tolerance
-        atol = 1e-3  # Absolute tolerance
+        rtol = 1e-2  # Relative tolerance

        validation_passed = True
        warnings = []
@@ -558,19 +564,26 @@ class RTCEvaluator:
        logging.info(f"    Chunk length: {chunk_len}")
        logging.info(f"    Inference delay: {inference_delay}")
        logging.info(f"    Execution horizon: {execution_horizon}")
-        logging.info(f"    Tolerance: rtol={rtol}, atol={atol}")
+        logging.info(f"    Tolerance: rtol={rtol}")

        # ============================================================================
        # Rule 1: During delay [0:inference_delay], RTC should equal prev_chunk
        # ============================================================================
        if inference_delay > 0:
            delay_end = min(inference_delay, chunk_len)
-            rtc_delay = rtc_actions_np[:delay_end]
+            rtc_delay = rtc_actions_t[:delay_end]
            prev_delay = prev_chunk[:delay_end]

-            if not np.allclose(rtc_delay, prev_delay, rtol=rtol, atol=atol):
-                max_diff = np.max(np.abs(rtc_delay - prev_delay))
-                mean_diff = np.mean(np.abs(rtc_delay - prev_delay))
+            logging.info(f"  rtc_delay: {rtc_delay.shape}")
+            logging.info(f"  prev_delay: {prev_delay.shape}")
+
+            if not torch.allclose(rtc_delay, prev_delay, rtol=rtol):
+                max_diff = torch.max(torch.abs(rtc_delay - prev_delay)).item()
+                mean_diff = torch.mean(torch.abs(rtc_delay - prev_delay)).item()
+                logging.info(f"  rtc_delay: {rtc_delay}")
+                logging.info(f"  prev_delay: {prev_delay}")
+                logging.info(f"  max_diff: {max_diff}")
+                logging.info(f"  mean_diff: {mean_diff}")
                warnings.append(
                    f"    ⚠ VALIDATION FAILED: During delay [0:{delay_end}], "
                    f"RTC does NOT equal prev_chunk!\n"
@@ -589,26 +602,26 @@ class RTCEvaluator:
        blend_end = min(execution_horizon, chunk_len)

        if blend_end > blend_start:
-            rtc_blend = rtc_actions_np[blend_start:blend_end]
+            rtc_blend = rtc_actions_t[blend_start:blend_end]
            prev_blend = prev_chunk[blend_start:blend_end]
-            no_rtc_blend = no_rtc_actions_np[blend_start:blend_end]
+            no_rtc_blend = no_rtc_actions_t[blend_start:blend_end]

            # Check if RTC is between prev_chunk and no_rtc (element-wise)
            # For each element, check if it's between the min and max of prev_chunk and no_rtc
-            min_bound = np.minimum(prev_blend, no_rtc_blend) - atol
-            max_bound = np.maximum(prev_blend, no_rtc_blend) + atol
+            min_bound = torch.minimum(prev_blend, no_rtc_blend)
+            max_bound = torch.maximum(prev_blend, no_rtc_blend)

-            within_bounds = np.logical_and(rtc_blend >= min_bound, rtc_blend <= max_bound)
+            within_bounds = torch.logical_and(rtc_blend >= min_bound, rtc_blend <= max_bound)

-            if not np.all(within_bounds):
-                violations = np.sum(~within_bounds)
-                total_elements = within_bounds.size
+            if not torch.all(within_bounds):
+                violations = torch.sum(~within_bounds).item()
+                total_elements = within_bounds.numel()
                violation_pct = 100.0 * violations / total_elements

                # Find max violation
-                lower_violations = np.maximum(0, min_bound - rtc_blend)
-                upper_violations = np.maximum(0, rtc_blend - max_bound)
-                max_violation = np.max(np.maximum(lower_violations, upper_violations))
+                lower_violations = torch.maximum(torch.tensor(0.0), min_bound - rtc_blend)
+                upper_violations = torch.maximum(torch.tensor(0.0), rtc_blend - max_bound)
+                max_violation = torch.max(torch.maximum(lower_violations, upper_violations)).item()

                warnings.append(
                    f"    ⚠ VALIDATION FAILED: In blend region [{blend_start}:{blend_end}], "
@@ -626,12 +639,15 @@ class RTCEvaluator:
        # Rule 3: After execution horizon [execution_horizon:], RTC should equal no_rtc
        # ============================================================================
        if execution_horizon < chunk_len:
-            rtc_after = rtc_actions_np[execution_horizon:chunk_len]
-            no_rtc_after = no_rtc_actions_np[execution_horizon:chunk_len]
+            rtc_after = rtc_actions_t[execution_horizon:chunk_len]
+            no_rtc_after = no_rtc_actions_t[execution_horizon:chunk_len]

-            if not np.allclose(rtc_after, no_rtc_after, rtol=rtol, atol=atol):
-                max_diff = np.max(np.abs(rtc_after - no_rtc_after))
-                mean_diff = np.mean(np.abs(rtc_after - no_rtc_after))
+            logging.info(f"  rtc_after: {rtc_after}")
+            logging.info(f"  no_rtc_after: {no_rtc_after}")
+
+            if not torch.allclose(rtc_after, no_rtc_after, rtol=rtol):
+                max_diff = torch.max(torch.abs(rtc_after - no_rtc_after)).item()
+                mean_diff = torch.mean(torch.abs(rtc_after - no_rtc_after)).item()
                warnings.append(
                    f"    ⚠ VALIDATION FAILED: After execution horizon [{execution_horizon}:{chunk_len}], "
                    f"RTC does NOT equal no_rtc!\n"
@@ -661,6 +677,103 @@ class RTCEvaluator:
            logging.error("")
            logging.error("  Please check the implementation of RTC guidance.")

+    def plot_final_actions_comparison(self, rtc_actions, no_rtc_actions, prev_chunk_left_over):
+        """Plot final action predictions comparison on a single chart.
+
+        Args:
+            rtc_actions: Final actions from RTC policy
+            no_rtc_actions: Final actions from non-RTC policy
+            prev_chunk_left_over: Previous chunk used as ground truth
+        """
+        # Remove batch dimension if present
+        rtc_actions_plot = rtc_actions.squeeze(0).cpu() if len(rtc_actions.shape) == 3 else rtc_actions.cpu()
+        no_rtc_actions_plot = (
+            no_rtc_actions.squeeze(0).cpu() if len(no_rtc_actions.shape) == 3 else no_rtc_actions.cpu()
+        )
+        prev_chunk_plot = prev_chunk_left_over.cpu()
+
+        # Create figure with 6 subplots (one per action dimension)
+        fig, axes = plt.subplots(6, 1, figsize=(16, 12))
+        fig.suptitle("Final Action Predictions Comparison (Raw)", fontsize=16)
+
+        # Plot each action dimension
+        for dim_idx, ax in enumerate(axes):
+            # Plot previous chunk (ground truth) in red
+            RTCDebugVisualizer.plot_waypoints(
+                [ax],
+                prev_chunk_plot[:, dim_idx : dim_idx + 1],
+                start_from=0,
+                color="red",
+                label="Previous Chunk (Ground Truth)",
+                linewidth=2.5,
+                alpha=0.8,
+            )
+
+            # Plot no-RTC actions in blue
+            RTCDebugVisualizer.plot_waypoints(
+                [ax],
+                no_rtc_actions_plot[:, dim_idx : dim_idx + 1],
+                start_from=0,
+                color="blue",
+                label="No RTC",
+                linewidth=2,
+                alpha=0.7,
+            )
+
+            # Plot RTC actions in green
+            RTCDebugVisualizer.plot_waypoints(
+                [ax],
+                rtc_actions_plot[:, dim_idx : dim_idx + 1],
+                start_from=0,
+                color="green",
+                label="RTC",
+                linewidth=2,
+                alpha=0.7,
+            )
+
+            # Add vertical lines for inference delay and execution horizon
+            inference_delay = self.cfg.inference_delay
+            execution_horizon = self.cfg.rtc.execution_horizon
+
+            if inference_delay > 0:
+                ax.axvline(
+                    x=inference_delay - 1,
+                    color="orange",
+                    linestyle="--",
+                    alpha=0.5,
+                    label=f"Inference Delay ({inference_delay})",
+                )
+
+            if execution_horizon > 0:
+                ax.axvline(
+                    x=execution_horizon,
+                    color="purple",
+                    linestyle="--",
+                    alpha=0.5,
+                    label=f"Execution Horizon ({execution_horizon})",
+                )
+
+            ax.set_ylabel(f"Dim {dim_idx}", fontsize=10)
+            ax.grid(True, alpha=0.3)
+
+            # Set x-axis ticks to show all integer values
+            max_len = max(rtc_actions_plot.shape[0], no_rtc_actions_plot.shape[0], prev_chunk_plot.shape[0])
+            ax.set_xticks(range(0, max_len, max(1, max_len // 20)))  # Show ~20 ticks
+            ax.set_xlim(-0.5, max_len - 0.5)
+
+            # Add legend only to first subplot
+            if dim_idx == 0:
+                ax.legend(loc="best", fontsize=9)
+
+        axes[-1].set_xlabel("Step", fontsize=10)
+
+        # Save figure
+        output_path = os.path.join(self.cfg.output_dir, "final_actions_comparison.png")
+        fig.tight_layout()
+        fig.savefig(output_path, dpi=150)
+        logging.info(f"Saved final actions comparison to {output_path}")
+        plt.close(fig)
+
    def plot_tracked_data(self, rtc_tracked_steps, no_rtc_tracked_steps, prev_chunk_left_over, num_steps):
        # Create side-by-side figures for denoising visualization
        fig_xt, axs_xt = self._create_figure("x_t Denoising: No RTC (left) vs RTC (right)")
@@ -828,6 +941,13 @@ class RTCEvaluator:
                margin = y_range * 0.1
                ax.set_ylim(ylim[0] - margin, ylim[1] + margin)

+            # Set x-axis ticks to show all integer values
+            xlim = ax.get_xlim()
+            max_len = int(xlim[1]) + 1
+            if max_len > 0:
+                ax.set_xticks(range(0, max_len, max(1, max_len // 20)))  # Show ~20 ticks
+                ax.set_xlim(-0.5, max_len - 0.5)
+

@parser.wrap()
 def main(cfg: RTCEvalConfig):