Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation

CVPR 2026
1Shanghai Jiao Tong University, 2Southeast University, 3USTC, 4Shanghai Innovation Institute, 5Noematrix Ltd.
* Equal contribution. ‡ Corresponding authors.

A systematic study revealing how wide FoV fisheye cameras enhance robotic manipulation through superior spatial localization and scene generalization.

Abstract

The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that:

  • Spatial Localization: Wide FoV significantly enhances localization, but this benefit is critically contingent on the visual complexity of the environment.
  • Scene Generalization: Fisheye-trained policies unlock superior scene generalization when trained with sufficient environmental diversity.
  • Hardware Generalization: We identify scale overfitting as the root cause of transfer failures and propose Random Scale Augmentation (RSA) to improve performance.

Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning.

Research Questions

Overview

Study Design Overview

Figure 3: Overview of the four factors analyzed to address our research questions: (a) Camera Model; (b) Scene Complexity; (c) Scene Diversity; (d) Camera Parameters.

Experimental Setup

Real-World Platform

Real-World Hardware Setup

Figure 1: Our hardware platform comprises a Flexiv Rizon 4 7-axis robot arm and a DH AG-160-95 adaptive gripper. High-quality demonstrations are collected via teleoperation with a Meta Quest 3 headset.

Simulation Environment

Simulation Rendering Pipeline

Figure 2: To enable reliable fisheye benchmarking, we implemented a two-stage projection pipeline within the MuJoCo physics engine. This process generates fisheye views from intermediate panoramic representations, allowing for precise lens parameter control.

Scaling Simulation Capabilities

We adapt Robomimic and MimicGen benchmarks to evaluate our policies across 6 challenging manipulation tasks. To support RQ2, we utilize 32 distinct background textures to provide sufficient environmental diversity during training.

RQ1: Spatial Localization

Can the wider FoV of fisheye cameras help policy localization?

In this section, we investigate whether the wider FoV of fisheye cameras enhances policy localization. Given our wrist-view-only configuration, the policy must rely on background cues for spatial reasoning.

Hypothesis: The fisheye's wider FoV enables superior policy localization by integrating a greater density of background features, leading to a strong positive dependency on the visual richness of the scene.

We validate this through task performance in feature-poor vs. feature-rich backgrounds. The results confirm that rich backgrounds are critical, with the fisheye camera showing an average gain of +0.39 in the real world.

Simulation Performance (SR)

CameraFeatureAvg. Success Rate
Pinhole (Single)Poor / Rich0.31 / 0.34 (+0.03)
Fisheye (Single)Poor / Rich0.57 / 0.66 (+0.09)

Table 1: Performance comparison in simulation.

Spatial Awareness Probing

ConfigurationTrans. Error (cm) ↓Rot. Error (°) ↓
Pinhole Poor9.809.160
Fisheye Rich1.731.578

Table 2: Spatial probing errors in Pick Cup.

Conclusion: To maximize performance, data should be collected in visually complex and feature-rich environments.

RQ2: Scene Generalization

How do fisheye cameras affect generalization to novel backgrounds?

In robotic manipulation, the motion of a wrist-mounted camera naturally induces background shifts, which acts as a form of implicit data augmentation. We investigate the Scaling Law of policy generalization by increasing the number of unique training scenes (N) while holding the total data volume fixed to isolate the impact of environmental diversity.

Hypothesis: Fisheye-trained policies can more effectively utilize scene diversity to improve generalization, exhibiting a steeper performance scaling curve as the number of unique training scenes (N) increases.

We validate this through zero-shot evaluation on distinct unseen scenes in both simulation and real-world environments. Our results demonstrate that fisheye cameras exhibit significantly greater scaling potential compared to conventional cameras; for instance, the real-world fisheye policy achieves near-perfect scores with just eight diverse training scenes.

Scaling Law Results

Figure 4: Success rate and normalized score vs. the number of training scenes (N).

Conclusion: Maximizing environmental diversity during data collection is essential to unlock the full generalization capabilities of fisheye cameras.

RQ3: Hardware Generalization

Can policies maintain performance when deployed on new fisheye lenses?

The varied distortion profiles of fisheye lenses make cross-camera transfer a significant challenge. Policies trained on a specific lens often overfit to the absolute pixel scale of objects to determine distance. When deployed on a new lens with different intrinsic parameters, these scales change, causing the policy to miscalculate depth—either undershooting or overshooting the target—leading to catastrophic failures.

Hypothesis: The primary bottleneck for cross-camera transfer is "Scale Overfitting." This can be mitigated by using Random Scale Augmentation (RSA) to force the policy to learn relative spatial relationships rather than absolute pixel sizes.

Random Scale Augmentation (RSA)

To address scale sensitivity, we introduce RSA, a strategy that compels the network to learn scale-invariant features. During training, RSA samples a random scale factor s from a uniform distribution (e.g., 0.7 to 1.3). Scale factors greater than 1.0 effectuate a "zoom-out" effect, where the image is resized down and the surrounding canvas is padded with black. This prevents the network from memorizing absolute sizes and instead teaches relative cues, such as the scale of a target object relative to the robot's gripper.

Random Scale Augmentation Strategy

Figure 5: Comparison between standard Random Crop Augmentation (fixed scale) and our proposed Random Scale Augmentation (RSA).

Zero-shot Cross-Camera Success Rate (Simulation)

Evaluation Setting Lens Characteristic Baseline (Standard Aug.) Ours (RSA)
Seen Param Training Baseline 0.56 0.67
Param 1 Decreased FoV 0.30 0.41
Param 2 Alternative Projection 0.41 0.43
Param 3 Geometric Scale Shift 0.15 0.57
Param 4 Increased Distortion 0.17 0.40
Param 5 Extreme Focal Length 0.01 0.06

Table 3: Comparison of success rates across six tasks in simulation for unseen camera parameters.

Real-World Hardware Generalization (Normalized Score)

Physical Lens FOV Angle Induced Scale Shift Baseline Score Ours (RSA) Score
Seen Camera 180° 1.0x (Seen) 1.0000 1.0000
Narrow Lens 150° ~1.2x (Zoom In) 0.5000 0.9500
Wide Lens 220° ~0.8x (Zoom Out) 0.0025 0.6000

Table 4: Zero-shot cross-camera transfer on physical hardware using distinct lenses.

Conclusion: Standard fisheye policies are highly sensitive to absolute object scale. Learning relative scale via strong data augmentation like RSA is essential to ensure policies are robust to hardware variations and can effectively leverage datasets from diverse lens sources.

Experiments


Main Results

We employ a normalized, multi-stage scoring metric for real-world evaluation to provide a more granular assessment of policy capability than binary success rates. Each evaluation setup consists of 20 trials, and we report the cumulative normalized score.

Task 1: Pick Cup

  • Task Description: The goal is to pick up a cup from a random starting position and place it upright onto a designated coaster.
  • Evaluation Protocol: We test the policy under two environmental settings: Feature-Poor (solid-colored background) and Feature-Rich (patterned cloths with diverse textures).
  • Score Metric:
    • Stage 1 (0.00 pts): Failed to grasp or place the cup.
    • Stage 2 (0.50 pts): Placed on coaster but toppled over.
    • Stage 3 (1.00 pts): Successfully placed upright on the coaster.
Pick Cup Task

Task 2: Fold Towel

  • Task Description: This task requires performing two consecutive folds on a deformable towel. The robot must grasp a corner, fold it diagonally, and then repeat for the second corner.
  • Evaluation Protocol: Focuses on the policy's ability to handle deformable objects and maintain localization across multi-step sequences.
  • Score Metric: Four stages (0.25 pts each) corresponding to: 1. Grasping the first corner, 2. Completing the first fold, 3. Grasping the second corner, and 4. Completing the final fold.
Fold Towel Task

Task 3: Hang Chinese Knot

  • Task Description: Requires precise rotational manipulation to hang a Chinese knot onto a designated hook on a stand.
  • Evaluation Protocol: Since the initial grasping phase is successfully completed by most baselines, we focus solely on the precise placement required to secure the knot.
  • Score Metric:
    • Stage 1 (0.00 pts): Failed to hang (e.g., dropped or missed the hook).
    • Stage 2 (1.00 pts): Successfully secured the knot onto the hook.
Hang Chinese Knot Task

RQ1: Spatial Localization

Hypothesis: The exceptionally wide Field of View (FoV) of wrist-mounted fisheye cameras enhances policy localization by integrating a greater density of static environmental features as visual anchors. Consequently, we expect policy performance to exhibit a strong positive dependency on the visual complexity of the training scene.

Quantitative Results

Task Camera Setup Feature-Poor Feature-Rich Performance Gain
Pick Cup Fisheye (State-free) 0.525 0.800 +0.275
Fold Towel Fisheye (State-free) 0.100 0.700 +0.600
Hang Chinese Knot Fisheye (State-free) 0.200 0.500 +0.300

*Note: All results are based on state-free policies (without proprioception) to isolate the visual localization capability of the visual encoder.

Qualitative Rollouts

We demonstrate the robustness of fisheye-based policies across different tasks and environmental settings.

Task 1: Pick Cup

Pinhole + Poor Scene
(Score: 0.125)

Fisheye + Poor Scene
(Score: 0.525)

Fisheye + Rich Scene
(Score: 0.800)

Task 2: Fold Towel

Pinhole + Rich Scene
(Score: 0.316)

Fisheye + Rich Scene
(Score: 0.700)

Fisheye + Rich + State
(Score: 0.917)

Task 3: Hang Chinese Knot

Fisheye + Poor Scene
(Score: 0.200)

Fisheye + Rich Scene
(Score: 0.500)

Fisheye + Rich + State
(Score: 0.700)

Conclusion: These results confirm that the fisheye camera's wide contextual view implicitly encodes the robot's spatial relationship with the environment. This renders explicit proprioceptive state redundant in feature-rich environments, allowing the policy to rely exclusively on vision for high-precision manipulation.

RQ2: Scene Generalization

Hypothesis: Fisheye-trained policies can more effectively utilize scene diversity to improve generalization, exhibiting a steeper performance scaling curve as the number of unique training scenes (N) increases. To isolate the impact of environmental diversity from data volume, we maintain a Fixed Total Data Volume (e.g., 200 trajectories for real-world tasks). Any performance gain is thus attributable solely to the increased diversity of the visual data.

Scaling Analysis (Real-World Pick Cup)

Number of Scenes (N) Pinhole Avg. Score Fisheye Avg. Score
N=1 0.081 0.556
N=2 0.106 0.638
N=4 0.238 0.869
N=6 0.225 0.913
N=8 0.181 0.988

*Note: Scores represent the normalized performance on unseen test backgrounds. Fisheye policies achieve near-perfect performance as scene diversity increases.

Performance Scaling on Unseen Backgrounds

Visual comparison of Pinhole vs. Fisheye policies on the same unseen test background as training diversity (N) increases from 1 to 8.

Pinhole Policies (N = 1, 2, 4, 6, 8)

N=1

N=2

N=4

N=6

N=8

Fisheye Policies (N = 1, 2, 4, 6, 8)

N=1

N=2

N=4

N=6

N=8 (Robust)

Conclusion: The wider FoV of the fisheye camera acts as a potent implicit data augmentation, enabling the policy to better leverage scene diversity for robust cross-scene generalization. Maximizing scene diversity is essential to unlock the full potential of fisheye-based robotic learning.

RQ3: Hardware Generalization

Research Question: How well do policies trained on one fisheye camera transfer to a new, unseen fisheye lens with different intrinsic parameters?

Hypothesis: The primary failure mode for cross-camera transfer is Scale Overfitting. Policies trained on a specific lens overfit to absolute pixel scales. This can be mitigated by Random Scale Augmentation (RSA), which compels the network to learn scale-invariant relative spatial relationships.

Scale Sensitivity Analysis (Mechanism Verification)

We first simulate geometric domain shifts by applying center-crop scale factors ($S$) during inference to mimic changes in focal length. The baseline policy exhibits a characteristic "inverted-V" performance drop, confirming it relies heavily on absolute object scales.

Scale Factor (S) Effect Baseline (Fixed Scale) Ours (RSA)
S = 0.70 Zoom In 0.000 0.900
S = 0.85 Moderate Zoom In 0.950 1.000
S = 1.00 Training Scale 1.000 1.000
S = 1.15 Moderate Zoom Out 0.750 0.975
S = 1.30 Zoom Out 0.650 1.000

Real-World Cross-Camera Verification

We deploy the policies on physical lenses with different fields of view (FoV). RSA demonstrates a broad generalization plateau, maintaining robust performance even when the lens scale deviates significantly.

Physical Lens Type Induced Scale Shift Baseline Score Ours (RSA) Score
Seen Camera (180°) 1.0x (Training Scale) 1.0000 1.0000
Narrow Lens (150°) ~1.2x (Zoom In) 0.5000 0.9500
Wide Lens (220°) ~0.8x (Zoom Out) 0.0025 0.6000

Failure Mode Analysis vs. RSA Robustness

Visualizing the "Depth-Scale Ambiguity": how scale shifts lead to depth misinterpretation in standard policies, and how RSA resolves it.

Baseline Policy (Scale Overfitting)

Zoom In (S=0.7):
Undershooting (Perceived as closer)

Zoom In (S=0.85):
Undershooting (Perceived as closer)

Seen Scale (S=1.00):
Success

Zoom Out (S=1.15):
Overshooting/Collision

Zoom Out (S=1.30):
Overshooting/Collision

Ours (Random Scale Augmentation)

Zoom In (S=0.70): Accurate Depth

Zoom In (S=0.85): Accurate Depth

Seen Scale (S=1.00): Accurate Depth

Zoom Out (S=1.15): Accurate Depth

Zoom Out (S=1.30): Accurate Depth

Real-World Cross-Camera Verification(Baseline Policy)

Zoom In (S=0.70):
Undershooting (Perceived as closer)

Zoom Out (S=1.30):
Overshooting (Perceived as farther)

Real-World Cross-Camera Verification(Ours : Random Scale Augmentation)

Zoom In: Accurate Depth

Zoom Out: Accurate Depth

BibTeX

@article{xue2026rethinking,
  title={Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation},
  author={Xue, Han and Nan, Min and Liu, Xiaotong and Chen, Wendi and Fang, Yuan and Lv, Jun and Lu, Cewu and Wen, Chuan},
  journal={CVPR},
  year={2026},
}