Towards Intrinsic-Aware Monocular 3D Object Detection
Abstract
Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image. Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsic governs how 3D scenes are projected onto the image plane. We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation.The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry. To capture this effect, MonoIA employs large language models and vision–language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters.These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics.This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras.Extensive experiments show that MonoIA achieves new state-of-the-art results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.18% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).
Overall Pipeline
MonoIA is a unified intrinsic-aware detection framework built upon two designs. The Intrinsic Encoder leverages the knowledge of LLM and CLIP to convert numeric intrinsics into semantically meaningful embeddings that capture their perceptual and geometric effects, providing a strong prior for generalization across cameras. The Intrinsic Adaptation Module bridges this semantic knowledge with visual perception through a lightweight Connector and hierarchical fusion, enabling the detector to interpret visual features in an intrinsic-aware manner and maintain consistent 3D detection under diverse camera settings.
Generalization on Synthetic Intrinsics
MonoIA achieves the highest AP3D across all focal lengths and maintains strong robustness even under extrapolated intrinsics.
Standard KITTI Results
MonoIA achieves state-of-the-art performance across KITTI leaderboard and KITTI Validation set, demonstrating that our intrinsic aware design also improves standard 3D object detection benchmarks.
Multi-Dataset Training Results
Our intrinsic-aware design helps bridge inter-dataset discrepancies especially for different focal lengths. So our MonoIA naturally supports multi-dataset training and further improves overall performance.
Qualitative Results
BibTeX
@inproceedings{zhang2026towards,
title={Towards Intrinsic-Aware Monocular 3D Object Detection},
author={Zhang, Zhihao and Kumar, Abhinav and Liu, Xiaoming},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}