Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection

Michigan State University
CVPR 2026
teaser image.

In Mono3D, 3D attributes are inter-correlated through the 3D-to-2D projection. Our MonoCoP is an adaptive framework that learns when and how to leverage inter-attribute correlations.

Abstract

Monocular 3D detection (Mono3D) aims to infer 3D bounding boxes from a single RGB image. Without auxiliary sensors such as LiDAR, this task is inherently ill-posed since the 3D-to-2D projection introduces depth ambiguity. Previous works often predict 3D attributes (e.g., depth, size, and orientation) in parallel, overlooking that these attributes are inherently correlated through the 3D-to-2D projection. However, simply enforcing such correlations through sequential prediction can propagate errors across attributes, especially when objects are occluded or truncated, where inaccurate size or orientation predictions can further amplify depth errors. Therefore, neither parallel nor sequential prediction is optimal. In this paper, we propose MonoCoP, an adaptive framework that learns when and how to leverage inter-attribute correlations with two complementary designs. A Chain-of-Prediction (CoP) explores inter-attribute correlations through feature-level learning, propagation, and aggregation, while an Uncertainty-Guided Selector (GS) dynamically switches between CoP and parallel paradigms for each object based on the predicted uncertainty. By combining their strengths, MonoCoP achieves state-of-the-art (SOTA) performance on KITTI, nuScenes, and Waymo, significantly improving depth accuracy, particularly for distant and challenging objects.

Overall Pipeline

MonoCoP consists of two main components: a Chain-of-Prediction (CoP) and an Uncertainty-Guided Selector (GS) module. The CoP module predicts 3D attributes (depth, size, and orientation) sequentially, leveraging the correlation between attributes. The GS module dynamically switches between CoP and parallel paradigms for each object based on the predicted uncertainty.

KITTI Results

MonoCoP sets a new state of the art in monocular 3D object detection across KITTI leaderboard and KITTI Validation set.

Waymo Results

MonoCoP outperforms previous methods by a large margin in Waymo Validation set.

nuScenes Results

MonoCoP also achieves state-of-the-art performance in nuScenes Validation set.

Qualitative Results

BibTeX

@inproceedings{zhang2025unleashing,
    title={Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection},
    author={Zhang, Zhihao and Kumar, Abhinav and Ganesan, Girish Chandar and Liu, Xiaoming},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2026}
}