Monocular camera calibration is a key precondition for numerous 3D vision applications. Despite considerable advancements, existing methods often hinge on specific assumptions and struggle to generalize across varied real-world scenarios, and the performance is limited by insufficient training data. Recently, diffusion models trained on expansive datasets have been confirmed to maintain the capability to generate diverse, high-quality images. This success suggests a strong potential of the models to effectively understand varied visual information. In this work, we leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation. Specifically, we reformulate the problem of estimating the four degrees of freedom (4-DoF) of camera intrinsic parameters as a dense incident map generation task. The map details the angle of incidence for each pixel in the RGB image, and its format aligns well with the paradigm of diffusion models. The camera intrinsic then can be derived from the incident map with a simple non-learning RANSAC algorithm during inference. Moreover, to further enhance the performance, we jointly estimate a depth map to provide extra geometric information for the incident map estimation. Extensive experiments on multiple testing datasets demonstrate that our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors. Besides, the experiments also show that the precise camera intrinsic and depth maps estimated by our pipeline can greatly benefit practical applications such as 3D reconstruction from a single in-the-wild image.
翻译:单目相机标定是众多三维视觉应用的关键前提。尽管已有显著进展,现有方法通常依赖于特定假设,难以泛化到多样化的真实世界场景,且性能受限于训练数据不足。最近,在大规模数据集上训练的扩散模型已被证实具备生成多样化、高质量图像的能力。这一成功表明该模型在有效理解多样化视觉信息方面具有巨大潜力。在本工作中,我们利用预训练扩散模型中蕴含的全面视觉知识,以实现更鲁棒和精确的单目相机内参估计。具体而言,我们将估计相机内参四个自由度(4-DoF)的问题重构为一个稠密入射图生成任务。该图详细描述了RGB图像中每个像素的入射角,其格式与扩散模型的范式高度契合。在推理过程中,相机内参随后可通过一个简单的非学习RANSAC算法从入射图中导出。此外,为了进一步提升性能,我们联合估计深度图,为入射图估计提供额外的几何信息。在多个测试数据集上的大量实验表明,我们的模型取得了最先进的性能,预测误差降低高达40%。此外,实验还表明,我们流程所估计的精确相机内参和深度图,能够极大地有益于单张真实场景图像的三维重建等实际应用。