The three-dimensional representation of objects or scenes starting from a set of images has been a widely discussed topic for years and has gained additional attention after the diffusion of NeRF-based approaches. However, an underestimated prerequisite is the knowledge of camera poses or, more specifically, the estimation of the extrinsic calibration parameters. Although excellent general-purpose Structure-from-Motion methods are available as a pre-processing step, their computational load is high and they require a lot of frames to guarantee sufficient overlapping among the views. This paper introduces KRONC, a novel approach aimed at inferring view poses by leveraging prior knowledge about the object to reconstruct and its representation through semantic keypoints. With a focus on vehicle scenes, KRONC is able to estimate the position of the views as a solution to a light optimization problem targeting the convergence of keypoints' back-projections to a singular point. To validate the method, a specific dataset of real-world car scenes has been collected. Experiments confirm KRONC's ability to generate excellent estimates of camera poses starting from very coarse initialization. Results are comparable with Structure-from-Motion methods with huge savings in computation. Code and data will be made publicly available.
翻译:从一组图像出发构建物体或场景的三维表示多年来一直是广泛讨论的课题,在基于NeRF的方法普及后更受关注。然而,一个被低估的前提条件是相机位姿的已知性,更具体而言是外参标定参数的估计。尽管现有优秀的通用运动恢复结构方法可作为预处理步骤,但其计算负担较重,且需要大量帧图像以保证视图间有足够的重叠区域。本文提出KRONC,一种通过利用待重建物体及其语义关键点表示的先验知识来推断视图位姿的新方法。聚焦于车辆场景,KRONC能够将视图位置估计转化为一个轻量优化问题的解,该问题以关键点反投影收敛至单一空间点为优化目标。为验证该方法,我们收集了真实世界汽车场景的专用数据集。实验证实KRONC能够从非常粗略的初始化开始生成优质的相机位姿估计结果。其效果与运动恢复结构方法相当,同时计算成本大幅降低。代码与数据将公开提供。