Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparse views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.
翻译:估计相机姿态是三维重建的基础任务,在稀疏视图(<10)条件下仍具挑战性。与现有方法采用自上而下的全局外参参数化预测不同,我们提出一种分布式相机姿态表示,将相机视为一束光线的集合。该表示能与空间图像特征紧密耦合,提升姿态精度。我们观察到这种表示天然适用于集合级Transformer,并开发了一种基于回归的方法,将图像块映射至对应光线。为捕捉稀疏视图姿态推断中的固有不确定性,我们改进该方法以学习去噪扩散模型,从而在提升性能的同时采样可能的模态。我们提出的回归法与扩散法均在CO3D相机姿态估计中达到最优表现,并泛化至未见物体类别及野外场景捕获。