Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. The key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion, which is prohibitively expensive to run at scale. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass. We estimate poses by first lifting frame-to-frame optical flow to 3D scene flow via differentiable rendering, preserving locality and shift-equivariance of the image processing backbone. SE(3) camera pose estimation is then performed via a weighted least-squares fit to the scene flow field. This formulation enables us to jointly supervise pose estimation and a generalizable neural scene representation via re-rendering the input video, and thus, train end-to-end and fully self-supervised on real-world video datasets. We demonstrate that our method performs robustly on diverse, real-world video, notably on sequences traditionally challenging to optimization-based pose estimation techniques.
翻译:从带有位姿的图像中重建三维神经场,已成为自监督表示学习的一种有前景的方法。阻碍这些三维场景学习器在大规模视频数据上部署的关键挑战在于,它们依赖于从运动恢复结构获得的精确相机位姿,而这一过程在大规模场景下成本过高。我们提出一种方法,能够在单次前向传播中在线联合重建相机位姿和三维神经场景表示。我们首先通过可微渲染将帧间光流提升为三维场景流,从而估计位姿,同时保留图像处理骨干网络的局部性和平移等变性。然后,通过对场景流场进行加权最小二乘拟合,实现SE(3)相机位姿估计。这一框架使我们能够通过重渲染输入视频来联合监督位姿估计和可泛化的神经场景表示,从而在真实世界视频数据集上实现端到端且完全自监督的训练。我们证明了该方法在多样化的真实世界视频上表现稳健,特别是在传统上对基于优化的位姿估计技术具有挑战性的序列上。