Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3R to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes. Project page: https://stereo4d.github.io
翻译:学习从图像中理解动态三维场景对于从机器人技术到场景重建的各类应用至关重要。然而,与其他通过大规模监督训练取得快速进展的问题不同,由于获取真实标注的根本性困难,直接监督三维运动恢复方法仍然具有挑战性。我们提出一个系统,用于从互联网立体广角视频中挖掘高质量的四维重建结果。我们的系统将相机姿态估计、立体深度估计和时间跟踪方法的输出进行融合与过滤,形成高质量的三维动态重建。我们利用该方法生成大规模数据,其形式为世界坐标系一致、具有伪度量尺度且包含长期运动轨迹的三维点云。我们通过训练一个DUSt3R的变体来预测真实世界图像对的结构和三维运动,证明了该数据的实用性,结果表明在我们重建数据上训练能够实现对多样化真实世界场景的泛化。项目页面:https://stereo4d.github.io