Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D
翻译:从单目视频中恢复4D(即联合估计动态几何与相机姿态)是一个必然具有挑战性的问题。尽管近期基于点图的三维重建方法(如DUSt3R)在静态场景重建中取得了显著进展,但直接将其应用于动态场景会导致结果不准确。这种差异的产生是因为运动物体违反了多视图几何约束,从而破坏了重建过程。为解决此问题,我们提出了C4D框架,该框架利用时序对应关系将现有的三维重建框架扩展至四维。具体而言,除了预测点图外,C4D捕获两种类型的对应关系:短期光流与长期点跟踪。我们训练了一个动态感知的点跟踪器,该跟踪器提供额外的运动信息,有助于估计运动掩码以分离运动元素与静态背景,从而为动态场景提供更可靠的指导。此外,我们引入了一组动态场景优化目标,以恢复逐帧的三维几何与相机参数。同时,对应关系将二维轨迹提升为平滑的三维轨迹,实现了完全集成的四维重建。实验表明,我们的框架实现了完整的四维恢复,并在多个下游任务(包括深度估计、相机姿态估计和点跟踪)中展现出强劲性能。项目页面:https://littlepure2333.github.io/C4D