Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

Gaussian Splatting (GS) has significantly elevated scene reconstruction efficiency and novel view synthesis (NVS) accuracy compared to Neural Radiance Fields (NeRF), particularly for dynamic scenes. However, current 4D NVS methods, whether based on GS or NeRF, primarily rely on camera parameters provided by COLMAP and even utilize sparse point clouds generated by COLMAP for initialization, which lack accuracy as well are time-consuming. This sometimes results in poor dynamic scene representation, especially in scenes with large object movements, or extreme camera conditions e.g. small translations combined with large rotations. Some studies simultaneously optimize the estimation of camera parameters and scenes, supervised by additional information like depth, optical flow, etc. obtained from off-the-shelf models. Using this unverified information as ground truth can reduce robustness and accuracy, which does frequently occur for long monocular videos (with e.g. > hundreds of frames). We propose a novel approach that learns a high-fidelity 4D GS scene representation with self-calibration of camera parameters. It includes the extraction of 2D point features that robustly represent 3D structure, and their use for subsequent joint optimization of camera parameters and 3D structure towards overall 4D scene optimization. We demonstrate the accuracy and time efficiency of our method through extensive quantitative and qualitative experimental results on several standard benchmarks. The results show significant improvements over state-of-the-art methods for 4D novel view synthesis. The source code will be released soon at https://github.com/fangli333/SC-4DGS.

翻译：与神经辐射场（NeRF）相比，高斯溅射（GS）显著提升了场景重建效率和新视角合成（NVS）的精度，尤其在动态场景中。然而，当前的四维NVS方法，无论是基于GS还是NeRF，主要依赖于COLMAP提供的相机参数，甚至利用COLMAP生成的稀疏点云进行初始化，这些方法不仅耗时且精度不足。这有时会导致动态场景表示效果不佳，尤其是在物体运动幅度大或相机处于极端条件（例如小平移结合大旋转）的场景中。一些研究通过额外信息（如从现成模型获取的深度、光流等）进行监督，同时优化相机参数与场景的估计。将这些未经验证的信息作为真值使用会降低鲁棒性和精度，这在长时单目视频（例如帧数超过数百帧）中经常发生。我们提出了一种新颖的方法，通过相机参数的自标定来学习高保真的四维GS场景表示。该方法包括提取能够鲁棒表示三维结构的二维点特征，并利用这些特征对相机参数和三维结构进行后续联合优化，以实现整体四维场景的优化。我们在多个标准基准上通过广泛的定量和定性实验结果，证明了我们方法的精度与时间效率。结果表明，在四维新视角合成任务上，我们的方法相比现有最先进方法有显著提升。源代码将很快发布于 https://github.com/fangli333/SC-4DGS。