We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), conditioned on one or more images of a general scene, and a set of camera poses and timestamps. To overcome challenges due to limited availability of 4D training data, we advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data and propose a new architecture that enables the same. We further advocate the calibration of SfM posed data using monocular metric depth estimators for metric scale camera control. For model evaluation, we introduce new metrics to enrich and overcome shortcomings of current evaluation schemes, demonstrating state-of-the-art results in both fidelity and pose control compared to existing diffusion models for 3D NVS, while at the same time adding the ability to handle temporal dynamics. 4DiM is also used for improved panorama stitching, pose-conditioned video to video translation, and several other tasks. For an overview see https://4d-diffusion.github.io
翻译:我们提出了4DiM,一种用于四维新视角合成(NVS)的级联扩散模型,其条件输入为一个通用场景的一张或多张图像,以及一组相机姿态和时间戳。为克服四维训练数据有限带来的挑战,我们主张对三维(带相机姿态)、四维(姿态+时间)和视频(有时间但无姿态)数据进行联合训练,并提出了一种支持此训练方式的新架构。我们进一步主张使用单目度量深度估计器对运动恢复结构(SfM)的姿态数据进行标定,以实现度量尺度的相机控制。对于模型评估,我们引入了新的度量指标,以丰富并克服当前评估方案的不足。与现有的三维NVS扩散模型相比,我们的模型在保真度和姿态控制方面均展示了最先进的结果,同时增加了处理时间动态的能力。4DiM还可用于改进全景图拼接、姿态条件视频到视频转换以及其他多项任务。概述请参见 https://4d-diffusion.github.io